# TransitGraphAI
This notebook contains the functions to preprocess the transportation graphs from the GTFS data that were collected from the `MobilityData/mobility-database-catalogs` GitHub repository. In a second part of this notebook, we experiment with generative models in order to eventually enrich the dataset with additional synthetic transportation networks.
# Data preprocessing and augmentation with synthetic data


In [None]:
import utils.preprocessing as pp
import utils.generative_models as gm
import os
import pickle as pkl
import torch.optim as optim
import torch

## Preprocessing

We load core GTFS tables (*stops, trips, routes, stop_times, calendar*) and filter out underspecified trips. Stops are then fused into station-level nodes using declared `parent_station` when available, or proximity-based fusion otherwise (great-circle distance with *BallTree*), while explicitly preventing the fusion of consecutive stops that appear within trips.  
From the cleaned data, we
1. build a directed stop connectivity graph (consecutive stops in trips)
2. and construct a time-enriched edge table that stores departure/arrival times and durations for each hop.
3. We also generate symmetric walking-transfer edges between fused stops using their surface distance and an assumed walking speed.

The final contextual edge set (trip + transfer) supports time-dependent routing: given a start time, the earliest-arrival path is computed by expanding only feasible departures and interleaving transfers as needed.  
This design yields a compact, station-level multimodal graph that retains GTFS temporal semantics and is suitable for RL agents or classical routing experiments.

> See the `utils/preprocessing.py` docstrings for further detail about each function.


In [None]:
for city in os.listdir("gtfs_data"):
    print(city)
    data, graph, contextual_edges = pp.preprocess_provider(f"gtfs_data/{city}")
    with open(f"preprocessed_data/{city}.pkl", "wb") as handle:
        pkl.dump((data, graph, contextual_edges), handle)

Barcelona


  data[ref] = pd.read_csv(path)


Bodensee_Oberschwaben


  data[ref] = pd.read_csv(path)


Toulouse


  data[ref] = pd.read_csv(path)


Hofmann_Omnibusverkehr_GmbH
Pays_de_la_Loire


  data[ref] = pd.read_csv(path)
  data[ref] = pd.read_csv(path)


Trentino
Isère
naldo_Verkehrsverbund


  data[ref] = pd.read_csv(path)


Aachen


  data[ref] = pd.read_csv(path)


Milano


  data[ref] = pd.read_csv(path)
  data[ref] = pd.read_csv(path)


Schweizer_Reisen


  data[ref] = pd.read_csv(path)


Aalen_Bopfingen


  data[ref] = pd.read_csv(path)


Piemonte
Lyon
Marseille
Toulon


## Generative synthesis of transport-like graphs

### GraphVAE
The objective is to enrich the dataset with synthetic transport networks that capture the topological patterns of real ones. To this end, the project explores deep generative models operating on graph adjacency matrices. The first stage implements a Variational Graph Autoencoder (VGAE) with a GCN-based encoder and an inner-product decoder, trained on Erdős–Rényi random graphs as a controlled proof of concept. The VAE learns a latent representation where each node is embedded into a Gaussian space; adjacency is reconstructed via σ(zzᵀ), and the training objective combines binary cross-entropy reconstruction loss with a KL divergence toward a standard normal prior.  

In [None]:
N = 10
model = gm.GraphVAE(10, 6, 4)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
for e in range(30):
    tot_loss = 0
    for b in range(10):
        model.train()
        optimizer.zero_grad()
        myloss = torch.tensor(0)
        for i in range(64):
            adj, x = gm.generate_batched_graphs(1, N)
            adj = adj.squeeze()
            x = x.squeeze()
            adj_recon, mu, logvar = model(x, adj)
            loss = gm.loss_function(adj_recon, adj, mu, logvar)
            myloss = myloss + loss
        myloss.backward()
        optimizer.step()
        tot_loss += myloss.item()
    print(tot_loss)

1028.8667907714844
1012.5847854614258
986.1924667358398
962.7984466552734
914.5387878417969
927.5323715209961
906.1050796508789
899.3350601196289
866.4687271118164
857.5817413330078
848.7939910888672
831.2275848388672
819.5922241210938
813.8950729370117
815.1714096069336
797.1603546142578
785.0302352905273
774.8384094238281
761.2522048950195
764.9651031494141
743.9743804931641
744.3758773803711
728.3412017822266
732.9286270141602
731.2053756713867
716.2471008300781
710.926399230957
710.0619049072266
699.8264999389648
701.0708999633789


In [73]:
print(adj_recon)

tensor([[0.9484, 0.7693, 0.7000, 0.9053, 0.4828, 0.7120, 0.8879, 0.3280, 0.4115,
         0.5533],
        [0.7693, 0.8732, 0.4009, 0.5385, 0.8699, 0.6917, 0.7695, 0.2072, 0.2088,
         0.7446],
        [0.7000, 0.4009, 0.7653, 0.8423, 0.2584, 0.4902, 0.5166, 0.7590, 0.5040,
         0.4327],
        [0.9053, 0.5385, 0.8423, 0.9462, 0.2473, 0.5812, 0.7268, 0.7223, 0.4848,
         0.4568],
        [0.4828, 0.8699, 0.2584, 0.2473, 0.9332, 0.6050, 0.5833, 0.2079, 0.2301,
         0.7829],
        [0.7120, 0.6917, 0.4902, 0.5812, 0.6050, 0.7221, 0.7647, 0.2734, 0.1459,
         0.6093],
        [0.8879, 0.7695, 0.5166, 0.7268, 0.5833, 0.7647, 0.8867, 0.1788, 0.2324,
         0.5862],
        [0.3280, 0.2072, 0.7590, 0.7223, 0.2079, 0.2734, 0.1788, 0.9125, 0.7395,
         0.3632],
        [0.4115, 0.2088, 0.5040, 0.4848, 0.2301, 0.1459, 0.2324, 0.7395, 0.9933,
         0.2179],
        [0.5533, 0.7446, 0.4327, 0.4568, 0.7829, 0.6093, 0.5862, 0.3632, 0.2179,
         0.6838]], grad_fn=<

### GraphVAE-GAN
To increase generative realism, a second stage introduces **adversarial training**: a **GraphGenerator** (MLP) produces synthetic adjacency matrices from random latent vectors, while a **GraphDiscriminator** attempts to distinguish real from generated graphs. Both networks are trained jointly in a minimax game, optionally coupled with the VAE to form a **GraphVAE–GAN hybrid**. The discriminator uses label smoothing to prevent overconfidence, and the generator (or VAE) is regularized with both adversarial and reconstruction objectives. This setup allows the system to learn richer graph priors, improving the diversity and structural coherence of synthetic transport networks that can later augment the real GTFS-based corpus.

In [None]:
model, discriminator = gm.trainGraphGAN(10, 6, 4, 100, 256, 30, 2, False)

In [None]:
z = torch.randn(1, 4)
adj_fake = model(z)
torch.where(adj_fake < 0.01, 0, 1)

In [None]:
adj_fake

In [None]:
with torch.no_grad():
    p = discriminator(adj_fake)
p