# Graphs Creation

In this notebook, the data about the transport networks are read from the CSV files and networkx graphs are created and stored. The data for each city is provided in a zip folder. Among many files, the files useful for us are `network_nodes.csv` and `network_combined.csv`. The former file contains information about the nodes (stops) such as the id, latitude, longitude and name. The latter file contains information about the edges (routes) such as the from the node to node, the straight line distance, the average duration between the stops, the number of times a public transport vehicle passes through that stop in an hour and the route type. The route type can be one of the following seven types: ` tram, subway, rail, bus, ferry, cablecar, gondola and funicular`. As there are different edges for different modes of transport the whole transport network graph is conceptualised as a MultiDiGraph. As the transport vehicles have a from and to node, it is represented in the graph as a directed graph. 

We found out that there are some self-loops in the edge data and we could not find the relevant information on the transportation website of the city. Hence we remove the edges with the same from and to id. 

We create a dictionary of graphs for each city with the mode of transport as the key and the network graph as the value. If a certain mode of transport does not exist, then the value is set to None. Apart from the individual modes, there will be one full graph as well. This dictionary of graphs is stored as a pickle file on the drive. Along with the dictionary of graphs, we store the `network_nodes.csv` and `network_combined.csv` for each city.

We also make an undirected version of the graphs alongside the directed ones.

Lastly, as none of the cities has gondola and funicular routes, that route type is excluded from the rest of the project.

1. For each city, creates an empty dictionary to store graphs representing different types of transportation routes in the city.
2. Extract the city name from the city zip file name, and read the node and edge data from CSV files in the zip file.
3. Remove the self-loops from the edge data
3. Create a MultiDiGraph for a full network of transportation routes in the city.
4. Get the unique transportation route types in the edge data CSV file, and create a new DiGraph for each route type.
5. Store the dictionary of graphs as a serialized pickle file with the name of the city

In [1]:
import glob
import pickle
import pathlib
import pandas as pd
import networkx as nx

from zipfile import ZipFile
from enum import Enum

In [2]:
# Set paths for input data and output graphs
rel_data_folder_path = pathlib.Path("./../../data")
transport_data_path = rel_data_folder_path.joinpath('transport_data')
city_network_graphs = rel_data_folder_path.joinpath('network_graphs').joinpath('graphs')
city_network_bones = rel_data_folder_path.joinpath('network_graphs').joinpath('nodes-edges')

# Get list of zip files with transportation data
city_zips = list(transport_data_path.glob('*.zip'))

# Define enum for route types
class RouteType(Enum):
    tram, subway, rail, bus, ferry, cablecar = range(6)

In [3]:
# Loop over each zip folder for the city
for city_data_path in city_zips:
    
    # Create dictionary to store graph representations of different route types and the full network
    city_graphs_dir = {RouteType(idx).name: None for idx in range(len(RouteType))}
    city_graphs_dir["full"] = None
    
    # Create dictionary to store undirected graph representations of different route types and the full network
    city_graphs_undir = {RouteType(idx).name: None for idx in range(len(RouteType))}
    city_graphs_undir["full"] = None
    
    # get the city name
    city_zf = ZipFile(city_data_path)
    city_name = city_data_path.name.removesuffix(".zip")
        
    # Read node information and save it to file
    city_nodes_df = pd.read_csv(city_zf.open(city_name + '/network_nodes.csv'),sep=";")
    city_zf.extract(city_name + '/network_nodes.csv', path=city_network_bones)
    
    # find the duplicate stops based on 'name' column
    duplicate_stops = city_nodes_df[city_nodes_df.duplicated(subset=['name'], keep=False)]

    # create the dictionary with key as the stop_I of duplicate row and value as the minimum among the stop_I of the duplicate rows
    dup_stop_mapping = {}
    for name, group in duplicate_stops.groupby('name'):
        min_stop_I = group['stop_I'].min()
        for stop_I in group['stop_I']:
            if stop_I != min_stop_I:
                dup_stop_mapping[stop_I] = min_stop_I

    # drop the duplicate nodes
    city_nodes_df.drop_duplicates(subset=['name'], keep='first', inplace=True)
    
    node_attrs = city_nodes_df.set_index('stop_I').to_dict('index')
    
    # read the edges information 
    city_network_df = pd.read_csv(city_zf.open(city_name + '/network_combined.csv'),sep=";")
    
    # replace the duplicate stop ids with the retained ones
    city_network_df['from_stop_I'] = city_network_df['from_stop_I'].map(dup_stop_mapping).fillna(city_network_df['from_stop_I'])
    city_network_df['to_stop_I'] = city_network_df['to_stop_I'].map(dup_stop_mapping).fillna(city_network_df['to_stop_I'])
    
    # remove self loops where the from stop and to stop are the same and save it to file
    city_network_df = city_network_df.query("from_stop_I != to_stop_I")
    city_zf.extract(city_name + '/network_combined.csv', path=city_network_bones)

     # Construct graph for full network
    full_city_graph = nx.MultiDiGraph()

    # Add edges to the graph
    for _, row in city_network_df.iterrows():
        source = row['from_stop_I']
        target = row['to_stop_I']
        edge_data = row[2:].to_dict()
        full_city_graph.add_edge(source, target, **edge_data)

    nx.set_node_attributes(full_city_graph, node_attrs)
    
    city_graphs_dir["full"] = full_city_graph
    city_graphs_undir["full"] = full_city_graph.to_undirected()

     # Construct graphs for different route types
    rte_types = city_network_df["route_type"].unique()

    for rte_type in rte_types:
        rte_network_df = city_network_df[city_network_df["route_type"] == rte_type]

        rte_type_graph = nx.DiGraph()

        # Add edges to the graph
        for _, row in rte_network_df.iterrows():
            source = row['from_stop_I']
            target = row['to_stop_I']
            edge_data = row[2:].to_dict()
            rte_type_graph.add_edge(source, target, **edge_data)

        nx.set_node_attributes(rte_type_graph, node_attrs)
        city_graphs_dir[RouteType(rte_type).name] = rte_type_graph
        city_graphs_undir[RouteType(rte_type).name] = rte_type_graph.to_undirected()

    # Save graphs for city to file
    ## directed
    city_network_graphs.joinpath('directed_graphs').mkdir(parents=True, exist_ok=True)
    with open(city_network_graphs.joinpath('directed_graphs').joinpath(city_name + '.gpickle'), 'wb') as f:
        pickle.dump(city_graphs_dir, f, pickle.HIGHEST_PROTOCOL)
    
    ## undirected
    city_network_graphs.joinpath('undirected_graphs').mkdir(parents=True, exist_ok=True)
    with open(city_network_graphs.joinpath('undirected_graphs').joinpath(city_name + '.gpickle'), 'wb') as f:
        pickle.dump(city_graphs_undir, f, pickle.HIGHEST_PROTOCOL)