# Lab in Data Science: Final Project

Pierre Fouche, Matthias Leroy and Raphaël Steinmann

## Abstract

The goal of this project is to build a robust route planner that computes the shortest path between two public transport stops, taking into account not only the time of each route, but also its 'safety', i.e. considering the possibility of missing a connection if a transport is late compared to the schedule.

To address this problem, we designed and implemented a robust route planning algorithm that, based on predictions made from historical public transports data, predicts not only the fastest route from one stop to another, but also some alternative routes that are safer but might take more time.

We used data from SBB/CFF in the Zürich area to build our project. It consists of:
- Modeling the public transport infrastructure for the route planning algorithm,
- Building a predictive model using historical arrival/departure time data for the public transport network,
- Implement a robust route planning algorithm using this predictive model,
- Implement a method to test and validate our results,
- Implement a web visualization to demonstrate our method.

## Imports

In [1]:
import getpass
import pyspark
from datetime import datetime,timedelta
from pyspark.sql import SparkSession
import pyspark.sql.functions as functions
from pyspark.sql.types import BooleanType, FloatType
from pyspark.sql.window import Window
import math
import helpers, spark_helpers
import pickle
import random
import numpy as np
import pandas as pd
import functools
import copy
from sklearn.linear_model import LinearRegression


ZH_HB_ID = 8503000

%load_ext autoreload
%autoreload 2

## Initialize the `SparkSession`

In [2]:
conf = pyspark.conf.SparkConf()
conf.setMaster('yarn')
conf.setAppName('project-{0}'.format(getpass.getuser()))
conf.set('spark.executor.memory', '6g')
conf.set('spark.executor.instances', '6')
conf.set('spark.port.maxRetries', '100')
sc = pyspark.SparkContext.getOrCreate(conf)
conf = sc.getConf()
sc

In [4]:
# init spark session
spark = SparkSession(sc)

## Data Processing

### Cleaning metadata
First, let's clean the metadata dataframe. We will use the SBB data limited around the Zurich area. We will focus on all the stops within 10km of the Zurich train station. Let's get rid of all the stations that are too far away from Zurich and dump the metadata (stations ID, name and coordinates) in a pickle:

In [5]:
# load metadata
raw_metadata = spark.read.load('/datasets/project/metadata', format='com.databricks.spark.csv', header='false', sep='\\t')

# remove multiple spaces
metadata = raw_metadata.withColumn('_c0', functions.regexp_replace(raw_metadata._c0, '\s+', ' '))
# split into columns
metadata = metadata.withColumn('name', functions.split(metadata._c0, '%')[1])
for (name, index, type_) in [('station_ID',0, 'int'), ('long',1, 'double'), ('lat',2, 'double'), ('height',3, 'int')]:
    metadata = metadata.withColumn(name, functions.split(metadata._c0, ' ')[index].cast(type_))
# remove useless column
metadata = metadata.drop('_c0')
# trim name column to remove left/right blank
metadata = metadata.withColumn('name', functions.trim(metadata.name))

# coordinates of Zürich main train station
lat_zurich = 47.3782
long_zurich = 8.5402

# convert to pandas dataframe
pandas_df = metadata.toPandas()

# keep only the stops that are located < 10km from Zurich HB
pandas_df['distance_to_zh'] = pandas_df.apply(lambda x: helpers.distance(x['long'], x['lat'], long_zurich, lat_zurich), axis=1)
pandas_df = pandas_df[pandas_df['distance_to_zh'] < 10]

# recreate spark dataframe from pandas dataframe
metadata = spark.createDataFrame(pandas_df)
# create dict of stations from pandas dataframe
stations = pandas_df.set_index('station_ID').to_dict('index')

# dump metadata in pickle
with open('./data/metadata.pickle', 'wb') as handle:
    pickle.dump(stations, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [7]:
# load metadata from pickle
with open('./data/metadata.pickle', 'rb') as handle:
    stations = pickle.load(handle)

### Cleaning main dataset
Here we only filter out the columns of the data that we don't need. We also add a column that contains the index of the weekday and do some rudimentary data cleaning.

In [8]:
# load full data
raw_df = spark.read.load('/datasets/project/istdaten/*/*', format='csv', header='true', inferSchema='true', sep=';')
# load sample data
# raw_df = spark.read.load('/datasets/project/istdaten/2018/01', format='csv', header='true', inferSchema='true', sep=';')

In [9]:
# rename the fields german -> english
fields = {
    'BETRIEBSTAG':'date',
    'FAHRT_BEZEICHNER':'trip_id',
    'PRODUKT_ID':'transport_type',
    'LINIEN_ID':'train_id',
    'LINIEN_TEXT':'line',
    'VERKEHRSMITTEL_TEXT':'train_type',
    'ZUSATZFAHRT_TF':'additional_trip',
    'FAELLT_AUS_TF':'trip_failed',
    'HALTESTELLEN_NAME':'stop_name',
    'BPUIC':'stop_id',
    'ANKUNFTSZEIT':'schedule_arrival',
    'AN_PROGNOSE':'real_arrival',
    'AN_PROGNOSE_STATUS':'arr_forecast_status',
    'ABFAHRTSZEIT':'schedule_dep',
    'AB_PROGNOSE':'real_dep',
    'AB_PROGNOSE_STATUS':'dep_forecast_status',
    'DURCHFAHRT_TF':'no_stop_here'
}

df = raw_df.selectExpr([k + ' as ' + fields[k] for k in fields])

In [10]:
# refactor dates
df = df.withColumn('date', functions.from_unixtime(functions.unix_timestamp('date', 'dd.MM.yyyy')))
df = df.withColumn('schedule_arrival', functions.from_unixtime(functions.unix_timestamp('schedule_arrival', 'dd.MM.yyyy HH:mm')))
df = df.withColumn('real_arrival', functions.from_unixtime(functions.unix_timestamp('real_arrival', 'dd.MM.yyyy HH:mm')))
df = df.withColumn('schedule_dep', functions.from_unixtime(functions.unix_timestamp('schedule_dep', 'dd.MM.yyyy HH:mm')))
df = df.withColumn('real_dep', functions.from_unixtime(functions.unix_timestamp('real_dep', 'dd.MM.yyyy HH:mm')))

In [11]:
# add a column containing the weekday (monday=1, sunday=6)
df = df.withColumn('weekday', spark_helpers.get_weekday(df.date))

# keep only the rows with stops near zurich
df = df.where(df.stop_id.isin([int(x) for x in list(pandas_df.station_ID.unique())]))

# there is still 51'571'541 rows in zurich area
# df.count()

# keep only date after the 10th of december, because the schedule changed
df = df.where(df.date > '2017-12-10 00:00:00')

# discard the rows when there is no stop here
df2 = df.where(df.no_stop_here == 'false')

# discard ill-formated rows where the train leaves a station before arriving in it
df2 = df2.where((df2.schedule_dep >= df2.schedule_arrival) | functions.col('schedule_arrival').isNull() | functions.col('schedule_dep').isNull())

## Modeling the network
The first step is to model the network. In the original data from the SBB, each row represents a stop of a given vehicle at a given stop, date and arrival/departure time. In order to model the network (i.e. the course and schedules of each line), we would like to tweak the data such that each row represent not a stop but an edge from stop A to stop B, with departure time (from A) and arrival time (to B).

### From stops to edges
Here is how we proceeded to achieve this configuration:
- group the data first by ```trip_id``` and order it by time
- duplicate each column *C* of the dataframe and append  the copy to the dataframe as *next_C*
- using a window, shift up all the *next* columns of one row. This way, each row contains as *next* data the data of the row below

Since the rows are ordered by ```trip_id``` and ordered by time, each row is the next stop of the row above. The partitioning of the window function and the ```edge_is_valid``` udf function ensure that we discard unwanted rows.

In [12]:
# create a column with the schedule time that will be used to build the network
df2 = df2.withColumn('schedule_time', spark_helpers.date_choice(df2.schedule_arrival, df2.schedule_dep))
#df2 = df2.withColumn('schedule_time', functions.from_unixtime(functions.unix_timestamp('schedule_time', 'dd.MM.yyyy HH:mm')))

# create a column that tells if a stop is the first/last one of its trip or in the middle
df2 = df2.withColumn('stop_type', spark_helpers.stop_type(df2.schedule_dep, df2.schedule_arrival))

In [13]:
trips = df2.select(['trip_id', 'date', 'schedule_time', 'stop_id', 'stop_type', 'schedule_arrival', 'schedule_dep', 'line', 'transport_type', 'train_type', 'arr_forecast_status', 'weekday', 'real_arrival']).orderBy(['trip_id', 'schedule_time'], ascending=[0,0])

In [14]:
# duplicate the dataframe, shift the copy of one row and append it to the original
# this way, we have for each row the current stop and the next stop
w = Window().partitionBy(functions.col('trip_id')).orderBy(functions.col('trip_id'))
trips2 = trips.select("*", functions.lag("trip_id").over(w).alias("next_tid"))
trips2 = trips2.select("*", functions.lag("schedule_time").over(w).alias("next_time"))
trips2 = trips2.select("*", functions.lag("stop_id").over(w).alias("next_sid"))
trips2 = trips2.select("*", functions.lag("stop_type").over(w).alias("next_type"))
trips2 = trips2.select("*", functions.lag("schedule_arrival").over(w).alias("next_sched_arr"))
trips2 = trips2.select("*", functions.lag("schedule_dep").over(w).alias("next_sched_dep"))
trips2 = trips2.select("*", functions.lag("arr_forecast_status").over(w).alias("next_arr_forecast_status"))
trips2 = trips2.select("*", functions.lag("real_arrival").over(w).alias("next_real_arrival"))

# since we shift columns, there is one row to remove per partition
trips2 = trips2.where(trips2.next_time.isNotNull())

In [15]:
# create a new column telling if the edge is valid or not
# (i.e. if the stop and next stop are really part of the same ride)
trips3 = trips2.withColumn('is_valid', spark_helpers.edge_is_valid(trips2.trip_id, trips2.schedule_time, trips2.stop_id, trips2.stop_type, trips2.next_tid, trips2.next_time, trips2.next_sid, trips2.next_type, trips2.schedule_dep,trips2.next_sched_arr))

# keep only valid edges
trips4 = trips3.filter(trips3.is_valid=='true')

### For each day of the week, model the network
Now that the data has a more conveniant format, we can model the network. After doing some research and for obvious performance issues, we make the following assumption:
***The schedule repeats every week.***

This assumption allows us to build only seven network models, one per weekday. To do that, we selected a typical week with no days off or unusual schedules (15-21 january 2018). Note also that we took the liberty of discarding all data from before the 10th of december since the SBB changed their schedules for the entire network on this date.

The network for a single weekday is organized as an adjacency list. the outermost python dictionary contains as keys the departure stops ```A```. For each departure stop, the associated value is a second dictionary whose keys are the arrival stop ```B``` of the edge. The value of the second dictionary consists of a list of tuples. Each tuple represents a trip from ```A``` to ```B```, with a departure/arrival time, the trip ID and the line. In other words, the data is organised as follows:

```models['monday'] = {stopA: {stopB: [(dep_time, arr_time, trip_id, line), (...)]}}```

In [16]:
# we chose the week of the 15-21 january 2018 as a typical week
typical_week = ['2018-01-' + str(x) + ' 00:00:00' for x in range(15,22)]

In [17]:
regenerate_models = False
days_names = ['monday','tuesday','wednesday','thursday','friday','saturday','sunday']

# generate one network for each weekday and store them in pickles
if regenerate_models:
    for (date, day_name) in zip(typical_week, days_names):
        network = (helpers.model_network(trips4, date))
        with open('./data/'+day_name+'.pickle', 'wb') as handle:
            helpers.network_to_datetime(network) # works inplace
            pickle.dump(network, handle, protocol=pickle.HIGHEST_PROTOCOL)
        print(str(day_name) + ' done')

In [18]:
# load the networks from the pickles
models = []
for day in days_names:
    with open('./data/'+ day +'.pickle', 'rb') as handle:
        network = pickle.load(handle)
    models.append(network)
    print(day + ' loaded')

monday loaded
tuesday loaded
wednesday loaded
thursday loaded
friday loaded
saturday loaded
sunday loaded


### Compute walking network
In addition to the public transports network, one must consider that some trips include to walk between two neighbouring stops. To take this into account, we also modeled a walking network, containing for each pair of stations (located at most at 500m from oneanother) the walking time between them. To simplify the problem, we made the assumption that people always walk in straight lines between stations, at a constant speed of 5km/h.

In [19]:
# compute walking network
walking_network = helpers.compute_walking_network(stations)
print('walking network loaded')

walking network loaded


## Predictive model :

In this part we are going to build a preditive model using historical arrival/departure time data. Thus, our goal is to predict an uncertainty or certainty rate of taking a transport change with success, and finally to spread it for a whole journey using a route planning algorithm explained in the next section.

### 1) Processing the datas :

First of all, we have to group the data that match (same trip between two transport stations at the same schedule), and to compute the delay for the trips where the historical time was measured.

In [20]:
# We drop what we do not need for the prediction
trips5 = trips4.drop('stop_type', 'next_type', 'is_valid', 'arr_forcast_status', 
                     'schedule_time', 'schedule_arrival', 'next_sched_dep', 
                     'next_time', 'next_tid', 'real_arrival', 'arr_forecast_status')

trips5 = trips5.na.fill("temp 00:00:00")

# We refactor the dates by keeping only the time of the day and not the entire day
trips5 = trips5.withColumn('schedule_dep', spark_helpers.keep_time(trips5.schedule_dep))
trips5 = trips5.withColumn('next_sched_arr', spark_helpers.keep_time(trips5.next_sched_arr))
trips5 = trips5.withColumn('next_real_arrival', spark_helpers.keep_time(trips5.next_real_arrival))

# We compute the arrival delay of each stop station
df3 = trips5.withColumn("delay_arrival", 
                     functions.unix_timestamp('next_real_arrival', 'HH:mm:ss') -
                     functions.unix_timestamp('next_sched_arr', 'HH:mm:ss'))

create_rush = functions.udf(helpers.rush_inter)
group_weekday = functions.udf(helpers.group_weekday_py)

# We create time interval in order to differentiate rush hour and other time of the day
# We have to try different combinations of interval time (that gave best prediction)
df3 = df3.withColumn('arrival_interval', create_rush(df3.next_sched_arr))
# We also gather the week day in 4 buckets (Wednesnay, Saturday and Sunday separatly and the other days together)
df3 = df3.withColumn('weekday', group_weekday(df3.next_sched_arr))

# We create a dataframe where we only keep the rows where there is a real arrival time
arrival_df1 = df3.filter(df3.next_arr_forecast_status == "GESCHAETZT")
arrival_df1 = arrival_df1.drop('next_arr_forecast_status')

# arrival_df.groupby('transport_type').count().show()

We have noticed that only the trains are concerned with historical time data, thus we were able to compute the delay only for those trips. Moreover we only had 1 millions rows of evaluated time on the 30 millions trips that we have. 

So, one solution was to consider that there is no delay for the other trips (Bus / Tram), thus, there will almost alway have 0% chance to miss a connection between two transports as there are a majority of Bus and Tram in the center of Zurich.

An other solution was to try to predict the delay of those trips according to the ones we already had. That is why we decided to make a linear regression using the localization (latitude, longitude) of the stop, as along with the day and time of the trip.

### 2) Complete the delay data with Linear Regression :

Here we tried to predict the delay for every station. We made the assumption to use the delay of the train to predict the delay of the bus and trams.

In [22]:
#We create a dataframe of stations with unknown delays
#station_to_predict = df3.filter(df3.next_arr_forecast_status == "PROGNOSE").groupby(['transport_type', 'next_sid', 'arrival_interval', 'weekday']).count()

#station_to_predict = station_to_predict.toPandas()

regenerate_knn_pickle = False
if regenerate_knn_pickle:
    station_to_predict.to_pickle('./data/delay_knn.pickle')
else:
    station_to_predict = pd.read_pickle('./data/delay_knn.pickle')

#Add longitutude and latitude
station_to_predict['long'] = station_to_predict['next_sid'].map(lambda x: pandas_df[pandas_df['station_ID'] == x].long.values[0])
station_to_predict['lat'] = station_to_predict['next_sid'].map(lambda x: pandas_df[pandas_df['station_ID'] == x].lat.values[0])

#Dataframe of stations with known delays
station_computed = arrival_df1.withColumn('long', spark_helpers.get_longitude(arrival_df1.next_sid))
station_computed = station_computed.withColumn('lat', spark_helpers.get_latitude(station_computed.next_sid))

#station_computed_pandas = station_computed.toPandas()
regenerate_stat_comp_pickle = False
if regenerate_stat_comp_pickle:
    station_computed_pandas.to_pickle('./data/stat_comp.pickle')
else:
    station_computed_pandas = pd.read_pickle('./data/stat_comp.pickle')

In [23]:
station_computed_lin = station_computed_pandas[['trip_id', 'next_sid', 'weekday', 'arrival_interval', 'long', 'lat', 'delay_arrival']]

#We use a linear regression to predict the delay using the localisation, the weekday and  the time interval
lr = LinearRegression()
lr.fit(station_computed_lin[['weekday', 'arrival_interval', 'long', 'lat']], station_computed_lin['delay_arrival'])

station_to_predict['pred'] = lr.predict(station_to_predict[['weekday', 'arrival_interval', 'long', 'lat']])


def get_delay(status, delay, interval, week_day, transport_type, sid):
    if status == 'GESCHAETZT':
        return float(delay)
    return float(station_to_predict[(station_to_predict['next_sid'] == sid) & 
                              (station_to_predict['transport_type'] == transport_type) & 
                              (station_to_predict['weekday'] == week_day) & 
                              (station_to_predict['arrival_interval'] == interval)]['pred'])

get_delay_udf = functions.udf(get_delay, FloatType())                                                                                                                                                    

#Set the delay to every stations. Keep the orignal if exist otherwise use the prediction
arrival_df = df3.withColumn('real_delay', get_delay_udf(df3.next_arr_forecast_status, df3.delay_arrival, df3.arrival_interval, df3.weekday, df3.transport_type, df3.next_sid))

arrival_df = arrival_df.drop('delay_arrival')
arrival_df = arrival_df.withColumnRenamed("real_delay", "delay_arrival")

### 3) Evaluate the uncertainty according to the delay:

Thus for now we have the delay for a majority of trips for a weekday and a time interval. Thus we have to find a way to evaluate the uncertainty according to the delays we have. We decide to use the interquartile range (IQR) in order to manage the outliers. In fact, the interquartile range is the difference between the upper and lower quartiles (75th, 25th percentiles) Q3 and Q2. 

$IQR = Q3 - Q2$

It allows us to mesure the dispersion of the delays for each trip. Moreover we associate it with the interquartile mean (IQM) which is the mean of the data included in the interquartile range and is insensitive to outliers.

$IQM = \frac{2}{n}\sum_{i = \frac{n}{4}+1}^\frac{3n}{4} x_i$

Thus for each trips, for a given day of the week and time interval, we now have an estimation of the **worst delay** possible without being affected by the outliers.

We have decided to group the Monday, Tuesday, Thursday and Friday as we considered them as typical working days, the 3 other days were processed separatly. Morevover in order to create the time intervals we have decided to group the "rush hours" together (6h/9h - 17h/19h) separatly to the others time of the day. First of all, we made a Kmeans based on the delay, the week days and the transport types in order to determine those intervals however it does not give us specific time clusters that is why we used those ones.

In [24]:
# We group the trips by their week day and time interval
delay_distribution = arrival_df.groupby(['trip_id', 'arrival_interval', 'weekday']).agg(functions.collect_list(functions.col('delay_arrival'))).alias('distri')
delay_distribution = delay_distribution.withColumn('distri', spark_helpers.delete_neg(functions.col('collect_list(delay_arrival)')))

# We compute the IQM / IQR and worst case
delay_distribution = delay_distribution.withColumn('IQM', spark_helpers.iqm(delay_distribution.distri))
delay_distribution = delay_distribution.withColumn('interquartile', spark_helpers.interquartile(delay_distribution.distri))
delay_distribution = delay_distribution.withColumn('worst_case', delay_distribution.IQM + delay_distribution.interquartile)

#delay_distribution_pd = delay_distribution.toPandas()

In [25]:
# Then we store it in order to easily make our predictions.
regenerate_distri_pickle = False
if regenerate_distri_pickle:
    delay_distribution_pd.to_pickle('./data/delay_distri.pickle')
else:
    #delay_distribution_pd = pd.read_pickle('./data/delay_distri.pickle')
    # if we want to use the predictions
    delay_distribution_pd = pd.read_pickle('./data/full_delay_distri.pickle')

Here we can see an example of delay worst cases in seconds for some trips

In [26]:
delay_distribution_pd[["trip_id", "IQM", "interquartile", "worst_case"]].head()

Unnamed: 0,trip_id,IQM,interquartile,worst_case
0,85:11:18231:002,20.005776926104943,60.0,80.005777
1,85:11:18330:001,22.5531914893617,60.0,82.553191
2,85:11:19536:001,0.0,0.0,0.0
3,85:11:2684:001,3.404255319148936,60.0,63.404255
4,85:11:31432:003,0.0,0.0,0.0


## Shortest path algorithm

The baseline version of our shortest path algorithm does not take any schedule delay into account and assumes that every train or bus is always sharp on time. In a second version, we will take into account the uncertainties obtained with the predictive model mentionned above.

We want to compute the shortest path between two stops, given a source stop (in our case, it's always ZH HB), a destination and a departure or arrival date/time. To achieve this goal, we implemented an adaptation of Dijkstra's shortest path algorithm such that it takes into account the schedules, the correspondances and eventual waiting times at stops. The algorithm uses the functions ```shortest_path``` and ```get_next_correspondance```.

In a nutshell, when the algorithm comes at a stop, it tries to update the shortest (=fastest in our case) path to all its neighbors either by taking the next public transport according to the current time or by walking if it's faster. If there is a change of vehicle, we will always take into account the waiting time at the station. Also note that we don't allow correspondances if the traveller has less than one minute to change.

We won't go into the details or the algorithm here, as the code is duely documented and mostly consists of a state-of-the-art implementation of Dijksta's famous shortest path algorithm.

For the case where we receive an arrival time instead of a departure time, we just modified the algorithm so that it finds a shortest path in a backward manner. This algorithm is implemented in the function ```shortest_path_reverse```.

Here is a simple example of how to call the algorithm and display a pretty-printed version of the resulting path:

In [27]:
# filter out the stations that are not reachable from Zürich HB on mondays
reachable_stations_ids = helpers.get_reachable_stations(models[0], walking_network, ZH_HB_ID)
reachable_stations = {sid: stations[sid] for sid in reachable_stations_ids}

In [28]:
sp = helpers.shortest_path(models, walking_network, reachable_stations,ZH_HB_ID, 8591085, datetime(2018, 1, 15, 14))
helpers.reduced_path_tostring(helpers.reduce_path(sp), reachable_stations)

line IR70 from Zürich HB to Zürich Oerlikon 14:01 -> 14:07(1 stops)
line walk from Zürich Oerlikon to Zürich Oerlikon, Bahnhof 14:07 -> 14:07(1 stops)
line 768 from Zürich Oerlikon, Bahnhof to Zürich, Seebach 14:08 -> 14:12(4 stops)
line walk from Zürich, Seebach to Zürich, Ausserdorfstrasse 14:12 -> 14:15(1 stops)
line 75 from Zürich, Ausserdorfstrasse to Zürich, Birch-/Glatttalstrasse 14:17 -> 14:18(1 stops)


And here is another example, this time giving the algorithm an arrival time instead of a departure time

In [29]:
sp = helpers.shortest_path_reverse(models, walking_network, reachable_stations, ZH_HB_ID, 8590628, datetime(2018, 1, 15, 14))
helpers.reduced_path_tostring(helpers.reduce_path(sp), reachable_stations)

line S6 from Zürich HB to Zürich Hardbrücke 13:31 -> 13:33(1 stops)
line S9 from Zürich Hardbrücke to Glattbrugg 13:39 -> 13:47(2 stops)
line walk from Glattbrugg to Glattbrugg, Oberhusen 13:49 -> 14:00(3 stops)


Thus we can see a typical trip from the Zurich main train station to the stop Zürich, Birch-/Glatttalstrasse.
However we can ask ourself what is the reliability of this journey. Do we have a big chance of missing a change ?

## Reliability of each connection and the whole trip

We can now compute the feasibility of the trip. We estimate the reliability of each change by computing the ratio between the scheduled period of time we have to take the connection and the worst case delay that can happened to the previous transport. This worst case delay has been computed in the previous part for all trips. 

Finally, we determine the global uncertainty score by multiplying the score of each change of the trip. This score is spread from 0 (certain to miss the change) to 1 (certain to take the change).

In [30]:
score = helpers.routing_algo(sp, delay_distribution_pd)
for i,s in enumerate(score[0].values()):
    print(" Change number {} : {}% of chance to take it with success".format(i+1, round(s*100)))

print("\n This trip is realisable {}% of the time ".format(round(score[1]*100)))

 Change number 1 : 100% of chance to take it with success

 This trip is realisable 100% of the time 


Thus, the route from Zürich HB to Zürich, Birch-/Glatttalstrasse is the fastest at least 35% of the time if I want to leave at 14h.

And it is obviously the first change, that has a problem. The line IR70 has a too big chance to be late, and thus the passenger will not succed to take the line 768 **35%** of the time after joining the station by walking.

Thus, we would like to let the posibility to choose a safest trip even if it is a bit slower, in order to reassure traveler that wants to arrive at all cost.

## Safest trip

In order to do so, we go to our network graph and delete the edge with the smallest certainty score. Then we rerun the shortest path algorithm until it finds a trip with a total score bigger than the threshold (that we fix here at 80%), we eventually stop after 100 runs if it does not find a 80% certain path.

Then we show the fastest without regard of the feasibility of the trip and the 3 safest ones (if we do not find the best one before)

In [31]:
res = helpers.safest_paths(models, walking_network, reachable_stations, ZH_HB_ID, 8591085, datetime(2018, 3, 15, 14), delay_distribution_pd, threshold=0.8)

1 / 10
2 / 10


In [32]:
for i,t in enumerate(res[:2]):    
    print("Trip number {} has {}% of chances to succeed and lasts {} ".format(i+1, round(t[1][1] * 100), str(helpers.compute_path_time(t[0]))))

Trip number 1 has 35% of chances to succeed and lasts 0:17:00 
Trip number 2 has 100% of chances to succeed and lasts 0:24:00 


As we can see the safest trip is not the fastest one but it has a better chance to succeed, thus it could be prefered by a traveller who wants to make sure he'll be on time. Furthermore, the second trip is only 7 minute longer and is ~ 65% safer that trip one, which is a drastic improvement.

## Validation method by visulatization

We created a Bokeh visualization in order to display those two trips and see if they make sense. It shows the different lines the user has to take and the stations where a connection occurs. We can see that the trip are reliable and do not scatter in all directions.

We also compared the results of our algorithm to other established services for several paths (SBB, Google Maps) to ensure we provided acceptable/plausible paths. Of course, this is not very exhaustive but it gives a good impression of our method's performance and correctness.

Let's start by visualizing the fastest path:

In [33]:
helpers.plot_trip(pandas_df, res[0][0])

Now the safest path:

In [34]:
helpers.plot_trip(pandas_df, res[1][0])

## Isochronous Map

We also implemented a method that, given a certain amount of time, displays an isochrone map of each and every position that can be reached from a source position (in our case, Zürich HB) in this amount of time.

In [35]:
def compute_all_path(regenerate):
    """
    Generates all shortest path for all the reachable stations
    """
    all_paths = []
    date = datetime(2018, 1, 15, 14)
    if regenerate:
        
        for destination in list(reachable_stations.keys()):
            if destination != ZH_HB_ID:
                all_paths.append(helpers.shortest_path(models, walking_network, reachable_stations, ZH_HB_ID, destination, date))
            
        pickle.dump(all_paths, open('./data/all_paths.pickle', 'wb'))
        
    else:
        all_paths = pickle.load(open('./data/all_paths.pickle', 'rb'))        
    return all_paths

# We can draw the isochrnous map for 15 minutes
helpers.isoch(15, compute_all_path(False), pandas_df)

## Web Visualization

We also implemented a web interface that allows users to query our robust route planning algorithm. The server and font-end code can be found on the gitlab repository. One just needs to install flask and the required dependencies to run it.

Here are a few screenshots that illustrate how it works:


![](images/webviz1.png)
![](images/webviz2.png)
![](images/webviz3.png)