# Final route planning algorithm

_Now that we have improved the initial by computing multiple paths and adding probability computation, we will build the final algorithm._

_The user enters the departure time, departure location, final stop and a desired probability._

_We then compute the top 3 routes with their respective probability._

### 1. There are directly some possible routes:
_If there are some paths, we output them directly such that it satisfies the desired probability._

### 2. There are no routes
_We reject the desired probability and considers it as a failure. The route closest to the user requirement is still returned._


In [1]:
%%configure
{"conf": {
    "spark.app.name": "group100_final"
}}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
9205,application_1589299642358_3772,pyspark,idle,Link,Link,
9226,application_1589299642358_3793,pyspark,idle,Link,Link,
9244,application_1589299642358_3813,pyspark,idle,Link,Link,
9250,application_1589299642358_3819,pyspark,idle,Link,Link,
9252,application_1589299642358_3821,pyspark,busy,Link,Link,
9255,application_1589299642358_3824,pyspark,idle,Link,Link,
9267,application_1589299642358_3836,pyspark,idle,Link,Link,
9269,application_1589299642358_3839,pyspark,idle,Link,Link,
9271,application_1589299642358_3842,pyspark,idle,Link,Link,
9285,application_1589299642358_3856,pyspark,idle,Link,Link,


In [2]:
username = 'mjouve'

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
9311,application_1589299642358_3888,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
from datetime import time, datetime, timedelta
from collections import defaultdict
import numpy as np

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Loading previously obtained dataframe

In [4]:
stops = spark.read.orc("/user/{}/zurich_stops.orc".format(username))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
reachable_pair_grouped = spark.read.orc("/user/{}/reachable_pair_grouped.orc".format(username))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
stop_times = spark.read.orc("/user/{}/stop_times_filtered.orc".format(username))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
connexions = spark.read.orc("/user/{}/connexions.orc".format(username))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Helpers methods

In [8]:
def compute_footpaths_dict(reachable_pair_df):
    """
    Given a pyspark Dataframe of reachable pairs grouped,
    returns the footpaths dictionary used by our algorithm
    """
    return dict(((row.id_1, row.destinations) for row in reachable_pair_df.collect()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
def to_datetime(str_time):
    """
    Given a string representing a time (format 'H:M:s', H: hour, M: minute, s:second), convert it to a datetime object
    """
    hour, minute, second = str_time.split(':')
    
    # convert it to int and remove potential errors by taking a modulo
    hour = int(hour) % 24
    minute = int(minute) % 60
    second = int(second) % 60
    
    # the year, month and day are dummies heres
    return datetime(year=2020, month=1, day=1, hour=hour, minute=minute, second=second)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
def sort_connexions(connexions_df, departure = True):
    """
    Given a pyspark DataFrame of connexions, returns an array of sorted connexions in ascending order of departure
    if departure = True, else in descending order of arrival
    """

    connexions_array = [{'departure_location': row.stop_id_1, 
                         'departure_time': to_datetime(row.departure_time_1), 
                         'arrival_location': row.stop_id_2, 
                         'arrival_time': to_datetime(row.arrival_time_2), 
                         'trip_id': row.trip_id} for row in connexions_df.collect()]
    
    if departure:
        sorted_connexions = sorted(connexions_array, key = (lambda tup: tup['departure_time']))
    
    else:
        sorted_connexions = sorted(connexions_array, key = (lambda tup: tup['arrival_time']), reverse = True)
        
    return sorted_connexions

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Compute probability given lambda and a time left

In [12]:
# for lambdaa we pass the arrival_delay median
def proba_trip(lambdaa, time_left): #time left in seconds
    
    # This is the average lambda for all possible entries, and we will use it as a default value
    default_value = 0.023447352748076224
    
    if lambdaa == -9999:
        lambdaa = default_value
    
    if lambdaa < 0:
        return 1
    else:
        
        return 1 - np.exp(- lambdaa * time_left)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Main algorithms

## Updating arrival times

In [13]:
def updates_times_dict_given_departure_top_K_with_proba(times, sorted_connexions, lambda_dict, footpaths, departure_location, departure_time, final_location, K, desired_probability, max_time):
    """
    Given an initialized times dictionary, the array of sorted connexion, the footpaths dictionary
    a departure_location given as a stop_id (str),
    a departure time (datetime object),
    the final destination,
    the number of routes (K),
    the desired confidence probability and
    a maximum time not to exceed
    updates the times dictionary
    """
    
    # initalize the departure
    times[departure_location][0] = ( (departure_time, 1, None, None) )
    
    # cheminement delay is 0 for the first connexion taken
    cheminement_delay = timedelta(seconds = 0)


    # Initalize a dictionary of trips taken. For each trip already taken, 
    # we map it to the first departure location and departure time where we could have taken this trip. 
    # Returns None if the key is not assigned to another value thanks to defaultdict.
    trips_taken = defaultdict(lambda: None)

    # Iterate over connexions in sorted order
    for c in sorted_connexions:
    
        # trip_id of the current connexion
        trip_id = c['trip_id']
    
        # departure location of the current connexion
        departure_location = c['departure_location']
    
        # departure time of the current connexion
        departure_time = c['departure_time']
    
        # arrival location of the current connexion
        arrival_location = c['arrival_location']
    
        # arrival time of the current connexion
        arrival_time = c['arrival_time']
    
        # If current trip could have been taken earlier
        if trips_taken[trip_id]:
            
            
            # obtain data about this current trip (where we could have taken it and when)
            trip_data = trips_taken[trip_id]
        
            # obtain array
            arrival_array = times[arrival_location]
            
            # access old proba
            if times[trip_data[0]][0][2] == None:
                old_proba = 1
                new_proba = 1
            else:
                prev_connex_data = times[trip_data[0]][0][2]                   
                old_proba = times[trip_data[0]][0][1]
                prev_arrival_time = prev_connex_data['arrival_time']
                
                
                if prev_connex_data.get('walking', None):
                    prev_arrival_time = prev_arrival_time - timedelta(seconds = prev_connex_data['walking'])
                    prev_arrival = prev_connex_data['departure_location']
                else:
                    prev_arrival = prev_connex_data['arrival_location']
            
                # compute the lambda of the exponential distribution given the arrival location and arrival time
                lambdaa = lambda_dict[(prev_arrival, str(prev_arrival_time.time()))]
                
                # take into consideration the walking time
                walking_time = prev_connex_data.get('walking', 0)
                
                # compute probability to catch the connexion
                new_proba = proba_trip(lambdaa, (trip_data[1] - prev_arrival_time - timedelta(seconds = 120) -  timedelta(seconds = int(walking_time))).seconds)
                
    
            # if it is the final location we store multiple paths
            if arrival_location == final_location and len(arrival_array) < K:
                
                # if we satisfy the confidence
                if new_proba >= desired_probability:
                    
                    # if we respect the maximal time
                    if arrival_time <= max_time:
                
                        arrival_array.append((arrival_time, old_proba * new_proba, {'departure_location': trip_data[0],
                                                              'departure_time': trip_data[1],
                                                              'arrival_location':arrival_location,
                                                              'arrival_time': arrival_time,
                                                              'trip_id': trip_id}, new_proba))
                
                        arrival_array.sort(key = (lambda tup: tup[0]))
            
            # otherwise update the entry
            elif arrival_time < times[arrival_location][-1][0]:
                
                if new_proba >= desired_probability:
                    
                    if arrival_time <= max_time:
                        # update arrival time as well as connexion data for this arrival location
                        times[arrival_location][-1] = (arrival_time, old_proba * new_proba, {'departure_location': trip_data[0],
                                                              'departure_time': trip_data[1],
                                                              'arrival_location':arrival_location,
                                                              'arrival_time': arrival_time,
                                                              'trip_id': trip_id}, new_proba)
                
                        arrival_array.sort(key = (lambda tup: tup[0]))
            
            # obtain the stops reachable by walking
            reachable_stops_walking = footpaths.get(arrival_location, None)
            
            
            if reachable_stops_walking:
                
                # for each possible destination
                for destination in reachable_stops_walking:
                    
                    # obtain the stop_id
                    location = destination[0]
                    
                    # obtain the walk duration from arrival_location (convert it to float)
                    walking_time = float(destination[1])
                    
                    # compute the new arrival time if using this path
                    new_arrival_time = arrival_time + timedelta(seconds = walking_time)
                    
                    # obtain the current arrival time
                    curr_arrival_time_array = times[location]
                      
                    if location == final_location and len(curr_arrival_time_array) < K:
                        
                        if new_proba >= desired_probability:
                        
                            if new_arrival_time <= max_time:
                                curr_arrival_time_array.append((new_arrival_time, old_proba * new_proba, {'departure_location': arrival_location,
                                                              'departure_time': arrival_time,
                                                              'arrival_location':location,
                                                              'arrival_time': new_arrival_time,
                                                              'trip_id': trip_id,
                                                              'walking': walking_time}, new_proba))
                        
                                curr_arrival_time_array.sort(key = (lambda tup: tup[0]))
                    
                    
                    # if it improves the current best arrival time, we update our dictionary
                    elif new_arrival_time < curr_arrival_time_array[-1][0]:
                        
                        
                        if new_proba >= desired_probability:
                            if new_arrival_time <= max_time:
                                curr_arrival_time_array[-1] = (new_arrival_time, old_proba * new_proba, {'departure_location': arrival_location,
                                                              'departure_time': arrival_time,
                                                              'arrival_location':location,
                                                              'arrival_time': new_arrival_time,
                                                              'trip_id': trip_id,
                                                              'walking': walking_time}, new_proba)
                        
                                curr_arrival_time_array.sort(key = (lambda tup: tup[0]))
    
        # if we can take this connexion
        elif (times[departure_location][0][0] + cheminement_delay) <= departure_time:
            
            # delay is now 2 minutes
            cheminement_delay = timedelta(seconds = 120)

            # update trips taken with this new trip
            trips_taken[trip_id] = (departure_location, departure_time)
        
            # get arrival array
            arrival_location_array = times[arrival_location]
            
            # access old proba
            if times[departure_location][0][2] == None:
                old_proba = 1
                new_proba = 1
            else:
                prev_connex_data = times[departure_location][0][2]
                    
                old_proba = times[departure_location][0][1]
                prev_arrival_time = prev_connex_data['arrival_time']
                    
                if prev_connex_data.get('walking', None):
                    prev_arrival_time = prev_arrival_time - timedelta(seconds = prev_connex_data['walking'])
                    prev_arrival = prev_connex_data['departure_location']
                else:
                    prev_arrival = prev_connex_data['arrival_location']
                    
                time_str = str(prev_arrival_time.time())
                
                # getting the lambda for the exponential distribution
                lambdaa = lambda_dict[(prev_arrival, time_str)]
                
                # take into consideration walking time
                walking_time = prev_connex_data.get('walking', 0)
                
                # calculate proba to get this connexion
                new_proba = proba_trip(lambdaa, (departure_time - prev_arrival_time - timedelta(seconds = 120) - timedelta(seconds = int(walking_time))).seconds)
            
            # if if is the final location, we store more than one path
            if arrival_location == final_location and len(arrival_location_array) < K:
                
                    if new_proba >= desired_probability:
                        if arrival_time <= max_time:
                            arrival_location_array.append((arrival_time, old_proba * new_proba, {'departure_location': departure_location,
                                                              'departure_time': departure_time,
                                                              'arrival_location':arrival_location,
                                                              'arrival_time': arrival_time,
                                                              'trip_id': trip_id}, new_proba)  )
                
                            arrival_location_array.sort(key=(lambda tup: tup[0]))
                
            # if the arrival time is better than the current best
            elif arrival_time < times[arrival_location][-1][0]:
                
                if new_proba >= desired_probability:
            
                    # update the best time for the arrival location
                    if arrival_time <= max_time:
                        arrival_location_array[-1] = (arrival_time, old_proba * new_proba, c, new_proba)  
                
                        arrival_location_array.sort(key=(lambda tup: tup[0]))
            
            # obtain the stops reachable by walking
            reachable_stops_walking = footpaths.get(arrival_location, None) 
            
            if reachable_stops_walking:
                
                # for each possible destination
                for destination in reachable_stops_walking:
                    
                    # obtain the stop_id
                    location = destination[0]
                    
                    # obtain the walk duration from arrival_location (convert it to float)
                    walking_time = float(destination[1])
                    
                    # compute the new arrival time if using this path
                    new_arrival_time = arrival_time + timedelta(seconds = walking_time)
                    
                    # obtain the current arrival time
                    curr_arrival_time_array = times[location]
                      
                    if location == final_location and len(curr_arrival_time_array) < K:
                        
                        if new_proba >= desired_probability:
                            
                            if new_arrival_time <= max_time:
                        
                                curr_arrival_time_array.append((new_arrival_time, old_proba * new_proba, {'departure_location': arrival_location,
                                                              'departure_time': arrival_time,
                                                              'arrival_location':location,
                                                              'arrival_time': new_arrival_time,
                                                              'trip_id': trip_id,
                                                              'walking': walking_time}, new_proba))
                        
                                curr_arrival_time_array.sort(key = (lambda tup: tup[0]))
                    
                    
                    # if it improves the current best arrival time, we update our dictionary
                    elif new_arrival_time < curr_arrival_time_array[-1][0]:
                        
                        if new_proba >= desired_probability:
                            
                            if new_arrival_time <= max_time:
                                curr_arrival_time_array[-1] = (new_arrival_time, old_proba * new_proba, {'departure_location': arrival_location,
                                                              'departure_time': arrival_time,
                                                              'arrival_location':location,
                                                              'arrival_time': new_arrival_time,
                                                              'trip_id': trip_id,
                                                              'walking': walking_time}, new_proba)
                        
                                curr_arrival_time_array.sort(key = (lambda tup: tup[0]))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Find possible routes

In [26]:
def find_routes(connexions, stops_array, lambda_dict, footpaths, departure_stop, departure_time, arrival_stop, desired_probability, max_time = '23:30:00'):
    
    hour, minute, second = departure_time.split(':')
    hour = int(hour)
    minute = int(minute)
    second = int(second)
    
    departure_time_datetime = datetime(year=2020, month=1, day=1, hour=hour, minute=minute, second=second)
    
    hour, minute, second = max_time.split(':')
    hour = int(hour)
    minute = int(minute)
    second = int(second)
    
    max_time_datetime = datetime(year=2020, month=1, day=1, hour=hour, minute=minute, second=second)
    
        
    # initialized time
    times = dict(((row.stop_id, 
               [(datetime(year=2020, month=1, day=6, hour=23, minute=59, second = 59), 1, None, None)]) 
              for row in stops_array))
        
        
    updates_times_dict_given_departure_top_K_with_proba(times,
                                                        connexions,
                                                        lambda_dict,
                                                        footpaths, 
                                                        departure_stop, 
                                                        departure_time_datetime, 
                                                        arrival_stop, K, 
                                                        desired_proba,
                                                        max_time_datetime)
        
    successful = []
    not_enough = []
        
    for output in times[arrival_stop]:
        if output[0] != datetime(year=2020, month=1, day=6, hour=23, minute=59, second = 59):

            if output[1] >= desired_probability:
                successful.append(output)
            else:
                not_enough.append(output)
        
    if len(successful) == 0 and len(not_enough) == 0:
        print('Failure to find such a path')
        return
            
    if len(successful) > 0:
        print('Successful - printing routes...\n')
            
        for i, connexion_data in enumerate(successful):
            paths = print_route(times, connexion_data, departure_stop)
            print('Route {nb}:'.format(nb = i+1))
            for path in paths:
                print(path)
            print('\n')
        return
            
    else:
        max_prob = -1
        max_route = None
        
        for path in not_enough:
            prob = path[1]
            
            if prob > max_prob:
                max_prob = prob
                max_route = path
            
            
        print('Failure to find such a path')
        print('However the route closest to your requirements is:')
        paths = print_route(times, max_route, departure_stop)
        for path in paths:
            print(path)
            print('\n')
        return       

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Given a possible route, print it

In [20]:
def print_route(times, last_connexion, departure_stop):
    paths = []
    
    current_stop_data = last_connexion
    current_stop = None
    
    while current_stop != departure_stop:
        
        current_connexion = current_stop_data[2]
        
        proba = current_stop_data[1]
        
        current_stop = current_connexion['departure_location']
        arrival_location = current_connexion['arrival_location']
        trip = current_connexion['trip_id']
        
        walking = current_connexion.get('walking', None)
        
        if walking:
            path = 'Walking during {s}s'.format(s = int(walking)) + ' from {d} to {a}'.format(d = current_stop, a = arrival_location)
        
        else:
            path = 'From {d_l} (at {d_t}) to {a_l} (at {a_t}) using trip: {t}. Current probability = {p}'.format(d_l = current_stop,
                                                                                      d_t = current_connexion['departure_time'].time(),
                                                                                      a_l = arrival_location,
                                                                                      a_t = current_connexion['arrival_time'].time(),
                                                                                      t = current_connexion['trip_id'],
                                                                                      p = proba)
        
        paths.append(path)
        
        
        current_stop_data = times[current_stop][0]
    
    return paths[::-1] 

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Running the algorithm

In [21]:
footpaths = compute_footpaths_dict(reachable_pair_grouped)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
sorted_connexions = sort_connexions(connexions)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
stops_array = stops.select(stops.stop_id).collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [24]:
predictive_data = spark.read.orc("/user/{}/grouped_delay_lambdas.orc".format('mjouve'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [25]:
distribution_data = predictive_data.select(predictive_data.stop_id,
                                           predictive_data.arrival_time,
                                           predictive_data['lambda'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Defaultdict storing for each pair (arrival_stop, arrival_time) a lambda computed by our predictive model, if not present returns the default value (average computed)

In [27]:
# This is the average lambda for all possible entries, and we will use it as a default value
default_value = 0.023447352748076224

lambda_dict = defaultdict(lambda: default_value)

for row in distribution_data.collect():
    lambda_dict[(row.stop_id, row.arrival_time)] = row['lambda']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [28]:
departure_stop = '8503000'
departure_time = '12:05:00'
arrival_stop = '8591049'

# if we give a maximum time
max_time = '12:30:00'
K = 3

desired_proba = 0.7

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [29]:
find_routes(sorted_connexions,
            stops_array,
            lambda_dict, 
            footpaths, 
            departure_stop, departure_time, arrival_stop, desired_proba, max_time)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successful - printing routes...

Route 1:
From 8503000 (at 12:05:00) to 8503006 (at 12:11:00) using trip: 32.TA.80-159-Y-j19-1.8.H. Current probability = 1
Walking during 72s from 8503006 to 8580449
From 8580449 (at 12:15:00) to 8591049 (at 12:24:00) using trip: 1914.TA.26-11-A-j19-1.27.R. Current probability = 0.783457744058


Route 2:
From 8503000 (at 12:07:00) to 8503310 (at 12:17:00) using trip: 20.TA.26-9-A-j19-1.2.H. Current probability = 1
Walking during 70s from 8503310 to 8590620
From 8590620 (at 12:23:00) to 8591049 (at 12:29:00) using trip: 168.TA.26-12-A-j19-1.2.H. Current probability = 0.915466308093

In [30]:
# if we don't give a max_time
find_routes(sorted_connexions,
            stops_array,
            lambda_dict, 
            footpaths, 
            departure_stop, departure_time, arrival_stop, desired_proba)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successful - printing routes...

Route 1:
From 8503000 (at 12:05:00) to 8503006 (at 12:11:00) using trip: 32.TA.80-159-Y-j19-1.8.H. Current probability = 1
Walking during 72s from 8503006 to 8580449
From 8580449 (at 12:15:00) to 8591049 (at 12:24:00) using trip: 1914.TA.26-11-A-j19-1.27.R. Current probability = 0.783457744058


Route 2:
From 8503000 (at 12:07:00) to 8503310 (at 12:17:00) using trip: 20.TA.26-9-A-j19-1.2.H. Current probability = 1
Walking during 70s from 8503310 to 8590620
From 8590620 (at 12:23:00) to 8591049 (at 12:29:00) using trip: 168.TA.26-12-A-j19-1.2.H. Current probability = 0.915466308093


Route 3:
From 8503000 (at 12:05:00) to 8503006 (at 12:11:00) using trip: 32.TA.80-159-Y-j19-1.8.H. Current probability = 1
Walking during 72s from 8503006 to 8580449
From 8580449 (at 12:16:00) to 8591225 (at 12:21:00) using trip: 660.TA.26-787-j19-1.4.R. Current probability = 0.968013186502
Walking during 556s from 8591225 to 8591049