# Route Calculation in less then two hours using GPU 

## Requirements:

To get this notebook to work on your own pc you will need a the following:

- Nvidia GPU with more than 8GB VRAM
- RAPIDS installed (from rapids.ai, version 1.14 or higher)
- OSMNX installed
- NetworkX installed

In [None]:
!nvidia-smi

In [None]:
!nvcc --version

In [None]:
!pip install geopandas==0.8.1

In [None]:
!pip install osmnx

In [None]:
import multiprocessing as mp
import numpy as np
import networkx as nx
import osmnx as ox
import requests
import matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline
ox.config(use_cache=True, log_console=True,timeout=1000)
ox.__version__

In [None]:
from datetime import datetime

In [None]:
import sys
!cp ../input/rapids/rapids.0.19.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
#import nvstrings
import numpy as np
import pandas as pd
import cudf, cuml
import dask_cudf
import io, requests
import math
import gc

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns 

#Learning
from cuml.preprocessing.model_selection import train_test_split
from cuml.linear_model import LinearRegression
from scipy.stats import uniform

import cuspatial
import cugraph

from cuml.solvers import SGD as cumlSGD
from cuml.linear_model import LogisticRegression
from cuml.ensemble import RandomForestRegressor as cuRF
from cuml.neighbors import KNeighborsRegressor
from cuml import ForestInference
from cuml import Ridge
from cuml import Lasso
from cuml import ElasticNet
from cuml.solvers import CD
from cuml.svm import SVR

import xgboost as xgb
from cuml.svm import SVC

import pandas as pd

#import dask_ml.model_selection as dcv
#from dask.distributed import Client, wait
#from dask_cuda import LocalCUDACluster

from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from cuml.metrics.regression import r2_score
from cuml.metrics.regression import mean_squared_error
pd.__version__ #1.1.4

In [None]:
import rmm
rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

In [None]:
cudf.set_allocator(allocator="managed")
#cudf.set_allocator(pool=True)
#cudf.set_allocator("managed")

# Intro

In this notebook I will show you how I used GPU acceleration to calculate the distance between pickup and dropoff by calculating the shortest route between those points using OpenStreetMap Data and a GPU.

But before I start, let me show you why this approach can be a valuable improvement to your notebook.

Let's say you wanted to meet a friend in a park close to the east river. For some unknown reason you forgot the exact location and go to your usual spot, the Marsha P. Johnson State Park in Brooklyn. After you waited a few minutes, you get a call from your friend, how is waiting for you at the East River Park Field 8 in Manhattan. You apologize for your mistake and take a taxi to get to your friend. If you look at the position of both the pickup point (Marsha P. Johnson State Park) and the dropoff point (East River Park Field 8), you will see that there is no direct connection between those point. To get to the dropoff point you have to cross the east river. The nearest possible crossing would be the Williamsburg Bridge or the Queens Midtown Tunnel. Either way, the shortest route the taxi can take will definitely exceed twice the harversine distance between those points.

This is just one possible example in which the haversine distance between dropoff and pickup differs from the shortest driving distance by more than just a small margin. Since the driving distance and time between pickup and dropoff will determine the taxi fare, this example would most likely be an outlier in a scatter plot for correlation between haversine distance and fare amount. Even with a good machine learning algorithm you will have difficulties to predict the real fare amount for such a ride. 

So what can you do? The simplest answer: Ask google maps. If you take a minute and do some research you will see that there actually is a API for google maps, that can be used to calculate such distances. The main problem in this case is, that google will charge you some money for using the API. In our example with somewhere around 55 Million documented taxi rides, that could amount to a little bit of money that you would have to pay. Since I'm still a student at a university with no regular income, and this is just a project that is supposed to be a replacement for an exam, I'm not willing to afford that. 

The second alternative you may find, is an API from a website called openrouteservice. You can use that API to get route calculations, but in the free version you are currently limited to 2000 request per day. Even if the same pickup-dropoff-pairs occur multiple times in the given dataset, you would still have so many different pairs, that you would have to come up with some ideas to reduce the number of needed requests in order to do the task in a reasonable time. Even if you can come up with enough ideas to make that work, the calculated shortest distance will still be of by a considerable margin from the real shortest driving distance. For me, that ain't worth my time. 

After all the research I did to find a possible solution, I wondered, if there may be a way to get access to the street map and calculate the shortest route on my own. Luckily I found someone who came up with the same idea and a possible solution. He published that solution as a notebook on kaggle.com. 

Notebook:
https://www.kaggle.com/usui113yst/basic-network-analysis-tutorial

This notebook uses free accessible data from OpenStreetMap and, after some other analysis, gives an example on how to calculate the routes using NetworkX. Before I proceed, let us load the street map from New York City.


In [None]:
G = ox.graph_from_place('New York City, New York, USA',simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
fig, ax = ox.plot_graph(G,figsize=(50,50))

In [None]:
print(nx.info(G))

As you can see now, the street map gets visualized as a graph with nodes and edges. That visualization also leads us to the answer on how to calculate the routes. Since the street map can also be seen as a graph or network, we can use known algorithms to calculate the shortest distance by using the real distance between the nodes as edgecosts. There are several possible implementations for such a calculation like the single source shortest path implementation in NetworkX.

The problem in using osmnx and NetworkX is the lack of parallelization. For the number of routes the route calculation would take several days. But for a start lets look at how we can use the approach implemented in NetworkX to solve our problem and to see what steps we need to take to get there.

The first step is already done, we have a graph of New York City. 

As a second step we need to get the IDs of the node or vertex closest to dropoff and pickup. NetworkX doesn't provide a function for that, but OSMNX does. 

- pickup point at Marsha P. Johnson State Park = 40.721153, -73.961206
- dropoff point at East River Park Field 8 = 40.723742, -73.972854



In [None]:
#https://osmnx.readthedocs.io/en/stable/osmnx.html#osmnx.distance.get_nearest_nodes
X = [-73.961206,-73.972854]
Y = [40.721153,40.723742]
ids = ox.distance.get_nearest_nodes(G,X,Y)

In [None]:
def get_distance(A_lat,A_long,B_lat,B_long): #simplified distance function
        long_dist = abs(A_long-B_long)
        lat_dist = abs(A_lat-B_lat)
        distance = math.sqrt(math.pow((111000*lat_dist),2)+math.pow((long_dist*111000*math.cos(A_lat)),2))
        return distance
distance_between_pickup_and_dropoff = get_distance(40.721153, -73.961206,40.723742, -73.972854)
print("Distance between pickup and Dropoff = ",distance_between_pickup_and_dropoff)

As a third step, we can calculate the shortest path between those points. For this, we can use NetworkX shortest_path and plot that route with osmnx

In [None]:
#https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html#networkx.algorithms.shortest_paths.generic.shortest_path
path = nx.shortest_path(G,ids[0],ids[1],weight = 'length')
path_length = nx.shortest_path_length(G,ids[0],ids[1],weight = 'length')
print("Shortest path length = ",path_length)
ox.plot_graph_route(G,path,figsize=(20,20), bbox=(40.76,40.70,-73.95,-73.99))


In [None]:
print("Shortest Route is {} times longer than the haversine distance between pickup and dropoff!".format(path_length/distance_between_pickup_and_dropoff))

As you can see in this Plot, the shortest route is more than four times longer as the haversine distance between the pickup and dropoff point. Since OSMNX allows to add the travel time to each edge, we can calculate the fastest route as well and see how it differs from the shortest route.


In [None]:
G = ox.add_edge_speeds(G, precision = 3)
G = ox.add_edge_travel_times(G,precision = 3)

fastest_path = nx.shortest_path(G,ids[0],ids[1],weight = 'travel_time')
fastest_path_time = nx.shortest_path_length(G,ids[0],ids[1],weight = 'travel_time')
print("Fastest path travel time = ",fastest_path_time)
ox.plot_graph_routes(G,[path,fastest_path],route_colors = ['r','g'],figsize=(20,20), bbox=(40.76,40.70,-73.95,-73.99))

As you can see from this plot, both paths will include the Williamsburg Bridge. The only difference is a small part close to the pickup point. Such a small difference is not always the case for all routes. I plotted several routes within Manhattan for different pickup and dropoffpoints, and not all fastest routes seemed to be reasonable in comparison to the shortest route. Because of this I will only use shortest routes for all future calculations.

Now that I have shown you how to make that calculation with OSMNX and NetworkX, let's get back to the parallelization. As mentioned before, NetworkX and OSMNX do not use a parallel approach to calculate the shortest path. 

Luckily RAPIDS has a parallel implementation on a gpu for the single source shortest path algorithm. But before I can show you how the same calculations can be made with RAPIDS, let me show you how to load the data and prepare it, so we can use it for our calculation

## Preparation

### Loading the Data 

To load the data you can use a simple function that will read the data from a csv file directly into your gpu memory. Unless you use Google Colab to run this notebook, this will only take a few seconds.

In [None]:
types =      {'fare_amount': 'float32',
              'pickup_datetime':'str',
              'pickup_longitude': 'float32',
              'pickup_latitude': 'float32',
              'dropoff_longitude': 'float32',
              'dropoff_latitude': 'float32',
              'passenger_count': 'int8'}
Spalten = list(types.keys())
df_train = cudf.read_csv('../input/new-york-city-taxi-fare-prediction/train.csv', usecols=Spalten, dtype=types)#,nrows=10000000
df_test = cudf.read_csv('../input/new-york-city-taxi-fare-prediction/test.csv', usecols=Spalten, dtype=types)


Now that we loaded the data, we can make some adjustments so that we won't use to much memory. This can be quite helpful, since out_of_memory exceptions can accure quite easily.

In [None]:
df_train['pickup_datetime'] = df_train['pickup_datetime'].astype('datetime64[ns]')
df_test['pickup_datetime'] = df_test['pickup_datetime'].astype('datetime64[ns]')
#Getting interger numbers from the pickup_datetime
df_train["hour"] = df_train.pickup_datetime.dt.hour.astype('int8')
df_train["minute"] = df_train.pickup_datetime.dt.minute.astype('int8')
df_train["weekday"] = df_train.pickup_datetime.dt.weekday.astype('int8')
df_train["month"] = df_train.pickup_datetime.dt.month.astype('int8')
df_train["year"] = df_train.pickup_datetime.dt.year.astype('int16')
df_train["year"] = df_train["year"]-2000
df_train["year"] = df_train["year"].astype('int8')

df_train["day"]=df_train.pickup_datetime.dt.day.astype('int8')

df_test["hour"] = df_test.pickup_datetime.dt.hour.astype('int8')
df_test["minute"]= df_test.pickup_datetime.dt.minute.astype('int8')
df_test["weekday"] = df_test.pickup_datetime.dt.weekday.astype('int8')
df_test["month"] = df_test.pickup_datetime.dt.month.astype('int8')
df_test["year"] = df_test.pickup_datetime.dt.year.astype('int8')
df_test["year"] = df_test["year"]-2000
df_test["year"] = df_test["year"].astype('int8')
df_test["day"]=df_test.pickup_datetime.dt.day.astype('int8')

df_train.drop(columns = ['pickup_datetime'])
df_test.drop(columns = ['pickup_datetime'])
df_train.head()

After loading the data and changing its format, we can inspect it and clean it just as easy as we would in pandas.

(If you are familiar with the usual cleaning process only the Second Cleaning will be of interest to you.)
### First Cleaning

In [None]:
df_train.describe()

In [None]:
df_test.describe()

As you can clearly see some of the GPS-Coordinates don't make sense, so we use a bounding box to only use useful GPS-Coordinates. We can also elimante all data with a negative fare amount, because that would mean the driver is paying you for a ride in his/her taxi. Less than 1 and more than 6 passengers are also not of interest. 

In [None]:
df_train.nans_to_nulls()
df_train= df_train.dropna()
Zeilen_vor_Bearbeitung=df_train.shape[0]
df_train.describe()

In [None]:
#Negative fareamount will be dropped
df_train = df_train[df_train['fare_amount'] > 0]
#Some of the GPS Coordinates are fare away from New York, so we use a Bounding Box.
df_train = df_train[(df_train['pickup_longitude'] < -72) & (df_train['pickup_longitude'] > -74.3)]
df_train = df_train[(df_train['pickup_latitude'] > 40.5) & (df_train['pickup_latitude'] < 41.71)]
df_train = df_train[(df_train['dropoff_longitude'] < -72) & (df_train['dropoff_longitude'] > -74.3)]
df_train = df_train[(df_train['dropoff_latitude'] > 40.5) & (df_train['dropoff_latitude'] < 41.71)]
df_train = df_train[(df_train['passenger_count'] > 0) & (df_train['passenger_count'] < 7)] #Passengercount 0 or higher than 6 is not of interest.
Zeilen_nach_Bearbeitung=df_train.shape[0]
Zeilenverlust = Zeilen_vor_Bearbeitung-Zeilen_nach_Bearbeitung
print("Number of datasamples eliminated= {}".format(Zeilenverlust))
df_train.describe()

Now that we cleaned the data, lets take a look at a heatmap of our rides. Herefore I'm using a function from https://www.kaggle.com/breemen/nyc-taxi-fare-data-exploration

In [None]:
def select_within_boundingbox(df, BB):
    return (df.pickup_longitude >= BB[0]) & (df.pickup_longitude <= BB[1]) & \
           (df.pickup_latitude >= BB[2]) & (df.pickup_latitude <= BB[3]) & \
           (df.dropoff_longitude >= BB[0]) & (df.dropoff_longitude <= BB[1]) & \
           (df.dropoff_latitude >= BB[2]) & (df.dropoff_latitude <= BB[3])
def plot_hires(df, BB, figsize=(40, 40), ax=None, c=('r', 'b')):
    if ax == None:
        fig, ax = plt.subplots(1, 1, figsize=figsize)

    idx = select_within_boundingbox(df, BB)
    ax.scatter(df[idx].pickup_longitude, df[idx].pickup_latitude, c=c[0], s=0.01, alpha=0.5)
    ax.scatter(df[idx].dropoff_longitude, df[idx].dropoff_latitude, c=c[1], s=0.01, alpha=0.5)

In [None]:
df_train_pd = df_train.to_pandas()
plot_hires(df_train_pd,(-74.3,-73,40.5,41.2))

As you can see in this plot, we can see the shape of Manhattan Brooklyn and Queens as well as the JFK Airport quite clearly. What seems strange though, is, that a considerable number of rides seem to start or end in water. To eliminate those rides, I made some adjustments to the idea I saw in https://www.kaggle.com/breemen/nyc-taxi-fare-data-exploration, which also tries to eliminate those rides. 

## Second Cleaning

Instead of using a picture of a map, I wanted group all rides by area. If a ride doesn't start or end in one of the predefined areas, it would start or end in water and I could eliminate that special ride.
To group the data by area I wanted to use polygons to determine which GPS Point belongs to which area. Unfortunately all implementations I could find used a slow sequential approach. Luckily, a friend of mine studied geoinformatics and recommended to use a GIS system. There are several APIs to include a GIS system into Python. The most interesting one is cuspatial, which is included in RAPIDS. It uses a GPU-accelerated solution that according to its developers has shown a significant speed-up in comparison to CPU-based implementations. For further information, check out https://medium.com/rapids-ai/acclerating-gis-data-science-with-rapids-cuspatial-and-gpus-fd012b27af0a

Since we want to check 55 Million taxi rides, with 2 GPS Points each, any speed-up we can achieve is more than welcome.

After we now know, how to check those points, we still have to find a way how to get those polygons.

### Loading polygons from public available data

Luckily I didn't need to create those polygons on my own. Some of those polygons can be easily downloaded using the OSMNX API. For some areas, like the State of New Jersey or Connecticut those polygons do not represent the coastline as precisely as you can see it in Google Maps. Thanks to the census, many states collect data about those areas, and it is often available publicly. Thankfully I was able to find out that most of the data, including precise polygons, is stored as ARCGIS data, that can be easily accessed via a REST API. Within some filters applied and some polygons merged I can create multiple precise polygons and save those polygons as a shapefile.

### OSMNX Data
For each Island as well as for Queens and Brooklyn a polygon can be downloaded via the OSMNX API

In [None]:
import os

#file_path = '/data/Staten_Island.shp'
#directory = os.path.dirname(file_path)
directory = 'data'
try:
    os.stat(directory)
except:
    os.mkdir(directory) 

In [None]:
Staten_Island = ox.geocode_to_gdf(['Staten Island, USA'])
Long_Island   = ox.geocode_to_gdf(['Long Island, USA'])
Brooklyn      = ox.geocode_to_gdf(['Brooklyn,New York, USA'])
Queens        = ox.geocode_to_gdf(['Queens,New York, USA'])
Bronx         = ox.geocode_to_gdf(['Bronx,New York, USA',])
Manhattan     = ox.geocode_to_gdf(['Manhattan,New York, USA',])

Staten_Island= Staten_Island.rename(columns={'display_name':'NAME'})
#Staten_Island= Staten_Island[['NAME','geometry']]
Staten_Island.to_file('data/Staten_Island.shp')

Long_Island= Long_Island.rename(columns={'display_name':'NAME'})
#Long_Island= Long_Island[['NAME','geometry']]
Long_Island.to_file('data/Long_Island.shp')

Brooklyn= Brooklyn.rename(columns={'display_name':'NAME'})
#Brooklyn= Brooklyn[['NAME','geometry']]
Brooklyn.to_file('data/Brooklyn.shp')

Queens= Queens.rename(columns={'display_name':'NAME'})
#Queens= Queens[['NAME','geometry']]
Queens.to_file('data/Queens.shp')

Bronx= Bronx.rename(columns={'display_name':'NAME'})
#Bronx= Bronx[['NAME','geometry']]
Bronx.to_file('data/Bronx.shp')

Manhattan= Manhattan.rename(columns={'display_name':'NAME'})
#Manhattan= Manhattan[['NAME','geometry']]
Manhattan.to_file('data/Manhattan.shp')

del Staten_Island
del Long_Island
del Brooklyn
del Queens
del Bronx
del Manhattan

#ox.plot.plot_footprints(Queens,figsize=(50,50))

### New York City Data

Since I would like to get a more precise information in which area of Manhattan a ride starts or end, I use publicly available data to differentiate  between Lower Manhattan, Midtown, Upper Westside, Upper Eastside, Upper Manhattan, and Central Park.

In [None]:
import geopandas as gp 
NYC = gp.read_file('https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/arcgis/rest/services/NYC_Community_Districts/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=geojson')
#NYC = gp.read_file('nycd.shp')
NYC = NYC.to_crs("EPSG:4326")
Manhattan = NYC[NYC['BoroCD']<200]
del NYC
ox.plot.plot_footprints(Manhattan,figsize=(50,50))

In [None]:
Lower_Manhattan = Manhattan[Manhattan['BoroCD']<104]
Lower_Manhattan['NAME']='Lower Manhattan'
Lower_Manhattan = Lower_Manhattan.dissolve(by = 'NAME')
Lower_Manhattan.to_file('data/Lower_Manhattan.shp')

Midtown = Manhattan[Manhattan['BoroCD']<107]
Midtown = Midtown[Midtown['BoroCD']>103]
Midtown['NAME']='Midtown'
Midtown = Midtown.dissolve(by = 'NAME')
Midtown.to_file('data/Midtown.shp')

Upper_Westside = Manhattan[Manhattan['BoroCD']==107]
Upper_Eastside = Manhattan[Manhattan['BoroCD']==108]
Upper_Westside['NAME']='Upper Westside'
Upper_Eastside['NAME']='Upper Eastside'
Upper_Westside.to_file('data/Upper_Westside.shp')
Upper_Eastside.to_file('data/Upper_Eastside.shp')

Upper_Manhattan = Manhattan[Manhattan['BoroCD']<150]
Upper_Manhattan = Upper_Manhattan[Upper_Manhattan['BoroCD']>108]
Upper_Manhattan['NAME']='Upper Manhattan'
Upper_Manhattan = Upper_Manhattan.dissolve(by = 'NAME')
Upper_Manhattan.to_file('data/Upper_Manhattan.shp')

Central_Park = Manhattan[Manhattan['BoroCD']>150]
Central_Park['NAME']='Central Park'
Central_Park.to_file('data/Central_Park.shp')

ox.plot.plot_footprints(Lower_Manhattan,figsize=(50,50))

del Manhattan
del Lower_Manhattan
del Midtown
del Upper_Westside
del Upper_Eastside
del Upper_Manhattan
del Central_Park



### New York State data

Although this competition is names New York City Taxi Fare Prediction some of rides do not stay within New York City, which is why I had to include polygons for several Counties in New York State.
I was able to filter those counties so that I could reduce the size and complexity of the resulting polygon.

In [None]:
import geopandas as gp 
NYS = gp.read_file("https://gisservices.its.ny.gov/arcgis/rest/services/NYS_Civil_Boundaries/FeatureServer/3/query?where=1%3D1&outFields=*&outSR=4326&f=geojson")
#NYS = gp.read_file('Counties_Shoreline.shp')
NYS = NYS.to_crs("EPSG:4326")
NYS = NYS[NYS['NYSP_ZONE']!='Long Island']
NYS = NYS[NYS['NYSP_ZONE']!='West']
NYS = NYS[NYS['NYSP_ZONE']!='Central']
NYS = NYS[NYS['ABBREV']!='STLA']
NYS = NYS[NYS['ABBREV']!='CLIN']
NYS = NYS[NYS['ABBREV']!='FRAN']
NYS = NYS[NYS['ABBREV']!='ESSE']
NYS = NYS[NYS['ABBREV']!='HAMI']
NYS = NYS[NYS['ABBREV']!='WARR']
NYS = NYS[NYS['ABBREV']!='HERK']
NYS = NYS[NYS['ABBREV']!='WASH']
NYS = NYS[NYS['ABBREV']!='SARA']
NYS = NYS[NYS['ABBREV']!='SCHE']
NYS = NYS[NYS['ABBREV']!='MONT']
NYS = NYS[NYS['ABBREV']!='FULT']
NYS = NYS[NYS['ABBREV']!='RENS']
NYS = NYS[NYS['ABBREV']!='ALBA']
NYS = NYS[NYS['ABBREV']!='SCHO']
NYS = NYS[NYS['ABBREV']!='OTSE']
NYS = NYS[NYS['ABBREV']!='DELA']
NYS = NYS[NYS['ABBREV']!='GREE']
NYS = NYS[NYS['ABBREV']!='COLU']
NYS['dummy'] = 0
New_York_State = NYS.dissolve(by = 'dummy')
#New_York_State = New_York_State[['NAME','geometry']]
New_York_State.to_file('data/New_York_State.shp')
ox.plot.plot_footprints(New_York_State,figsize=(50,50))
del NYS
del New_York_State

### Connecticut Data
Since I was not completely sure if some rides would not end in Connecticut I had to included Connecticut as well. By applying a filter, I was able to keep all 'Inland Polygons' a thereby generate merged polygon with a better representation of the coastline then I could get via OSMNX.

In [None]:
CT = gp.read_file("https://services1.arcgis.com/FjPcSmEFuDYlIdKC/arcgis/rest/services/Connecticut_Towns_NoLabels/FeatureServer/1/query?where=1%3D1&outFields=*&outSR=4326&f=geojson")
#CT = gp.read_file('Town_Polygon.shp')
CT.head()

In [None]:
CT = CT[CT['COAST_POLY']=='Inland Polygons']
CT['dummy'] = 0
Connecticut = CT.dissolve(by = 'dummy')
#Connecticut = Connecticut[['TOWN','geometry']]
Connecticut = Connecticut.rename(columns={'TOWN':'NAME'})
Connecticut.to_file('data/Connecticut.shp')
ox.plot.plot_footprints(Connecticut,figsize=(50,50))
del Connecticut
del CT

### New Jersey Data
As you could see in the heatmap generated earlier in this notebook, many rides started or ended in New Jersey. To reduce the size and complexity of merged polygon I filtered the publicly available Data to exclude several Counties of New Jersey.

In [None]:
NJ = gp.read_file("https://maps.nj.gov/arcgis/rest/services/Framework/Government_Boundaries/MapServer/2/query?where=1%3D1&outFields=*&outSR=4326&f=geojson")
#NJ = gp.read_file('Municipal_Boundaries_of_NJ.shp')
NJ = NJ.to_crs("EPSG:4326")

In [None]:
NJ = NJ[NJ['COUNTY']!='SUSSEX']
NJ = NJ[NJ['COUNTY']!='WARREN']
NJ = NJ[NJ['COUNTY']!='HUNTERDON']
NJ = NJ[NJ['COUNTY']!='MERCER']
NJ = NJ[NJ['COUNTY']!='BURLINGTON']
NJ = NJ[NJ['COUNTY']!='OCEAN']
NJ = NJ[NJ['COUNTY']!='CAMDEN']
NJ = NJ[NJ['COUNTY']!='GLOUCESTER']
NJ = NJ[NJ['COUNTY']!='SALEM']
NJ = NJ[NJ['COUNTY']!='CUMBERLAND']
NJ = NJ[NJ['COUNTY']!='ATLANTIC']
NJ = NJ[NJ['COUNTY']!='MORRIS']
NJ = NJ[NJ['COUNTY']!='SOMERSET']
NJ = NJ[NJ['COUNTY']!='CAPE MAY']
NJ.head()

In [None]:
NJ['dummy'] = 0
New_Jersey = NJ.dissolve(by = 'dummy')
#New_Jersey = New_Jersey[['NAME','geometry']]
New_Jersey.to_file('data/New_Jersey.shp')
ox.plot.plot_footprints(New_Jersey,figsize=(50,50))
del NJ
del New_Jersey

### Special Areas
For some special areas of interest there are no polygons available, like for the 3 airports JFK, La Guardia and Newark Liberty International Airport. Since a high number of rides start or end at the airport, as we can see in the heatmap plotted earlier, and in case of JFK a ride between the airport and Manhattan will have a special tariff, I would like to have a special label. 

The same goes for Roosevelt Island, which is often included in the polygons for Manhattan, although there is no direct road connection between Roosevelt Island an Manhattan.

To create Polygons for these areas, I manually extracted GPS-point surrounding these areas out of Google Maps. I than use those points to create Polygons, build a GeoPandas Dataframe and write that into a shapefile.

In [None]:
from shapely.geometry import Point, Polygon

In [None]:
JFK_coords = [[40.649279, -73.795067],
              [40.633968, -73.795423],
              [40.632889, -73.777102],
              [40.647104, -73.766159],
              [40.654283, -73.782681]]
LGA_coords = [[40.766340, -73.863005],
              [40.768537, -73.865400],
              [40.770438, -73.868098],
              [40.771190, -73.869536],
              [40.771580, -73.870689],
              [40.771900, -73.872492],
              [40.771961, -73.874069],
              [40.771799, -73.875984],
              [40.771155, -73.878300],
              [40.769887, -73.880747],
              [40.768099, -73.884373],
              [40.768031, -73.885808],
              [40.779945, -73.889108],
              [40.787910, -73.868231],
              [40.765435, -73.847899],
              [40.764702, -73.859288]]
EWR_coords = [[40.697, -73.185],
              [40.687, -73.185],
              [40.687, -73.175],
              [40.697, -73.175]]
Roosevelt_Island_coords = [[40.773330,-73.939915],
                           [40.772429,-73.942333],
                           [40.769805,-73.945004],
                           [40.752010,-73.960918],
                           [40.749331,-73.961929],
                           [40.749386,-73.960846],
                           [40.752229,-73.957634],
                           [40.756630,-73.952871],
                           [40.763668,-73.946713],
                           [40.768803,-73.942099],
                           [40.769461,-73.940919],
                           [40.771671,-73.939417]
                          ]

JFK = Polygon(JFK_coords)
LGA = Polygon(LGA_coords)
EWR = Polygon(EWR_coords)
Roosevelt = Polygon(Roosevelt_Island_coords)

In [None]:
temp_pd = pd.DataFrame({'NAME':['JFK'],'geometry':[JFK]})
JFK_geopd =  gp.GeoDataFrame(temp_pd)
JFK_geopd.to_file('data/JFK.shp')

temp_pd = pd.DataFrame({'NAME':['LGA'],'geometry':[LGA]})
LGA_geopd =  gp.GeoDataFrame(temp_pd)
LGA_geopd.to_file('data/LGA.shp')

temp_pd = pd.DataFrame({'NAME':['EWR'],'geometry':[EWR]})
EWR_geopd =  gp.GeoDataFrame(temp_pd)
EWR_geopd.to_file('data/EWR.shp')

temp_pd = pd.DataFrame({'NAME':['Roosevelt'],'geometry':[Roosevelt]})
Roosevelt_geopd =  gp.GeoDataFrame(temp_pd)
Roosevelt_geopd.to_file('data/Roosevelt.shp')

### point-in-polygon test

To calculate which Ride starts and ends in which area, I created four new columns. The first and second column **"Pickup_Island"** and **"Dropoff_Island"** will define, if the ride started or ended on an Island, in New Jersey or in the New York State Mainland as well as in Connecticut. The third and fourth column **"Pickup_Borough"** and **"Dropoff_Borough"** will give a more precise definition about the area. It not only includes labels for areas like Manhattan, but also special labels for Midtown, Upper Westside, Brooklyn or JFK, to just name a few. This more precise form of labeling can be helpful later on in a machine learning part.

To do our calculations, I just load each polygon from a shapefile into the gpu memory and use cuspatials **point_in_polygon** function to determine if a point is within that area.

In [None]:
df_train['Pickup_Island']= 20
df_train['Dropoff_Island']= 20
df_test['Pickup_Island']= 20
df_test['Dropoff_Island']= 20

df_train['Pickup_Borough']= 20
df_train['Dropoff_Borough']= 20
df_test['Pickup_Borough']= 20
df_test['Dropoff_Borough']= 20

df_train['Pickup_Borough']=df_train['Pickup_Borough'].astype('int8')
df_train['Dropoff_Borough']=df_train['Dropoff_Borough'].astype('int8')
df_train['Pickup_Island'] = df_train['Pickup_Island'].astype('int8')
df_train['Dropoff_Island'] = df_train['Dropoff_Island'].astype('int8')

df_test['Pickup_Borough']=df_test['Pickup_Borough'].astype('int8')
df_test['Dropoff_Borough']=df_test['Dropoff_Borough'].astype('int8')
df_test['Pickup_Island'] = df_test['Pickup_Island'].astype('int8')
df_test['Dropoff_Island'] = df_test['Dropoff_Island'].astype('int8')

In [None]:
def getInclusion_shape(latitude,longitude,shape):
  result = cuspatial.point_in_polygon(latitude,longitude,cudf.Series([0],index='resultcolumn'),shape[1],shape[2]['y'],shape[2]['x'])
  return result

In [None]:

def check_Zones(shape,name):
    result_eins = getInclusion_shape(df_train.pickup_latitude,df_train.pickup_longitude,shape)
    result_zwei = getInclusion_shape(df_train.dropoff_latitude,df_train.dropoff_longitude,shape)
    df_train.loc[result_eins['resultcolumn'],'Pickup_Borough']=name
    df_train.loc[result_zwei['resultcolumn'],'Dropoff_Borough']=name
    result_eins = getInclusion_shape(df_test.pickup_latitude,df_test.pickup_longitude,shape)
    result_zwei = getInclusion_shape(df_test.dropoff_latitude,df_test.dropoff_longitude,shape)
    df_test.loc[result_eins['resultcolumn'],'Pickup_Borough']=name
    df_test.loc[result_zwei['resultcolumn'],'Dropoff_Borough']=name
    return 

In [None]:

def check_Islands(shape,name):
    result_eins = getInclusion_shape(df_train.pickup_latitude,df_train.pickup_longitude,shape)
    result_zwei = getInclusion_shape(df_train.dropoff_latitude,df_train.dropoff_longitude,shape)
    df_train.loc[result_eins['resultcolumn'],'Pickup_Island']=name
    df_train.loc[result_zwei['resultcolumn'],'Dropoff_Island']=name
    result_eins = getInclusion_shape(df_test.pickup_latitude,df_test.pickup_longitude,shape)
    result_zwei = getInclusion_shape(df_test.dropoff_latitude,df_test.dropoff_longitude,shape)
    df_test.loc[result_eins['resultcolumn'],'Pickup_Island']=name
    df_test.loc[result_zwei['resultcolumn'],'Dropoff_Island']=name
    return 

In [None]:
NYS_cd = cuspatial.read_polygon_shapefile('data/New_York_State.shp')
CT_cd = cuspatial.read_polygon_shapefile('data/Connecticut.shp')
NJ_cd = cuspatial.read_polygon_shapefile('data/New_Jersey.shp')
LI_cd = cuspatial.read_polygon_shapefile('data/Long_Island.shp')
SI_cd = cuspatial.read_polygon_shapefile('data/Staten_Island.shp')
BK_cd = cuspatial.read_polygon_shapefile('data/Brooklyn.shp')
Q_cd = cuspatial.read_polygon_shapefile('data/Queens.shp')
BX_cd = cuspatial.read_polygon_shapefile('data/Bronx.shp')
M_cd = cuspatial.read_polygon_shapefile('data/Manhattan.shp')
LM_cd = cuspatial.read_polygon_shapefile('data/Lower_Manhattan.shp')
MM_cd = cuspatial.read_polygon_shapefile('data/Midtown.shp')
UE_cd = cuspatial.read_polygon_shapefile('data/Upper_Eastside.shp')
UW_cd = cuspatial.read_polygon_shapefile('data/Upper_Westside.shp')
UM_cd = cuspatial.read_polygon_shapefile('data/Upper_Manhattan.shp')
CP_cd = cuspatial.read_polygon_shapefile('data/Central_Park.shp')
JFK_cd = cuspatial.read_polygon_shapefile('data/JFK.shp')
LGA_cd = cuspatial.read_polygon_shapefile('data/LGA.shp')
EWR_cd = cuspatial.read_polygon_shapefile('data/EWR.shp')
RI_cd = cuspatial.read_polygon_shapefile('data/Roosevelt.shp')

In [None]:
%%time
check_Zones(CT_cd, 1)
check_Zones(NYS_cd,2)
check_Zones(NJ_cd, 3)
check_Zones(LI_cd, 4)
check_Zones(SI_cd, 5)
check_Zones(EWR_cd,6)
check_Zones(RI_cd, 7)
check_Zones(BX_cd, 8)
check_Zones(M_cd,  9)
check_Zones(UM_cd, 10)
check_Zones(CP_cd, 11)
check_Zones(JFK_cd,12)
check_Zones(LGA_cd,13)
check_Zones(UE_cd, 14)
check_Zones(UW_cd, 15)
check_Zones(Q_cd,  16)
check_Zones(BK_cd, 17)
check_Zones(MM_cd, 18)
check_Zones(LM_cd, 19)

In [None]:
%%time
check_Islands(CT_cd, 1)
check_Islands(NYS_cd,1)
check_Islands(BX_cd, 1)
check_Islands(NJ_cd, 2)
check_Islands(LI_cd, 4)
check_Islands(SI_cd, 3)
check_Islands(M_cd,  5)

In [None]:
del NYS_cd
del CT_cd
del NJ_cd
del LI_cd
del SI_cd
del BK_cd
del Q_cd 
del BX_cd
del M_cd 
del LM_cd
del MM_cd
del UE_cd
del UW_cd
del UM_cd
del CP_cd
del JFK_cd
del LGA_cd
del EWR_cd
del RI_cd

In [None]:
df_train['Pickup_Borough']=df_train['Pickup_Borough'].astype('int8')
df_train['Dropoff_Borough']=df_train['Dropoff_Borough'].astype('int8')
df_train['Pickup_Island'] = df_train['Pickup_Island'].astype('int8')
df_train['Dropoff_Island'] = df_train['Dropoff_Island'].astype('int8')

df_test['Pickup_Borough']=df_test['Pickup_Borough'].astype('int8')
df_test['Dropoff_Borough']=df_test['Dropoff_Borough'].astype('int8')
df_test['Pickup_Island'] = df_test['Pickup_Island'].astype('int8')
df_test['Dropoff_Island'] = df_test['Dropoff_Island'].astype('int8')

### Eliminating the rides that start or end in water

Now that we have labels for each data, all rows can be deleted, where one or both GPS-Points (pickup and dropoff) are not included in the range of possible areas. To get the most precise elimination, I use the **"Pickup_Island"** and **"Dropoff_Island"** label.

In [None]:
shape_with_water = df_train.shape[0]
df_train = df_train[df_train['Pickup_Island']<20]
df_train = df_train[df_train['Dropoff_Island']<20]
shape_without_water = df_train.shape[0]
df_train_pd = df_train.to_pandas()
print("Number of rides eleminated = {}".format((shape_with_water-shape_without_water)))
#plot_hires(df_train_pd,(-74.3,-73,40.5,41.2))

In [None]:
#plot_hires(df_train_pd,(-74.256,-73.69,40.49,40.93))
#plot_hires(df_train_pd,(-74.05,-73.9,40.69,40.9))
plot_hires(df_train_pd,(-74.3,-73,40.5,41.2))

In [None]:
print("{} Rows get deleted, because the Ride started or ended in Water".format((shape_with_water-shape_without_water)))

As you can see in the heatmap created after cleaning, the data no longer includes rides that starts or end in water. This helps, since all rides that start or ended in water can be considered outlier, since the shortest path or route calculation relies on the street network, and therefore could produce incorrect distance calculations, which itself would negatively affect the modell training with machine learning algorithms.

### Manhattan distance

To have a comparison to the shortest route, I want to use a common distance calculation called taxicab geometry, also known as Manhattan distance. "*The latter name\[s\] allude to the grid layout of most streets on the island of Manhattan, which causes the shortest path a car could take between two intersections in the borough to have length equal to the intersections' distance in taxicab geometry."* (https://en.wikipedia.org/wiki/Taxicab_geometry)

In [None]:
def Manhattan_distance(lat1,lon1,lat2,lon2):
  lat_dist = lat1-lat2
  lon_dist = lon1-lon2
  distance = 111000*abs(lat_dist)+abs(lon_dist*111000*np.cos(lat1))
  return distance

In [None]:
%%time
df_train['Manhattan_distance']=Manhattan_distance(df_train['pickup_latitude'],df_train['pickup_longitude'],df_train['dropoff_latitude'],df_train['dropoff_longitude'])
df_test['Manhattan_distance']=Manhattan_distance(df_test['pickup_latitude'],df_test['pickup_longitude'],df_test['dropoff_latitude'],df_test['dropoff_longitude'])
df_train['Manhattan_distance']=df_train['Manhattan_distance'].astype('float32')
df_test['Manhattan_distance']=df_test['Manhattan_distance'].astype('float32')
df_train.head()

## Shortest path calculation

Before I start with the shortest path calculation, I want to take advantage of the **"Pickup_Borough"** label. One graph for all pickup- and dropoff-points in our data would not be an efficient way to handle the calculation, since a lot of routes stay within a small area. To calculate these rides a graph of a smaller area would be better suited, since the runtime of a single source shortest path calculation will scale with the number of nodes and edges in the graph used for this calculation. To see who many rides stay within an island or area, I use the **"Pickup_Island"** and **"Dropoff_Island"** label to generate a matrix, which will show how many rides stay within an island and how many rides start on one island and end on another island. This should help to determine which graphs to use for the calculation.


In [None]:
count_array = np.array([[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0]])
count_sum = 0
#df_train_pd = df_train.to_pandas()
for i in range(0,5):
    for j in range(0,5):
        df_special = df_train[(df_train['Pickup_Island'] == (i+1)) & (df_train['Dropoff_Island'] == (1+j))]
        count_array[i][j]=df_special.shape[0]
print(count_array)

As you can see from this matrix, 46 million rides stay within Manhattan. 1.8 million rides stay within Long Island. Around 5.4 Million rides are between Manhattan and Long Island. Since efficiency is particularly important in parallel algorithms, I want to use my resources efficiently adjusting the size of the graph for the **single source shortest path (sssp)** calculation, and thereby reduce the runtime of the code.
To achieve that, I split the date in several parts by using the Island labels, Borough labels as well as bounding boxes. For each part I will use a special Graph. 

Because generating multiple new dataframes containing the same data as df_train would cause an out_of_memory exception on my gpu, I transfer the cudf Dataframe to pandas and split the data using the CPU and RAM. This will consume more time than doing the same steps with a gpu, but I have to stay within the limits of my hardware. This is also the reason why I write each new Dataframe into a csv file, so that I do not need to keep everything in the system memory.

In [None]:
import os

#file_path = '/data/Staten_Island.shp'
#directory = os.path.dirname(file_path)
directory = 'split_parts'
try:
    os.stat(directory)
except:
    os.mkdir(directory) 

In [None]:
def split_Dataframe(df):
    df_Manhattan_gesamt = df[True==((df['Pickup_Island'] == 5) & (df['Dropoff_Island'] == 5))]
    df_nicht_Manhattan = df[False==((df['Pickup_Island'] == 5) & (df['Dropoff_Island'] == 5))]
    
    df_Roosevelt_eins   = df_Manhattan_gesamt[True==((df_Manhattan_gesamt['Pickup_Borough'] == 7))]
    #df_Roosevelt_eins.to_csv('Roosevelt_eins.csv',index=False,chunksize=1000000)
    df_Manhattan_gesamt = df_Manhattan_gesamt[False==((df_Manhattan_gesamt['Pickup_Borough'] == 7))]
    df_Roosevelt_zwei   = df_Manhattan_gesamt[True==((df_Manhattan_gesamt['Dropoff_Borough'] == 7))]
    #df_Roosevelt_zwei.to_csv('Roosevelt_zwei.csv',index=False,chunksize=1000000)
    df_Manhattan_gesamt = df_Manhattan_gesamt[False==((df_Manhattan_gesamt['Dropoff_Borough'] == 7))]
    print("Number of Rides that stay within Manhattan: {}".format(df_Manhattan_gesamt.shape[0]))
    
    df_Manhattan_eins = df_Manhattan_gesamt[0:16000000]
    df_Manhattan_zwei = df_Manhattan_gesamt[16000000:32000000]
    df_Manhattan_drei = df_Manhattan_gesamt[32000000:]
    del df_Manhattan_gesamt
    
    df_Manhattan_eins.to_csv('split_parts/Manhattan_eins.csv',index=False,chunksize=1000000)
    df_Manhattan_zwei.to_csv('split_parts/Manhattan_zwei.csv',index=False,chunksize=1000000)
    df_Manhattan_drei.to_csv('split_parts/Manhattan_drei.csv',index=False,chunksize=1000000)
    #df_Manhattan_vier.to_csv('split_parts/Manhattan_vier.csv',index=False,chunksize=1000000)
    #df_Manhattan_fünf.to_csv('split_parts/Manhattan_fünf.csv',index=False,chunksize=1000000)
    
    
    del df_Manhattan_eins
    del df_Manhattan_zwei
    del df_Manhattan_drei
    #del df_Manhattan_vier
    #del df_Manhattan_fünf
    
    
    df_Queens = df[(df['Pickup_Borough']==16)&(df['Dropoff_Borough']==16)]
    print("Number of Rides that stay within Queens: {}".format(df_Queens.shape[0]))
    df_Brooklyn = df[(df['Pickup_Borough']==17)&(df['Dropoff_Borough']==17)]
    print("Number of Rides that stay within Brooklyn: {}".format(df_Brooklyn.shape[0]))
    df_Brooklyn_Queens = df[(df['Pickup_Borough']==17)&(df['Dropoff_Borough']==16)]
    df_Queens_Brooklyn = df[(df['Pickup_Borough']==16)&(df['Dropoff_Borough']==17)]
    df_Queens_Brooklyn_gesamt = df_Queens
    df_Queens_Brooklyn_gesamt = df_Queens_Brooklyn_gesamt.append(df_Brooklyn_Queens)
    df_Queens_Brooklyn_gesamt = df_Queens_Brooklyn_gesamt.append(df_Queens_Brooklyn)
    print("Number of Rides between Brooklyn and Queens: {}".format(df_Queens_Brooklyn_gesamt.shape[0]))
    
    df_Queens.to_csv('split_parts/Queens.csv',index=False,chunksize=1000000)
    df_Brooklyn.to_csv('split_parts/Brooklyn.csv',index=False,chunksize=1000000)
    df_Brooklyn_Queens.to_csv('split_parts/Brooklyn_Queens.csv',index=False,chunksize=1000000)
    df_Queens_Brooklyn.to_csv('split_parts/Queens_Brooklyn.csv',index=False,chunksize=1000000)
    df_Queens_Brooklyn_gesamt.to_csv('split_parts/Queens_Brooklyn_total.csv',index=False,chunksize=1000000)
    
    del df_Queens
    del df_Brooklyn
    del df_Brooklyn_Queens
    del df_Queens_Brooklyn
    del df_Queens_Brooklyn_gesamt
    
    df_Long_Island_rest = df[(df['Pickup_Island'] == 4) & (df['Dropoff_Island'] == 4)]
    df_Long_Island_rest = df_Long_Island_rest[False==((df_Long_Island_rest['Pickup_Borough'] == 16) & (df_Long_Island_rest['Dropoff_Borough'] == 16))]
    df_Long_Island_rest = df_Long_Island_rest[False==((df_Long_Island_rest['Pickup_Borough'] == 16) & (df_Long_Island_rest['Dropoff_Borough'] == 17))]
    df_Long_Island_rest = df_Long_Island_rest[False==((df_Long_Island_rest['Pickup_Borough'] == 17) & (df_Long_Island_rest['Dropoff_Borough'] == 16))]
    df_Long_Island_rest = df_Long_Island_rest[False==((df_Long_Island_rest['Pickup_Borough'] == 17) & (df_Long_Island_rest['Dropoff_Borough'] == 17))]
    
    df_Long_Island_rest.to_csv('split_parts/Long_Island.csv',index=False,chunksize=1000000)
    print("Number of Rides that stay within the rest of Long Island: {}".format(df_Long_Island_rest.shape[0]))
    
    #df_Long_Island_gesamt_eins = df_Long_Island_gesamt[((df_Long_Island_gesamt['Pickup_Borough']==2)|(df_Long_Island_gesamt['Dropoff_Borough']==2)==False)]
    df_Staten_Island_gesamt = df[(df['Pickup_Island'] == 3) & (df['Dropoff_Island'] == 3)]
    df_Staten_Island_gesamt.to_csv('split_parts/Staten_Island.csv',index=False,chunksize=1000000)
    print("Number of Rides that stay within Staten Island: {}".format(df_Staten_Island_gesamt.shape[0]))
    del df_Long_Island_rest
    del df_Staten_Island_gesamt
    
    df_rest = df_nicht_Manhattan[False==((df_nicht_Manhattan['Pickup_Island'] == 4) & (df_nicht_Manhattan['Dropoff_Island'] == 4))]
    del df_nicht_Manhattan
    df_rest = df_rest[False==((df_rest['Pickup_Island'] == 3) & (df_rest['Dropoff_Island'] == 3))]
    
    df_rest = df_rest.append(df_Roosevelt_eins)
    df_rest = df_rest.append(df_Roosevelt_zwei)
    rest_shape_before_split = df_rest.shape[0]
    print("Number of Rides not included so far: {}".format(df_rest.shape[0]))
    df_spezial = df_rest[True==((df_rest['pickup_longitude'] <-73.69 ) & (df_rest['pickup_longitude'] > -74.256 )&(df_rest['pickup_latitude'] > 40.49) & (df_rest['pickup_latitude'] < 40.93 )&(df_rest['dropoff_longitude'] <-73.69 ) & (df_rest['dropoff_longitude'] > -74.256 )&(df_rest['dropoff_latitude'] > 40.49 ) & (df_rest['dropoff_latitude'] < 40.93 ))]
    df_most_spezial = df_spezial[True==((df_spezial['Pickup_Borough'] >8) & (df_spezial['Dropoff_Borough'] >8))]
    df_spezial = df_spezial[False==((df_spezial['Pickup_Borough'] >8) & (df_spezial['Dropoff_Borough'] >8))]
    df_rest =   df_rest[False==((df_rest['pickup_longitude'] <-73.69 ) & (df_rest['pickup_longitude'] > -74.256 )&(df_rest['pickup_latitude'] > 40.49) & (df_rest['pickup_latitude'] < 40.93 )&(df_rest['dropoff_longitude'] <-73.69 ) & (df_rest['dropoff_longitude'] > -74.256 )&(df_rest['dropoff_latitude'] > 40.49 ) & (df_rest['dropoff_latitude'] < 40.93 ))]

    
    df_most_spezial.to_csv('split_parts/most_spezial.csv',index=False,chunksize=1000000)
    df_spezial.to_csv('split_parts/spezial.csv',index=False,chunksize=1000000)
    df_rest.to_csv('split_parts/rest.csv',index=False,chunksize=1000000)

    
    
    
    del df_most_spezial
    del df_spezial
    del df_rest
    
    df_List_names = ["Manhattan_eins","Manhattan_zwei","Manhattan_drei",#"Manhattan_vier","Manhattan_fünf",
                    "Brooklyn",
                    "Queens_Brooklyn_total",
                    "Long_Island",
                    "Staten_Island",
                    "most_spezial",
                    "spezial",
                    "rest",]
    
    return df_List_names


In [None]:
%%time
splitted_list = split_Dataframe(df_train)

## Graphs for each dataframe part
These generated dataframe parts will use the following graphs for the shortest path calculation:

### Manhattan 
All rides that stay within Manhattan

In [None]:
G = ox.graph_from_place('Manhattan, New York, USA',simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]') #4503 Nodes, 9668 Edges
print(nx.info(G))
fig, ax = ox.plot_graph(G,figsize=(50,50))

### Brooklyn
All rides that stay within Brooklyn

In [None]:
G = ox.graph_from_place(['Brooklyn, New York, USA'],simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
print(nx.info(G))
fig, ax = ox.plot_graph(G,figsize=(50,50))

### Queens and Brooklyn
All rides that stay within Queens as well as rides between Queens and Brookyln

In [None]:
G = ox.graph_from_bbox(40.885,40.49,-74.05,-73.69,simplify=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
print(nx.info(G))
fig, ax = ox.plot_graph(G,figsize=(50,50))

### Long Island
All rides that stay within Long Island

In [None]:
G = ox.graph_from_place(['Long Island, USA'],simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
print(nx.info(G))
fig, ax = ox.plot_graph(G,figsize=(50,50))

### Staten Island
All rides that stay within Staten Island

In [None]:
G = ox.graph_from_place('Staten Island, New York, USA',simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
print(nx.info(G))
fig, ax = ox.plot_graph(G,figsize=(50,50))

### Other
All rides that are not included in previous parts make use of three different graphs, starting with a smaller graph, that should be well suited for rides between Manhattan and Brooklyn/Queens and two bigger graphs for other rides. The key difference between the second and third graph for all other rides is not only the size, but that no residential roads are included, which reduce the accuracy of the calculation with that graph. I was willing to accept that loss in accuracy in order to lower the runtime of the shortest path calculation.

In [None]:
G = ox.graph_from_bbox(40.885,40.49,-74.05,-73.69,simplify=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
print(nx.info(G))
fig, ax = ox.plot_graph(G,figsize=(50,50))

In [None]:
G = ox.graph_from_bbox(40.93,40.49,-74.256,-73.69,simplify=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
print(nx.info(G))
fig, ax = ox.plot_graph(G,figsize=(50,50))

In [None]:
G = ox.graph_from_bbox(41.71,40.48,-74.352,-72,simplify=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]') 
print(nx.info(G))
fig, ax = ox.plot_graph(G,figsize=(50,50))

In [None]:
df_train.to_csv('df_train_before_route_calculation.csv',index=False,chunksize=1000000)
#print(df_train.memory_usage())
df_train.shape[0]

In [None]:
df_train = None

In [None]:
del df_train

In [None]:
!nvidia-smi

# End of Part One

Since gpu memory is limited, and I wasn't able to free memory manually, I will restart the kernel, and unless you have a gpu with more than 11GB VRAM, I would advise you to do the same, and then run all cells below manually. I tried to optimize the second part both in regard to memory utilization and in regard to total runtime. A kernel restart still can be helpful even with highend hardware.

# Start of Part Two
In this part I will explain who to do the actual shortest path calculation. 

In [None]:
!nvidia-smi

In [None]:
!pip install osmnx

In [None]:
#import nvstrings
import numpy as np
import pandas as pd
import cudf, cuml
import dask_cudf
import io, requests
import math
import gc

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns 

#Learning
from cuml.preprocessing.model_selection import train_test_split
from cuml.linear_model import LinearRegression
from scipy.stats import uniform

import cuspatial
import cugraph

from cuml.solvers import SGD as cumlSGD
from cuml.linear_model import LogisticRegression
from cuml.ensemble import RandomForestRegressor as cuRF
from cuml.neighbors import KNeighborsRegressor
from cuml import ForestInference
from cuml import Ridge
from cuml import Lasso
from cuml import ElasticNet
from cuml.solvers import CD
from cuml.svm import SVR

import xgboost as xgb
from cuml.svm import SVC

import pandas as pd

#import dask_ml.model_selection as dcv
#from dask.distributed import Client, wait
#from dask_cuda import LocalCUDACluster

from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from cuml.metrics.regression import r2_score
from cuml.metrics.regression import mean_squared_error
pd.__version__ #1.1.4

In [None]:
import rmm
rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

In [None]:
cudf.set_allocator(allocator="managed")

In [None]:
import multiprocessing as mp
import numpy as np
import networkx as nx
import osmnx as ox
import requests
import matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline
ox.config(use_cache=True, log_console=True,timeout=1000)
ox.__version__

In [None]:
types =      {'fare_amount': 'float32',
              'pickup_datetime':'str',
              'pickup_longitude': 'float32',
              'pickup_latitude': 'float32',
              'dropoff_longitude': 'float32',
              'dropoff_latitude': 'float32',
              'passenger_count': 'int8'}
types_nach_bearbeitung =     {'fare_amount': 'float32',
                              'pickup_longitude': 'float32',
                              'pickup_latitude': 'float32',
                              'dropoff_longitude': 'float32',
                              'dropoff_latitude': 'float32',
                              'passenger_count': 'int8',
                              'hour': 'int8',
                              'minute': 'int8',
                              'weekday': 'int8',
                              'month': 'int8',
                              'year': 'int8',
                              'day': 'int8',
                              'Pickup_Island': 'int8',
                              'Dropoff_Island': 'int8',
                              'Pickup_Borough': 'int8',
                              'Dropoff_Borough': 'int8',
                              'Manhattan_distance': 'float32',
                             }
Spalten = list(types.keys())
Spalten_nach_bearbeitung = list(types_nach_bearbeitung.keys())
#df_train = cudf.read_csv('df_train_before_route_calculation.csv', usecols=Spalten, dtype=types_nach_bearbeitung)#,nrows=10000000
#teadf_test = cudf.read_csv('test.csv', usecols=Spalten, dtype=types)

In [None]:
splitted_list = ["Manhattan_eins","Manhattan_zwei","Manhattan_drei",#"Manhattan_vier","Manhattan_fünf",
                    "Brooklyn",
                    "Queens_Brooklyn_total",
                    "Long_Island",
                    "Staten_Island",
                    "most_spezial",
                    "spezial",
                    "rest",]

## First Step
To start the route calculation on the gpu I had to find a way to parallelize all steps I made to calculate one route with osmnx and networkx. 
The first step was to get a graph, I already found a way to do that, but I still need to transfer the graph into the gpu. Therefor I extract an Pandas edgelist Dataframe out of the graph, transfer it into the gpu memory as a cudf Dataframe, generate a new cugraph Graph and use the cudf edgelist dataframe to add all edges to the new Graph. 
I also extract a Dataframe containing information about each node and the corresponding GPS coordinates out of the graph in preparation of the nearest node search.

In [None]:
def create_cu_Graph(Graph):
    pd_edge = nx.to_pandas_edgelist(Graph) # generating a pandas dataframe containing all edgeinformations based on the given Graph
    pd_edge_zwei = pd_edge[["source","target","length"]] # simplification of the dataframe
    cd_edge = cudf.from_pandas(pd_edge_zwei)
    coords = np.array([[node, data['x'], data['y']] for node, data in Graph.nodes(data=True)]) #extracting nodenumber and coordinates of each node out of the Graph
    node_frame = pd.DataFrame(data=coords,columns=['Node','longitude','latitude'])
    cd_node = cudf.from_pandas(node_frame)
    
    G_drei = cugraph.Graph() # building a cugraph Graph from the edgelist dataframe
    G_drei.from_cudf_edgelist(cd_edge,source='source', destination='target',
                             edge_attr='length', renumber=True)
    
    # This part currently not in use is intended to reduce the overhead generated due to the renumbering process which itself causes a mapping of external to internal vertex ids when using sssp()
    # While this can reduce the runtime by 20% to 30%, I strongly advice not to use that code unless you are sure you understand how it works.
    """
    cd_node_renumbered = G_drei.add_internal_vertex_id(cd_node,internal_column_name = 'NewNode',external_column_name = 'Node',drop = False)
    cd_node_renumbered = cd_node_renumbered.drop(columns = ['longitude'])
    cd_node_renumbered = cd_node_renumbered.drop(columns = ['latitude'])
    
    cd_node = cudf.from_pandas(node_frame)
    cd_node_new = cd_node.merge(cd_node_renumbered,how = 'left', on = 'Node')
    cd_node_new = cd_node_new.drop(columns = ['Node'])
    cd_node_new = cd_node_new.rename(columns={'NewNode':'Node'})
    
    
    cd_node_renumbered = cd_node_renumbered.rename(columns={'Node':'source'})
    cd_edge_new = cd_edge.merge(cd_node_renumbered,how = 'left', on = 'source')
    cd_edge_new = cd_edge_new.rename(columns={'NewNode':'new source'})
    
    cd_node_renumbered = cd_node_renumbered.rename(columns={'source':'target'})
    cd_edge = cd_edge_new.merge(cd_node_renumbered,how = 'left', on = 'target')
    cd_edge = cd_edge.rename(columns={'NewNode':'new target'})
    
    G_zwei = cugraph.Graph() # building a cugraph Graph from the edgelist dataframe
    G_zwei.from_cudf_edgelist(cd_edge,source='new source', destination='new target',
                             edge_attr='length', renumber=False)
    
    del cd_edge
    del cd_edge_new
    del pd_edge
    del pd_edge_zwei
    del coords
    del node_frame
    del cd_node_renumbered
    del cd_node
    del G_drei
    return G_zwei, cd_node_new"""
    del pd_edge
    del pd_edge_zwei
    del coords
    del node_frame
    del cd_edge
    return G_drei, cd_node

## Second Step
The second step is the nearest node search. For this step I could not find a implementation, which is why I wrote my own using numba. Since the underlying principle should be easy to understand by just inspecting the code, I won't explain it. There are only two things important enough to explain. The first thing is, that to use that numba kernel you need to generate a gpu array out of a cudf.Series by using *to_gpu_array()* and after the execution of the kernelfunction you need to reassign the cudf.Series with the gpu array, otherwise the changes made to the gpu array won't be saved into the cudf.Series.
The second thing is, that in order to execute the kernelfunction I needed to define a blocksize and a number of threads_per block. Those numbers may not work for you, so if the code does not work for you, start with those numbers.

In [None]:
import numba
import math
from numba import cuda
@cuda.jit
def calc_nearest_point(long,lat,node,distance,lenght,node_number,node_long,node_lat,node_length):
    #node = Number of next node, 
    #distance = distance to next node, 
    #length = length of dataframes containing all points, 
    #node_length = length of dataframe containing all nodes
    def get_distance(current_long,current_lat,current_node_long,current_node_lat): #simplified distance function
        long_dist = abs(current_long-current_node_long)
        lat_dist = abs(current_lat-current_node_lat)
        distance = math.sqrt(math.pow((111000*lat_dist),2)+math.pow((long_dist*111000*math.cos(current_lat)),2))
        return distance
    def haversine_distance(current_long,current_lat,current_node_long,current_node_lat): #funktion to calculate the haversine distance between a node and a point.
        R = 6372800 # this is in miles.  For Earth radius in kilometers use 6372.8 km
        long_dist = abs(current_long-current_node_long)
        lat_dist = abs(current_lat-current_node_lat)
        dLat = (3.14159265359/180) * lat_dist
        dLon = (3.14159265359/180) * long_dist
        lat1 = (3.14159265359/180) * current_lat
        lat2 = (3.14159265359/180) * current_node_lat
        a = math.sin(dLat/2)*math.sin(dLat/2) + math.cos(lat1)*math.cos(lat2)*math.sin(dLon/2)*math.sin(dLon/2)
        c = 2*math.atan2(math.sqrt(a), math.sqrt(1-a))
        return R * c
    i = cuda.grid(1)
    if i <lenght:
        point_long=long[i]
        point_lat =lat[i]
        min_dist = 10000000
        node_numb=-1
        for j in range(0,node_length):
            dist_to_node = get_distance(point_long,point_lat,node_long[j],node_lat[j])
            if(dist_to_node<min_dist):
                min_dist = dist_to_node
                node_numb=node_number[j]
        node[i]=node_numb
        distance[i]=min_dist

In [None]:
def get_nodes_for_frame(df,Graph,cd_node): 
    # df = Dataframe containing pickup and dropoff coordinates
    # graph = Graph from osmnx with the streetnetwork
    node_numb_arr=cd_node['Node'].to_gpu_array()
    node_long_arr=cd_node['longitude'].to_gpu_array()
    node_lat_arr=cd_node['latitude'].to_gpu_array()
    node_length=cd_node.shape[0]
    pickup_long_arr = df['pickup_longitude'].to_gpu_array()
    pickup_lat_arr  = df['pickup_latitude'].to_gpu_array()
    drop_long_arr   = df['dropoff_longitude'].to_gpu_array()
    drop_lat_arr    = df['dropoff_latitude'].to_gpu_array()
    df['pickup_node']=-1
    df['dropoff_node']=-1
    df['distance_to_pickup']=-1.0
    df['distance_to_dropoff']=-1.0
    pickup_node_arr = df['pickup_node'].to_gpu_array()
    dropoff_node_arr = df['dropoff_node'].to_gpu_array()
    pickup_dist_arr = df['distance_to_pickup'].to_gpu_array()
    dropoff_dist_arr = df['distance_to_dropoff'].to_gpu_array()
    df_length = df.shape[0]
    blockspergrid = (df_length+(128-1)) #defining blocks per grid and threads per block, further information under https://numba.readthedocs.io/en/stable/cuda/kernels.html
    calc_nearest_point[blockspergrid,128](pickup_long_arr,pickup_lat_arr,pickup_node_arr,pickup_dist_arr,df_length,node_numb_arr,node_long_arr,node_lat_arr,node_length)
    cuda.synchronize()
    calc_nearest_point[blockspergrid,128](drop_long_arr,drop_lat_arr,dropoff_node_arr,dropoff_dist_arr,df_length,node_numb_arr,node_long_arr,node_lat_arr,node_length)
    cuda.synchronize()
    df['pickup_node']=pickup_node_arr
    df['dropoff_node']=dropoff_node_arr
    df['distance_to_pickup']=pickup_dist_arr
    df['distance_to_dropoff']=dropoff_dist_arr
    del pickup_node_arr
    del dropoff_node_arr
    del pickup_dist_arr
    del dropoff_dist_arr
    del pickup_long_arr
    del pickup_lat_arr
    del drop_long_arr
    del drop_lat_arr
    del df_length
    
    return df

## Third step
The third step is the shortest path calculation. I found an existing implementation of the single source shortest path algorithm included in RAPIDS cuspatial. Calling sssp with the graph and a node/vertex id as input will result in a dataframe containing all shortest path lengths from that source to all nodes in the graph. To assign the correct results to each source-destination-pair I wrote my own numba function. I later found out, that this can also be achieved with cudf.merge(), but therefor I would need to make some adjustments, and the numba kernelfunction seems to work just fine for me.

In [None]:
@cuda.jit
def kernelfunktionzehn(cal_src,source, destination,cost, cudf_length, vertex,weight,nodes):
    # calc_src =  current Source for which the Single Source Shortest Path calculation was made
    # source = pickup node column
    # destination = dropoff node column 
    # cost = edgecost column that we want to compute
    # cudf_length = length of the dataframe containing each taxi ride, for which we want to compute the routes.
    # vertex = destination for which we computed the shortest route length
    # weight = computed cost for the shortest route to the corresponding vertex
    # nodes = number of nodes, for which the shortest route was computed.
  i = cuda.grid(1)
  if i <cudf_length:
    if (source[i] == cal_src):
      for j in range(0,nodes):
        if (vertex[j]==destination[i]):
          cost[i]=weight[j]

In [None]:

gpu = cuda.get_current_device()
def group_and_find_cost_basic(df,Graph,G_zwei):
  # df = Dataframe containing pickup and dropoff coordinates
  # graph = Graph from osmnx with the streetnetwork  
    
    #print("Build graph")
    
    df['shortest_path_length'] = -1
    df['shortest_path_length'] = df['shortest_path_length'].astype('float32')
    #value_series_pd = df['pickup_node'].to_pandas()
    #unique_values = value_series_pd.unique()
    unique_values = df['pickup_node'].unique().to_pandas()
    #print("Transferring series as array to gpu")
    cost_arr    = df['shortest_path_length'].to_gpu_array()
    src_arr     = df['pickup_node'].to_gpu_array()
    dst_arr     = df['dropoff_node'].to_gpu_array()
    group_length = df.shape[0]
    #counter = len(unique_values)
    #print("Before for loop")
    for name in unique_values: #iterating over all unique pickup nodes
        distances = cugraph.sssp(G_zwei,name) # Single Source Shortest Path Calculation for one source node
        #distances_zwei= cugraph.filter_unreachable(distances) #Filter that can be helpful, but isn't used in this case
        blockspergrid = (group_length+(128-1)) #defining blocks per grid and threads per block, further information under https://numba.readthedocs.io/en/stable/cuda/kernels.html
        vertex_arr    = distances['vertex'].to_gpu_array()
        distance_arr  = distances['distance'].to_gpu_array() 
        nodes_length  = distances.shape[0]
        #cost_arr    = df['fastest_path_time'].to_gpu_array()
        #sec_cost_arr    = df['fastest_path_length'].to_gpu_array()
        #src_arr     = df['pickup_node'].to_gpu_array()
        #dst_arr     = df['fastest_path_length'].to_gpu_array()
        kernelfunktionzehn[blockspergrid,128](name,src_arr,dst_arr,cost_arr,group_length,
                                         vertex_arr,distance_arr, nodes_length) # Kernelfunction to combine the dataframe containing the routes and costs with the dataframe containing the pickup and dropoff nodes.
        #cost_arr.to_host()
        #src_arr.to_host()
        #dst_arr.to_host()
        #df['fastest_path_time'] = cost_arr
        #df['fastest_path_length'] = sec_cost_arr
        cuda.synchronize()  # important to counter any possible race condition or memory inconsistency due to parallelization
        #counter = counter - 1
        #print("{} Durchläufe verbleibend".format(counter))
        
        del distances
    df['shortest_path_length'] = cost_arr
    #del G_zwei
    #del cd_edge
    #del pd_edge
    #del cd_edge
    return df

In [None]:
def calc_routes_and_times_zwei(df,Graph,cd_Graph,cd_node):
    start_time = time.time()
    #print("Nodebestimmung gestartet")
    df_result_eins = get_nodes_for_frame(df,Graph,cd_node)
    #print("df_result_eins typ = {}".format(type(df_result_eins)))
    #print("Nodes bestimmt")
    end_time = time.time()
    duration = end_time-start_time
    print("Runtime Nearest Node Search: {} seconds".format(duration))
    
    #print("Spalten wurden eingeführt")
    df_result_eins = df_result_eins.sort_values('pickup_node')
    start_time = time.time()
    df_result_eins = group_and_find_cost_basic(df_result_eins,Graph,cd_Graph)
    print("shortest path found")
    
    end_time = time.time()
    duration = end_time-start_time
    print("runtime  of shortest path calculation: {} seconds".format(duration))
    
    df_result_eins['shortest_path_length']=df_result_eins['shortest_path_length']+df_result_eins['distance_to_pickup']+df_result_eins['distance_to_dropoff']
    df_result_eins['shortest_path_time']=0
    df_result_eins = df_result_eins.drop(columns = ['pickup_node'])
    df_result_eins = df_result_eins.drop(columns = ['dropoff_node'])
    df_result_eins = df_result_eins.drop(columns = ['distance_to_pickup'])
    df_result_eins = df_result_eins.drop(columns = ['distance_to_dropoff'])
    df_result_eins = df_result_eins.drop(columns = ['shortest_path_time'])
    return df_result_eins

## Final step
The final step is to load each dataframe part and execute the code with the correct graph

In [None]:
def load_part(Filename):
    start_time = time.time()
    df = cudf.read_csv(('split_parts/'+Filename+'.csv'), usecols=Spalten_nach_bearbeitung, dtype=types_nach_bearbeitung)#,nrows=10000000
    end_time = time.time()
    duration = end_time-start_time
    print("Loadtime DataFrame: {}".format(duration))
    
    return df

In [None]:
import time
def calculate_splitpart(df_List):
    result_list = []
    df_result_concat=None
    for i in range(0,len(df_List)):
        print("======================================================================")
        G=None
        start_time = time.time()
        if(i<3):
            print("Manhattan Part {} started".format(i+1))
            G = ox.graph_from_place('Manhattan, New York, USA',simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]') 
        if(i==3):
            print("Brooklyn Part started")
            G = ox.graph_from_place(['Brooklyn, New York, USA'],simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]') 
        if(i==4):
            print("Brooklyn-Queens Part started")
            G = ox.graph_from_bbox(40.885,40.49,-74.05,-73.69,simplify=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
        if(i==5):
            print("Long Island Part started")
            G = ox.graph_from_place(['Long Island, USA'],simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]') 
        if(i==6):
            print("Staten Island Part started")
            G = ox.graph_from_place('Staten Island, New York, USA',simplify=True,truncate_by_edge=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
        if(i==7):
            print("Between Islands and Most Inner Part started")
            G = ox.graph_from_bbox(40.885,40.49,-74.05,-73.69,simplify=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
        if(i==8):
            print("Between Islands and Inner Part started")
            G = ox.graph_from_bbox(40.93,40.49,-74.256,-73.69,simplify=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|residential|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]')
        if(i==9):
            print("Between Islands and Outer Part started")
            G = ox.graph_from_bbox(41.71,40.48,-74.352,-72,simplify=True,custom_filter='["area"!~"yes"]["highway"~"motorway|trunk|primary|secondary|tertiary|motorway_link|trunk_link|primary_link|secondary_link|tertiary_link"]["highway"!~"service"]') 
        end_time = time.time()
        duration = end_time-start_time
        print("loadtime graph: {} seconds".format(duration))
        
        print(nx.info(G))
        #ox.add_edge_speeds(G,precision=1)
        #ox.add_edge_travel_times(G, precision=1)
        start_time = time.time()
        df_part = load_part(df_List[i])
        G_zwei, cd_node = create_cu_Graph(G)
        
        df_result = None
        if(df_part.shape[0]>0):
            #print(type(df_result))
            df_result = calc_routes_and_times_zwei(df_part,G,G_zwei,cd_node)
        end_time = time.time()
        duration = end_time-start_time
        print("runtime of calculation for part: {} seconds".format(duration))
        
        #print(df_result)
        if(df_result is None):
            df_result = None
        else:
            df_result_pd = df_result.to_pandas()
            result_list.append(df_result_pd)
        df_result = None
        del df_result
        del df_part
    print("======================================================================")
    df_result_concat_pd = pd.concat(result_list)
    df_result_concat = cudf.from_pandas(df_result_concat_pd)
    del df_result_concat_pd
    df_result_concat.shape[0]
    del result_list
    return df_result_concat

In [None]:
!nvidia-smi

## Execution
With a 2080ti the runtime for all parts was around 1 hour and 40 minutes. This time may vary depending on your hardware. 

In [None]:
%%time
#df_train = calculate_splitpart(splitted_list_pd)
df_train = calculate_splitpart(splitted_list)

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
df_train.shape[0]

In [None]:
%%time
df_train.to_csv('df_train_after_route_calculation.csv',index=False,chunksize=1000000)

## Final words

I hope when you now try to check the correlation between shortest_path_length and fare_amount as well as the correlation between Manhattan_distance and fare_amount, you will see the same improvement I saw when I used the data for my project. I hope I was able to show you why this approach might be better than a simpler distance calculation and perhaps even the Borough labels are helpful for your machine learning algorithm. Maybe you are interested in making your own improvements to that notebook. There are multiple possibilities like tracing back each path and calculating the travel time for the shortest path on your own, or maybe changing the edgeweight so it represents the dollar fare value by combining both travel_time and length of each edge. Maybe you can even use that notebook to calculate when and how often a node or edge is included in a path of a ride, so you can predict traffic jams or adjust the travel_time in accordance to the time of a day.  The possibilities are almost endless. I hope you can have some fun with that notebook and enjoyed reading it. If so feel free to leave a comment below.


At the End, I want to give a special thank you to my friend Sanja Hoermann, how studies geoinformatics and helped me with her ideas to realize this notebook. If you like my work or hers, feel free to follow us on LinkedIn.

Me: https://www.linkedin.com/in/jonas-ziegler-108120200/ 

Sanja: https://www.linkedin.com/in/sanja-maria-hoermann-52810a214/ 