# Streetcar Delay Prediction - Data Preparation Geocode Specific

Use dataset covering Toronto Transit Commission (TTC) streetcar delays 2014 - present to predict future delays and come up with recommendations for avoiding delays.

Source dataset: : https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#e8f359f0-2f47-3058-bf64-6ec488de52da

This notebook contains the data preparation steps specific to mapping free-form location descriptions to latitude and longitude

- use the Google Maps API Web Services for Python  https://github.com/googlemaps/google-maps-services-python
- generate the latitude and longitude values for locations and create new columns in the output dataset

# Streetcar routes

From https://www.ttc.ca/Routes/Streetcars.jsp

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/streetcarnov3/master/streetcar%20routes.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

In [7]:
! pwd

/notebooks/manning/notebooks


# Get path and load dataframe saved from previous data preparation step

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# import seaborn as sns
import datetime
import os

remove_bad_values = False
city_name = 'Toronto'


In [32]:
# get the directory for that this notebook is in
rawpath = os.getcwd()
print("raw path is",rawpath)

raw path is /notebooks/manning/notebooks


In [33]:
# data is in a directory called "data" that is a sibling to the directory containing the notebook
path = os.path.abspath(os.path.join(rawpath, '..', 'data')) + "/"
print("path is", path)

path is /notebooks/manning/data/


In [34]:
# constants for the streetcar problem
# same values saved in data_preparation notebook: pickled_input_dataframe, pickled_output_dataframe
pickled_data_file = '2014_2018.pkl'
#pickled_dataframe = '2014_2018_df.pkl'
pickled_dataframe = '2014_2018_df_cleaned_keep_bad.pkl'
pickled_output_dataframe = '2014_2018_df_cleaned_keep_bad_loc_geocoded.pkl'

In [35]:
file_name = path + pickled_dataframe
df = pd.read_pickle(file_name)
df.head()

Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01 01:25:00,2015-01-01,504,01:25:00,Thursday,broadview and gerrard,Mechanical,9.0,18.0,s,4092,2015-01-01 01:25:00
2015-01-01 01:44:00,2015-01-01,504,01:44:00,Thursday,galley and roncesvalles,Held By,14.0,23.0,s,4030,2015-01-01 01:44:00
2015-01-01 02:04:00,2015-01-01,504,02:04:00,Thursday,king and sherborne,Mechanical,9.0,18.0,e,4147,2015-01-01 02:04:00
2015-01-01 02:12:00,2015-01-01,306,02:12:00,Thursday,main st. and upper gerard,Investigation,29.0,39.0,s,4049,2015-01-01 02:12:00
2015-01-01 05:05:00,2015-01-01,306,05:05:00,Thursday,gerrard and sumach,Mechanical,30.0,60.0,w,4114,2015-01-01 05:05:00


In [7]:
df.shape

(83365, 11)

# Set up geocode

In [3]:
! pip install -U googlemaps

Collecting googlemaps
  Downloading https://files.pythonhosted.org/packages/5a/3d/13b4230f3c1b8a586cdc8d8179f3c6af771c11247f8de9c166d1ab37f51d/googlemaps-3.0.2.tar.gz
Requirement not upgraded as not directly required: requests<3.0,>=2.11.1 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from googlemaps)
Requirement not upgraded as not directly required: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests<3.0,>=2.11.1->googlemaps)
Requirement not upgraded as not directly required: idna<2.7,>=2.5 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests<3.0,>=2.11.1->googlemaps)
Requirement not upgraded as not directly required: urllib3<1.23,>=1.21.1 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests<3.0,>=2.11.1->googlemaps)
Requirement not upgraded as not directly required: certifi>=2017.4.17 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests<3.0,>=2.11.1->googlema

In [4]:
# The code was removed by Watson Studio for sharing.

geocode result []


In [9]:
'''# problem is get IndexError: list index out of range
# 4     locs = geocode_result[0]["geometry"]["location"]

locs = geocode_result[0]["geometry"]["location"]
print("locs ",locs)'''

locs  {'lat': 43.653226, 'lng': -79.3831843}


In [10]:
'''# data["results"][0]["geometry"]["location"]
locs = geocode_result[0]["geometry"]["location"]
print("locs",locs)
lats = locs["lat"]
print("lats",lats)'''

locs {'lat': 43.653226, 'lng': -79.3831843}
lats 43.653226


In [6]:
def get_geocode_result(junction):
    
    geo_string = junction+", "+city_name
    # print("geo_string is", geo_string)
    geocode_result = gmaps.geocode(geo_string)
    # check to see if the result is empty and if so return zeros to indicate unparseable junction value
    if len(geocode_result) > 0:
        locs = geocode_result[0]["geometry"]["location"]
        return [locs["lat"], locs["lng"]]
    else:
        return [0.0,0.0]



In [7]:
locs = get_geocode_result("roncesvalles to longbranch")
print("locs ",locs)

locs  [0.0, 0.0]


In [8]:
get_geocode_result("queen and bathurst")[0]

43.6471969

In [40]:
df.shape

(83365, 11)

In [15]:
df_cut = df[:10]
df_cut.shape

(10, 11)

In [16]:
df_cut.head()

Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01 01:25:00,2015-01-01,504,01:25:00,Thursday,broadview and gerrard,Mechanical,9.0,18.0,s,4092,2015-01-01 01:25:00
2015-01-01 01:44:00,2015-01-01,504,01:44:00,Thursday,galley and roncesvalles,Held By,14.0,23.0,s,4030,2015-01-01 01:44:00
2015-01-01 02:04:00,2015-01-01,504,02:04:00,Thursday,king and sherborne,Mechanical,9.0,18.0,e,4147,2015-01-01 02:04:00
2015-01-01 02:12:00,2015-01-01,306,02:12:00,Thursday,main st. and upper gerard,Investigation,29.0,39.0,s,4049,2015-01-01 02:12:00
2015-01-01 05:05:00,2015-01-01,306,05:05:00,Thursday,gerrard and sumach,Mechanical,30.0,60.0,w,4114,2015-01-01 05:05:00


In [17]:
df_cut.shape

(10, 11)

In [41]:

# df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})), 
#    left_index=True, right_index=True)
# df['Route'] = df['Route'].apply(lambda x:check_route(x))
# merge two new columns to the dataframe by apply get_geocode_result function to the Location values and 
# with the first result populating the Latitude col and the second result populating the Longitude col
# small sample - saved as 2014_2018_df_cleaned_keep_bad_loc_geocoded_first100.pkl

# df_cut = df_cut.merge(df_cut.Location.apply(lambda s: pd.Series({'Latitude':get_geocode_result(s)[0],'Longitude':get_geocode_result(s)[1]})),left_index=True, right_index=True)

# to avoid making multiple calls to the geocode API, bring in the latitude and longitude values as a single 
# column and 
df['lat_long'] = df.Location.apply(lambda s: get_geocode_result(s))

# with the first result populating the Latitude col and the second result populating the Longitude col
# df = df.merge(df.Location.apply(lambda s: pd.Series({'Latitude':get_geocode_result(s)[0],'Longitude':get_geocode_result(s)[1]})),left_index=True, right_index=True)

Timeout: 

In [95]:
df_cut.head()

Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time,lat_long
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2015-01-01 01:25:00,2015-01-01,504,01:25:00,Thursday,broadview and gerrard,Mechanical,9.0,18.0,s,4092,2015-01-01 01:25:00,"[43.6654831, -79.35263359999999]"
2015-01-01 01:44:00,2015-01-01,504,01:44:00,Thursday,galley and roncesvalles,Held By,14.0,23.0,s,4030,2015-01-01 01:44:00,"[43.6428252, -79.4477026]"
2015-01-01 02:04:00,2015-01-01,504,02:04:00,Thursday,king and sherborne,Mechanical,9.0,18.0,e,4147,2015-01-01 02:04:00,"[43.6580047, -79.3710098]"
2015-01-01 02:12:00,2015-01-01,306,02:12:00,Thursday,main st. and upper gerard,Investigation,29.0,39.0,s,4049,2015-01-01 02:12:00,"[43.6841917, -79.3004627]"
2015-01-01 05:05:00,2015-01-01,306,05:05:00,Thursday,gerrard and sumach,Mechanical,30.0,60.0,w,4114,2015-01-01 05:05:00,"[43.66315549999999, -79.3614893]"


In [96]:
# derive latitude and longitude columns from list column
# df["new_col"] = df["A"].str[0]
df["latitude"] = df["lat_long"].str[0]
df["longitude"] = df["lat_long"].str[1]
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time,lat_long,latitude,longitude
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2015-01-01 01:25:00,2015-01-01,504,01:25:00,Thursday,broadview and gerrard,Mechanical,9.0,18.0,s,4092,2015-01-01 01:25:00,"[43.6654831, -79.35263359999999]",43.665483,-79.352634
2015-01-01 01:44:00,2015-01-01,504,01:44:00,Thursday,galley and roncesvalles,Held By,14.0,23.0,s,4030,2015-01-01 01:44:00,"[43.6428252, -79.4477026]",43.642825,-79.447703
2015-01-01 02:04:00,2015-01-01,504,02:04:00,Thursday,king and sherborne,Mechanical,9.0,18.0,e,4147,2015-01-01 02:04:00,"[43.6580047, -79.3710098]",43.658005,-79.37101
2015-01-01 02:12:00,2015-01-01,306,02:12:00,Thursday,main st. and upper gerard,Investigation,29.0,39.0,s,4049,2015-01-01 02:12:00,"[43.6841917, -79.3004627]",43.684192,-79.300463
2015-01-01 05:05:00,2015-01-01,306,05:05:00,Thursday,gerrard and sumach,Mechanical,30.0,60.0,w,4114,2015-01-01 05:05:00,"[43.66315549999999, -79.3614893]",43.663155,-79.361489


In [44]:
df.shape

(83365, 11)

In [39]:
df.head()


Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time,Latitude,Longitude
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2015-01-01 01:25:00,2015-01-01,504,01:25:00,Thursday,broadview and gerrard,Mechanical,9.0,18.0,s,4092,2015-01-01 01:25:00,43.665483,-79.352634
2015-01-01 01:44:00,2015-01-01,504,01:44:00,Thursday,galley and roncesvalles,Held By,14.0,23.0,s,4030,2015-01-01 01:44:00,43.642825,-79.447703
2015-01-01 02:04:00,2015-01-01,504,02:04:00,Thursday,king and sherborne,Mechanical,9.0,18.0,e,4147,2015-01-01 02:04:00,43.658005,-79.37101
2015-01-01 02:12:00,2015-01-01,306,02:12:00,Thursday,main st. and upper gerard,Investigation,29.0,39.0,s,4049,2015-01-01 02:12:00,43.684192,-79.300463
2015-01-01 05:05:00,2015-01-01,306,05:05:00,Thursday,gerrard and sumach,Mechanical,30.0,60.0,w,4114,2015-01-01 05:05:00,43.663155,-79.361489


# Remove bad rows

In [54]:
print("Location count post cleanup:",df['Location'].nunique())
print("Route count post cleanup:",df['Route'].nunique())
print("Direction count post cleanup:",df['Direction'].nunique())
print("Vehicle count post cleanup:",df['Vehicle'].nunique())
# print("Bad Location count":df[df.Vehicle == 'bad vehicle'].shape[0])
print("Bad route count:",df[df.Route == 'bad route'].shape[0])
print("Bad direction count:",df[df.Direction == 'bad direction'].shape[0])
print("Bad vehicle count:",df[df.Vehicle == 'bad vehicle'].shape[0])

Location count post cleanup: 10074
Route count post cleanup: 15
Direction count post cleanup: 6
Vehicle count post cleanup: 1017
Bad route count: 3091
Bad direction count: 334
Bad vehicle count: 14480


In [55]:
# remove rows with bad vehicle value
if remove_bad_values:
    df = df[df.Vehicle != 'bad vehicle']
    df = df[df.Direction != 'bad direction']
    df = df[df.Route != 'bad route']

In [56]:
df.shape

(66095, 11)

In [40]:
# pickle the cleansed dataframe
file_name = path + pickled_output_dataframe
df_cut.to_pickle(file_name)

In [36]:
dfn.shape

(100, 11)

In [41]:
dfn = pd.read_pickle(file_name)
dfn.head()

Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time,Latitude,Longitude
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2015-01-01 01:25:00,2015-01-01,504,01:25:00,Thursday,broadview and gerrard,Mechanical,9.0,18.0,s,4092,2015-01-01 01:25:00,43.665483,-79.352634
2015-01-01 01:44:00,2015-01-01,504,01:44:00,Thursday,galley and roncesvalles,Held By,14.0,23.0,s,4030,2015-01-01 01:44:00,43.642825,-79.447703
2015-01-01 02:04:00,2015-01-01,504,02:04:00,Thursday,king and sherborne,Mechanical,9.0,18.0,e,4147,2015-01-01 02:04:00,43.658005,-79.37101
2015-01-01 02:12:00,2015-01-01,306,02:12:00,Thursday,main st. and upper gerard,Investigation,29.0,39.0,s,4049,2015-01-01 02:12:00,43.684192,-79.300463
2015-01-01 05:05:00,2015-01-01,306,05:05:00,Thursday,gerrard and sumach,Mechanical,30.0,60.0,w,4114,2015-01-01 05:05:00,43.663155,-79.361489


# Visualize cleaned data

In [None]:
!pip install pixiedust

In [None]:
import pixiedust

In [None]:
display(df)