# Streetcar Delay Prediction - Data Preparation

Use dataset covering Toronto Transit Commission (TTC) streetcar delays 2014 - present to predict future delays and come up with recommendations for avoiding delays.

Source dataset: : https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#e8f359f0-2f47-3058-bf64-6ec488de52da

This notebook contains the common data loading and preparation steps:
- load data from all the tabs of all the XLS files into a single dataframe
- correct type issues
- fix missing values
- clean up anomalies in the location, direction and vehicle columns

# Streetcar routes

From https://www.ttc.ca/Routes/Streetcars.jsp

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/streetcarnov3/master/streetcar%20routes.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Streetcar vehicle IDs CLRV/ALRV

From https://en.wikipedia.org/wiki/Toronto_streetcar_system_rolling_stock#CLRVs_and_ALRVs

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/streetcarnov3/master/streetcarCLRV.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

In [100]:
streetcar_vehicles = list(range(4000,4005))+ list(range(4010,4200)) +  list(range(4200,4252)) + [4900]
streetcar_vehicles = streetcar_vehicles + [4400] + list(range(4402,4508))
print("valid streetcars",streetcar_vehicles)

valid streetcars [4000, 4001, 4002, 4003, 4004, 4010, 4011, 4012, 4013, 4014, 4015, 4016, 4017, 4018, 4019, 4020, 4021, 4022, 4023, 4024, 4025, 4026, 4027, 4028, 4029, 4030, 4031, 4032, 4033, 4034, 4035, 4036, 4037, 4038, 4039, 4040, 4041, 4042, 4043, 4044, 4045, 4046, 4047, 4048, 4049, 4050, 4051, 4052, 4053, 4054, 4055, 4056, 4057, 4058, 4059, 4060, 4061, 4062, 4063, 4064, 4065, 4066, 4067, 4068, 4069, 4070, 4071, 4072, 4073, 4074, 4075, 4076, 4077, 4078, 4079, 4080, 4081, 4082, 4083, 4084, 4085, 4086, 4087, 4088, 4089, 4090, 4091, 4092, 4093, 4094, 4095, 4096, 4097, 4098, 4099, 4100, 4101, 4102, 4103, 4104, 4105, 4106, 4107, 4108, 4109, 4110, 4111, 4112, 4113, 4114, 4115, 4116, 4117, 4118, 4119, 4120, 4121, 4122, 4123, 4124, 4125, 4126, 4127, 4128, 4129, 4130, 4131, 4132, 4133, 4134, 4135, 4136, 4137, 4138, 4139, 4140, 4141, 4142, 4143, 4144, 4145, 4146, 4147, 4148, 4149, 4150, 4151, 4152, 4153, 4154, 4155, 4156, 4157, 4158, 4159, 4160, 4161, 4162, 4163, 4164, 4165, 4166, 4167, 4168

# Streetcar vehicle IDs Flexity

From https://en.wikipedia.org/wiki/Toronto_streetcar_system_rolling_stock#CLRVs_and_ALRVs

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/streetcarnov3/master/streetcarflexity.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Bus identification
The following links define the valid non-streetcar vehicles that can be delayed by streetcar incidents

- Buses 1xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_1000-1149
- Buses 2xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_2000-2110,_2150-2155,_2240-2485,_2600-2619,_2700-2765,_2767-2858
- Buses 70xx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_7000-7134
- Buses 74xx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_7400-7499,_7500-7619,_7620-7881
- Buses 8xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_8000-8099
- Buses 9xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_9000-9026







In [101]:
bus_vehicles = list(range(1000,1150))+ list(range(2000,2111)) + list(range(2150,2156)) + list(range(2240,2486))
bus_vehicles = bus_vehicles + list(range(2600,2620)) + list(range(2700,2766)) + list(range(2767,2859))
bus_vehicles = bus_vehicles + list(range(7000,7135)) + list(range(7400,7450)) + list(range(7500,7620)) + list(range(7620,7882))
bus_vehicles = bus_vehicles + list(range(8000,8100)) + list(range(9000,9027))
valid_vehicles = streetcar_vehicles + bus_vehicles
print("valid vehicles",valid_vehicles)

valid vehicles [4000, 4001, 4002, 4003, 4004, 4010, 4011, 4012, 4013, 4014, 4015, 4016, 4017, 4018, 4019, 4020, 4021, 4022, 4023, 4024, 4025, 4026, 4027, 4028, 4029, 4030, 4031, 4032, 4033, 4034, 4035, 4036, 4037, 4038, 4039, 4040, 4041, 4042, 4043, 4044, 4045, 4046, 4047, 4048, 4049, 4050, 4051, 4052, 4053, 4054, 4055, 4056, 4057, 4058, 4059, 4060, 4061, 4062, 4063, 4064, 4065, 4066, 4067, 4068, 4069, 4070, 4071, 4072, 4073, 4074, 4075, 4076, 4077, 4078, 4079, 4080, 4081, 4082, 4083, 4084, 4085, 4086, 4087, 4088, 4089, 4090, 4091, 4092, 4093, 4094, 4095, 4096, 4097, 4098, 4099, 4100, 4101, 4102, 4103, 4104, 4105, 4106, 4107, 4108, 4109, 4110, 4111, 4112, 4113, 4114, 4115, 4116, 4117, 4118, 4119, 4120, 4121, 4122, 4123, 4124, 4125, 4126, 4127, 4128, 4129, 4130, 4131, 4132, 4133, 4134, 4135, 4136, 4137, 4138, 4139, 4140, 4141, 4142, 4143, 4144, 4145, 4146, 4147, 4148, 4149, 4150, 4151, 4152, 4153, 4154, 4155, 4156, 4157, 4158, 4159, 4160, 4161, 4162, 4163, 4164, 4165, 4166, 4167, 4168, 

# Vehicles that are not legitimate subjects of streetcar incidents
The following vehicles are not legitimate subjects of streetcar incidents because they run on completely separate tracks (RT and subway) or they have been retired (6xxx buses).

- RT cars 3xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_3000-3027
- Subway cars 5xxx https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_5000-5371
- Retired Buses 6xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_6000-6122


In [102]:
# load the valid list of TTC Streetcar routes
valid_routes = ['501','502','503','504','505','506','509','510','511','512','301','304','306','310']

In [103]:
valid_routes

['501',
 '502',
 '503',
 '504',
 '505',
 '506',
 '509',
 '510',
 '511',
 '512',
 '301',
 '304',
 '306',
 '310']

In [104]:
# original valid directions
# valid_directions = ['E/B','W/B','N/B','S/B','B/W']
# revised valid directions to include lowercasing and removal of '/' and simplify to single letter
valid_directions = ['e','w','n','s','b']

In [105]:
valid_days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

In [106]:
! pwd

/notebooks/manning/notebooks


# Load and Save Data
- parse list of XLS files 
- load XLS files, tab by tab, into dataframe
- pickle dataframe for future runs

In [107]:
# variables to control function of this notebook
# control whether to load data from scratch from original source or from saved dataframe
load_from_scratch = False
# control whether to save dataframe with transformed data
save_transformed_dataframe = True
# control whether rows containing erroneous values are removed from the saved dataset
remove_bad_values = True

In [108]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# import seaborn as sns
import datetime
import os


In [109]:
# get the directory for that this notebook is in
rawpath = os.getcwd()
print("raw path is",rawpath)

raw path is /notebooks/manning/notebooks


In [110]:
# data is in a directory called "data" that is a sibling to the directory containing the notebook
# this code assumes you have copied to this directory all the XLS files from the source dataset: https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#e8f359f0-2f47-3058-bf64-6ec488de52da
path = os.path.abspath(os.path.join(rawpath, '..', 'data')) + "/"
print("path is", path)

path is /notebooks/manning/data/


In [111]:
# pickled_dataframe = '20142018_df.pkl'
pickled_input_dataframe = '2014_2018.pkl'
# pickled_output_dataframe = '2014_2018_df.pkl'
pickled_output_dataframe = '2014_2018_df_direction_vehicle_cleaned.pkl'
# path,picklename,firstfile, firstsheet



In [112]:
# given a path return the list of xls files in the directory
def get_xls_list(path):
    files = os.listdir(path)
    files_xls = [f for f in files if f[-4:] == 'xlsx']
    print(files)
    print(files_xls)
    return(files_xls)


In [113]:
# load all the tabs of all the XLS files in a list of XLS files, minus tab that has seeded dataframe
def load_xls(path, files_xls, firstfile, firstsheet, df):
    for f in files_xls:
        print("file name",f)
        xlsf = pd.ExcelFile(path+f)
        # iterate through sheets
        for sheet_name in xlsf.sheet_names:
            print("sheet_name",sheet_name)
            if (f != firstfile) or (sheet_name != firstsheet):
                print("sheet_name in loop",sheet_name)
                data = pd.read_excel(path+f,sheetname=sheet_name)    
                df = df.append(data)
    return (df)

In [114]:
# given a path and a filename, load all the XLS files in the path into a dataframe and save
# to the dataframe to the filename
def reloader(path,picklename):
    # get list of all xls files in the path
    files_xls = get_xls_list(path)
    print("list of xls",files_xls)
    # seed initial tab on initial xls file
    dfnew = pd.read_excel(path+files_xls[0])
    # get the list of sheets in the first file
    xlsf = pd.ExcelFile(path+files_xls[0])
    # load the remaining tabs from all the other xls
    # pass the first file (files_xls[0]) and the first tab in that file (xlsf[0]) explicitly
    dflatest = load_xls(path,files_xls,files_xls[0],xlsf.sheet_names[0], dfnew)
    # save dataframe to pickle
    dflatest.to_pickle(path+ picklename)
    # return dataframe loaded with all tabs of all xls files
    return(dflatest)
    

In [115]:
# define categories for input columns
def define_feature_categories(df):
    allcols = list(df)
    print("all cols",allcols)
    textcols = ['Incident','Location'] # 
    continuouscols = ['Min Delay','Min Gap'] 
                      # columns to deal with as continuous values - no embeddings
    timecols = ['Report Date','Time']
    collist = ['Day','Vehicle','Route','Direction']
    for col in continuouscols:
        df[col] = df[col].astype(float)
    print('texcols: ',textcols)
    print('continuouscols: ',continuouscols)
    print('timecols: ',timecols)
    print('collist: ',collist)
    return(allcols,textcols,continuouscols,timecols,collist)

In [116]:
# fill missing values according to the column category
def fill_missing(dataset):
    print("before mv")
    for col in collist:
        dataset[col].fillna(value="missing", inplace=True)
    for col in continuouscols:
        dataset[col].fillna(value=0.0,inplace=True)
    for col in textcols:
        dataset[col].fillna(value="missing", inplace=True)
    return (dataset)

# Load dataframe
- load pickled dataframe
- show info about the dataset


In [117]:
# read in previously pickled dataframe containing data from s/s 2014 - 2018
if load_from_scratch:
    unpickled_df = reloader(path,pickled_input_dataframe)
    print("reloader done")
    #unpickled_df = pd.read_pickle(path+pickled_data_file)
else:
    unpickled_df = pd.read_pickle(path+pickled_input_dataframe)

In [118]:
df = unpickled_df
df.head()

Unnamed: 0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle
0,2015-01-01,504,01:25:00,Thursday,Broadview and Gerrard,Mechanical,9.0,18.0,S/B,4092.0
1,2015-01-01,504,01:44:00,Thursday,Roncesvalles and Galley,Held By,14.0,23.0,S/B,4030.0
2,2015-01-01,504,02:04:00,Thursday,King and Sherborne,Mechanical,9.0,18.0,E/B,4147.0
3,2015-01-01,306,02:12:00,Thursday,Main St. and Upper Gerard,Investigation,29.0,39.0,S/B,4049.0
4,2015-01-01,306,05:05:00,Thursday,Gerrard and Sumach,Mechanical,30.0,60.0,W/B,4114.0


# General cleanup
- correct types for Route and Vehicle
- fill missing values
- create report-date-time index

In [119]:
# ensure Route and Vehicle are strings, not numeric
# df = df.astype({"Route": str, "Vehicle": int})
df['Route'] = df['Route'].astype(str)
df['Vehicle'] = df['Vehicle'].astype(str)
# df['filename'] = df['filename'].str[:-4]
df['Vehicle'] = df['Vehicle'].str[:-2]
df.head()

Unnamed: 0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle
0,2015-01-01,504,01:25:00,Thursday,Broadview and Gerrard,Mechanical,9.0,18.0,S/B,4092
1,2015-01-01,504,01:44:00,Thursday,Roncesvalles and Galley,Held By,14.0,23.0,S/B,4030
2,2015-01-01,504,02:04:00,Thursday,King and Sherborne,Mechanical,9.0,18.0,E/B,4147
3,2015-01-01,306,02:12:00,Thursday,Main St. and Upper Gerard,Investigation,29.0,39.0,S/B,4049
4,2015-01-01,306,05:05:00,Thursday,Gerrard and Sumach,Mechanical,30.0,60.0,W/B,4114


In [120]:
# define categories
allcols,textcols,continuouscols,timecols,collist = define_feature_categories(df) 

all cols ['Report Date', 'Route', 'Time', 'Day', 'Location', 'Incident', 'Min Delay', 'Min Gap', 'Direction', 'Vehicle']
texcols:  ['Incident', 'Location']
continuouscols:  ['Min Delay', 'Min Gap']
timecols:  ['Report Date', 'Time']
collist:  ['Day', 'Vehicle', 'Route', 'Direction']


In [121]:
# get the number of missing values for the columns
df.isnull().sum(axis = 0)

Report Date      0
Route            0
Time             0
Day              0
Location       270
Incident         0
Min Delay       58
Min Gap         77
Direction      232
Vehicle          0
dtype: int64

In [122]:
# fill in missing values
df = fill_missing(df)

before mv


In [123]:
# getting some information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83365 entries, 0 to 1519
Data columns (total 10 columns):
Report Date    83365 non-null datetime64[ns]
Route          83365 non-null object
Time           83365 non-null object
Day            83365 non-null object
Location       83365 non-null object
Incident       83365 non-null object
Min Delay      83365 non-null float64
Min Gap        83365 non-null float64
Direction      83365 non-null object
Vehicle        83365 non-null object
dtypes: datetime64[ns](1), float64(2), object(7)
memory usage: 7.0+ MB


In [124]:
# getting some information about dataset
df.shape

(83365, 10)

In [125]:
# further Analysis 
df.describe()

Unnamed: 0,Min Delay,Min Gap
count,83365.0,83365.0
mean,12.630229,18.103209
std,29.93981,33.000675
min,0.0,0.0
25%,5.0,9.0
50%,6.0,12.0
75%,11.0,20.0
max,1400.0,4216.0


In [126]:
df.dtypes

Report Date    datetime64[ns]
Route                  object
Time                   object
Day                    object
Location               object
Incident               object
Min Delay             float64
Min Gap               float64
Direction              object
Vehicle                object
dtype: object

In [127]:
# create new column combing date + time (needed for resampling) and make it the index

df['Report Date Time'] = pd.to_datetime(df['Report Date'].astype(str) + ' ' + df['Time'].astype(str))
df.index = df['Report Date Time']
df.head()

Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01 01:25:00,2015-01-01,504,01:25:00,Thursday,Broadview and Gerrard,Mechanical,9.0,18.0,S/B,4092,2015-01-01 01:25:00
2015-01-01 01:44:00,2015-01-01,504,01:44:00,Thursday,Roncesvalles and Galley,Held By,14.0,23.0,S/B,4030,2015-01-01 01:44:00
2015-01-01 02:04:00,2015-01-01,504,02:04:00,Thursday,King and Sherborne,Mechanical,9.0,18.0,E/B,4147,2015-01-01 02:04:00
2015-01-01 02:12:00,2015-01-01,306,02:12:00,Thursday,Main St. and Upper Gerard,Investigation,29.0,39.0,S/B,4049,2015-01-01 02:12:00
2015-01-01 05:05:00,2015-01-01,306,05:05:00,Thursday,Gerrard and Sumach,Mechanical,30.0,60.0,W/B,4114,2015-01-01 05:05:00


# Clean up selected columns
Some values in the input dataset were entered "free form" when they should have been constricted to a pick list. Columns with this problem include:

- Route
- Vehicle
- Direction
- Location


Each of these have a finite set of valid values. We have to fix the data in these columns where multiple tokens have been used to signify the same real-world entity (e.g. "roncesvalles yard." and "roncesvalles carhouse", or where incorrect values have been entered (e.g. Direction that does not correspond with a compass point)

# Clean up Route

In [128]:
def check_route (x):
    if x in valid_routes:
        return(x)
    else:
        return("bad route")

In [129]:
print("route count",df['Route'].nunique())
df['Route'].value_counts()

route count 106


501    21160
504    15862
506    11361
505     8947
512     6354
510     5291
511     4208
509     2817
514     2086
502     1746
503     1027
301      891
306      356
705      209
304      205
805      192
508      137
535       53
317       52
50        50
310       49
807       23
5         22
51        21
1         20
500       11
4         11
3          8
8          8
11         8
       ...  
49         2
804        1
68         1
13         1
519        1
63         1
375        1
93         1
28         1
403        1
57         1
81         1
86         1
830        1
45         1
21         1
85         1
701        1
513        1
60         1
53         1
205        1
405        1
31         1
19         1
210        1
204        1
594        1
999        1
111        1
Name: Route, Length: 106, dtype: int64

In [130]:
print("route count pre cleanup",df['Route'].nunique())

route count pre cleanup 106


In [131]:
# apply(lambda x:findEmpty(x) df['Route'].apply(lambda x:check_route(x))
df['Route'] = df['Route'].apply(lambda x:check_route(x))

In [132]:
print("route count post cleanup",df['Route'].nunique())
df['Route'].value_counts()

route count post cleanup 15


501          21160
504          15862
506          11361
505           8947
512           6354
510           5291
511           4208
bad route     3091
509           2817
502           1746
503           1027
301            891
306            356
304            205
310             49
Name: Route, dtype: int64

# Clean up Vehicle

In [133]:
df[df.Vehicle == 'n'].shape[0]

5552

In [134]:
df['Vehicle'].shape[0]

83365

In [135]:
def check_vehicle (x):
    if str.isdigit(x):
        if int(x) in valid_vehicles:
            return x
        else:
            return("bad vehicle")
    else:
        return("bad vehicle")

In [136]:
df['Vehicle'].value_counts()

n       5552
4074     330
4199     320
4101     316
4209     314
4229     308
4204     307
4176     306
4247     304
4144     302
4218     300
4050     299
4115     299
4200     298
4001     297
4147     294
4222     293
4185     291
4149     290
4215     288
4141     288
4210     287
4152     287
4208     286
4048     286
4110     286
4143     284
4181     283
4217     282
4226     280
        ... 
7690       1
8434       1
8624       1
8480       1
6955       1
11         1
160        1
8839       1
805        1
8484       1
7638       1
8405       1
4603       1
7420       1
7674       1
5163       1
8694       1
8487       1
8729       1
4296       1
7440       1
8715       1
7593       1
7417       1
7619       1
8504       1
7663       1
7606       1
8928       1
467        1
Name: Vehicle, Length: 2438, dtype: int64

In [137]:
print("vehicle count pre cleanup",df['Vehicle'].nunique())
df['Vehicle'] = df['Vehicle'].apply(lambda x:check_vehicle(x))
print("vehicle count post cleanup",df['Vehicle'].nunique())
df['Vehicle'].value_counts()

vehicle count pre cleanup 2438
vehicle count post cleanup 1017


bad vehicle    14480
4074             330
4199             320
4101             316
4209             314
4229             308
4204             307
4176             306
4247             304
4144             302
4218             300
4115             299
4050             299
4200             298
4001             297
4147             294
4222             293
4185             291
4149             290
4215             288
4141             288
4210             287
4152             287
4048             286
4110             286
4208             286
4143             284
4181             283
4217             282
4244             280
               ...  
7449               1
1106               1
7439               1
7663               1
9005               1
7639               1
7409               1
7648               1
7434               1
7428               1
7597               1
7587               1
1089               1
7400               1
7048               1
7422               1
7572         

# Clean up Direction

In [93]:
def check_direction (x):
    if x in valid_directions:
        return(x)
    else:
        return("bad direction")

In [94]:
df['Direction'].shape[0]

83365

In [95]:
# prior to cleanup of the Direction column, get a look at the values
df['Direction'].value_counts()

W/B                            32466
E/B                            32343
N/B                             6006
B/W                             5747
S/B                             5679
missing                          232
EB                               213
eb                               173
WB                               149
wb                               120
SB                                20
NB                                18
nb                                18
sb                                14
EW                                13
eastbound                          8
5                                  7
w/b                                7
bw                                 7
w                                  6
w/B                                4
8                                  4
s                                  4
ew                                 4
E                                  4
BW                                 4
2                                  4
S

In [96]:
# to have consistent checking of directions: lowercase and remove '/'
print("Unique directions before cleanup:",df['Direction'].nunique())
df['Direction'] = df['Direction'].str.lower()
df['Direction'] = df['Direction'].str.replace('/','')
df['Direction'] = df['Direction'].replace({'eb':'e','wb':'w','sb':'s','nb':'n','bw':'b'})
df['Direction'] = df['Direction'].replace({'eastbound':'e','westbound':'w','southbound':'s','northbound':'n'})
# replace any remaining bad Direction values with a common token
df['Direction'] = df['Direction'].apply(lambda x:check_direction(x))
print("Unique directions after cleanup:",df['Direction'].nunique())

Unique directions before cleanup: 95
Unique directions after cleanup: 6


In [97]:
print("Unique directions after cleanup:",df['Direction'].nunique())

Unique directions after cleanup: 6


In [98]:
print("direction count",df['Direction'].nunique())
df['Direction'].value_counts()

direction count 6


w                32757
e                32747
n                 6045
b                 5763
s                 5719
bad direction      334
Name: Direction, dtype: int64

In [99]:
df.head()

Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01 01:25:00,2015-01-01,504,01:25:00,Thursday,Broadview and Gerrard,Mechanical,9.0,18.0,s,4092,2015-01-01 01:25:00
2015-01-01 01:44:00,2015-01-01,504,01:44:00,Thursday,Roncesvalles and Galley,Held By,14.0,23.0,s,4030,2015-01-01 01:44:00
2015-01-01 02:04:00,2015-01-01,504,02:04:00,Thursday,King and Sherborne,Mechanical,9.0,18.0,e,4147,2015-01-01 02:04:00
2015-01-01 02:12:00,2015-01-01,306,02:12:00,Thursday,Main St. and Upper Gerard,Investigation,29.0,39.0,s,4049,2015-01-01 02:12:00
2015-01-01 05:05:00,2015-01-01,306,05:05:00,Thursday,Gerrard and Sumach,Mechanical,30.0,60.0,w,4114,2015-01-01 05:05:00


# Clean up Location

In [47]:
def clean_conjunction(intersection):
    # make conjunctions in intersections consistent
    if " and " not in intersection:
        if "&" in intersection:
            if " & " in intersection:
                intersection.replace(" & "," and ")
            else:
                if "&" in intersection:
                    intersection.replace("&"," and ")
        else:
            if " / " in intersection:
                intersection.replace(" / "," and ")
            else:
                if "/" in intersection:
                    intersection.replace("/"," and ")
    return(intersection)

In [48]:
def order_location(intersection):
    # for any string with the format "* and *" if the value before the and is alphabetically
    # higher than the value after the and, swap the values
    conj = " and "
    alpha_ordered_intersection = intersection
    if conj in intersection:
        end_first_street = intersection.find(conj)
        if (end_first_street > 0) and (len(intersection) > (end_first_street + len(conj))):
            start_second_street = intersection.find(conj) + len(conj)
            first_street = intersection[0:end_first_street]
            second_street = intersection[start_second_street:]
            alpha_ordered_intersection = min(first_street,second_street)+conj+max(first_street,second_street)
    return(alpha_ordered_intersection)

In [49]:
# the values in the location column were entered in free form, so there are several problems to fix

# start by counting the distinct values in location column before and after lowercasing
print("Location count pre cleanup:",df['Location'].nunique())
print("Route count pre cleanup:",df['Route'].nunique())
print("Direction count pre cleanup:",df['Direction'].nunique())
print("Vehicle count pre cleanup:",df['Vehicle'].nunique())
df['Location'] = df['Location'].str.lower()
print("Unique Location values after lcasing:",df['Location'].nunique())
df['Location'].value_counts().head(100)

Location count pre cleanup: 15691
Route count pre cleanup: 15
Direction count pre cleanup: 6
Vehicle count pre cleanup: 1017
Unique Location values after lcasing: 13263


russell yard                 1805
roncy yard                   1385
queen and connaught          1158
roncesvalles yard            1088
roncesvalles and queen       1046
leslie barns                  936
queen and roncesvalles        846
broadview station             827
dundas west station           812
cne loop                      794
broadview stn                 667
humber loop                   625
broadview and queen           622
queen at connaught            573
spadina and king              519
neville loop                  498
main station                  495
bathurst station              488
bingham loop                  488
king and spadina              451
dundas west stn               443
coxwell and gerrard           437
roncesvalles yard.            435
queen and broadview           423
spadina station               415
broadview and dundas          351
queen at roncesvalles         346
long branch loop              332
exhibition loop               318
longbranch loo

In [50]:
# make substitutions to eliminate obvious duplicate tokens, counting unique values before and after
# need to add a function to flip "x and y" consistently
print("Unique Location values before substitutions:",df['Location'].nunique())
df['Location'] = df['Location'].replace({'broadviewstation':'broadview station',' at ':' and ',' stn':' station',' ave.':'','/':' and ','roncy':'roncesvalles','carhouse':'yard','yard.':'yard','st. clair':'st clair','ronc. ':'roncesvalles ','long branch':'longbranch','garage':'yard','barns':'yard',' & ':' and '}, regex=True)
print("Unique Location values after substitutions:",df['Location'].nunique())
df['Location'].value_counts().head(50)

Unique Location values before substitutions: 13263
Unique Location values after substitutions: 10867


roncesvalles yard               3951
russell yard                    2050
queen and connaught             1995
roncesvalles and queen          1896
broadview station               1494
queen and roncesvalles          1392
dundas west station             1255
leslie yard                     1052
broadview and queen              805
cne loop                         794
spadina and king                 770
bathurst station                 768
king and spadina                 727
main station                     655
longbranch loop                  649
queen and broadview              641
spadina station                  626
humber loop                      625
broadview and dundas             586
coxwell and gerrard              525
neville loop                     498
bingham loop                     488
st clair west station            477
king and bathurst                441
broadview and gerrard            418
bathurst and king                411
howard park and roncesvalles     405
b

In [51]:
# put intersection values into consistent order
print("Unique Location values:",df['Location'].nunique())
df['Location'] = df['Location'].apply(lambda x:order_location(x))
print("Location values post cleanup:",df['Location'].nunique())
df['Location'].value_counts().head(100)

Unique Location values: 10867
Location values post cleanup: 10074


roncesvalles yard               3951
queen and roncesvalles          3288
connaught and queen             2222
russell yard                    2050
king and spadina                1497
broadview station               1494
broadview and queen             1446
dundas west station             1255
leslie yard                     1052
coxwell and gerrard              867
bathurst and king                852
broadview and dundas             850
cne loop                         794
bathurst station                 768
howard park and roncesvalles     728
broadview and gerrard            661
main station                     655
longbranch loop                  649
spadina station                  626
humber loop                      625
queen and spadina                571
bathurst and st clair            513
neville loop                     498
bingham loop                     488
bathurst and queen               479
bathurst and fleet               478
st clair west station            477
b

In [52]:
df['Location'].value_counts().head(80)

roncesvalles yard                           3951
queen and roncesvalles                      3288
connaught and queen                         2222
russell yard                                2050
king and spadina                            1497
broadview station                           1494
broadview and queen                         1446
dundas west station                         1255
leslie yard                                 1052
coxwell and gerrard                          867
bathurst and king                            852
broadview and dundas                         850
cne loop                                     794
bathurst station                             768
howard park and roncesvalles                 728
broadview and gerrard                        661
main station                                 655
longbranch loop                              649
spadina station                              626
humber loop                                  625
queen and spadina   

# Remove bad rows

In [None]:
print("Location count post cleanup:",df['Location'].nunique())
print("Route count post cleanup:",df['Route'].nunique())
print("Direction count post cleanup:",df['Direction'].nunique())
print("Vehicle count post cleanup:",df['Vehicle'].nunique())
# print("Bad Location count":df[df.Vehicle == 'bad vehicle'].shape[0])
print("Bad route count:",df[df.Route == 'bad route'].shape[0])
print("Bad direction count:",df[df.Direction == 'bad direction'].shape[0])
print("Bad vehicle count:",df[df.Vehicle == 'bad vehicle'].shape[0])

In [None]:
# remove rows with bad vehicle value
if remove_bad_values:
    df = df[df.Vehicle != 'bad vehicle']
    df = df[df.Direction != 'bad direction']
    df = df[df.Route != 'bad route']

In [None]:
df.shape

In [None]:
# pickle the cleansed dataframe
file_name = path + pickled_output_dataframe
# file_name = path + "2014_2018_df_cleaned_all_bad removed.pkl"
df.to_pickle(file_name)

In [None]:
dfn = pd.read_pickle(file_name)
dfn.head()

# Visualize cleaned data

In [None]:
!pip install pixiedust

In [None]:
import pixiedust

In [None]:
display(df)