# Initial Subway Model V2
---
The first subway modelling was creating two separate models, one for January and one for February-Decmeber. This was due to the old format concerning stations as opposed to station complexes. After revisiting and re-cleaning the January data, we have now linked the January records to station complexes, as well as extracted hourly rider by averging the entries in the 4 hour observation windows. This notebook is now updated to reflect this.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import xgboost as xgb

## Updated Single-Model for all of 2022 Subway data
---
Since we have now cleaned and merged all 2022 data into one, we can conveniently train a model for subway data. This will be much cleaner and less messy for the backend than having two separate models to juggle around. We will be following the same logic as the taxi model, so for more in-depth reasoning for the feature engineering and new split, refer to that notebook or the dev logs on notion.

In [2]:
df = pd.read_csv("2022_cleaned_subway_data_for_modelling.csv")

In [3]:
df.shape

(1039614, 7)

In [4]:
df.head()

Unnamed: 0,transit_timestamp,station_complex,ridership,Hour,DayOfWeek,DayOfMonth,Month
0,2022-01-01 03:00:00,1 Av (L),0.0,3,5,1,1
1,2022-01-01 04:00:00,1 Av (L),0.0,4,5,1,1
2,2022-01-01 05:00:00,1 Av (L),0.0,5,5,1,1
3,2022-01-01 06:00:00,1 Av (L),0.0,6,5,1,1
4,2022-01-01 07:00:00,1 Av (L),34.75,7,5,1,1


### Feature Engineering
---
We will extract the same features we used in the updated taxi model for conformity and to help the model generalise. We will also change the train/test split to the last 5 days of each month as well in order to test the model on truly unseen data.

**Weekend vs Weekdays:**

In [5]:
df['Weekend'] = np.where(df['DayOfWeek'].isin([5, 6]), 1, 0)

**Time of Day:**

In [6]:
# Creating bins for the time blocks
bins = [0, 6, 12, 17, 22, 24]
labels = ['Night', 'Morning', 'Afternoon', 'Evening', 'Night']

# Assign the labels based on the time ranges
df['TimeOfDay'] = pd.cut(df['Hour'], bins=bins, labels=labels, right=False, include_lowest=True, ordered=False)

In [7]:
df

Unnamed: 0,transit_timestamp,station_complex,ridership,Hour,DayOfWeek,DayOfMonth,Month,Weekend,TimeOfDay
0,2022-01-01 03:00:00,1 Av (L),0.00,3,5,1,1,1,Night
1,2022-01-01 04:00:00,1 Av (L),0.00,4,5,1,1,1,Night
2,2022-01-01 05:00:00,1 Av (L),0.00,5,5,1,1,1,Night
3,2022-01-01 06:00:00,1 Av (L),0.00,6,5,1,1,1,Morning
4,2022-01-01 07:00:00,1 Av (L),34.75,7,5,1,1,1,Morning
...,...,...,...,...,...,...,...,...,...
1039609,2022-12-31 12:00:00,72 St (Q),602.00,12,5,31,12,1,Afternoon
1039610,2022-12-31 20:00:00,Dyckman St (1),124.00,20,5,31,12,1,Evening
1039611,2022-12-31 01:00:00,"57 St-7 Av (N,Q,R,W)",202.00,1,5,31,12,1,Night
1039612,2022-12-31 16:00:00,"34 St-Herald Sq (B,D,F,M,N,Q,R,W)",4755.00,16,5,31,12,1,Afternoon


We will need to extract dummies for the station complexes to make it machine readable for training. We also encode time of day and day of week as these are nominal and not ordinal

In [8]:
# Get dummies and concatenate with original dataframe
dummies = pd.get_dummies(df['station_complex'], prefix='station_complex')
df = pd.concat([df, dummies], axis=1)

dummies = pd.get_dummies(df['TimeOfDay'], prefix='TimeOfDay')
df = pd.concat([df, dummies], axis=1)

dummies = pd.get_dummies(df['DayOfWeek'], prefix='DayOfWeek')
df = pd.concat([df, dummies], axis=1)

# Drop the original columns
df = df.drop('station_complex', axis=1)
df = df.drop('TimeOfDay', axis=1)
df = df.drop('DayOfWeek', axis=1)

In [9]:
df

Unnamed: 0,transit_timestamp,ridership,Hour,DayOfMonth,Month,Weekend,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",...,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6
0,2022-01-01 03:00:00,0.00,3,1,1,1,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1,2022-01-01 04:00:00,0.00,4,1,1,1,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
2,2022-01-01 05:00:00,0.00,5,1,1,1,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
3,2022-01-01 06:00:00,0.00,6,1,1,1,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,2022-01-01 07:00:00,34.75,7,1,1,1,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1039609,2022-12-31 12:00:00,602.00,12,31,12,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1039610,2022-12-31 20:00:00,124.00,20,31,12,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1039611,2022-12-31 01:00:00,202.00,1,31,12,1,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1039612,2022-12-31 16:00:00,4755.00,16,31,12,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


We'll extract a date column temporarily to help with the train/test split and make it easier.

In [10]:
df['transit_timestamp'] = pd.to_datetime(df['transit_timestamp'])
df['date'] = df['transit_timestamp'].dt.date

In [11]:
df = df.drop(columns='transit_timestamp')

In [12]:
df

Unnamed: 0,ridership,Hour,DayOfMonth,Month,Weekend,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",station_complex_110 St (6),...,TimeOfDay_Morning,TimeOfDay_Night,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,date
0,0.00,3,1,1,1,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,2022-01-01
1,0.00,4,1,1,1,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,2022-01-01
2,0.00,5,1,1,1,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,2022-01-01
3,0.00,6,1,1,1,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,2022-01-01
4,34.75,7,1,1,1,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,2022-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1039609,602.00,12,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,2022-12-31
1039610,124.00,20,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,2022-12-31
1039611,202.00,1,31,12,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,2022-12-31
1039612,4755.00,16,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,2022-12-31


In [13]:
df['date'] = pd.to_datetime(df['date'])
df_test = df[df['date'].dt.day > df['date'].dt.daysinmonth - 5].copy()

# Train data: Checks for rows not in the new test set
df_train = df[~df.index.isin(df_test.index)].copy()

In [14]:
df_test

Unnamed: 0,ridership,Hour,DayOfMonth,Month,Weekend,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",station_complex_110 St (6),...,TimeOfDay_Morning,TimeOfDay_Night,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,date
621,303.00,0,27,1,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,2022-01-27
622,303.00,1,27,1,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,2022-01-27
623,303.00,2,27,1,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,2022-01-27
624,73.75,3,27,1,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,2022-01-27
625,73.75,4,27,1,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,2022-01-27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1039609,602.00,12,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,2022-12-31
1039610,124.00,20,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,2022-12-31
1039611,202.00,1,31,12,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,2022-12-31
1039612,4755.00,16,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,2022-12-31


In [15]:
df_train

Unnamed: 0,ridership,Hour,DayOfMonth,Month,Weekend,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",station_complex_110 St (6),...,TimeOfDay_Morning,TimeOfDay_Night,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,date
0,0.00,3,1,1,1,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,2022-01-01
1,0.00,4,1,1,1,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,2022-01-01
2,0.00,5,1,1,1,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,2022-01-01
3,0.00,6,1,1,1,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,2022-01-01
4,34.75,7,1,1,1,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,2022-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1025370,796.00,18,26,12,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,2022-12-26
1025371,1596.00,18,26,12,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,2022-12-26
1025372,940.00,19,26,12,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,2022-12-26
1025373,156.00,19,26,12,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,2022-12-26


In [16]:
scaler = MinMaxScaler()

# Scale 'Dropoffs' column for train and test data
df_train['busyness_score'] = scaler.fit_transform(df_train[['ridership']])
df_test['busyness_score'] = scaler.transform(df_test[['ridership']])

In [17]:
df_train

Unnamed: 0,ridership,Hour,DayOfMonth,Month,Weekend,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",station_complex_110 St (6),...,TimeOfDay_Night,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,date,busyness_score
0,0.00,3,1,1,1,1,0,0,0,0,...,1,0,0,0,0,0,1,0,2022-01-01,0.000000
1,0.00,4,1,1,1,1,0,0,0,0,...,1,0,0,0,0,0,1,0,2022-01-01,0.000000
2,0.00,5,1,1,1,1,0,0,0,0,...,1,0,0,0,0,0,1,0,2022-01-01,0.000000
3,0.00,6,1,1,1,1,0,0,0,0,...,0,0,0,0,0,0,1,0,2022-01-01,0.000000
4,34.75,7,1,1,1,1,0,0,0,0,...,0,0,0,0,0,0,1,0,2022-01-01,0.001562
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1025370,796.00,18,26,12,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,2022-12-26,0.035787
1025371,1596.00,18,26,12,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,2022-12-26,0.071753
1025372,940.00,19,26,12,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,2022-12-26,0.042260
1025373,156.00,19,26,12,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,2022-12-26,0.007013


In [18]:
df_test

Unnamed: 0,ridership,Hour,DayOfMonth,Month,Weekend,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",station_complex_110 St (6),...,TimeOfDay_Night,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,date,busyness_score
621,303.00,0,27,1,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,2022-01-27,0.013622
622,303.00,1,27,1,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,2022-01-27,0.013622
623,303.00,2,27,1,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,2022-01-27,0.013622
624,73.75,3,27,1,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,2022-01-27,0.003316
625,73.75,4,27,1,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,2022-01-27,0.003316
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1039609,602.00,12,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,2022-12-31,0.027065
1039610,124.00,20,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,2022-12-31,0.005575
1039611,202.00,1,31,12,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,2022-12-31,0.009082
1039612,4755.00,16,31,12,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,2022-12-31,0.213775


In [19]:
df_train.drop('ridership', axis=1, inplace=True)
df_test.drop('ridership', axis=1, inplace=True)

df_train.drop('date', axis=1, inplace=True)
df_test.drop('date', axis=1, inplace=True)

In [20]:
df_train_sample = pd.DataFrame()  

for month in df_train['Month'].unique():
    # Get a sample of 20,000 rows for the month
    month_df = df_train[df_train['Month'] == month].sample(n=16000, random_state=1)
    
    # Append the month sample to the training DataFrame
    df_train_sample = pd.concat([df_train_sample, month_df])

In [21]:
df_test_sample = pd.DataFrame()  

for month in df_test['Month'].unique():
    month_df = df_test[df_test['Month'] == month].sample(n=4000, random_state=1)
    df_test_sample = pd.concat([df_test_sample, month_df])

In [22]:
# Assign the feature columns to X_train and X_test
X_train = df_train_sample.drop('busyness_score', axis=1)
X_test = df_test_sample.drop('busyness_score', axis=1)

# Assign the target column to y_train and y_test
y_train = df_train_sample['busyness_score']
y_test = df_test_sample['busyness_score']

In [23]:
X_train

Unnamed: 0,Hour,DayOfMonth,Month,Weekend,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",station_complex_110 St (6),"station_complex_116 St (2,3)",...,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6
17322,15,12,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
30089,20,19,1,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
58004,15,9,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
7058,8,17,1,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
51631,20,21,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
971401,22,8,12,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
1010946,5,21,12,0,0,0,0,0,1,0,...,0,0,1,0,0,1,0,0,0,0
976822,20,10,12,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
957004,22,2,12,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


In [24]:
y_train

17322      0.003788
30089      0.024592
58004      0.002214
7058       0.000922
51631      0.026694
             ...   
971401     0.052960
1010946    0.006249
976822     0.042440
957004     0.001079
994694     0.007643
Name: busyness_score, Length: 192000, dtype: float64

In [25]:
X_test

Unnamed: 0,Hour,DayOfMonth,Month,Weekend,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",station_complex_110 St (6),"station_complex_116 St (2,3)",...,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6
68092,11,28,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
77712,17,27,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
76320,8,31,1,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
5883,0,30,1,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
77064,14,31,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1029919,17,28,12,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
1031654,2,29,12,0,1,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
1028634,0,28,12,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
1026500,22,27,12,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0


In [26]:
y_test

68092      0.002102
77712      0.011824
76320      0.009778
5883       0.001708
77064      0.037326
             ...   
1029919    0.047835
1031654    0.001663
1028634    0.002293
1026500    0.007103
1037587    0.004631
Name: busyness_score, Length: 48000, dtype: float64

## Random Forest
---
First, we will test a random forest model, using an 80/20 train/test split on our sample data. We observed in the taxi data that linear regression was not suited to our particular problem. As such, there is little point in creating a linear regression model and we start on random forest.

In [27]:
random_forest = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1)
random_forest.fit(X_train, y_train)

In [28]:
# Creating a dataframe to store & display feature importance
feature_importance = pd.DataFrame({'feature': X_train.columns, 'importance':random_forest.feature_importances_})
feature_importance.sort_values('importance', ascending=False)

Unnamed: 0,feature,importance
0,Hour,0.286329
119,"station_complex_Times Sq-42 St (N,Q,R,W,S,1,2,...",0.184901
2,Month,0.075326
103,"station_complex_Grand Central-42 St (S,4,5,6,7)",0.052295
3,Weekend,0.052044
...,...,...
113,station_complex_Rector St (1),0.000045
13,station_complex_125 St (1),0.000044
11,"station_complex_116 St (B,C)",0.000044
92,"station_complex_Central Park North-110 St (2,3)",0.000038


In [29]:
# Testing predicted vs actual values 
rf_training_predictions = random_forest.predict(X_train)
df_true_vs_rf_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': rf_training_predictions})
df_true_vs_rf_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
17322,0.003788,0.003695
30089,0.024592,0.024362
58004,0.002214,0.002239
7058,0.000922,0.001099
51631,0.026694,0.026629
88010,0.003956,0.004034
31276,0.002934,0.002955
80956,0.000843,0.001036
87892,0.003585,0.003684
71610,0.003091,0.003234


In [30]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, rf_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, rf_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, rf_training_predictions)))
print('R^2:', metrics.r2_score(y_train, rf_training_predictions))
print("=======================================================")


Mean Absolute Error: 0.0010620442831265264
Mean Squared Error: 7.875946774119792e-06
Root Mean Squared Error: 0.00280641172569525
R^2: 0.9956677472717306


In [31]:
# Predicted class labels for all examples, 
# using the trained model, on in-sample data (same sample used for training and test)
rf_test_predictions = random_forest.predict(X_test)
df_true_vs_rf_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': rf_test_predictions})
df_true_vs_rf_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
68092,0.002102,0.002941
77712,0.011824,0.012617
76320,0.009778,0.010697
5883,0.001708,0.001757
77064,0.037326,0.035657
45117,0.000697,0.000546
81516,0.008115,0.01136
19160,0.005867,0.004938
21383,0.004192,0.005577
3634,0.004552,0.004378


In [32]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_test_predictions)))
print('R^2:', metrics.r2_score(y_test, rf_test_predictions))
print("=======================================================")


Mean Absolute Error: 0.0037180864251300035
Mean Squared Error: 9.80071901586859e-05
Root Mean Squared Error: 0.009899858087805396
R^2: 0.9459559649630838


**Results:**
- These metrics already look much better than before with the two single models.
- In particular, the January model was significantly underperforming due to inadequate cleaning.
- It also doesn't seem to be overfitting as much on the test data.
- Overall, a good improvement compared to the first modelling attemot.

## XGBoost
---
Now, we will try out XGBoost in order to compare it's performance to random forest.

In [33]:
# Instantiate an XGBoost regressor object 
xg_reg = xgb.XGBRegressor()

# Fit the regressor to the training set
xg_reg.fit(X_train,y_train)

# Predict on the test set
preds = xg_reg.predict(X_test)



In [34]:
# Predict on the training set
xg_training_predictions = xg_reg.predict(X_train)
df_true_vs_xg_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': xg_training_predictions})
df_true_vs_xg_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
17322,0.003788,0.026074
30089,0.024592,0.013151
58004,0.002214,0.012218
7058,0.000922,0.009431
51631,0.026694,0.013151
88010,0.003956,0.010951
31276,0.002934,0.009431
80956,0.000843,0.001113
87892,0.003585,0.013884
71610,0.003091,0.01132


In [35]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, xg_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, xg_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, xg_training_predictions)))
print('R^2:', metrics.r2_score(y_train, xg_training_predictions))
print("\n=======================================================")


Mean Absolute Error: 0.011805488341640738
Mean Squared Error: 0.0005039111420268763
Root Mean Squared Error: 0.022447965209053497
R^2: 0.7228180328713245



In [36]:
# Predict on the test set
xg_test_predictions = xg_reg.predict(X_test)
df_true_vs_xg_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': xg_test_predictions})
df_true_vs_xg_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
68092,0.002102,0.009431
77712,0.011824,0.084026
76320,0.009778,0.009431
5883,0.001708,0.002045
77064,0.037326,0.057394
45117,0.000697,0.009431
81516,0.008115,0.013151
19160,0.005867,0.009431
21383,0.004192,0.017883
3634,0.004552,0.000613


In [37]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("=======================================================")


Mean Absolute Error: 0.01156248428500451
Mean Squared Error: 0.0004879722938553848
Root Mean Squared Error: 0.022090094926355223
R^2: 0.7309177856903639


**Results:**
- XGboost seems to perform worse compared to random forest.
- We will also test with the parameters found in the first model to see if that improves the metrics.

We will also test using the optimal parameters we found in the first model. These may no longer be optimal but just as a test. We will perform parameter tuning after merging the taxi and subway models into one.

In [38]:
xg_reg = xgb.XGBRegressor(subsample=0.7, n_estimators=300, min_child_weight=3, max_depth=7, learning_rate=0.1, colsample_bytree=1, objective='reg:squarederror')

xg_reg.fit(X_train, y_train)

In [39]:
# Predict on the training set
xg_training_predictions = xg_reg.predict(X_train)
df_true_vs_xg_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': xg_training_predictions})
df_true_vs_xg_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
17322,0.003788,0.000304
30089,0.024592,0.022923
58004,0.002214,-0.002495
7058,0.000922,0.005507
51631,0.026694,0.020807
88010,0.003956,0.004374
31276,0.002934,0.004472
80956,0.000843,0.000575
87892,0.003585,0.008778
71610,0.003091,0.00147


In [40]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, xg_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, xg_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, xg_training_predictions)))
print('R^2:', metrics.r2_score(y_train, xg_training_predictions))
print("\n=======================================================")


Mean Absolute Error: 0.004504607782009095
Mean Squared Error: 6.9113926921045e-05
Root Mean Squared Error: 0.008313478629373207
R^2: 0.9619831104688275



In [41]:
# Predict on the test set
xg_test_predictions = xg_reg.predict(X_test)
df_true_vs_xg_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': xg_test_predictions})
df_true_vs_xg_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
68092,0.002102,-0.00251
77712,0.011824,0.071204
76320,0.009778,-0.0021
5883,0.001708,0.002677
77064,0.037326,0.044584
45117,0.000697,0.00225
81516,0.008115,0.003625
19160,0.005867,0.002488
21383,0.004192,0.005449
3634,0.004552,0.001611


In [42]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("=======================================================")


Mean Absolute Error: 0.01156248428500451
Mean Squared Error: 0.0004879722938553848
Root Mean Squared Error: 0.022090094926355223
R^2: 0.7309177856903639


**Results:**
- Changing the parameters did not improve the performance metrics.
- They may no longer be optimal.
- We will perform parameter tuning once the taxi and subway models are merged together.
- For now, random forest is performing the best from the very first run through.

We will save the random forest model as it performed best. 

In [43]:
with open('subway_model_v1.pickle', 'wb') as file:
    pickle.dump(random_forest, file)

The next step will be looking at merging the taxi and subway data into one, and creating a model that incorporates both.