# Initial Subway Model
---
This notebook will be creating simple single-model models for both January and February-December subway data. The steps followed will follow the steps taken in the initial taxi model. Date time features as well as station complexes will be taken as inputs. Feature engineering will be added later on the help the model generalise.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import xgboost as xgb

from tabulate import tabulate

## February-December Model
---
We will start with modelling for the February-December data. We have tried combining the two datasets based on both latitude and longitude co-ordinates as well as the Georeference. However, presumably due to rounding errors or slightly difference in decimal points this proved more difficult than expected. As such, we will train two models for now in order to get predictions for the MVP. However, we will be returning to merging to two datasets, as well as the taxi model into one model when possible.

In [23]:
df = pd.read_csv("2022_subway_feb_onwards_cleaned.csv")

In [24]:
df.shape

(951420, 15)

In [25]:
df.head()

Unnamed: 0,transit_timestamp,station_complex_id,station_complex,borough,routes,payment_method,ridership,transfers,latitude,longitude,Georeference,date,hour,day,month
0,2022-02-01 10:00:00,R106,WTC Cortlandt (1),M,1,all,87,0,40.711834,-74.01219,POINT (-74.01219 40.711834),2022-02-01,10,Tuesday,2
1,2022-02-01 21:00:00,R185,191 St (1),M,1,all,52,0,40.855225,-73.92941,POINT (-73.92941 40.855225),2022-02-01,21,Tuesday,2
2,2022-02-01 02:00:00,A022,"34 St-Herald Sq (B,D,F,M,N,Q,R,W)",M,"F,M,N,R,Q,B,W,D",all,76,0,40.749718,-73.98782,POINT (-73.98782 40.749718),2022-02-01,2,Tuesday,2
3,2022-02-01 06:00:00,N094,"Chambers St (A,C)/WTC (E)/Park Pl (2,3)/Cortla...",M,"3,C,E,2,R,,A,W",all,378,23,40.71411,-74.00858,POINT (-74.00858 40.71411),2022-02-01,6,Tuesday,2
4,2022-02-01 15:00:00,R308,145 St (3),M,3,all,109,2,40.82042,-73.93625,POINT (-73.93625 40.82042),2022-02-01,15,Tuesday,2


We drop any non-date time features as these will not be used in our initial model.

In [26]:
columns_to_drop = ['transit_timestamp', 'station_complex_id', 'borough', 'routes', 'payment_method',
                  'transfers', 'latitude', 'longitude', 'Georeference', 'date']

In [27]:
df = df.drop(columns=columns_to_drop)

In [28]:
df

Unnamed: 0,station_complex,ridership,hour,day,month
0,WTC Cortlandt (1),87,10,Tuesday,2
1,191 St (1),52,21,Tuesday,2
2,"34 St-Herald Sq (B,D,F,M,N,Q,R,W)",76,2,Tuesday,2
3,"Chambers St (A,C)/WTC (E)/Park Pl (2,3)/Cortla...",378,6,Tuesday,2
4,145 St (3),109,15,Tuesday,2
...,...,...,...,...,...
951415,72 St (Q),602,12,Saturday,12
951416,Dyckman St (1),124,20,Saturday,12
951417,"57 St-7 Av (N,Q,R,W)",202,1,Saturday,12
951418,"34 St-Herald Sq (B,D,F,M,N,Q,R,W)",4755,16,Saturday,12


We take a sample of 20,000 entries per month. We use stratified sampling in order to keep the integrity of the original dataset.

In [32]:
# Initialize an empty DataFrame to store the samples
sampled_df = pd.DataFrame()

# Loop over each month
for month in df['month'].unique():
    # Get a sample of 1000 rows for the month
    month_df = df[df['month'] == month].sample(n=20000, random_state=1)
    
    # Append the month sample to the overall sample DataFrame
    sampled_df = pd.concat([sampled_df, month_df])

In [30]:
sampled_df = df

In [33]:
sampled_df

Unnamed: 0,station_complex,ridership,hour,day,month
44225,"96 St (B,C)",1,1,Wednesday,2
14977,"8 St-New York University (R,W)",54,9,Sunday,2
67554,Dyckman St (A),169,11,Thursday,2
70829,"81 St-Museum of Natural History (B,C)",90,23,Friday,2
67059,"Broad St (J,Z)",156,13,Thursday,2
...,...,...,...,...,...
944119,"47-50 Sts-Rockefeller Center (B,D,F,M)",72,5,Thursday,12
867561,"86 St (4,5,6)",3799,16,Friday,12
924336,23 St (1),452,19,Thursday,12
890130,28 St (6),39,4,Saturday,12


We need to one-hot encode both the station complex and day of the week in order to may it readable for the models.

In [34]:
sampled_df = pd.get_dummies(sampled_df, columns=['station_complex', 'day'])

In [35]:
sampled_df

Unnamed: 0,ridership,hour,month,station_complex_1 Av (L),station_complex_103 St (1),station_complex_103 St (6),"station_complex_103 St (B,C)",station_complex_110 St (6),"station_complex_116 St (2,3)",station_complex_116 St (6),...,"station_complex_Wall St (2,3)","station_complex_Wall St (4,5)","station_complex_West 4 St-Washington Sq (A,B,C,D,E,F,M)",day_Friday,day_Monday,day_Saturday,day_Sunday,day_Thursday,day_Tuesday,day_Wednesday
44225,1,1,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
14977,54,9,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
67554,169,11,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
70829,90,23,2,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
67059,156,13,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
944119,72,5,12,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
867561,3799,16,12,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
924336,452,19,12,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
890130,39,4,12,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [36]:
X = sampled_df.drop('ridership', axis=1)
y = sampled_df['ridership']

## Random Forest
---
First, we will test a random forest model, using a 70/30 train/test split on our sample data. We observed in the taxi data that linear regression was not suited to our particular problem. As such, there is little point in creating a linear regression model and we start on random forest.

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [38]:
random_forest = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1)
random_forest.fit(X_train, y_train)

In [39]:
# Creating a dataframe to store & display feature importance
feature_importance = pd.DataFrame({'feature': X_train.columns, 'importance':random_forest.feature_importances_})
feature_importance.sort_values('importance', ascending=False)

Unnamed: 0,feature,importance
0,hour,0.333696
117,"station_complex_Times Sq-42 St (N,Q,R,W,S,1,2,...",0.188076
101,"station_complex_Grand Central-42 St (S,4,5,6,7)",0.061288
48,"station_complex_34 St-Herald Sq (B,D,F,M,N,Q,R,W)",0.042550
125,day_Sunday,0.034674
...,...,...
111,station_complex_Rector St (1),0.000048
9,"station_complex_116 St (B,C)",0.000048
90,"station_complex_Central Park North-110 St (2,3)",0.000048
113,station_complex_Roosevelt Island (F),0.000046


In [40]:
# Testing predicted vs actual values 
rf_training_predictions = random_forest.predict(X_train)
df_true_vs_rf_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': rf_training_predictions})
df_true_vs_rf_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
137223,10,11.411667
612635,17,21.618667
613506,26,25.83
718126,786,694.328524
58085,364,305.651333
641631,422,426.064167
571110,271,230.45
416288,45,51.889167
693356,143,177.792333
855453,4,5.1025


In [41]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, rf_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, rf_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, rf_training_predictions)))
print('R^2:', metrics.r2_score(y_train, rf_training_predictions))
print("=======================================================")


Mean Absolute Error: 35.339571858006536
Mean Squared Error: 10463.088666471547
Root Mean Squared Error: 102.28924022824467
R^2: 0.9887383930614362


In [42]:
# Predicted class labels for all examples, 
# using the trained model, on in-sample data (same sample used for training and test)
rf_test_predictions = random_forest.predict(X_test)
df_true_vs_rf_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': rf_test_predictions})
df_true_vs_rf_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
182085,195,189.038976
636350,1546,1512.01
602970,223,265.080095
272953,251,222.078
299053,2,2.75
546592,21,17.412
375382,73,49.615
340058,206,212.52419
334079,541,672.12
764858,664,734.594667


In [43]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_test_predictions)))
print('R^2:', metrics.r2_score(y_test, rf_test_predictions))
print("=======================================================")


Mean Absolute Error: 74.00707969752932
Mean Squared Error: 40812.52503689481
Root Mean Squared Error: 202.0211004744178
R^2: 0.9594994179454377


**Results:**
- Much like the taxi data, performance metrics are actually quite good for such a simple model.
- MAE of 74 is not bad for a first prediction, considering the wide range in ridership across station complexes.
- R-squared is also quite high as well.
- Arguably, like in our other taxi model, overfitting is occuring. Ideally feature engineering will help alleviate this when we work on model improvement.

## XGBoost
---
Now, we will try out XGBoost in order to compare it's performance to random forest.

In [44]:
# Instantiate an XGBoost regressor object 
xg_reg = xgb.XGBRegressor()

# Fit the regressor to the training set
xg_reg.fit(X_train,y_train)

# Predict on the test set
preds = xg_reg.predict(X_test)



In [45]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("=======================================================")


Mean Absolute Error: 269.7385642024035
Mean Squared Error: 265850.80916141806
Root Mean Squared Error: 515.6072237288943
R^2: 0.7361811722999189


It does not seem to be performing quite as well as random forest. We will now try apply the paramters we found when using randomised search CV in our taxi data. 

In [46]:
# Apply the optimal hyperparameters
xg_reg = xgb.XGBRegressor(subsample=0.7, n_estimators=400, min_child_weight=2, max_depth=7, learning_rate=0.1, colsample_bytree=1, objective ='reg:squarederror')

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict on the test set
preds = xg_reg.predict(X_test)

In [47]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("=======================================================")


Mean Absolute Error: 99.15993029419536
Mean Squared Error: 44418.895552364105
Root Mean Squared Error: 210.75790744919658
R^2: 0.9559206120556065


**Results:**
- Still slightly underperfoming compared to random forest.
- But again, still very respectable performance metrics.
- Future work will include more parameter tuning, as perhaps optimal taxi paramters are not fitting here.

# January Model
---
Now, we perform the exact same steps for the January data. We will need to merge these going forward into one.

In [48]:
df = pd.read_csv('2022_subway_jan_cleaned.csv')

In [49]:
df.shape

(203467, 17)

In [50]:
df.head()

Unnamed: 0,C/A,Unit,SCP,Station,Date,Time,Description,Entries,Exits,Latitude,Longitude,geometry,Entries_diff,Exits_diff,Hour,Day,Time Block
0,N067,R012,00-00-00,34 ST-PENN STA,2022-01-01,2023-07-02 07:00:00,REGULAR,328005,947875,40.752247,-73.993456,POINT (-73.993456 40.752247),5.0,6.0,7,Saturday,4-7
1,N067,R012,00-00-00,34 ST-PENN STA,2022-01-01,2023-07-02 11:00:00,REGULAR,328017,947920,40.752247,-73.993456,POINT (-73.993456 40.752247),12.0,45.0,11,Saturday,8-11
2,N067,R012,00-00-00,34 ST-PENN STA,2022-01-01,2023-07-02 15:00:00,REGULAR,328042,947991,40.752247,-73.993456,POINT (-73.993456 40.752247),25.0,71.0,15,Saturday,12-15
3,N067,R012,00-00-00,34 ST-PENN STA,2022-01-01,2023-07-02 19:00:00,REGULAR,328086,948064,40.752247,-73.993456,POINT (-73.993456 40.752247),44.0,73.0,19,Saturday,16-19
4,N067,R012,00-00-00,34 ST-PENN STA,2022-01-01,2023-07-02 23:00:00,REGULAR,328107,948115,40.752247,-73.993456,POINT (-73.993456 40.752247),21.0,51.0,23,Saturday,20-23


In [51]:
columns_to_drop = ['C/A', 'Unit', 'SCP', 'Time', 'Description', 'Entries', 'Exits', 'Latitude',
                   'Longitude', 'geometry','Exits_diff', 'Time Block', 'Date']

In [52]:
df = df.drop(columns=columns_to_drop)

In [55]:
df

Unnamed: 0,Station,Entries_diff,Hour,Day
0,34 ST-PENN STA,5.0,7,Saturday
1,34 ST-PENN STA,12.0,11,Saturday
2,34 ST-PENN STA,25.0,15,Saturday
3,34 ST-PENN STA,44.0,19,Saturday
4,34 ST-PENN STA,21.0,23,Saturday
...,...,...,...,...
203462,34 ST-PENN STA,30.0,7,Monday
203463,34 ST-PENN STA,141.0,11,Monday
203464,34 ST-PENN STA,29.0,15,Monday
203465,34 ST-PENN STA,25.0,19,Monday


**Note:** We use Entries_diff as our ridership equivalent, as the data dictionary for the new format mentions ridership as number of entries in a given hour. Thus, exits does not seem applicable in the interest of keeping things consistent.

In [57]:
df = pd.get_dummies(df, columns=['Station','Day'])

In [58]:
X = df.drop('Entries_diff', axis=1)
y = df['Entries_diff']

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Random Forest
---

In [60]:
random_forest = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1)
random_forest.fit(X_train, y_train)

In [61]:
# Creating a dataframe to store & display feature importance
feature_importance = pd.DataFrame({'feature': X_train.columns, 'importance':random_forest.feature_importances_})
feature_importance.sort_values('importance', ascending=False)

Unnamed: 0,feature,importance
0,Hour,0.436824
70,Day_Saturday,0.077653
71,Day_Sunday,0.061538
52,Station_GRAND ST,0.039312
24,Station_34 ST-HERALD SQ,0.024203
...,...,...
56,Station_PARK PLACE,0.000830
31,Station_66 ST-LINCOLN,0.000788
15,Station_18 ST,0.000563
39,Station_BOWERY,0.000229


In [62]:
# Testing predicted vs actual values 
rf_training_predictions = random_forest.predict(X_train)
df_true_vs_rf_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': rf_training_predictions})
df_true_vs_rf_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
32415,11.0,17.507459
63530,10.0,31.865684
173718,113.0,96.441726
142825,160.0,38.656425
33445,1.0,2.609929
193357,10.0,11.104908
82841,7.0,2.992026
6047,1.0,17.40099
167354,31.0,31.676238
42182,12.0,118.508594


In [63]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, rf_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, rf_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, rf_training_predictions)))
print('R^2:', metrics.r2_score(y_train, rf_training_predictions))
print("=======================================================")


Mean Absolute Error: 38.4017784140527
Mean Squared Error: 4121.025159810058
Root Mean Squared Error: 64.19521134640853
R^2: 0.49085316060029816


In [64]:
# Predicted class labels for all examples, 
# using the trained model, on in-sample data (same sample used for training and test)
rf_test_predictions = random_forest.predict(X_test)
df_true_vs_rf_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': rf_test_predictions})
df_true_vs_rf_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
200434,11.0,65.021769
193582,6.0,4.381586
73625,125.0,210.684239
99526,278.0,187.400608
15969,10.0,22.708527
87951,87.0,62.303542
119586,60.0,124.557578
137352,154.0,206.792451
18588,313.0,179.692705
137903,320.0,163.954477


In [65]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_test_predictions)))
print('R^2:', metrics.r2_score(y_test, rf_test_predictions))
print("=======================================================")


Mean Absolute Error: 39.265191824427426
Mean Squared Error: 4272.9847793418685
Root Mean Squared Error: 65.36807155899483
R^2: 0.47039292843474556


**Results:**
- Performance of the January model is much worse compared to the first model.
- This is perhaps due to the quality issues identified in the January data during data exploration.
- This gives cause to re-investigate the data cleaning strategy in an effort to improve the model's performance.

## XGBoost
---

In [66]:
# Instantiate an XGBoost regressor object 
xg_reg = xgb.XGBRegressor()

# Fit the regressor to the training set
xg_reg.fit(X_train,y_train)

# Predict on the test set
preds = xg_reg.predict(X_test)



In [67]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("=======================================================")


Mean Absolute Error: 44.51603097536857
Mean Squared Error: 5030.213676448517
Root Mean Squared Error: 70.92399929818197
R^2: 0.376539615256549


In [68]:
# Apply the optimal hyperparameters
xg_reg = xgb.XGBRegressor(subsample=0.7, n_estimators=400, min_child_weight=2, max_depth=7, learning_rate=0.1, colsample_bytree=1, objective ='reg:squarederror')

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict on the test set
preds = xg_reg.predict(X_test)

In [69]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("=======================================================")


Mean Absolute Error: 39.30283340356304
Mean Squared Error: 4255.517061130306
Root Mean Squared Error: 65.23432425594908
R^2: 0.472557931954926


**Results:**
- Almost identical performance to random forest.
- Still vastly underperforming compared to Febraury onwards.

The january model is certainly underperforming in comparison to the February onwards mdoel. This will need to be investigated, as well as a method for merging all of the models into one.