# Initial Data Modelling
---
This notebook will be used for the initial data modelling. We will import and used the cleaned 2022 Taxi Data in order to create an initial model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import xgboost as xgb

from tabulate import tabulate

In [2]:
df = pd.read_csv("2022_taxi_data_cleaned.csv")

In [3]:
df.shape

(31323476, 21)

In [4]:
df.head()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,date,month,time,day_of_the_week
0,1,2.0,3.8,1.0,N,142,236,1,14.5,3.0,...,3.65,0.0,0.3,21.95,2.5,0.0,2022-01-01,1,0,Saturday
1,1,1.0,2.1,1.0,N,236,42,1,8.0,0.5,...,4.0,0.0,0.3,13.3,0.0,0.0,2022-01-01,1,0,Saturday
2,2,1.0,0.97,1.0,N,166,166,1,7.5,0.5,...,1.76,0.0,0.3,10.56,0.0,0.0,2022-01-01,1,0,Saturday
3,2,1.0,1.09,1.0,N,114,68,2,8.0,0.5,...,0.0,0.0,0.3,11.8,2.5,0.0,2022-01-01,1,0,Saturday
4,2,1.0,4.3,1.0,N,68,163,1,23.5,0.5,...,3.0,0.0,0.3,30.3,2.5,0.0,2022-01-01,1,0,Saturday


# Training a single-model
---
We are training a single-model with Drop-off zones as a feature as a starting point. Having zone as an input will allow the model to learn general patterns across all zones, while still taking into account the effect a particular zone has, as some are generally busyer than others.

We will also explore the idea of dropping the zones as a feature, and training an individual model for each zone. It is hard tell which is a better approach, so we will test both and analyse the results to compared which performs better.

---

As described in the preliminary data exploration and processing, many of the continuous features appear to have little relation to the target outcome of predicting busyness for a particular zone. As such, we are dropping these columns before training our first model.

In [5]:
columns_to_drop = ['VendorID', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
                  'PULocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
                  'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge',
                  'airport_fee', 'date']

In [6]:
df = df.drop(columns=columns_to_drop)

In order to quantify busyness for a particular zone, we need some sort of measure that acts as a proxy for busyness. For the initial model, we have decided to use the number of dropoffs in a particular zone, at a given hour/day/month, for this proxy. As such, we add a new column to the data frame in order to represent this. This will be our target prediction when we perform regression analysis.

In [7]:
# Create an aggregated DataFrame
df_agg = df.groupby(['DOLocationID', 'month', 'time', 'day_of_the_week']).size().reset_index(name='num_pickups')

# Join the aggregated DataFrame back to the original DataFrame
df = pd.merge(df, df_agg, how='left', on=['DOLocationID', 'month', 'time', 'day_of_the_week'])

In [8]:
df

Unnamed: 0,DOLocationID,month,time,day_of_the_week,num_pickups
0,236,1,0,Saturday,358
1,42,1,0,Saturday,113
2,166,1,0,Saturday,100
3,68,1,0,Saturday,451
4,163,1,0,Saturday,192
...,...,...,...,...,...
31323471,162,12,23,Saturday,382
31323472,142,12,23,Saturday,500
31323473,141,12,23,Saturday,752
31323474,142,12,23,Saturday,500


**Dealing with the Large Data Size**
With over 30 million rows, coupled that we are training locally on a laptop, we need some sort of solution to deal with the extremely large dataset. To begin, we are going to use stratified random sampling in order to get a sample of 1,000 rows for each month of the year. Having a random sample of equal size from each month will hopefully be more representative of the original data compared to other techniques, such as simple random sampling. 

In [9]:
# Initialize an empty DataFrame to store the samples
sampled_df = pd.DataFrame()

# Loop over each month
for month in df['month'].unique():
    # Get a sample of 1000 rows for the month
    month_df = df[df['month'] == month].sample(n=1000, random_state=1)
    
    # Append the month sample to the overall sample DataFrame
    sampled_df = pd.concat([sampled_df, month_df])

In [10]:
sampled_df

Unnamed: 0,DOLocationID,month,time,day_of_the_week,num_pickups
1023800,237,1,17,Monday,1177
2018926,48,1,21,Monday,403
163723,170,1,11,Tuesday,586
263743,232,1,22,Wednesday,88
1533312,164,1,17,Monday,444
...,...,...,...,...,...
30903021,68,12,10,Sunday,441
30336702,170,12,23,Saturday,638
31010885,246,12,14,Tuesday,397
30422979,186,12,8,Monday,299


We also one-hot-encode the day of the week, as it is categorical and not suitable in its original form for training models on.

In [11]:
sampled_df = pd.get_dummies(sampled_df, columns=['day_of_the_week'])

In [12]:
sampled_df

Unnamed: 0,DOLocationID,month,time,num_pickups,day_of_the_week_Friday,day_of_the_week_Monday,day_of_the_week_Saturday,day_of_the_week_Sunday,day_of_the_week_Thursday,day_of_the_week_Tuesday,day_of_the_week_Wednesday
1023800,237,1,17,1177,0,1,0,0,0,0,0
2018926,48,1,21,403,0,1,0,0,0,0,0
163723,170,1,11,586,0,0,0,0,0,1,0
263743,232,1,22,88,0,0,0,0,0,0,1
1533312,164,1,17,444,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
30903021,68,12,10,441,0,0,0,1,0,0,0
30336702,170,12,23,638,0,0,1,0,0,0,0
31010885,246,12,14,397,0,0,0,0,0,1,0
30422979,186,12,8,299,0,1,0,0,0,0,0


Our sample is now ready, and we can split the dataset into a 70/30 training/test split.

In [13]:
X = sampled_df.drop('num_pickups', axis=1)
y = sampled_df['num_pickups']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Linear Regression:
---
While we don't believe linear regression is particularly well suited for this analysis, we will still run it as a baseline to benchmark the performance against other models. Typically, linear regression isn't always suited to temporal analysis as it may not be able to capture time-related trends across a long period. Still, given we are only looking at one year currently, it is no harm to test it for a starting point.

In [15]:
linear_reg = LinearRegression().fit(X_train, y_train)

In [16]:
linreg_coefficients = list(zip(X_train.columns, linear_reg.coef_))

# Sorting the coefficients in ascending order
sorted_linreg_coef_data = sorted(linreg_coefficients, key=lambda x: x[1])

headers = ["Feature", "Coefficient"]
print(tabulate(sorted_linreg_coef_data, headers=headers, floatfmt=".6f"), "\n")

print(f"Model Intercept: {linear_reg.intercept_}\n")

Feature                      Coefficient
-------------------------  -------------
day_of_the_week_Sunday       -146.474892
day_of_the_week_Monday        -40.630231
day_of_the_week_Saturday      -37.537829
month                          -0.540827
DOLocationID                    1.601206
time                            7.921883
day_of_the_week_Friday         13.684354
day_of_the_week_Wednesday      48.651843
day_of_the_week_Thursday       76.330572
day_of_the_week_Tuesday        85.976184 

Model Intercept: 174.5697028546204



In [17]:
y_train_pred = linear_reg.predict(X_train)
print('Predicted values:', y_train_pred[:10])

Predicted values: [645.0026764  710.93090198 569.56962982 558.39385108 419.26069036
 796.80437337 592.98247558 640.99912722 494.15603825 748.0231251 ]


In [18]:
print("==================== Train Data =======================")
print('Mean Absolute Error:', mean_absolute_error(y_train, y_train_pred))
print('Mean Squared Error:', mean_squared_error(y_train, y_train_pred))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
print('R2 Score:', r2_score(y_train, y_train_pred))
print("=======================================================")

Mean Absolute Error: 270.56617488053195
Mean Squared Error: 125950.23311996537
Root Mean Squared Error: 354.89467891187854
R2 Score: 0.1374287533273585


**Training Data Evaluation:**
- As expected, linear regression does not seem to perform well on our training data.
- All performance metrics are quite poor across the board.

In [19]:
y_test_pred = linear_reg.predict(X_test)
print('Predicted values:', y_test_pred[:10])
print("")

Predicted values: [605.97906562 560.23361399 478.87342264 626.06191875 627.25484483
 623.77132082 429.80647258 661.84659741 419.53325972 545.98539743]



In [20]:
print("==================== Train Data =======================")
print('Mean Absolute Error:', mean_absolute_error(y_test, y_test_pred))
print('Mean Squared Error:', mean_squared_error(y_test, y_test_pred))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print("\n================== Regression Report ==================")
print('R2 Score:', r2_score(y_test, y_test_pred))
print("\n=======================================================")

Mean Absolute Error: 275.441533645591
Mean Squared Error: 131725.6493284557
Root Mean Squared Error: 362.94028341926406

R2 Score: 0.14253653989060522



**Test Data Evaluation:**
- We see similar findings when testing on the test split.
- Overall, linear regressions does seem suited to our particular problem.

## Random Forest
---

We will now test random forest for our regression analysis.

In [21]:
random_forest = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1)
random_forest.fit(X_train, y_train)

In [22]:
# Creating a dataframe to store & display feature importance
feature_importance = pd.DataFrame({'feature': X_train.columns, 'importance':random_forest.feature_importances_})
feature_importance.sort_values('importance', ascending=False)

Unnamed: 0,feature,importance
0,DOLocationID,0.599702
2,time,0.247061
1,month,0.064077
6,day_of_the_week_Sunday,0.029639
5,day_of_the_week_Saturday,0.021503
3,day_of_the_week_Friday,0.009915
4,day_of_the_week_Monday,0.009223
8,day_of_the_week_Tuesday,0.006797
9,day_of_the_week_Wednesday,0.0061
7,day_of_the_week_Thursday,0.005984


**Observations:** The ranking of feature importance actually looks quite promising at first glance. Dropoff Location is the highest by far, which should mean we can see the difference in busyness per zone relatively clearly when running the model. Time and month being second and third respectively also backs up our thoughts in the initial data visualisation, where the graphs showed a clear difference in the number of taxis as the time of day changes. And lesser but still significant differences were also seen when the pickups per month were plotted. So far, this backs up our initial suspicions.

In [23]:
# Testing predicted vs actual values 
rf_training_predictions = random_forest.predict(X_train)
df_true_vs_rf_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': rf_training_predictions})
df_true_vs_rf_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
27686079,728,737.86
3960398,1700,1670.71
9991269,722,736.84
1972780,171,185.3
10898020,823,869.57
22972743,850,851.93
15349398,702,634.97
16496062,428,430.59
8401815,735,718.26
28038907,730,702.6


In [24]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, rf_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, rf_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, rf_training_predictions)))
print('R^2:', metrics.r2_score(y_train, rf_training_predictions))
print("\n=======================================================")


Mean Absolute Error: 26.78383333333333
Mean Squared Error: 1469.757363857143
Root Mean Squared Error: 38.337414673620636
R^2: 0.9899343541473161



**Training Data Evaluation:**
- The model is actually quite accurate on the training set.
- MAE of approx. 26.8 seems reasonable for a first model, given that we are prediciton the number of dropoffs gor a given hours. Being +/- 27 on average is a good starting point.
- R-squared is extremely high, at almost 0.99. This could be a sign of overitting.


We will now run our model on the test set to see how it performs on unseen data.

In [25]:
# Predicted class labels for all examples, 
# using the trained model, on in-sample data (same sample used for training and test)
rf_test_predictions = random_forest.predict(X_test)
df_true_vs_rf_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': rf_test_predictions})
df_true_vs_rf_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
10437746,957,895.17
31057146,687,793.08
30429506,543,621.56
18679329,431,511.22
29621108,1479,1306.54
11270405,408,516.06
22188841,375,312.66
19280168,163,225.96
30710829,247,230.02
23903410,311,260.25


In [26]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_test_predictions)))
print('R^2:', metrics.r2_score(y_test, rf_test_predictions))
print("=======================================================")


Mean Absolute Error: 73.12122222222222
Mean Squared Error: 10735.220530722221
Root Mean Squared Error: 103.61090932291937
R^2: 0.9301194612572542


**Test Data Evaluation:**
- The model did not perform as well on the test data.
- MAE is significantly higher, at approx 73. 
- R-squared is also lower at 0.93, however this isn't necesasarily a bad thing. It is still a very high score, and means most of the variance in the target variable is explained from the independent variables.

## XGBoost
---

XGboost (Extreme Gradient Boosting) is another popular model when it comes to machine learning. We will now train a new model using XGboost and see how it performs relative to the other models. 

In [27]:
# Instantiate an XGBoost regressor object 
xg_reg = xgb.XGBRegressor()

# Fit the regressor to the training set
xg_reg.fit(X_train,y_train)

# Predict on the test set
preds = xg_reg.predict(X_test)

In [28]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("\n=======================================================")


Mean Absolute Error: 63.64092616809739
Mean Squared Error: 7501.753606437261
Root Mean Squared Error: 86.61266423818898
R^2: 0.951167599954474



**Testing RandomisedSearchCV to improve paramter tuning:**
<br><br>
Initially, we tried to run GridSearchCV in order to find the optimal parameters for XGBoost. GridSearchCV searches through all possible parameter combinations for a given list, and finds the best performing combination. However, this proved too computationally intensive and we ran out of memory.

Given this, we decided to use RandomisedSearchCV instead. It is much less resource intensive as it explores a random subset of combinations rather than all possible ones. 

In [29]:
param_dist = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [2, 3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'min_child_weight': [1, 2, 3],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1],
}

random_search = RandomizedSearchCV(estimator=xgb.XGBRegressor(), param_distributions=param_dist, n_iter=100, cv=5, n_jobs=-1, verbose=1)
random_search.fit(X_train, y_train)

# Print the best parameters found
print(random_search.best_params_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
{'subsample': 1, 'n_estimators': 300, 'min_child_weight': 3, 'max_depth': 7, 'learning_rate': 0.2, 'colsample_bytree': 1}


**Re-running the model while applying the optimal hyperparameters found from RandomisedSearchCV.**

In [30]:
# Apply the optimal hyperparameters
xg_reg = xgb.XGBRegressor(subsample=0.7, n_estimators=400, min_child_weight=2, max_depth=7, learning_rate=0.1, colsample_bytree=1, objective ='reg:squarederror')

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict on the test set
preds = xg_reg.predict(X_test)

In [31]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("\n=======================================================")


Mean Absolute Error: 51.79604471200042
Mean Squared Error: 5087.275149849588
Root Mean Squared Error: 71.32513687228078
R^2: 0.9668845621581138



**Observations:**
- The XGBoost with optimised hyperparamters appears to be the best performing model so far.
- It has the lowest MAE, MSE, and RMSE values across the board that we've seen so far.
- The r-squared is also very respectable, nearing almost 0.97.

For an initial start to modelling, these seem like they will be viable for the MVP. Once subway data and other proxies for busyness are implemented, these will be revisited and trained again.

In [32]:
with open('xgb_reg_single_model.pkl', 'wb') as file:
    pickle.dump(xg_reg, file)