# Initial Data Modelling
---
This notebook will be used for the initial data modelling. We will import and used the cleaned 2022 Taxi Data in order to create an initial model.

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

import xgboost as xgb
from xgboost import XGBRegressor

from tabulate import tabulate

from itertools import product

In [2]:
df = pd.read_csv("2022_taxi_data_cleaned.csv")

In [3]:
df.shape

(31323476, 21)

In [4]:
df.head()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,date,month,time,day_of_the_week
0,1,2.0,3.8,1.0,N,142,236,1,14.5,3.0,...,3.65,0.0,0.3,21.95,2.5,0.0,2022-01-01,1,0,Saturday
1,1,1.0,2.1,1.0,N,236,42,1,8.0,0.5,...,4.0,0.0,0.3,13.3,0.0,0.0,2022-01-01,1,0,Saturday
2,2,1.0,0.97,1.0,N,166,166,1,7.5,0.5,...,1.76,0.0,0.3,10.56,0.0,0.0,2022-01-01,1,0,Saturday
3,2,1.0,1.09,1.0,N,114,68,2,8.0,0.5,...,0.0,0.0,0.3,11.8,2.5,0.0,2022-01-01,1,0,Saturday
4,2,1.0,4.3,1.0,N,68,163,1,23.5,0.5,...,3.0,0.0,0.3,30.3,2.5,0.0,2022-01-01,1,0,Saturday


# Training a single-model
---
We are training a single-model with Drop-off zones as a feature as a starting point. Having zone as an input will allow the model to learn general patterns across all zones, while still taking into account the effect a particular zone has, as some are generally busyer than others.

We will also explore the idea of dropping the zones as a feature, and training an individual model for each zone. It is hard tell which is a better approach, so we will test both and analyse the results to compared which performs better.

---

As described in the preliminary data exploration and processing, many of the continuous features appear to have little relation to the target outcome of predicting busyness for a particular zone. As such, we are dropping these columns before training our first model.

In [5]:
columns_to_drop = ['VendorID', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
                  'PULocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
                  'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge',
                  'airport_fee', 'date']

In [6]:
df = df.drop(columns=columns_to_drop)

In order to quantify busyness for a particular zone, we need some sort of measure that acts as a proxy for busyness. For the initial model, we have decided to use the number of dropoffs in a particular zone, at a given hour/day/month, for this proxy. As such, we add a new column to the data frame in order to represent this. This will be our target prediction when we perform regression analysis.

In [7]:
# Create an aggregated DataFrame
df_agg = df.groupby(['DOLocationID', 'month', 'time', 'day_of_the_week']).size().reset_index(name='num_pickups')

# Join the aggregated DataFrame back to the original DataFrame
df = pd.merge(df, df_agg, how='left', on=['DOLocationID', 'month', 'time', 'day_of_the_week'])

In [8]:
df

Unnamed: 0,DOLocationID,month,time,day_of_the_week,num_pickups
0,236,1,0,Saturday,358
1,42,1,0,Saturday,113
2,166,1,0,Saturday,100
3,68,1,0,Saturday,451
4,163,1,0,Saturday,192
...,...,...,...,...,...
31323471,162,12,23,Saturday,382
31323472,142,12,23,Saturday,500
31323473,141,12,23,Saturday,752
31323474,142,12,23,Saturday,500


**Dealing with the Large Data Size**
With over 30 million rows, coupled that we are training locally on a laptop, we need some sort of solution to deal with the extremely large dataset. To begin, we are going to use stratified random sampling in order to get a sample of 1,000 rows for each month of the year. Having a random sample of equal size from each month will hopefully be more representative of the original data compared to other techniques, such as simple random sampling. 

In [9]:
# Initialize an empty DataFrame to store the samples
sampled_df = pd.DataFrame()

# Loop over each month
for month in df['month'].unique():
    # Get a sample of 1000 rows for the month
    month_df = df[df['month'] == month].sample(n=1000, random_state=1)
    
    # Append the month sample to the overall sample DataFrame
    sampled_df = pd.concat([sampled_df, month_df])

In [10]:
sampled_df

Unnamed: 0,DOLocationID,month,time,day_of_the_week,num_pickups
1023800,237,1,17,Monday,1177
2018926,48,1,21,Monday,403
163723,170,1,11,Tuesday,586
263743,232,1,22,Wednesday,88
1533312,164,1,17,Monday,444
...,...,...,...,...,...
30903021,68,12,10,Sunday,441
30336702,170,12,23,Saturday,638
31010885,246,12,14,Tuesday,397
30422979,186,12,8,Monday,299


We also label encode the day of the week, as it is categorical and not suitable in its original form for training models on. We set a map in order to retain the original order and easuly identify which number matches each day.

In [11]:
mapping = {'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3, 'Friday': 4, 'Saturday': 5, 'Sunday': 6}

# Create an instance of LabelEncoder and fit the mapping
encoder = LabelEncoder()
encoder.fit([key for key, value in sorted(mapping.items(), key=lambda x: x[1])])

# Apply the custom mapping to label encode the column
sampled_df['day_of_the_week'] = sampled_df['day_of_the_week'].map(mapping)

In [12]:
sampled_df

Unnamed: 0,DOLocationID,month,time,day_of_the_week,num_pickups
1023800,237,1,17,0,1177
2018926,48,1,21,0,403
163723,170,1,11,1,586
263743,232,1,22,2,88
1533312,164,1,17,0,444
...,...,...,...,...,...
30903021,68,12,10,6,441
30336702,170,12,23,5,638
31010885,246,12,14,1,397
30422979,186,12,8,0,299


Our sample is now ready, and we can split the dataset into a 70/30 training/test split.

In [13]:
X = sampled_df.drop('num_pickups', axis=1)
y = sampled_df['num_pickups']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Linear Regression:
---
While we don't believe linear regression is particularly well suited for this analysis, we will still run it as a baseline to benchmark the performance against other models. Typically, linear regression isn't always suited to temporal analysis as it may not be able to capture time-related trends across a long period. Still, given we are only looking at one year currently, it is no harm to test it for a starting point.

In [15]:
linear_reg = LinearRegression().fit(X_train, y_train)

In [16]:
linreg_coefficients = list(zip(X_train.columns, linear_reg.coef_))

# Sorting the coefficients in ascending order
sorted_linreg_coef_data = sorted(linreg_coefficients, key=lambda x: x[1])

headers = ["Feature", "Coefficient"]
print(tabulate(sorted_linreg_coef_data, headers=headers, floatfmt=".6f"), "\n")

print(f"Model Intercept: {linear_reg.intercept_}\n")

Feature            Coefficient
---------------  -------------
day_of_the_week     -20.283885
month                -0.048304
DOLocationID          1.627621
time                  8.456624 

Model Intercept: 226.32995553141876



In [17]:
y_train_pred = linear_reg.predict(X_train)
print('Predicted values:', y_train_pred[:10])

Predicted values: [580.48718194 688.17676959 496.25907265 662.73245791 422.3835098
 763.74513158 568.90143917 647.57060424 599.74738427 713.54582379]


In [18]:
print("==================== Train Data =======================")
print('Mean Absolute Error:', mean_absolute_error(y_train, y_train_pred))
print('Mean Squared Error:', mean_squared_error(y_train, y_train_pred))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
print('R2 Score:', r2_score(y_train, y_train_pred))
print("=======================================================")

Mean Absolute Error: 273.7669096965032
Mean Squared Error: 129566.0670530082
Root Mean Squared Error: 359.9528678216194
R2 Score: 0.11266568377102604


**Training Data Evaluation:**
- As expected, linear regression does not seem to perform well on our training data.
- All performance metrics are quite poor across the board.

In [19]:
y_test_pred = linear_reg.predict(X_test)
print('Predicted values:', y_test_pred[:10])
print("")

Predicted values: [564.12956884 537.98765444 587.01204829 558.51746374 635.29870011
 604.31613086 403.72980601 634.6517503  347.7591678  523.13484728]



In [20]:
print("==================== Test Data =======================")
print('Mean Absolute Error:', mean_absolute_error(y_test, y_test_pred))
print('Mean Squared Error:', mean_squared_error(y_test, y_test_pred))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print("================== Regression Report ==================")
print('R2 Score:', r2_score(y_test, y_test_pred))
print("=======================================================")

Mean Absolute Error: 277.55830886612
Mean Squared Error: 134124.77667336943
Root Mean Squared Error: 366.23049664571823
R2 Score: 0.12691950520601392


**Test Data Evaluation:**
- We see similar findings when testing on the test split.
- Overall, linear regressions does seem suited to our particular problem.

## Random Forest
---

We will now test random forest for our regression analysis.

In [21]:
random_forest = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1)
random_forest.fit(X_train, y_train)

In [22]:
# Creating a dataframe to store & display feature importance
feature_importance = pd.DataFrame({'feature': X_train.columns, 'importance':random_forest.feature_importances_})
feature_importance.sort_values('importance', ascending=False)

Unnamed: 0,feature,importance
0,DOLocationID,0.600261
2,time,0.250038
3,day_of_the_week,0.083641
1,month,0.06606


**Observations:** The ranking of feature importance actually looks quite promising at first glance. Dropoff Location is the highest by far, which should mean we can see the difference in busyness per zone relatively clearly when running the model. Time and month being second and third respectively also backs up our thoughts in the initial data visualisation, where the graphs showed a clear difference in the number of taxis as the time of day changes. And lesser but still significant differences were also seen when the pickups per month were plotted. So far, this backs up our initial suspicions.

In [23]:
# Testing predicted vs actual values 
rf_training_predictions = random_forest.predict(X_train)
df_true_vs_rf_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': rf_training_predictions})
df_true_vs_rf_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
27686079,728,735.7
3960398,1700,1645.39
9991269,722,727.18
1972780,171,186.33
10898020,823,857.77
22972743,850,861.99
15349398,702,637.52
16496062,428,426.52
8401815,735,723.82
28038907,730,711.61


In [24]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, rf_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, rf_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, rf_training_predictions)))
print('R^2:', metrics.r2_score(y_train, rf_training_predictions))
print("\n=======================================================")


Mean Absolute Error: 25.730753571428572
Mean Squared Error: 1340.8119698690475
Root Mean Squared Error: 36.61709941911084
R^2: 0.9908174377787617



**Training Data Evaluation:**
- The model is actually quite accurate on the training set.
- MAE of approx. 26.8 seems reasonable for a first model, given that we are prediciton the number of dropoffs gor a given hours. Being +/- 27 on average is a good starting point.
- R-squared is extremely high, at almost 0.99. This could be a sign of overitting.


We will now run our model on the test set to see how it performs on unseen data.

In [25]:
# Predicted class labels for all examples, 
# using the trained model, on in-sample data (same sample used for training and test)
rf_test_predictions = random_forest.predict(X_test)
df_true_vs_rf_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': rf_test_predictions})
df_true_vs_rf_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
10437746,957,915.72
31057146,687,816.39
30429506,543,641.01
18679329,431,517.86
29621108,1479,1279.59
11270405,408,527.86
22188841,375,296.49
19280168,163,227.71
30710829,247,230.31
23903410,311,263.94


In [26]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_test_predictions)))
print('R^2:', metrics.r2_score(y_test, rf_test_predictions))
print("=======================================================")


Mean Absolute Error: 68.35833055555555
Mean Squared Error: 9081.855909694443
Root Mean Squared Error: 95.29877181629595
R^2: 0.9408819798403582


**Test Data Evaluation:**
- The model did not perform as well on the test data.
- MAE is significantly higher, at approx 68. 
- R-squared is also lower at 0.94, however this isn't necesasarily a bad thing. It is still a very high score, and means most of the variance in the target variable is explained from the independent variables.

## XGBoost
---

XGboost (Extreme Gradient Boosting) is another popular model when it comes to machine learning. We will now train a new model using XGboost and see how it performs relative to the other models. 

In [30]:
xg_reg = XGBRegressor(objective ='reg:squarederror', eval_metric ='rmse')
xg_reg.fit(X_train, y_train)

preds = xg_reg.predict(X_test)

In [31]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("\n=======================================================")


Mean Absolute Error: 145.19167664912013
Mean Squared Error: 39527.488755559796
Root Mean Squared Error: 198.81521258585772
R^2: 0.7426972085499846



**Testing RandomisedSearchCV to improve paramter tuning:**
<br><br>
Initially, we tried to run GridSearchCV in order to find the optimal parameters for XGBoost. GridSearchCV searches through all possible parameter combinations for a given list, and finds the best performing combination. However, this proved too computationally intensive and we ran out of memory.

Given this, we decided to use RandomisedSearchCV instead. It is much less resource intensive as it explores a random subset of combinations rather than all possible ones. 

In [32]:
# Define the parameter distribution for the randomized search
param_dist = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [2, 3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'min_child_weight': [1, 2, 3],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1],
}

xg_reg = xgb.XGBRegressor(objective='reg:squarederror')

# Perform randomized search for best parameters
random_search = RandomizedSearchCV(estimator=xg_reg, param_distributions=param_dist, n_iter=100, cv=5, n_jobs=-1, verbose=1)
random_search.fit(X_train, y_train)

print(random_search.best_params_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
{'subsample': 0.7, 'n_estimators': 300, 'min_child_weight': 3, 'max_depth': 7, 'learning_rate': 0.1, 'colsample_bytree': 1}


**Re-running the model while applying the optimal hyperparameters found from RandomisedSearchCV.**

In [37]:
# Applying the optimal hyperparameters
xg_reg = xgb.XGBRegressor(subsample=0.7, n_estimators=300, min_child_weight=3, max_depth=7, learning_rate=0.1, colsample_bytree=1, objective='reg:squarederror')

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict on the test set
preds = xg_reg.predict(X_test)

In [38]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds)))
print('R^2:', metrics.r2_score(y_test, preds))
print("\n=======================================================")


Mean Absolute Error: 50.38769306244949
Mean Squared Error: 4747.133398350876
Root Mean Squared Error: 68.8994441077058
R^2: 0.9690987028714423



**Observations:**
- The XGBoost with optimised hyperparamters appears to be the best performing model so far.
- It has the lowest MAE, MSE, and RMSE values across the board that we've seen so far.
- The r-squared is also very respectable, nearing almost 0.97.

For an initial start to modelling, these seem like they will be viable for the MVP. Once subway data and other proxies for busyness are implemented, these will be revisited and trained again.

In [None]:
with open('xgb_reg_single_model.pkl', 'wb') as file:
    pickle.dump(xg_reg, file)

## Formatting the Model for Front/Back End
---

To get things integrated, we need the predictions to be usable by the front and back end. As a first test, we are going to try predict the number of dropoffs for all possible DOLocationID/month/time/day_of_the_week combinations, and then normalise these predictions to for a busyness index between 0-1. 
<br><br>
First, we get all unique values in each column so we can get the unique combinations.

In [39]:
unique_DO_location_ids = df['DOLocationID'].unique().tolist()
unique_months = sampled_df['month'].unique().tolist()
unique_times = sampled_df['time'].unique().tolist()
unique_days_of_week = sampled_df['day_of_the_week'].unique().tolist()

In [40]:
print(len(unique_DO_location_ids))
print(len(unique_months))
print(len(unique_times))
print(len(unique_days_of_week))

67
12
24
7


Now, we can use the product function from itertools to get the catersian product of all the unique values, and store them in a list.

In [41]:
combinations = list(product(unique_DO_location_ids, unique_months, unique_times, unique_days_of_week))

In [42]:
print(len(combinations))

135072


In [43]:
print(67*12*24*7)

135072


All has added up correctly so seems to have worked as intended.

Next, we can loop through each unique combination, use that specific set of inputs in our model in order to get a prediction, and store that in a new predictions data frame. We add a new column that contains the prediction as well.

In [44]:
data_for_df = []

for combination in combinations:
    DOLocationID, month, time, day_of_week = combination
    input_data = [DOLocationID, month, time, day_of_week]
    input_df = pd.DataFrame([input_data], columns=['DOLocationID', 'month', 'time', 'day_of_the_week'])
    prediction = xg_reg.predict(input_df)
    # Adding the prediction in the new column
    data_for_df.append(list(combination) + [prediction[0]])

# Create dataframe
df_predictions = pd.DataFrame(data_for_df, columns=['DOLocationID', 'month', 'time', 'day_of_the_week', 'prediction'])

Lastly, we now create the last column busyness, which are the predictions normalised to a value between 0-1. We then save the output as a json file.

In [45]:
# Normalize predictions to [0,1] range
scaler = MinMaxScaler()
df_predictions['busyness_score'] = scaler.fit_transform(df_predictions[['prediction']])

In [46]:
# Renaming columns for slightly more clarity
df_predictions = df_predictions.rename(columns={'DOLocationID': 'ZoneID', 'prediction': 'predicted_dropoffs'})

In [47]:
df_predictions

Unnamed: 0,ZoneID,month,time,day_of_the_week,predicted_dropoffs,busyness_score
0,236,1,17,0,1452.794556,0.691621
1,236,1,17,1,1415.192261,0.677383
2,236,1,17,2,1435.983276,0.685256
3,236,1,17,6,891.913757,0.479257
4,236,1,17,3,1409.822266,0.675350
...,...,...,...,...,...,...
135067,105,12,5,2,70.610847,0.168290
135068,105,12,5,6,85.639999,0.173980
135069,105,12,5,3,90.493713,0.175818
135070,105,12,5,5,144.010361,0.196081


In [48]:
df_predictions.to_json('predictions.json', orient='records')