# Feature Engineering - Taxi Data
---
Following on from our first model, we noticed that it was overfitting. In this notebook we will be trying to alleviate this with a mixture of feature engineering, as well as trying a different train/test split more suited to temporal analysis.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import xgboost as xgb

In [2]:
df = pd.read_csv("2022_taxi_data_cleaned.csv")

In [3]:
df.shape

(31323476, 21)

In [4]:
df

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,date,month,time,day_of_the_week
0,1,2.0,3.80,1.0,N,142,236,1,14.5,3.0,...,3.65,0.0,0.3,21.95,2.5,0.0,2022-01-01,1,0,Saturday
1,1,1.0,2.10,1.0,N,236,42,1,8.0,0.5,...,4.00,0.0,0.3,13.30,0.0,0.0,2022-01-01,1,0,Saturday
2,2,1.0,0.97,1.0,N,166,166,1,7.5,0.5,...,1.76,0.0,0.3,10.56,0.0,0.0,2022-01-01,1,0,Saturday
3,2,1.0,1.09,1.0,N,114,68,2,8.0,0.5,...,0.00,0.0,0.3,11.80,2.5,0.0,2022-01-01,1,0,Saturday
4,2,1.0,4.30,1.0,N,68,163,1,23.5,0.5,...,3.00,0.0,0.3,30.30,2.5,0.0,2022-01-01,1,0,Saturday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31323471,2,1.0,2.62,1.0,N,144,162,2,14.9,1.0,...,0.00,0.0,1.0,19.90,2.5,0.0,2022-12-31,12,23,Saturday
31323472,2,1.0,1.12,1.0,N,161,142,1,8.6,1.0,...,0.00,0.0,1.0,13.60,2.5,0.0,2022-12-31,12,23,Saturday
31323473,2,1.0,1.81,1.0,N,161,141,1,12.8,1.0,...,4.45,0.0,1.0,22.25,2.5,0.0,2022-12-31,12,23,Saturday
31323474,2,1.0,2.35,1.0,N,229,142,2,14.9,1.0,...,0.00,0.0,1.0,19.90,2.5,0.0,2022-12-31,12,23,Saturday


As before, we drop these columns as they have little relevance for our starting model.

In [5]:
columns_to_drop = ['VendorID', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
                  'PULocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
                  'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge',
                  'airport_fee']

In [6]:
df = df.drop(columns=columns_to_drop)

### Changing the Train/Test split
One of the main reasons our original model was overfitting was due to the way performed the train/test split. Originally, we took a sample from each month in order to maintain the original integrity of the data. Then, we did a 70/30 train/test split on this sub sample.
<br><br>
However, upon investigation this does not seem like a good train/test split. As we initially had a Zone ID, time and day of the week as inputs, it was likely that the train and test data were almost identical, which is what was contributing to the overfitting. In essence, the model was never being tested on unseen data because of this.
<br><br>
In an effort to combat this, we are changing our methodology when splitting the data. We now extract a new column called Day, which is the day of the month. We will use this to split our data. The test split will now be the last 5 days of each month, while the rest is the training. 

In [7]:
df['date'] = pd.to_datetime(df['date'])
df['Day'] = df['date'].dt.day

We also now slightly change the aggregation of dropoffs to now group by this new Day column, instead of day_of_the_week.

In [9]:
df_agg = df.groupby(['DOLocationID', 'month', 'time', 'Day']).size().reset_index(name='Dropoffs')

# Join the aggregated DataFrame back to the original DataFrame
df = pd.merge(df, df_agg, how='left', on=['DOLocationID', 'month', 'time', 'Day'])

In [10]:
df

Unnamed: 0,DOLocationID,date,month,time,day_of_the_week,Day,Dropoffs
0,236,2022-01-01,1,0,Saturday,1,116
1,42,2022-01-01,1,0,Saturday,1,21
2,166,2022-01-01,1,0,Saturday,1,32
3,68,2022-01-01,1,0,Saturday,1,102
4,163,2022-01-01,1,0,Saturday,1,35
...,...,...,...,...,...,...,...
31323471,162,2022-12-31,12,23,Saturday,31,47
31323472,142,2022-12-31,12,23,Saturday,31,85
31323473,141,2022-12-31,12,23,Saturday,31,86
31323474,142,2022-12-31,12,23,Saturday,31,85


## Feature Engineering
---
For some basic feature engineering the help the model generalise a bit more, we are going to extract two new features:
- **Weekend:** Binary value representing weekdays (0) vs weekends (1)
- **TimeOfDay:** Time blocks corresponding to Morning, Afternoon, Evening and Night
    - Morning: 6am - 12pm
    - Afternoon: 12-pm - 5pm
    - Evening: 5pm - 10pm
    - Night: 10pm - 6am
    
While we may add more, these seem like a simple but logical starting point.

**Weekend**

In [11]:
df['Weekend'] = np.where(df['day_of_the_week'].isin(['Saturday', 'Sunday']), 1, 0)

**Time of Day**

In [12]:
# Creating bins for the time blocks
bins = [0, 6, 12, 17, 22, 24]
labels = ['Night', 'Morning', 'Afternoon', 'Evening', 'Night']

# Assign the labels based on the time ranges
df['TimeOfDay'] = pd.cut(df['time'], bins=bins, labels=labels, right=False, include_lowest=True, ordered=False)

In [13]:
df

Unnamed: 0,DOLocationID,date,month,time,day_of_the_week,Day,Dropoffs,Weekend,TimeOfDay
0,236,2022-01-01,1,0,Saturday,1,116,1,Night
1,42,2022-01-01,1,0,Saturday,1,21,1,Night
2,166,2022-01-01,1,0,Saturday,1,32,1,Night
3,68,2022-01-01,1,0,Saturday,1,102,1,Night
4,163,2022-01-01,1,0,Saturday,1,35,1,Night
...,...,...,...,...,...,...,...,...,...
31323471,162,2022-12-31,12,23,Saturday,31,47,1,Night
31323472,142,2022-12-31,12,23,Saturday,31,85,1,Night
31323473,141,2022-12-31,12,23,Saturday,31,86,1,Night
31323474,142,2022-12-31,12,23,Saturday,31,85,1,Night


## Encoding
---
Upon first testing, we used label encoding in order to map the categorical values to useable inputs in our model. However, after investigating further from articles as well as the official scikit learn documentation, label encoding is typically recommended for target variables, and not input variables, as it can create a false sense of order in the data. Our categorical inputs are nominal, and as such we use one-hot encoding instead. This will increase training time, however should lead to better model performance overall.

In [14]:
# Get dummies and concatenate with original dataframe
dummies = pd.get_dummies(df['TimeOfDay'], prefix='TimeOfDay')
df = pd.concat([df, dummies], axis=1)

dummies = pd.get_dummies(df['day_of_the_week'], prefix='day_of_the_week')
df = pd.concat([df, dummies], axis=1)

# Drop the original columns 
df = df.drop('TimeOfDay', axis=1)
df = df.drop('day_of_the_week', axis=1)

In [15]:
df

Unnamed: 0,DOLocationID,date,month,time,Day,Dropoffs,Weekend,TimeOfDay_Afternoon,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,day_of_the_week_Friday,day_of_the_week_Monday,day_of_the_week_Saturday,day_of_the_week_Sunday,day_of_the_week_Thursday,day_of_the_week_Tuesday,day_of_the_week_Wednesday
0,236,2022-01-01,1,0,1,116,1,0,0,0,1,0,0,1,0,0,0,0
1,42,2022-01-01,1,0,1,21,1,0,0,0,1,0,0,1,0,0,0,0
2,166,2022-01-01,1,0,1,32,1,0,0,0,1,0,0,1,0,0,0,0
3,68,2022-01-01,1,0,1,102,1,0,0,0,1,0,0,1,0,0,0,0
4,163,2022-01-01,1,0,1,35,1,0,0,0,1,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31323471,162,2022-12-31,12,23,31,47,1,0,0,0,1,0,0,1,0,0,0,0
31323472,142,2022-12-31,12,23,31,85,1,0,0,0,1,0,0,1,0,0,0,0
31323473,141,2022-12-31,12,23,31,86,1,0,0,0,1,0,0,1,0,0,0,0
31323474,142,2022-12-31,12,23,31,85,1,0,0,0,1,0,0,1,0,0,0,0


## Train/Test Split
---
As mentioned above, we are changing our train test split. We now take the last 5 days of each month, and store these rows in a test dataframe. The train dataframe will then be the original dataframe less these test rows. This should help with overfitting as the test data will now be unseen. 

In [16]:
# Test data: Returns last 5 days of each month
df_test = df[df['date'].dt.day > df['date'].dt.daysinmonth - 5].copy()

# Train data: Checks for rows not in the new test set
df_train = df[~df.index.isin(df_test.index)].copy()

In [17]:
df_test

Unnamed: 0,DOLocationID,date,month,time,Day,Dropoffs,Weekend,TimeOfDay_Afternoon,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,day_of_the_week_Friday,day_of_the_week_Monday,day_of_the_week_Saturday,day_of_the_week_Sunday,day_of_the_week_Thursday,day_of_the_week_Tuesday,day_of_the_week_Wednesday
1707973,90,2022-01-27,1,0,27,28,0,0,0,0,1,0,0,0,0,1,0,0
1708136,249,2022-01-27,1,0,27,24,0,0,0,0,1,0,0,0,0,1,0,0
1709939,162,2022-01-27,1,0,27,17,0,0,0,0,1,0,0,0,0,1,0,0
1709940,48,2022-01-27,1,0,27,48,0,0,0,0,1,0,0,0,0,1,0,0
1709941,239,2022-01-27,1,0,27,34,0,0,0,0,1,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31323471,162,2022-12-31,12,23,31,47,1,0,0,0,1,0,0,1,0,0,0,0
31323472,142,2022-12-31,12,23,31,85,1,0,0,0,1,0,0,1,0,0,0,0
31323473,141,2022-12-31,12,23,31,86,1,0,0,0,1,0,0,1,0,0,0,0
31323474,142,2022-12-31,12,23,31,85,1,0,0,0,1,0,0,1,0,0,0,0


In [18]:
df_train

Unnamed: 0,DOLocationID,date,month,time,Day,Dropoffs,Weekend,TimeOfDay_Afternoon,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,day_of_the_week_Friday,day_of_the_week_Monday,day_of_the_week_Saturday,day_of_the_week_Sunday,day_of_the_week_Thursday,day_of_the_week_Tuesday,day_of_the_week_Wednesday
0,236,2022-01-01,1,0,1,116,1,0,0,0,1,0,0,1,0,0,0,0
1,42,2022-01-01,1,0,1,21,1,0,0,0,1,0,0,1,0,0,0,0
2,166,2022-01-01,1,0,1,32,1,0,0,0,1,0,0,1,0,0,0,0
3,68,2022-01-01,1,0,1,102,1,0,0,0,1,0,0,1,0,0,0,0
4,163,2022-01-01,1,0,1,35,1,0,0,0,1,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30982311,74,2022-12-26,12,23,26,16,0,0,0,0,1,0,1,0,0,0,0,0
30982425,164,2022-12-26,12,23,26,32,0,0,0,0,1,0,1,0,0,0,0,0
30982441,161,2022-12-26,12,23,26,34,0,0,0,0,1,0,1,0,0,0,0,0
30982892,162,2022-12-26,12,23,26,44,0,0,0,0,1,0,1,0,0,0,0,0


All looks good so far. Another minor modification to the model we are making is to predict our busyness index directly instead of after. Since it's based off Dropoffs it makes more sense calculate it before and use the busyness index as the target feature. It will not affect model performance, but will save time from now having the manually normalise the values after predictions.

In [19]:
# Create MinMaxScaler instance
scaler = MinMaxScaler()

# Scale 'Dropoffs' column for train and test data
df_train['busyness_score'] = scaler.fit_transform(df_train[['Dropoffs']])
df_test['busyness_score'] = scaler.transform(df_test[['Dropoffs']])

In [20]:
df_train

Unnamed: 0,DOLocationID,date,month,time,Day,Dropoffs,Weekend,TimeOfDay_Afternoon,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,day_of_the_week_Friday,day_of_the_week_Monday,day_of_the_week_Saturday,day_of_the_week_Sunday,day_of_the_week_Thursday,day_of_the_week_Tuesday,day_of_the_week_Wednesday,busyness_score
0,236,2022-01-01,1,0,1,116,1,0,0,0,1,0,0,1,0,0,0,0,0.208333
1,42,2022-01-01,1,0,1,21,1,0,0,0,1,0,0,1,0,0,0,0,0.036232
2,166,2022-01-01,1,0,1,32,1,0,0,0,1,0,0,1,0,0,0,0,0.056159
3,68,2022-01-01,1,0,1,102,1,0,0,0,1,0,0,1,0,0,0,0,0.182971
4,163,2022-01-01,1,0,1,35,1,0,0,0,1,0,0,1,0,0,0,0,0.061594
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30982311,74,2022-12-26,12,23,26,16,0,0,0,0,1,0,1,0,0,0,0,0,0.027174
30982425,164,2022-12-26,12,23,26,32,0,0,0,0,1,0,1,0,0,0,0,0,0.056159
30982441,161,2022-12-26,12,23,26,34,0,0,0,0,1,0,1,0,0,0,0,0,0.059783
30982892,162,2022-12-26,12,23,26,44,0,0,0,0,1,0,1,0,0,0,0,0,0.077899


In [21]:
df_test

Unnamed: 0,DOLocationID,date,month,time,Day,Dropoffs,Weekend,TimeOfDay_Afternoon,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,day_of_the_week_Friday,day_of_the_week_Monday,day_of_the_week_Saturday,day_of_the_week_Sunday,day_of_the_week_Thursday,day_of_the_week_Tuesday,day_of_the_week_Wednesday,busyness_score
1707973,90,2022-01-27,1,0,27,28,0,0,0,0,1,0,0,0,0,1,0,0,0.048913
1708136,249,2022-01-27,1,0,27,24,0,0,0,0,1,0,0,0,0,1,0,0,0.041667
1709939,162,2022-01-27,1,0,27,17,0,0,0,0,1,0,0,0,0,1,0,0,0.028986
1709940,48,2022-01-27,1,0,27,48,0,0,0,0,1,0,0,0,0,1,0,0,0.085145
1709941,239,2022-01-27,1,0,27,34,0,0,0,0,1,0,0,0,0,1,0,0,0.059783
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31323471,162,2022-12-31,12,23,31,47,1,0,0,0,1,0,0,1,0,0,0,0,0.083333
31323472,142,2022-12-31,12,23,31,85,1,0,0,0,1,0,0,1,0,0,0,0,0.152174
31323473,141,2022-12-31,12,23,31,86,1,0,0,0,1,0,0,1,0,0,0,0,0.153986
31323474,142,2022-12-31,12,23,31,85,1,0,0,0,1,0,0,1,0,0,0,0,0.152174


We no longer need the dropoffs column, and can drop it in both dataframes. We also drop date as we no longer need it either.

In [22]:
df_train.drop('Dropoffs', axis=1, inplace=True)
df_test.drop('Dropoffs', axis=1, inplace=True)

df_train.drop('date', axis=1, inplace=True)
df_test.drop('date', axis=1, inplace=True)

**Sampling**: We are going to test an 80/20 train test split. We still use stratifed sampling in order to ensure we have an equal amount of data points from each month, in order to be representative of the original taxi data. We take 20,000 samples from each month, for a total of 240,000 samples for our train/test split.

In [23]:
# Initialize an empty DataFrame for training data
df_train_sample = pd.DataFrame()  

for month in df_train['month'].unique():
    # Get a sample of 20,000 rows for the month
    month_df = df_train[df_train['month'] == month].sample(n=16000, random_state=1)
    
    # Append the month sample to the training DataFrame
    df_train_sample = pd.concat([df_train_sample, month_df])

In [24]:
df_test_sample = pd.DataFrame()  

for month in df_test['month'].unique():
    month_df = df_test[df_test['month'] == month].sample(n=4000, random_state=1)
    df_test_sample = pd.concat([df_test_sample, month_df])

And now the actualy train test split.

In [25]:
# Assign the feature columns to X_train and X_test
X_train = df_train_sample.drop('busyness_score', axis=1)
X_test = df_test_sample.drop('busyness_score', axis=1)

# Assign the target column to y_train and y_test
y_train = df_train_sample['busyness_score']
y_test = df_test_sample['busyness_score']

In [26]:
X_train

Unnamed: 0,DOLocationID,month,time,Day,Weekend,TimeOfDay_Afternoon,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,day_of_the_week_Friday,day_of_the_week_Monday,day_of_the_week_Saturday,day_of_the_week_Sunday,day_of_the_week_Thursday,day_of_the_week_Tuesday,day_of_the_week_Wednesday
931503,137,1,0,16,1,0,0,0,1,0,0,0,1,0,0,0
313020,48,1,17,6,0,0,1,0,0,0,0,0,0,1,0,0
170010,140,1,13,4,0,1,0,0,0,0,0,0,0,0,1,0
268831,68,1,6,6,0,0,0,1,0,0,0,0,0,1,0,0
214603,68,1,9,5,0,0,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30647814,263,12,16,21,0,1,0,0,0,0,0,0,0,0,0,1
29807980,151,12,17,12,0,0,1,0,0,0,1,0,0,0,0,0
28997957,162,12,0,4,1,0,0,0,1,0,0,0,1,0,0,0
30936776,142,12,9,26,0,0,0,1,0,0,1,0,0,0,0,0


In [27]:
y_train

931503      0.146739
313020      0.266304
170010      0.269928
268831      0.056159
214603      0.144928
              ...   
30647814    0.228261
29807980    0.148551
28997957    0.119565
30936776    0.094203
29388170    0.715580
Name: busyness_score, Length: 192000, dtype: float64

In [28]:
X_test

Unnamed: 0,DOLocationID,month,time,Day,Weekend,TimeOfDay_Afternoon,TimeOfDay_Evening,TimeOfDay_Morning,TimeOfDay_Night,day_of_the_week_Friday,day_of_the_week_Monday,day_of_the_week_Saturday,day_of_the_week_Sunday,day_of_the_week_Thursday,day_of_the_week_Tuesday,day_of_the_week_Wednesday
1722287,163,1,9,27,0,0,0,1,0,0,0,0,0,1,0,0
1794191,249,1,23,27,0,0,0,0,1,0,0,0,0,1,0,0
1850405,237,1,18,28,0,0,1,0,0,1,0,0,0,0,0,0
1930515,90,1,15,30,1,1,0,0,0,0,0,0,1,0,0,0
1902766,114,1,1,30,1,0,0,0,1,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31163898,164,12,17,29,0,0,1,0,0,0,0,0,0,1,0,0
31222805,234,12,15,30,0,1,0,0,0,1,0,0,0,0,0,0
31043409,231,12,22,27,0,0,0,0,1,0,0,0,0,0,1,0
31191998,162,12,2,30,0,0,0,0,1,1,0,0,0,0,0,0


In [29]:
y_test

1722287     0.315217
1794191     0.163043
1850405     0.452899
1930515     0.139493
1902766     0.014493
              ...   
31163898    0.302536
31222805    0.221014
31043409    0.072464
31191998    0.036232
31211245    0.442029
Name: busyness_score, Length: 48000, dtype: float64

## Random Forest
---
We'll begin by testing on Random Forest. We skip linear regression as we prviosuly found it was not suitable for our particular problem.

In [30]:
random_forest = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1)
random_forest.fit(X_train, y_train)

In [31]:
# Creating a dataframe to store & display feature importance
feature_importance = pd.DataFrame({'feature': X_train.columns, 'importance':random_forest.feature_importances_})
feature_importance.sort_values('importance', ascending=False)

Unnamed: 0,feature,importance
0,DOLocationID,0.599921
2,time,0.15326
8,TimeOfDay_Night,0.069179
1,month,0.04864
4,Weekend,0.03883
3,Day,0.033192
9,day_of_the_week_Friday,0.010636
10,day_of_the_week_Monday,0.00866
12,day_of_the_week_Sunday,0.008245
7,TimeOfDay_Morning,0.007246


In [32]:
# Testing predicted vs actual values 
rf_training_predictions = random_forest.predict(X_train)
df_true_vs_rf_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': rf_training_predictions})
df_true_vs_rf_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
931503,0.146739,0.145543
313020,0.266304,0.259638
170010,0.269928,0.26308
268831,0.056159,0.050634
214603,0.144928,0.138931
1607386,0.42029,0.43038
1218760,0.143116,0.142337
503784,0.056159,0.062971
1407305,0.255435,0.255417
1096479,0.023551,0.024167


In [33]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, rf_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, rf_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, rf_training_predictions)))
print('R^2:', metrics.r2_score(y_train, rf_training_predictions))
print("\n=======================================================")


Mean Absolute Error: 0.004980749075332147
Mean Squared Error: 6.977891365985519e-05
Root Mean Squared Error: 0.008353377380428541
R^2: 0.9974601063145035



In [34]:
# Predicted class labels for all examples, 
# using the trained model, on in-sample data (same sample used for training and test)
rf_test_predictions = random_forest.predict(X_test)
df_true_vs_rf_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': rf_test_predictions})
df_true_vs_rf_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
1722287,0.315217,0.313732
1794191,0.163043,0.130254
1850405,0.452899,0.60067
1930515,0.139493,0.12808
1902766,0.014493,0.087591
1729525,0.074275,0.090072
1729405,0.148551,0.141033
1874065,0.068841,0.073659
1848589,0.380435,0.398714
1745253,0.730072,0.790996


In [35]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_test_predictions)))
print('R^2:', metrics.r2_score(y_test, rf_test_predictions))
print("=======================================================")


Mean Absolute Error: 0.03786586956521739
Mean Squared Error: 0.003830487446765276
Root Mean Squared Error: 0.061890931862149855
R^2: 0.8569190892261723


**Observations:**
- When predicting on the test set, it appears as though the measures taken have helped with overfitting.
- The model still performs well on unseen data, but R-squared is much more reasonable here.

## XGBoost
---
Now we will perform the same test with XGBoost. We'll start by just using default parameters.

In [36]:
xg_reg = xgb.XGBRegressor()

xg_reg.fit(X_train, y_train)



In [37]:
# Predict on the training set
xg_training_predictions = xg_reg.predict(X_train)
df_true_vs_xg_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': xg_training_predictions})
df_true_vs_xg_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
931503,0.146739,0.110059
313020,0.266304,0.246519
170010,0.269928,0.213402
268831,0.056159,0.057845
214603,0.144928,0.129308
1607386,0.42029,0.362486
1218760,0.143116,0.188874
503784,0.056159,0.115775
1407305,0.255435,0.252441
1096479,0.023551,0.163769


In [38]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, xg_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, xg_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, xg_training_predictions)))
print('R^2:', metrics.r2_score(y_train, xg_training_predictions))
print("\n=======================================================")


Mean Absolute Error: 0.06172686354924615
Mean Squared Error: 0.006855364286367583
Root Mean Squared Error: 0.08279712728330364
R^2: 0.7504705139492533



In [39]:
# Predict on the test set
xg_test_predictions = xg_reg.predict(X_test)
df_true_vs_xg_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': xg_test_predictions})
df_true_vs_xg_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
1722287,0.315217,0.263462
1794191,0.163043,0.122977
1850405,0.452899,0.561839
1930515,0.139493,0.131776
1902766,0.014493,0.1076
1729525,0.074275,0.143029
1729405,0.148551,0.219417
1874065,0.068841,0.107015
1848589,0.380435,0.237072
1745253,0.730072,0.559784


In [40]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, xg_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, xg_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, xg_test_predictions)))
print('R^2:', metrics.r2_score(y_test, xg_test_predictions))
print("\n=======================================================")


Mean Absolute Error: 0.06272949029563743
Mean Squared Error: 0.007186454336734137
Root Mean Squared Error: 0.08477295757925482
R^2: 0.7315630331584135



In [None]:
# Predict on the training set
xg_training_predictions = xg_reg.predict(X_train)
df_true_vs_xg_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': xg_training_predictions})
df_true_vs_xg_predicted.head(10)

print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, xg_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, xg_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, xg_training_predictions)))
print('R^2:', metrics.r2_score(y_train, xg_training_predictions))
print("\n=======================================================")

# Predict on the test set
xg_test_predictions = xg_reg.predict(X_test)

# Evaluate the performance on the test set
print("\n==================== Test Data ========================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, xg_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, xg_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, xg_test_predictions)))
print('R^2:', metrics.r2_score(y_test, xg_test_predictions))
print("\n=======================================================")


**Observations:**
- Using default parameters, XGboost significantly underperforms compared to random forest.
As a test, we will run retrain it using the optimised parameters we found from our initial model. These may no longer be optimal as the model inputs have changed, however it will hopefully provide some performance increase.

In [41]:
xg_reg = xgb.XGBRegressor(subsample=0.7, n_estimators=300, min_child_weight=3, max_depth=7, learning_rate=0.1, colsample_bytree=1, objective='reg:squarederror')

xg_reg.fit(X_train, y_train)

In [42]:
# Predict on the training set
xg_training_predictions = xg_reg.predict(X_train)
df_true_vs_xg_predicted = pd.DataFrame({'Actual Value': y_train, 'Predicted Value': xg_training_predictions})
df_true_vs_xg_predicted.head(10)

Unnamed: 0,Actual Value,Predicted Value
931503,0.146739,0.142992
313020,0.266304,0.282358
170010,0.269928,0.230246
268831,0.056159,0.040303
214603,0.144928,0.143365
1607386,0.42029,0.419518
1218760,0.143116,0.135421
503784,0.056159,0.065351
1407305,0.255435,0.261231
1096479,0.023551,0.001981


In [43]:
print("\n==================== Train Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, xg_training_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, xg_training_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, xg_training_predictions)))
print('R^2:', metrics.r2_score(y_train, xg_training_predictions))
print("\n=======================================================")


Mean Absolute Error: 0.024463536374360894
Mean Squared Error: 0.0011503961774387893
Root Mean Squared Error: 0.03391749073028235
R^2: 0.9581265480695342



In [44]:
# Predict on the test set
xg_test_predictions = xg_reg.predict(X_test)
df_true_vs_xg_predicted_test = pd.DataFrame({'Actual Value': y_test, 'Predicted Value': xg_test_predictions})
df_true_vs_xg_predicted_test.head(10)

Unnamed: 0,Actual Value,Predicted Value
1722287,0.315217,0.283787
1794191,0.163043,0.124922
1850405,0.452899,0.599822
1930515,0.139493,0.143654
1902766,0.014493,0.091337
1729525,0.074275,0.061837
1729405,0.148551,0.168418
1874065,0.068841,0.080215
1848589,0.380435,0.419201
1745253,0.730072,0.770896


In [45]:
print("\n==================== Test Data =======================")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, xg_test_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, xg_test_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, xg_test_predictions)))
print('R^2:', metrics.r2_score(y_test, xg_test_predictions))
print("\n=======================================================")


Mean Absolute Error: 0.03477601028303954
Mean Squared Error: 0.002586265078413597
Root Mean Squared Error: 0.0508553348078016
R^2: 0.9033947590052922



Applying these parameters has significantly improved performance of the model. As we continue to improve the model, we will perform more parameter tuning once we have merged the subway data into the model.

In [46]:
with open('xgb_reg_single_model_v2.pkl', 'wb') as file:
    pickle.dump(xg_reg, file)