# 5. Capstone Project: Machine Learning Models III

***

![headerall](./images/headers/header_all.jpg)

##  Goals

### Project:
In this work, we will first analyze where and when traffic congestion is highest and lowest in New York State. We will then build different machine learning models capable of predicting cab travel times in and around New York City using only variables that can be easily obtained from a smartphone app or a website. We will then compare their performance and explore the possibility of using additional variables such as weather forecasts and holidays to improve the predictive performance of the models.

### Section:
In this section, we will use the knowledge gained during the exploratory data analysis to perform the final feature transformation. Next, we will create and compare the performance of three machine learning models based on linear regression, support vector machine, and a gradient enhanced decision tree. Hyperparameters will be optimized for each model to achieve the best possible performance.

## Data
### External Datasets:
- Weather Forecast: The 2018 NYC weather forecast was collected from the [National Weather Service Forecast Office](https://w2.weather.gov/climate/index.php?wfo=okx) website. Daily measurements were taken from January to December 2018 in Central Park. These measures are given in imperial units and include daily minimum and maximum temperatures, precipitations, snowfall, and snow depth.

- Holidays: The 2018 NYC holidays list was collected from the [Office Holiday](https://www.officeholidays.com/countries/usa/new-york/2021) website. The dataset contains the name, date, and type of holidays for New York.

- Taxi Zones: The NYC Taxi Zones dataset was collected from the [NYC Open Data](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc) website. It contains the pickup and drop-off zones (Location IDs) for the Yellow, Green, and FHV Trip Records. The taxi zones are based on the NYC Department of City Planning’s Neighborhood.

### Primary Datasets:

- Taxi Trips: The 2018 NYC Taxi Trip dataset was collected from the [Google Big Query](https://console.cloud.google.com/marketplace/product/city-of-new-york/nyc-tlc-trips?project=jovial-monument-300209&folder=&organizationId=) platform. The dataset contains more than 100'000'000 Yellow Taxi Trip records for 2018 and contains an extensive amount of variables including the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

***
## Table of Content:
    1. Data Preparation
        1.1 External Datasets
            1.1.1 Weather Forecast Dataset
            1.1.2 Holidays Dataset
            1.1.3 Taxi Zones Dataset
        1.2 Primary Dataset
            1.2.1 Taxi Trips Dataset
            1.2.2 Taxi Trips Subset
    2. Exploratory Data Analysis
        2.1 Primary Dataset
            2.1.1 Temporal Analysis
            2.1.2 Spatio-Temporal Analysis
        2.2 External Datasets
            2.2.1 Temporal Analysis of Weather Data
            2.2.2 Temporal Analysis of Holidays Data
        2.3 Combined Dataset
            2.3.1 Overall Features Correlation
    3. Machine Learning Models
        3.1 Data Preparation
        3.2 Baselines
        3.2 Model Training
            3.2.1 Linear Regression
            3.2.2 Support Vector Machine
            3.2.3 Gradient Boosted Decision Tree

***
## Python Libraries and Magic commands Import

In [2]:
# Import python core libraries
import time

# Import data processing libraries gpd
import pandas as pd
import numpy as np

# Import Visualization libraries
import seaborn as sns 
import matplotlib.pyplot as plt

# Import machine learning libraries
from sklearn.pipeline import make_pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import median_absolute_error as MAE 

In [3]:
# Set up magic commands
%matplotlib inline
%config Completer.use_jedi = False

***

## Data Import

In [4]:
# Import the train dataset
train_df = pd.read_pickle(r'data/processed/train.pickle')

# Get the independant variables from the train dataset
X_tr = train_df.drop("trip_duration", axis=1)

# Get the dependant variable from the train dataset
y_tr = train_df["trip_duration"]

print('X_tr:', X_tr.shape)
print('y_tr:', y_tr.shape, y_tr.dtype)

X_tr: (824654, 33)
y_tr: (824654,) float64


In [5]:
# Import the test dataset
test_df = pd.read_pickle(r'data/processed/test.pickle')

# Get the independant variables from the test dataset
X_te = test_df.drop("trip_duration", axis=1)

# Get the dependant variable from the test dataset
y_te = test_df["trip_duration"]

print('X_te:', X_te.shape)
print('y_te:', y_te.shape, y_te.dtype)

X_te: (206156, 33)
y_te: (206156,) float64


***
## Functions Import

In [6]:
# Define a function that performs preprocessing steps to the selected dataset
def preprocess(data, categorical_cols, continuous_cols, transform_cols, polynome_deg=1):

    # Create a copy of the data frame
    df = data.copy()

    # One-hot encode categorical features
    df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False)

    # Log-transform numerical variables
    for col in transform_cols:
        df[col] = np.log(df[col])
    
    # Add polynomial features
    for col in continuous_cols:
        if polynome_deg > 1:
            for poly in range(polynome_deg + 1):
                df["{}**{}".format(col, poly)] = df[col] ** poly

    return df

***
## Variable Import

In [7]:
# Define a list of categorical variables
categorical_cols = [
    "pickup_month",
    "pickup_week",
    "pickup_weekday",
    "pickup_weekday_type",
    "pickup_hour",
    "pickup_hour_type",
    "wf_avg_temp_lvl",
    "wf_prec_lvl",
    "wf_new_snow_lvl",
    "wf_snow_depth_lvl",
    "holiday_type",
    "holiday",
    "trip_within_borough",
    "tolls_amount_lvl",
]

# Define a list of continuous variables
continuous_cols = [
    "trip_distance",
    "tolls_amount",
    "wf_avg_temp",
    "wf_prec",
    "wf_new_snow",
    "wf_snow_depth",
    "pickup_zone_latitude",
    "pickup_zone_longitude",
    "pickup_borough_latitude",
    "pickup_borough_longitude",
    "dropoff_zone_latitude",
    "dropoff_zone_longitude",
    "dropoff_borough_latitude",
    "dropoff_borough_longitude",
]

***
# 3. Machine Learning Models
## 3.2 Machine Learning Models: Model Training

In [8]:
# Get id column names from the train dataset
id_cols = [c for c in train_df.columns if "_id" in c]

# Remove ID features in the train dataset
X_tr.drop(id_cols, axis=1, inplace=True)

# Remove ID features in the test dataset
X_te.drop(id_cols, axis=1, inplace=True)

# Drop the pickup day of the year variable from the train dataset
X_tr.drop("pickup_yearday", axis=1, inplace=True)

# Drop the pickup day of the year variable from the test dataset
X_te.drop("pickup_yearday", axis=1, inplace=True)

## 3.2.1 Model Training: Support Vector Machine

## Goals:

## Code:

###  Support Vector Machine: testing different training sizes

In [9]:
# Get time at exectution start
start_time = time.time()

# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Fit and evaluate the pipeline to the preprocessed train dataset
svr_model.fit(X_tr[:100], y_tr[:100])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model.predict(X_te)

# Calculate execution time
end_time_1 = time.time() - start_time

print('The MSE of the support vector machine regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the support vector machine regression model model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print('Exectution time: {:.0f} sec'.format(end_time_1))

The MSE of the support vector machine regression model is: 91
The MAE of the support vector machine regression model model is: 5
Exectution time: 2 sec


In [10]:
# Get time at exectution start
start_time = time.time()

# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Fit and evaluate the pipeline to the preprocessed train dataset
svr_model.fit(X_tr[:1000], y_tr[:1000])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model.predict(X_te)

# Calculate execution time
end_time_2 = time.time() - start_time

print('The MSE of the support vector machine regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the support vector machine regression model model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print('Exectution time: {:.0f} sec'.format(end_time_2))

The MSE of the support vector machine regression model is: 60
The MAE of the support vector machine regression model model is: 3
Exectution time: 17 sec


In [11]:
# Get time at exectution start
start_time = time.time()

# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Fit and evaluate the pipeline to the preprocessed train dataset
svr_model.fit(X_tr[:10000], y_tr[:10000])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model.predict(X_te)

# Calculate execution time
end_time_3 = time.time() - start_time

print('The MSE of the support vector machine regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the support vector machine regression model model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print('Exectution time: {:.0f} sec'.format(end_time_3))

The MSE of the support vector machine regression model is: 31
The MAE of the support vector machine regression model model is: 2
Exectution time: 169 sec


In [12]:
# Get time at exectution start
start_time = time.time()

# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Fit and evaluate the pipeline to the preprocessed train dataset
svr_model.fit(X_tr[:100000], y_tr[:100000])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model.predict(X_te)

# Calculate execution time
end_time_4 = time.time() - start_time

print('The MSE of the support vector machine regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the support vector machine regression model model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print('Exectution time: {:.0f} sec'.format(end_time_4))

The MSE of the support vector machine regression model is: 23
The MAE of the support vector machine regression model model is: 2
Exectution time: 2314 sec


**Notes:** subsample dataset and decrease/optimise feature space

###  Support Vector Machine: testing different feature spaces

In [None]:
# Create a subset of the train matrix without holiday data
X_tr_sub1 = X_tr.drop(columns=[col for col in X_tr.columns if "holiday" in col])

# Create a subset of the test matrix without holiday data
X_te_sub1 = X_te.drop(columns=[col for col in X_te.columns if "holiday" in col])

print("X_tr:", X_tr_sub1.shape)
print("X_te:", X_te_sub1.shape)

In [None]:
# Get time at exectution start
start_time = time.time()

# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Fit and evaluate the pipeline to the preprocessed train dataset
svr_model.fit(X_tr_sub1[:80000], y_tr[:80000])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model.predict(X_te_sub1)

# Calculate execution time
end_time = time.time() - start_time

print('The MSE of the support vector machine regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the support vector machine regression model model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print('Exectution time: {:.0f} sec'.format(end_time))

In [None]:
# Create a subset of the train matrix without weather forecast data
X_tr_sub2 = X_tr.drop(columns=[col for col in X_tr.columns if "wf" in col])

# Create a subset of the test matrix without weather forecast data
X_te_sub2 = X_te.drop(columns=[col for col in X_te.columns if "wf" in col])

print("X_tr:", X_tr_sub2.shape)
print("X_te:", X_te_sub2.shape)

In [None]:
# Get time at exectution start
start_time = time.time()

# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Fit and evaluate the pipeline to the prerocessed train dataset
svr_model.fit(X_tr_sub2[:80000], y_tr[:80000])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model.predict(X_te_sub2)

# Calculate execution time
end_time = time.time() - start_time

print('The MSE of the support vector machine regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the support vector machine regression model model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print('Exectution time: {:.0f} sec'.format(end_time))

In [None]:
# Create a subset of the train matrix without holiday and weather forecastdata
X_tr_sub3 = X_tr.drop(columns=[col for col in X_tr.columns if "wf" in col or "holiday" in col])
                               
# Create a subset of the test matrix without holiday and weather forecast data
X_te_sub3 = X_te.drop(columns=[col for col in X_te.columns if "wf" in col or "holiday" in col])

print("X_tr:", X_tr_sub3.shape)
print("X_te:", X_te_sub3.shape)

In [None]:
# Get time at exectution start
start_time = time.time()

# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Fit and evaluate the pipeline to the preprocessed train dataset
svr_model.fit(X_tr_sub3[:80000], y_tr[:80000])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model.predict(X_te_sub3)

# Calculate execution time
end_time = time.time() - start_time

print('The MSE of the support vector machine regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the support vector machine regression model model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print('Exectution time: {:.0f} sec'.format(end_time))

**Notes:** Improve the accuracy with which the model is able to predict for new data.
Reduce computational cost.
Produce a more interpretable model.

###  Support Vector Machine: testing ...

In [None]:
# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Define a set of hyperparameters to be tested during gridsearch
svr_model_params = {
    'svc__C':  np.logspace(-4, 4, num=9),
    'svc__gamma':  np.logspace(-4, 4, num=9)
}

# Create a gridsearch object to find the optimum hyperparameters
svr_model_gs = GridSearchCV(
    svr_model,
    svr_model_params,
    cv=3,
    return_train_score=True,
    verbose=True,
    n_jobs=-1,
)

# Fit and evaluate the pipeline to the preprocessed train dataset
svr_model_gs.fit(X_tr[:100], y_tr[:100])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model_gs.predict(X_te)

print('The MSE of the support vector machine regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the support vector machine regression model model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print("\n The best parameters across all searched params: \n", svr_y_pred.best_params_)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


In [None]:
# Create a pipeline that performs standardization and fit the data to a support vector machine model
svr_model = make_pipeline(
    StandardScaler(),
    SVR()
)

# Define a set of hyperparameters to be tested during gridsearch
svr_model_params = {
    'svc__kernel': ['rbf', 'linear'],
    'svc__C':  np.logspace(-4, 4, num=9),
    'svc__gamma':  np.logspace(-4, 4, num=9)
}

# Create a gridsearch object to find the optimum hyperparameters
svr_model_gs = GridSearchCV(
    svr_model,
    svr_model_params,
    cv=5,
    return_train_score=True,
    verbose=True,
    n_jobs=-1,
)

# Fit and evaluate the pipeline to the preprocessed train dataset
svr_model_gs.fit(X_tr[:10000], y_tr[:10000])

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
svr_y_pred = svr_model_gs.predict(X_te)

print('The MSE of the linear regression model is: {:.0f}'.format(MSE(y_te, svr_y_pred)))
print('The MAE of the linear regression model is: {:.0f}'.format(MAE(y_te, svr_y_pred)))

print("\n The best parameters across all searched params:\n",svr_y_pred.best_params_)

**Notes:**

***