# 5. Capstone Project: Machine Learning Models II

***

![headerall](./images/headers/header_all.jpg)

##  Goals

### Project:
In this work, we will first analyze where and when traffic congestion is highest and lowest in New York State. We will then build different machine learning models capable of predicting cab travel times in and around New York City using only variables that can be easily obtained from a smartphone app or a website. We will then compare their performance and explore the possibility of using additional variables such as weather forecasts and holidays to improve the predictive performance of the models.

### Section:
In this section, we will use the knowledge gained during the exploratory data analysis to perform the final feature transformation. Next, we will create and compare the performance of several machine learning models, namely: linear regressions, a support vector machine regressor, a random forest regressor and a gradient boosted decision tree. The feature space and hyperparameters will be optimised for each model to obtain the best possible performance.

## Data
### External Datasets:
- Weather Forecast: The 2018 NYC weather forecast was collected from the [National Weather Service Forecast Office](https://w2.weather.gov/climate/index.php?wfo=okx) website. Daily measurements were taken from January to December 2018 in Central Park. These measures are given in imperial units and include daily minimum and maximum temperatures, precipitations, snowfall, and snow depth.

- Holidays: The 2018 NYC holidays list was collected from the [Office Holiday](https://www.officeholidays.com/countries/usa/new-york/2021) website. The dataset contains the name, date, and type of holidays for New York.

- Taxi Zones: The NYC Taxi Zones dataset was collected from the [NYC Open Data](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc) website. It contains the pickup and drop-off zones (Location IDs) for the Yellow, Green, and FHV Trip Records. The taxi zones are based on the NYC Department of City Planning’s Neighborhood.

### Primary Datasets:

- Taxi Trips: The 2018 NYC Taxi Trip dataset was collected from the [Google Big Query](https://console.cloud.google.com/marketplace/product/city-of-new-york/nyc-tlc-trips?project=jovial-monument-300209&folder=&organizationId=) platform. The dataset contains more than 100'000'000 Yellow Taxi Trip records for 2018 and contains an extensive amount of variables including the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

***
## Table of Content:
    1. Data Preparation
        1.1 External Datasets
            1.1.1 Weather Forecast Dataset
            1.1.2 Holidays Dataset
            1.1.3 Taxi Zones Dataset
        1.2 Primary Dataset
            1.2.1 Taxi Trips Dataset
            1.2.2 Taxi Trips Subset
    2. Exploratory Data Analysis
        2.1 Primary Dataset
            2.1.1 Temporal Analysis
            2.1.2 Spatio-Temporal Analysis
        2.2 External Datasets
            2.2.1 Temporal Analysis of Weather Data
            2.2.2 Temporal Analysis of Holidays Data
        2.3 Combined Dataset
            2.3.1 Overall Features Correlation
    3. Machine Learning Models
        3.1 Data Preparation
        3.2 Baselines
        3.3 Model Training
            3.3.1 Linear Regression
            3.3.2 Support Vector Machine
            3.3.3 Random Forest
            3.3.4 Gradient Boosted Decision Tree
        3.4 Final Models Comparison
    4. Conclusions

***
## Python Libraries and Magic Commands Import

In [1]:
# Import data processing libraris gpd
import pandas as pd
import numpy as np

# Import Visualization librairies
import seaborn as sns 
import matplotlib.pyplot as plt

# Import machine learning libraries
from sklearn.pipeline import make_pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Ridge
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import median_absolute_error as MAE 

In [2]:
# Set up magic commands
%matplotlib inline
%config Completer.use_jedi = False

***

## Data Import

In [3]:
# Import the train dataset
train_df = pd.read_pickle(r'data/processed/train.pickle')

# Get the independant variables from the train dataset
X_tr = train_df.drop("trip_duration", axis=1)

# Get the dependant variable from the train dataset
y_tr = train_df["trip_duration"]

print('X_tr:', X_tr.shape)
print('y_tr:', y_tr.shape, y_tr.dtype)

X_tr: (824654, 33)
y_tr: (824654,) float64


In [4]:
# Import the test dataset
test_df = pd.read_pickle(r'data/processed/test.pickle')

# Get the independant variables from the test dataset
X_te = test_df.drop("trip_duration", axis=1)

# Get the dependant variable from the test dataset
y_te = test_df["trip_duration"]

print('X_te:', X_te.shape)
print('y_te:', y_te.shape, y_te.dtype)

X_te: (206156, 33)
y_te: (206156,) float64


In [5]:
# Get id column names from the train dataset
id_cols = [c for c in train_df.columns if "_id" in c]

# Remove ID features in the train dataset
X_tr.drop(id_cols, axis=1, inplace=True)

# Remove ID features in the test dataset
X_te.drop(id_cols, axis=1, inplace=True)

# Drop the pickup day of the year variable from the train dataset
X_tr.drop("pickup_yearday", axis=1, inplace=True)

# Drop the pickup day of the year variable from the test dataset
X_te.drop("pickup_yearday", axis=1, inplace=True)

***
## Functions Import

In [6]:
# Define a function that performs preprocessing steps to the selected dataset
def preprocess(data, categorical_cols, continuous_cols, transform_cols, polynome_deg=1):

    # Create a copy of the data frame
    df = data.copy()

    # One-hot encode categorical features
    df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False)

    # Log-transform numerical variables
    for col in transform_cols:
        df[col] = np.log(df[col])
    
    # Add polynomial features
    for col in continuous_cols:
        if polynome_deg > 1:
            for poly in range(polynome_deg + 1):
                df["{}**{}".format(col, poly)] = df[col] ** poly

    return df

***
## Variable Import

In [7]:
# Define a list of categorical variables
categorical_cols = [
    "pickup_month",
    "pickup_week",
    "pickup_weekday",
    "pickup_weekday_type",
    "pickup_hour",
    "pickup_hour_type",
    "wf_avg_temp_lvl",
    "wf_prec_lvl",
    "wf_new_snow_lvl",
    "wf_snow_depth_lvl",
    "holiday_type",
    "holiday",
    "trip_within_borough",
    "tolls_amount_lvl",
]

# Define a list of continuous variables
continuous_cols = [
    "trip_distance",
    "tolls_amount",
    "wf_avg_temp",
    "wf_prec",
    "wf_new_snow",
    "wf_snow_depth",
]

***
## 3.3 Machine Learning Models: Model Training
## 3.3.1 Model Training: Linear Regression

## Goals:
Train and optimise linear regression models

## Code:

### Linear Regression: testing preprocessing performance impact

In [8]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the train dataset
lr_model.fit(X_tr, y_tr)

# Predict the target variable of the test dataset
lr_y_pred = lr_model.predict(X_te)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 114.9
The MAE of the linear regression model is: 3.1


In [9]:
# Apply preprocessing to the train dataset
X_tr_p = preprocess(X_tr, categorical_cols, continuous_cols, ["trip_distance"])

# Apply preprocessing to the train dataset
X_te_p = preprocess(X_te, categorical_cols, continuous_cols, ["trip_distance"])

print("X_tr:", X_tr_p.shape)
print("X_te:", X_te_p.shape)

X_tr: (824654, 139)
X_te: (206156, 139)


In [10]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the preprocessed train dataset
lr_model.fit(X_tr_p, y_tr)

# Predict the target variable of the preprocessed test dataset
lr_y_pred = lr_model.predict(X_te_p)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 26.1
The MAE of the linear regression model is: 2.1


**Notes:** The log-transformation of continuous variables and the one-hot-encoding of categorical features significantly improve the performance of our linear models.

### Linear Regression: testing different linear models and hyperparameters

In [11]:
# Apply preprocessing to the train dataset
X_tr_p0 = preprocess(X_tr, categorical_cols, continuous_cols, ["trip_distance"])

# Apply preprocessing to the train dataset
X_te_p0 = preprocess(X_te, categorical_cols, continuous_cols, ["trip_distance"])

print("X_tr:", X_tr_p.shape)
print("X_te:", X_te_p.shape)

X_tr: (824654, 139)
X_te: (206156, 139)


In [12]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the preprocessedtrain dataset
lr_model.fit(X_tr_p0, y_tr)

# Predict the target variable of the preprocessed test dataset
lr_y_pred = lr_model.predict(X_te_p0)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 26.1
The MAE of the linear regression model is: 2.1


In [13]:
# Create a pipeline that performs standardization and fit the data to Ridge regression model
rr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(regressor=Ridge(), func=np.log, inverse_func=np.exp),
)

# Define a set of hyperparameters to be tested during gridsearch
rr_model_params = {
    "transformedtargetregressor__regressor__alpha": np.logspace(-1, 4, num=10)
}

# Create a gridsearch object to find the optimum hyperparameters
rr_model_gs = GridSearchCV(
    rr_model,
    rr_model_params,
    cv=5,
    return_train_score=True,
    verbose=True,
    n_jobs=-1,
)

# Fit and evaluate the pipeline to the preprocessed train dataset
rr_model_gs.fit(X_tr_p0, y_tr)

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
rr_y_pred = rr_model_gs.predict(X_te_p0)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, rr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, rr_y_pred)))

print("\n The best parameters across all searched params:\n",rr_model_gs.best_params_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
The MSE of the linear regression model is: 25.9
The MAE of the linear regression model is: 2.1

 The best parameters across all searched params:
 {'transformedtargetregressor__regressor__alpha': 10000.0}


In [14]:
# Create a pipeline that performs standardization and fit the data to SGD regression model
sr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(regressor=SGDRegressor(random_state=0), func=np.log, inverse_func=np.exp),
)

# Define a set of hyperparameters to be tested during gridsearch
sr_model_params = {
    "transformedtargetregressor__regressor__alpha": np.logspace(-3,-7, num= 5),
    "transformedtargetregressor__regressor__penalty": ['l2', 'l1', 'elasticnet'],
}

# Create a gridsearch object to find the optimum hyperparameters
sr_model_gs = GridSearchCV(
    sr_model,
    sr_model_params,
    cv=5,
    return_train_score=True,
    verbose=True,
    n_jobs=-1,
)

# Fit and evaluate the pipeline to the preprocessed train dataset
sr_model_gs.fit(X_tr_p0, y_tr)

# Predict the target variable of the preprocessed test dataset with the best hyperparameters 
sr_y_pred = sr_model_gs.predict(X_te_p0)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, sr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, sr_y_pred)))

print("\n The best parameters across all searched params:\n",sr_model_gs.best_params_)

Fitting 5 folds for each of 15 candidates, totalling 75 fits
The MSE of the linear regression model is: 26.5
The MAE of the linear regression model is: 2.1

 The best parameters across all searched params:
 {'transformedtargetregressor__regressor__alpha': 0.001, 'transformedtargetregressor__regressor__penalty': 'l1'}


**Notes:** All three linear model regression models show similar performances.

### Linear Regression: testing polynomial features

In [15]:
# Apply preprocessing to the train dataset
X_tr_p1 = preprocess(X_tr, categorical_cols, continuous_cols, ["trip_distance"])

# Apply preprocessing to the train dataset
X_te_p1 = preprocess(X_te, categorical_cols, continuous_cols, ["trip_distance"])

print("X_tr:", X_tr_p.shape)
print("X_te:", X_te_p.shape)

X_tr: (824654, 139)
X_te: (206156, 139)


In [16]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the preprocessed train dataset
lr_model.fit(X_tr_p1, y_tr)

# Predict the target variable of the preprocessed test dataset
lr_y_pred = lr_model.predict(X_te_p1)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 26.1
The MAE of the linear regression model is: 2.1


In [17]:
# Apply preprocessing to the train dataset
X_tr_p2 = preprocess(X_tr, categorical_cols, continuous_cols, ["trip_distance"], 2)

# Apply preprocessing to the train dataset
X_te_p2 = preprocess(X_te, categorical_cols, continuous_cols, ["trip_distance"], 2)

print("X_tr:", X_tr_p2.shape)
print("X_te:", X_te_p2.shape)

X_tr: (824654, 157)
X_te: (206156, 157)


In [18]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the preprocessedtrain dataset
lr_model.fit(X_tr_p2, y_tr)

# Predict the target variable of the preprocessed test dataset
lr_y_pred = lr_model.predict(X_te_p2)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 25.3
The MAE of the linear regression model is: 2.1


In [19]:
# Apply preprocessing to the train dataset
X_tr_p3 = preprocess(X_tr, categorical_cols, continuous_cols, ["trip_distance"], 3)

# Apply preprocessing to the train dataset
X_te_p3 = preprocess(X_te, categorical_cols, continuous_cols, ["trip_distance"], 3)

print("X_tr:", X_tr_p3.shape)
print("X_te:", X_te_p3.shape)

X_tr: (824654, 163)
X_te: (206156, 163)


In [20]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the preprocessed train dataset
lr_model.fit(X_tr_p3, y_tr)

# Predict the target variable of the preprocessed test dataset
lr_y_pred = lr_model.predict(X_te_p3)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 25.0
The MAE of the linear regression model is: 2.1


**Notes:** Polynomial features improve the performance of the linear regression model. However, no significant improvement is obtained when using polynomials of degree three and above.

### Linear Regression: testing different feature spaces

In [21]:
# Apply preprocessing to the train dataset
X_tr_p4 = preprocess(X_tr, categorical_cols, continuous_cols, ["trip_distance"], 2)

# Apply preprocessing to the train dataset
X_te_p4 = preprocess(X_te, categorical_cols, continuous_cols, ["trip_distance"], 2)

print("X_tr:", X_tr_p4.shape)
print("X_te:", X_te_p4.shape)

X_tr: (824654, 157)
X_te: (206156, 157)


In [22]:
# Create a subset of the train matrix without holiday data
X_tr_p4_sub1 = X_tr_p4.drop(columns=[col for col in X_tr_p4.columns if "holiday" in col])

# Create a subset of the test matrix without holiday data
X_te_p4_sub1 = X_te_p4.drop(columns=[col for col in X_te_p4.columns if "holiday" in col])

print("X_tr:", X_tr_p4_sub1.shape)
print("X_te:", X_te_p4_sub1.shape)

X_tr: (824654, 151)
X_te: (206156, 151)


In [23]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the train dataset
lr_model.fit(X_tr_p4_sub1, y_tr)

# Predict the target variable of the test dataset
lr_y_pred = lr_model.predict(X_te_p4_sub1)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 25.4
The MAE of the linear regression model is: 2.1


In [24]:
# Create a subset of the train matrix without weather forecast data
X_tr_p4_sub2 = X_tr_p4.drop(columns=[col for col in X_tr_p4.columns if "wf" in col])

# Create a subset of the test matrix without weather forecast data
X_te_p4_sub2 = X_te_p4.drop(columns=[col for col in X_te_p4.columns if "wf" in col])

print("X_tr:", X_tr_p4_sub2.shape)
print("X_te:", X_te_p4_sub2.shape)

X_tr: (824654, 126)
X_te: (206156, 126)


In [25]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the train dataset
lr_model.fit(X_tr_p4_sub2, y_tr)

# Predict the target variable of the test dataset
lr_y_pred = lr_model.predict(X_te_p4_sub2)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 25.4
The MAE of the linear regression model is: 2.1


In [26]:
# Create a subset of the train matrix without holiday and weather forecastdata
X_tr_p4_sub3 = X_tr_p4.drop(columns=[col for col in X_tr_p4.columns if "wf" in col or "holiday" in col])

# Create a subset of the test matrix without holiday and weather forecast data
X_te_p4_sub3 = X_te_p4.drop(columns=[col for col in X_te_p4.columns if "wf" in col or "holiday" in col])

print("X_tr:", X_tr_p4_sub3.shape)
print("X_te:", X_te_p4_sub3.shape)

X_tr: (824654, 120)
X_te: (206156, 120)


In [27]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the train dataset
lr_model.fit(X_tr_p4_sub3, y_tr)

# Predict the target variable of the test dataset
lr_y_pred = lr_model.predict(X_te_p4_sub3)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 25.5
The MAE of the linear regression model is: 2.1


**Notes:** Reducing the feature space by removing less informative variables does not affect performance, but significantly reduces computation time. It can also results in a more accurate and interpretable model.

### Linear Regression: testing dimensionality reduction with PCA

In [28]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    PCA(n_components = 0.95),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the train dataset
lr_model.fit(X_tr_p4, y_tr)

# Predict the target variable of the test dataset
lr_y_pred = lr_model.predict(X_te_p4)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 34.2
The MAE of the linear regression model is: 2.2


In [29]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    PCA(n_components = 0.99),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the train dataset
lr_model.fit(X_tr_p4, y_tr)

# Predict the target variable of the test dataset
lr_y_pred = lr_model.predict(X_te_p4)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 31.0
The MAE of the linear regression model is: 2.2


In [30]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    PCA(n_components = 0.95),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the train dataset
lr_model.fit(X_tr_p4_sub3, y_tr)

# Predict the target variable of the test dataset
lr_y_pred = lr_model.predict(X_te_p4_sub3)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 35.9
The MAE of the linear regression model is: 2.3


In [31]:
# Create a pipeline that performs standardization and fit the data to linear regression model
lr_model = make_pipeline(
    StandardScaler(),
    PCA(n_components = 0.99),
    TransformedTargetRegressor(
        regressor=LinearRegression(), func=np.log, inverse_func=np.exp
    ),
)

# Fit and evaluate the pipeline to the train dataset
lr_model.fit(X_tr_p4_sub3, y_tr)

# Predict the target variable of the test dataset
lr_y_pred = lr_model.predict(X_te_p4_sub3)

print('The MSE of the linear regression model is: {:.1f}'.format(MSE(y_te, lr_y_pred)))
print('The MAE of the linear regression model is: {:.1f}'.format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 31.9
The MAE of the linear regression model is: 2.2


**Notes:** Reducing the feature space with principal component analysis (PCA) worsens the performance of the models.

### Linear Regression: final result

In [32]:
# Apply preprocessing to the train dataset
X_tr_p5 = preprocess(X_tr, categorical_cols, continuous_cols, ["trip_distance"], 2)

# Apply preprocessing to the train dataset
X_te_p5 = preprocess(X_te, categorical_cols, continuous_cols, ["trip_distance"], 2)

print("X_tr:", X_tr_p5.shape)
print("X_te:", X_te_p5.shape)

X_tr: (824654, 157)
X_te: (206156, 157)


In [34]:
# Create a subset of the train matrix without holiday and weather forecastdata
X_tr_p5_sub1 = X_tr_p5.drop(columns=[col for col in X_tr_p4.columns if "wf" in col or "holiday" in col])

# Create a subset of the test matrix without holiday and weather forecast data
X_te_p5_sub1 = X_te_p5.drop(columns=[col for col in X_te_p4.columns if "wf" in col or "holiday" in col])

print("X_tr:", X_tr_p5_sub1.shape)
print("X_te:", X_te_p5_sub1.shape)  

X_tr: (824654, 120)
X_te: (206156, 120)


In [35]:
# Create a pipeline that performs standardization and fit the data to Ridge regression model
lr_model = make_pipeline(
    StandardScaler(),
    TransformedTargetRegressor(
        regressor=LinearRegression(),
        func=np.log,
        inverse_func=np.exp,
    ),
)

# Fit and evaluate the pipeline to the preprocessed train dataset
lr_model.fit(X_tr_p5_sub1, y_tr)

# Predict the target variable of the preprocessed test dataset with the best hyperparameters
lr_y_pred = lr_model.predict(X_te_p5_sub1)

print("The MSE of the linear regression model is: {:.1f}".format(MSE(y_te, lr_y_pred)))
print("The MAE of the linear regression model is: {:.1f}".format(MAE(y_te, lr_y_pred)))

The MSE of the linear regression model is: 25.5
The MAE of the linear regression model is: 2.1


***
## Export Files

In [36]:
# Create a new data frame containing the MSE and MAE of the model
sr_results = pd.DataFrame({
    'model':['lr'],
    'mse':[MSE(y_te, sr_y_pred)],
    'mae':[MAE(y_te, sr_y_pred)],
})

# Save the results in the project folder
sr_results.to_csv('results/lr_model.csv', index=False)