# Mini Project: Build a Machine Learning Model

## Predict Total Fare on the NYC Taxi Dataset

Welcome to the NYC Taxi Fare Prediction project! In this Colab, we will continue using the NYC Taxi Dataset to predict the fare amount for taxi rides using a subset of available features. We will go through three main stages: building a baseline model, creating a full model, and performing hyperparameter tuning to enhance our predictions.

Now that you've completed exploratory data analysis on this dataset you should have a good understanding of the feature space.

## Project Objectives

The primary objectives of this project are as follows:

Baseline Model: We will start by building a simple baseline model to establish a benchmark for our predictions. This model will serve as a starting point to compare the performance of our subsequent models.

Full Model: Next, we will develop a more comprehensive model that leverages machine learning techniques to improve prediction accuracy. We will use Scikit-Learn's model pipeline to build a framework that enables rapid experimentation.

Hyperparameter Tuning: Lastly, we will optimize our full model by fine-tuning its hyperparameters. By systematically adjusting the parameters that control model behavior, we aim to achieve the best possible performance for our prediction task.

In [1]:
# Import the required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Load the NYC taxi dataset into a Pandas DataFrame and do a few basic checks to ensure the data is loaded properly. Note, there are several months of data that can be used. For simplicity, use the Yellow Taxi 2022-01 parquet file [here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet). Here are your tasks:

  1. Load the `yellow_tripdata_2022-01.parquet` file into Pandas.
  2. Print the first 5 rows of data.
  3. Drop any rows of data that contain NULL values.
  4. Create a new feature, 'trip_duration' that captures the duration of the trip in minutes.
  5. Create a varible named 'target_variable' to store the name of the thing we're trying to predict, 'total_amount'.
  6. Create a list called 'feature_cols' containing the feature names that we'll be using to predict our target variable. The list should contain 'VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', and 'trip_duration'.

In [2]:
# Load the dataset into a pandas DataFrame (from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
file_path = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet'
trip_data = pd.read_parquet(file_path)

In [3]:
# Display the first few rows of the dataset
trip_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0


In [18]:
# Print the size of the loaded trip data
trip_data.shape

(2463931, 19)

In [19]:
# Find the number of missing values/rows (i.e., NaN or None)  
trip_data.isna().sum()

VendorID                     0
tpep_pickup_datetime         0
tpep_dropoff_datetime        0
passenger_count          71503
trip_distance                0
RatecodeID               71503
store_and_fwd_flag       71503
PULocationID                 0
DOLocationID                 0
payment_type                 0
fare_amount                  0
extra                        0
mta_tax                      0
tip_amount                   0
tolls_amount                 0
improvement_surcharge        0
total_amount                 0
congestion_surcharge     71503
airport_fee              71503
dtype: int64

In [4]:
# Drop rows with missing values.
clean_trip_data = trip_data.dropna(axis=0, how='any')
# Verify if dropping NA values excuted properly 
clean_trip_data.isna().sum()

VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
congestion_surcharge     0
airport_fee              0
dtype: int64

In [21]:
# Check whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).
clean_trip_data.isnull().sum()

VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
congestion_surcharge     0
airport_fee              0
dtype: int64

In [5]:
# Print any row if exists with NA in the data set
mask = clean_trip_data.isna().any(axis=1) # create a mask of missing value
clean_trip_data[mask]  # filter the missing data rows

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee


In [23]:
# Print the columns/features of the trip data
clean_trip_data.columns.to_list()

['VendorID',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'store_and_fwd_flag',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount',
 'congestion_surcharge',
 'airport_fee']

In [6]:
# Create new feature, 'trip_duration'.
# clean_trip_data, the 'tpep_dropoff_datetime' and 'tpep_pickup_datetime' are datetime and indicate the drop off time and pick up time respectively 
# Therefore, the difference of 'tpep_dropoff_datetime' and 'tpep_pickup_datetime' is the time duration
import datetime as dt
# trip_duration 
clean_trip_data['trip_duration'] = (clean_trip_data['tpep_dropoff_datetime'] - clean_trip_data['tpep_pickup_datetime']).dt.total_seconds()/60
# clean_trip_data.loc['trip_duration'] = trip_duration
print(clean_trip_data.columns)


Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee', 'trip_duration'],
      dtype='object')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_trip_data['trip_duration'] = (clean_trip_data['tpep_dropoff_datetime'] - clean_trip_data['tpep_pickup_datetime']).dt.total_seconds()/60


In [25]:
# Print the clean trip data (with trip duration)
clean_trip_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,trip_duration
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0,17.816667
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0,8.4
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0,8.966667
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0,10.033333
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0,37.533333


In [26]:
# Check if there are NA values in the data set
clean_trip_data.isna().sum()

VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
congestion_surcharge     0
airport_fee              0
trip_duration            0
dtype: int64

In [14]:
# Drop the pickup and drop off columns from the table
df = clean_trip_data.drop(columns=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], inplace=False)
# Observe the unique values in the column 'store_and_fwd_flag' and cross validate with https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
df['store_and_fwd_flag'].unique().tolist()

['N', 'Y']

In [28]:
# Check if there is any NA rows in the data set
df.isna().any(axis=1).sum()

0

In [15]:
# Create a list called feature_col to store column names
feature_col = df.columns.to_list()
print(feature_col)

['VendorID', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'airport_fee', 'trip_duration']


In [16]:
# Convert categorical variable 'store_and_fwd_flag' into dummy/indicator variables of integer.
df = pd.get_dummies(df, columns=['store_and_fwd_flag'], drop_first=True, dtype=int, prefix='store_and_fwd_flag',  prefix_sep='_int') 
# Verify the dummy variable is created 
df.columns

Index(['VendorID', 'passenger_count', 'trip_distance', 'RatecodeID',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee', 'trip_duration',
       'store_and_fwd_flag_intY'],
      dtype='object')

In [17]:
# Check the numeric data types columns 
df.select_dtypes(include='number').columns.tolist()

['VendorID',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount',
 'congestion_surcharge',
 'airport_fee',
 'trip_duration',
 'store_and_fwd_flag_intY']

In [32]:
# See the data type of each column 
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2392428 entries, 0 to 2392427
Data columns (total 18 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   VendorID                 int64  
 1   passenger_count          float64
 2   trip_distance            float64
 3   RatecodeID               float64
 4   PULocationID             int64  
 5   DOLocationID             int64  
 6   payment_type             int64  
 7   fare_amount              float64
 8   extra                    float64
 9   mta_tax                  float64
 10  tip_amount               float64
 11  tolls_amount             float64
 12  improvement_surcharge    float64
 13  total_amount             float64
 14  congestion_surcharge     float64
 15  airport_fee              float64
 16  trip_duration            float64
 17  store_and_fwd_flag_intY  int32  
dtypes: float64(13), int32(1), int64(4)
memory usage: 337.7 MB


In [33]:
# Print descriptive statistics (summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values)
df.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,trip_duration,store_and_fwd_flag_intY
count,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0,2392428.0
mean,1.697032,1.389453,3.099698,1.415507,165.9911,163.7749,1.230148,12.80723,1.034301,0.4913711,2.368683,0.3748284,0.2966534,19.02453,2.282322,0.08249935,14.14116,0.02296077
std,0.4595418,0.9829686,4.308517,5.917573,65.15313,70.7153,0.4623205,259.5991,1.243128,0.08361189,2.836618,1.675193,0.04430015,259.7478,0.743204,0.3125554,46.79586,0.1497785
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-480.0,-4.5,-0.5,-125.22,-31.4,-0.3,-480.3,-2.5,-1.25,-3442.4,0.0
25%,1.0,1.0,1.03,1.0,132.0,113.0,1.0,6.5,0.0,0.5,0.49,0.0,0.3,11.3,2.5,0.0,6.266667,0.0
50%,2.0,1.0,1.71,1.0,162.0,162.0,1.0,9.0,0.5,0.5,2.0,0.0,0.3,14.3,2.5,0.0,10.08333,0.0
75%,2.0,1.0,3.1,1.0,234.0,236.0,1.0,13.5,2.5,0.5,3.0,0.0,0.3,19.75,2.5,0.0,16.03333,0.0
max,2.0,9.0,651.0,99.0,265.0,265.0,5.0,401092.3,33.5,16.59,888.88,193.3,0.3,401095.6,2.5,1.25,8513.183,1.0


In [18]:
# Check how many rows have value 1 (Y) for 'store_and_fwd_flag' in original data set
(df['store_and_fwd_flag_intY'] == 1).sum()

54932

Splitting a dataset into training and test sets is a crucial step in machine learning model development. It allows us to evaluate the performance and generalization ability of our models accurately. The training set is used to train the model, while the test set serves as an independent sample for evaluating its performance.

1. **Model Training**: The training set is used to fit the model, allowing it to learn the underlying patterns and relationships between the features and the target variable. By exposing the model to a diverse range of examples in the training set, it can capture the underlying structure of the data.

2. **Model Evaluation**: The test set, which is independent of the training set, is crucial for evaluating how well the trained model generalizes to unseen data. It provides an unbiased assessment of the model's performance on new instances. By measuring the model's accuracy, precision, recall, or other evaluation metrics on the test set, we can estimate how well the model will perform on unseen data.

3. **Preventing Overfitting**: Overfitting occurs when a model learns the training data's noise and idiosyncrasies instead of the underlying patterns. By evaluating the model on the test set, we can identify if the model is overfitting. If the model performs significantly worse on the test set compared to the training set, it indicates overfitting. In such cases, we might need to adjust the model, feature selection, or regularization techniques to improve generalization.

4. **Hyperparameter Tuning**: Splitting the dataset allows us to perform hyperparameter tuning on the model. Hyperparameters are configuration settings that control the learning process, such as learning rate, regularization strength, or the number of hidden layers in a neural network. By using a validation set (often created from a portion of the training set), we can iteratively adjust the hyperparameters and select the best combination that maximizes the model's performance on the validation set. The final evaluation on the test set provides an unbiased estimate of the model's performance.

By splitting the dataset into training and test sets, we can ensure that our models are both well-trained and accurately evaluated. This separation helps us understand how the model will perform on new, unseen data, which is critical for assessing its effectiveness and making informed decisions about its deployment.

Here is your task:

  1. Use Scikit-Learn's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the data into training and test sets. Don't forget to set the random state.

In [22]:
# See the numerica columns as a list 
df.select_dtypes(include='number').columns.tolist()

['VendorID',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount',
 'congestion_surcharge',
 'airport_fee',
 'trip_duration',
 'store_and_fwd_flag_intY']

In [19]:
# Assign the 'total_amount' as the target_variable
target_variable = df['total_amount']
# Drop the 'total_amount' from the data set 
df.drop(columns=['total_amount'], inplace=True)
df.head()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,congestion_surcharge,airport_fee,trip_duration,store_and_fwd_flag_intY
0,1,2.0,3.8,1.0,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,2.5,0.0,17.816667,0
1,1,1.0,2.1,1.0,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,0.0,0.0,8.4,0
2,2,1.0,0.97,1.0,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,0.0,0.0,8.966667,0
3,2,1.0,1.09,1.0,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,2.5,0.0,10.033333,0
4,2,1.0,4.3,1.0,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,2.5,0.0,37.533333,0


In [24]:
# Visualize the target variable
target_variable

0          21.95
1          13.30
2          10.56
3          11.80
4          30.30
           ...  
2392423    11.30
2392424    11.16
2392425    14.75
2392426    13.56
2392427    20.76
Name: total_amount, Length: 2392428, dtype: float64

In [20]:
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df, target_variable, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1913942, 17) (1913942,) (478486, 17) (478486,)


The importance of a baseline model, even if it uses a simple strategy like always predicting the mean, cannot be understated. Here's why a baseline model is valuable:

1. **Performance Comparison**: A baseline model serves as a reference point for evaluating the performance of more sophisticated models. By establishing a simple yet reasonable baseline, we can determine whether our advanced models offer any significant improvement over this basic approach. It helps us set realistic expectations and gauge the effectiveness of our efforts.

2. **Model Complexity**: A baseline model provides insight into the complexity required to solve the prediction task. If a simple strategy like predicting the median performs reasonably well, it suggests that the problem might not necessitate complex modeling techniques. Conversely, if the baseline model performs poorly, it indicates the presence of more intricate patterns that need to be captured by more sophisticated models.

3. **Minimum Performance Requirement**: A baseline model can establish a minimum performance requirement for a predictive task. If we cannot outperform the baseline, it suggests that our models have failed to capture even the most fundamental relationships within the data. In such cases, we may need to revisit our data preprocessing steps, feature engineering techniques, or consider other external factors affecting the task.

4. **Identifying Data Issues**: A baseline model can help identify potential issues within the dataset. If the baseline model performs poorly, it may indicate problems like missing values, outliers, or data inconsistencies. These issues can be further investigated and resolved to improve the overall model performance.

While a baseline model like always predicting the median may not offer the highest prediction accuracy, its importance lies in its role as a starting point for model development and evaluation. It provides a solid foundation for comparing and assessing the performance of more complex models, ensuring that any improvements made are meaningful and significant.

Here is your task:

  1. Create a model that always predicts the mean total fare of the training dataset. Use Scikit-Learn's [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) to evaluate this model. Is it any good?

In [26]:
# Create a list features that are numeric type
df.select_dtypes(include='number').columns.tolist()

['VendorID',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'congestion_surcharge',
 'airport_fee',
 'trip_duration',
 'store_and_fwd_flag_intY']

In [27]:
# Create a baseline for mean absolute error of total amount
# LinearRegression model creation
reg = LinearRegression(fit_intercept=True)
# Train the LinearRegression model from the training set 
reg.fit(X_train, y_train)

In [178]:
# Print the mean absolute error of the trained LinearRegression model (the baseline)
mean_abs_error_baseline = mean_absolute_error(y_test, reg.predict(X_test))
print('Mean Absolute Error (Baseline):', mean_abs_error_baseline)

Mean Absolute Error (Baseline): 0.13904375500272972


In [30]:
# Print the coefficients of the trained linear regression model
reg.coef_

array([ 1.82542058e+00, -7.82139183e-04,  2.12100379e-04,  1.17905106e-02,
       -1.92820362e-05,  7.98728826e-07, -9.99613325e-04,  9.99999866e-01,
        7.62736436e-01,  1.27564284e+00,  1.00048766e+00,  1.00774480e+00,
        2.59891539e+00,  7.60625196e-01,  7.57891803e-01,  1.63277811e-05,
        5.01238446e-03])

In [31]:
# Create anotehr baseline RandomForestRegressor, train from the data set, print the mean absolute error of total amount
reg_forest = RandomForestRegressor(max_depth=2, min_samples_split=1000, min_samples_leaf=100, random_state=0)
reg_forest.fit(X_train, y_train)
mean_absolute_error(y_test, reg_forest.predict(X_test))

4.514316516342158

# The baseline model
- **Model**: The baseline model is a linear regression model from Scikit learn. 
- **Data set** : The data set has been cleaned by dropping the NA values. The categorical data type 'Store_and_fwd_flag' is made numeric to feed to the baseline model. The time data columns (i.e., pick up time and drop off time) has been dropped. 
- **Training**: The baseline model has been trained with train/test data split (80% train, 20% test) 
- **Performance metric**: The mean absolute error performance of the linear regression model (baseline) is 0.1390. The mean absolute error of the RandomForestRegression model with max_depth=2, min_samples_split=1000, min_samples_leaf=100 turns out to be 4.5143 which is higher than the regression model. This is likely because the small max_depth. 

With a baseline metric in place, we can try to build a machine learning model. Obviously, if the model can't beat the baseline then there are some major issues to be resolved.

It's always a good idea to start with a simple machine learning model, like linear regression, and build upon it if necessary.

Here are your tasks:

  1. Use Scikit-Learn's [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to preprocess the categorical and continuous features independently. Apply the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to the continuous columns and [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to the categorical columns.

  One-hot encoding is a popular technique used to represent categorical variables numerically in machine learning models. It transforms categorical features into a binary vector representation, where each category is represented by a binary column. Here's an explanation of one-hot encoding:

  When working with categorical variables, such as colors (e.g., red, blue, green) or vehicle types (e.g., car, truck, motorcycle), machine learning algorithms often require numerical inputs. However, directly assigning numerical values to categories can introduce unintended relationships or orderings between them. For example, assigning the values 0, 1, and 2 to the categories red, blue, and green may imply a sequential relationship, which is not desired.

  One-hot encoding solves this problem by creating new binary columns, equal to the number of unique categories in the original feature. Each binary column represents a specific category and takes a value of 1 if the data point belongs to that category, and 0 otherwise. This encoding ensures that no implicit ordering or relationship exists between the categories.

  2. Integrate the preprocessor in the previous step with Scikit-Learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model using a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

  3. Train the pipeline on the training data.

  4. Evaluate the model using mean absolute error as a metric on the test data. Does the model beat the baseline?


In [34]:
# Identify the categorical features from the data set and comparing with description at https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
df.columns.tolist()

['VendorID',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'congestion_surcharge',
 'airport_fee',
 'trip_duration',
 'store_and_fwd_flag_intY']

In [119]:
# Drop time data from the data set
trip_data_1 = clean_trip_data.drop(columns=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], inplace=False)
trip_data_1.head(10)

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,trip_duration
0,1,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0,17.816667
1,1,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0,8.4
2,2,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0,8.966667
3,2,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0,10.033333
4,2,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0,37.533333
5,1,1.0,10.3,1.0,N,138,161,1,33.0,3.0,0.5,13.0,6.55,0.3,56.35,2.5,0.0,29.55
6,2,1.0,5.07,1.0,N,233,87,1,17.0,0.5,0.5,5.2,0.0,0.3,26.0,2.5,0.0,14.133333
7,2,1.0,2.02,1.0,N,238,152,2,9.0,0.5,0.5,0.0,0.0,0.3,12.8,2.5,0.0,9.683333
8,2,1.0,2.71,1.0,N,166,236,1,12.0,0.5,0.5,2.25,0.0,0.3,18.05,2.5,0.0,14.783333
9,2,1.0,0.78,1.0,N,236,141,2,5.0,0.5,0.5,0.0,0.0,0.3,8.8,2.5,0.0,4.6


In [138]:
# Sanity check of the categorical variables based on data set and https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
# categorical_features = ['VendorID', 'RatecodeID', 'PULocationID', 'DOLocationID', 'payment_type', 'store_and_fwd_flag']
# trip_data_1['RatecodeID'].unique().tolist() # value 1,2,3,4,5,6 and 99 exists in data set. RatecodeID=99 appears in 8732 times in the dataset
# trip_data_1['VendorID'].unique().tolist()  # Only exist VendorID 1,2 in the data set (no abnormal values)
# len(trip_data_1['PULocationID'].unique().tolist())  # There are  256 PULocationIDs in data set.
# len(trip_data_1['DOLocationID'].unique().tolist())  # There are  261 DOLocationIDs in data set.
trip_data_1['payment_type'].unique().tolist()  # There are 5 different payment types in the data set

[1, 2, 4, 3, 5]

In [139]:
# Sanity check of the categorical variables based on data set value_count
# trip_data_1['RatecodeID'].value_counts()
# trip_data_1['VendorID'].value_counts()
# trip_data_1['PULocationID'].value_counts()
# trip_data_1['DOLocationID'].value_counts()
trip_data_1['payment_type'].value_counts()

payment_type
1    1874874
2     495171
3      11709
4      10673
5          1
Name: count, dtype: int64

In [151]:
# Categorical features for one hot encoding are: 'VendorID', 'RatecodeID', 'payment_type', 'store_and_fwd_flag'  
# Categorical features that are not encoded: 'PULocationID', 'DOLocationID'  
categorical_features = ['VendorID', 'store_and_fwd_flag', 'RatecodeID', 'payment_type']
# Numerical features are 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'congestion_surcharge', 'airport_fee', 'trip_duration'
# Numerical features for standard scaling are 'trip_distance', 'fare_amount', 'trip_duration'
numerical_features = ['trip_distance', 'fare_amount', 'trip_duration'] 

In [168]:
# Use Scikit-Learn's ColumnTransformer to preprocess the categorical and continuous features independently.
# Define the transformer for categorical features and numerical features
t_cat_num = [('cat_enc', OneHotEncoder(drop='first'), categorical_features), ('num_enc', StandardScaler(), numerical_features)]
# Apply the column transformer
transformer_cat_num = ColumnTransformer(transformers=t_cat_num, remainder='passthrough')

In [169]:
# Apply the transformer for the selected categorical features 
clean_trip_data_transformed = transformer_cat_num.fit_transform(trip_data_1)

In [170]:
# Print and see the values of the transformed data 
clean_trip_data_transformed[0:10, :]

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.62538991e-01,  6.52070580e-03,  7.85433335e-02,
         2.00000000e+00,  1.42000000e+02,  2.36000000e+02,
         3.00000000e+00,  5.00000000e-01,  3.65000000e+00,
         0.00000000e+00,  3.00000000e-01,  2.19500000e+01,
         2.50000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        -2.32028386e-01, -1.85179036e-02, -1.22685351e-01,
         1.00000000e+00,  2.36000000e+02,  4.20000000e+01,
         5.00000000e-01,  5.00000000e-01,  4.00000000e+00,
         0.00000000e+00,  3.00000000e-01,  1.33000000e+01,
         0.00

In [171]:
# Print the feature titles of the transformed data set
transformer_cat.get_feature_names_out()

array(['store_fwd_enc__VendorID_2', 'store_fwd_enc__store_and_fwd_flag_Y',
       'store_fwd_enc__RatecodeID_2.0', 'store_fwd_enc__RatecodeID_3.0',
       'store_fwd_enc__RatecodeID_4.0', 'store_fwd_enc__RatecodeID_5.0',
       'store_fwd_enc__RatecodeID_6.0', 'store_fwd_enc__RatecodeID_99.0',
       'store_fwd_enc__payment_type_2', 'store_fwd_enc__payment_type_3',
       'store_fwd_enc__payment_type_4', 'store_fwd_enc__payment_type_5',
       'num_enc__trip_distance', 'num_enc__fare_amount',
       'num_enc__trip_duration', 'remainder__passenger_count',
       'remainder__PULocationID', 'remainder__DOLocationID',
       'remainder__extra', 'remainder__mta_tax', 'remainder__tip_amount',
       'remainder__tolls_amount', 'remainder__improvement_surcharge',
       'remainder__total_amount', 'remainder__congestion_surcharge',
       'remainder__airport_fee'], dtype=object)

In [172]:
# Create a linear regression 
reg_trip_data = LinearRegression(fit_intercept=True)

In [173]:
# Create a pipeline object containing the column transformations and regression model.
# The pipeline applies ColumnTransformer and LinearRegression
# The transformer_cat_num applies the ColumnTransformer (for the selected categorical features and selected numerical features as described above)
pipeline = Pipeline(steps=[('trip_data_prep', transformer_cat_num), ('trip_model', reg_trip_data)])

In [174]:
# Apply the train and test data set split and print their sizes
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(trip_data_1, target_variable, test_size=0.2, random_state=42)
print(X_train_p.shape, y_train_p.shape, X_test_p.shape, y_test_p.shape)

(1913942, 18) (1913942,) (478486, 18) (478486,)


In [175]:
# Fit the pipeline on the training data.
pipeline.fit(X_train_p, y_train_p)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [176]:
# Make predictions on the test data.
trip_data_mean_abs_error = mean_absolute_error(y_test_p, pipeline.predict(X_test_p))
print('Mean Absolute Error (Pipeline):', trip_data_mean_abs_error)

Mean Absolute Error (Pipeline): 1.1694567576522937e-10


In [177]:
# Print the steps of the pipeline 
pipeline.named_steps

{'trip_data_prep': ColumnTransformer(remainder='passthrough',
                   transformers=[('cat_enc', OneHotEncoder(drop='first'),
                                  ['VendorID', 'store_and_fwd_flag',
                                   'RatecodeID', 'payment_type']),
                                 ('num_enc', StandardScaler(),
                                  ['trip_distance', 'fare_amount',
                                   'trip_duration'])]),
 'trip_model': LinearRegression()}

# Linear regression (improved) model
- **Model**: The improved model is a linear regression model from Scikit learn. Improvements are made to the data set and therefore the model is trained on higher number of features on pre-processed data set. 
- **Data set** : The data set has been cleaned by dropping the NA values. Categorical features 'VendorID', 'store_and_fwd_flag', 'RatecodeID', 'payment_type' in the data set have been encoded by one-shot numeric array (OneShotEncoder). The categorical features 'PULocationID', 'DOLocationID' are not encoded as they need a large number of arrays.  The numerical features for standard scaling (Standardized features by removing the mean and scaling to unit variance by StandardScaling) are 'trip_distance', 'fare_amount', 'trip_duration'. The StandardScalar is note applied to numerical features 'passenger_count', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'congestion_surcharge', 'airport_fee'.  The time data (pick up and drop off) has been dropped. 
- **Training**: The baseline model has been trained with train/test data split (80% train, 20% test) 
- **Performance metric**: The mean absolute error performance of the linear regression model (improved model) is 1.1694567576522937e-10, i.e., significant improvement compared to the baseline model. 


Random Forest Regression and Linear Regression are two commonly used regression algorithms, each with its own advantages and suitability for different scenarios. Random Forest Regression offers several advantages over Linear Regression, including:

1. **Non-linearity**: Random Forest Regressor is capable of capturing non-linear relationships between features and the target variable. In contrast, Linear Regression assumes a linear relationship between the features and the target. When faced with non-linear relationships or complex feature interactions, Random Forest Regressor can provide more accurate predictions.

2. **Robustness to Outliers**: Random Forest Regressor is generally more robust to outliers compared to Linear Regression. Outliers can disproportionately impact the coefficients and predictions of Linear Regression models. However, as an ensemble of decision trees, Random Forest Regressor can mitigate the effect of outliers by averaging predictions from multiple trees.

3. **Feature Importance**: Random Forest Regressor provides a measure of feature importance, which helps identify the most influential features for making predictions. This information is useful for feature selection, understanding the underlying relationships in the data, and gaining insights into the problem domain. Unlike Linear Regression, which provides coefficient values indicating the direction and magnitude of relationships, Random Forest Regressor explicitly highlights feature importance.

4. **Handling of Categorical Variables**: Random Forest Regressor can effectively handle categorical variables without requiring pre-processing steps like one-hot encoding. It can directly incorporate categorical variables into the model, making it more convenient when working with mixed data types. In contrast, Linear Regression often requires categorical variables to be encoded or transformed before use.

5. **Handling of High-Dimensional Data**: Random Forest Regressor can handle datasets with a large number of features (high dimensionality) by automatically selecting subsets of features during the construction of individual decision trees. This reduces the risk of overfitting, which is a concern with Linear Regression when dealing with high-dimensional data.

6. **Resistance to Multicollinearity**: Random Forest Regressor is less affected by multicollinearity, which occurs when predictor variables are highly correlated. In Linear Regression, highly correlated features can lead to unstable coefficient estimates, making it challenging to interpret the individual effects of each feature. Random Forest Regressor, as an ensemble approach, is less impacted by multicollinearity because each tree is built independently.

Here are your tasks:

  1. Build a Random Forest Regressor model using Scikit-Learn's [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and train it on the train data.

  2. Evaluate the performance of the model on the test data using mean absolute error as a metric. Mess around with various input parameter configurations to see how they affect the model. Can you beat the performance of the linear regression model?

In [186]:
# Build random forest regressor model
reg_forest = RandomForestRegressor(max_depth=10, min_samples_split=1000, min_samples_leaf=100, random_state=0)
reg_forest.fit(X_train, y_train)

In [187]:
# Make predictions on the test data
mean_absolute_error(y_test, reg_forest.predict(X_test))

0.7219935073635966

# RandomForestRegressor model
- **Model**: The RandomForestRegressor model with is a linear regression model with max_depth=10, min_samples_split=1000, min_samples_leaf=100 from Scikit learn. 
- **Data set** : The data set has been cleaned by dropping the NA values. The categorical data type 'Store_and_fwd_flag' is made numeric to feed. The time data columns (pick up time and drop off time) has been dropped. 
- **Training**: The model has been trained with train/test data split (80% train, 20% test) 
- **Performance metric**: The mean absolute error performance of the RandomForestRegressor model is 0.7220. The mean absolute error is improved compared to the RandomForestRegressor model used in the previous section due to the use of higher max_depth (10). As compared to the improved LinearRegression model, RandomForestRegressor performance is worst. The main reason is the pre-processing (OneHotEncoding, StandardScaling) for linear regression model. On the other hand, hyperparameters such as max_depth, n_estimators etc have not been tuned for RandomForestRegressor.       

Hyperparameter tuning plays a critical role in machine learning model development. It involves selecting the optimal values for the hyperparameters, which are configuration settings that control the behavior of the learning algorithm. Here's why hyperparameter tuning is so important in ML:

1. **Optimizing Model Performance**: The choice of hyperparameters can significantly impact the model's performance. By fine-tuning the hyperparameters, we can improve the model's accuracy, precision, recall, or other performance metrics. It helps to extract the maximum predictive power from the chosen algorithm and ensures that the model is well-suited to the specific problem at hand.

2. **Avoiding Overfitting and Underfitting**: Hyperparameter tuning helps strike a balance between overfitting and underfitting.

3. **Exploring Model Complexity**: Hyperparameter tuning enables us to explore the complexity of the model. For instance, in algorithms like decision trees or neural networks, we can adjust the number of layers, the number of neurons, or the maximum depth of the tree. By systematically modifying these hyperparameters, we can understand how different levels of complexity impact the model's performance and find the right balance between simplicity and complexity.

Note, there are multiple approaches to hyperparemeter tuning.  

While grid search is the easiest to understand and implement there are many advantages of Bayesian search over grid search for hyperparameter tuning:

1. **Efficiency**: Bayesian search is generally more efficient than grid search. Grid search explores all possible combinations of hyperparameter values, which can be computationally expensive and time-consuming, especially when dealing with a large number of hyperparameters or a wide range of values. Bayesian search, on the other hand, intelligently selects the next hyperparameter configuration to evaluate based on the results of previous evaluations. It focuses on areas of the hyperparameter space that are more likely to yield better performance, reducing the number of evaluations needed.

2. **Flexibility**: Bayesian search is flexible in handling continuous and discrete hyperparameters. It can handle both types of hyperparameters naturally and effectively. In contrast, grid search is more suitable for discrete hyperparameters but may struggle with continuous ones, as it requires discretization or defining a finite set of values to search over.

3. **Adaptive Search**: Bayesian search adapts its search strategy based on the results of previous evaluations. It maintains a probability distribution over the hyperparameter space, updating it with each evaluation. This allows it to dynamically allocate more evaluations to promising regions and explore unexplored areas. In contrast, grid search follows a fixed and predefined search grid, regardless of the results of previous evaluations.

4. **Better Convergence**: Bayesian search has the potential to converge to the optimal hyperparameter configuration more quickly.

Here are your tasks:

  1. Perform a grid-search on a Random Forest Regressor model. Only search the space for the parameters 'n_estimators', 'max_depth', and 'min_samples_split'. Note, this can take some time to run. Make sure you set reasonable boundaries for the search space. Use Scikit-Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) method.

  2. After you've identified the best parameters, train a random forest regression model using these parameters on the full training data.

  3. Evaluate the model from the previous step using the test data. How does your model perform?

In [21]:
# Define the hyperparameters to tune.
param_grid = {'n_estimators': [100, 120], 'max_depth':[5, 7], 'min_samples_leaf':[100, 150]} 

In [22]:
# Perform grid search to find the best hyperparameters. This could take a while.
reg_forest_grid_search = RandomForestRegressor(random_state=0)
# Create the parameter grid
# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
    estimator=reg_forest_grid_search,
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',
    n_jobs=4,
    cv=3,
    refit=True, return_train_score=False)
print(grid_rf_class)

GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=0), n_jobs=4,
             param_grid={'max_depth': [5, 7], 'min_samples_leaf': [100, 150],
                         'n_estimators': [100, 120]},
             scoring='neg_mean_absolute_error')


In [23]:
grid_rf_class.fit(X_train, y_train)

In [24]:
# Get the best model and its parameters.
best_estimator_rf_gs = grid_rf_class.best_estimator_

In [25]:
# Fit the best classifier on the training data.
best_estimator_rf_gs.fit(X_train, y_train)

In [26]:
# Make predictions on the test data
predicted_values = best_estimator_rf_gs.predict(X_test)

In [27]:
# The mean absolute error of the best model foudn
mean_absolute_error(y_test, predicted_values)

1.203719600601042

In [33]:
# Find the parameters of the best model found
grid_rf_class.best_params_

{'max_depth': 7, 'min_samples_leaf': 150, 'n_estimators': 120}

# Hyperparameter tuning with GridSearchCV 
- **Search method**: Grid search using RandomForestRegressor model on parameters  'max_depth', 'min_samples_leaf', 'n_estimators' is set up. However, as the search took a very log time, range of each parameter is reduced to a small set. Refit was set to true to obtain the best model and k-fold cross validation is set at 3. Mean absolute error is used as the metric. 
- **Data set** : Similar to the previous method, the data set has been cleaned by dropping the NA values. The categorical data type 'Store_and_fwd_flag' is made numeric to feed. The time data columns (pick up time and drop off time) has been dropped. 
- **Search in the grid**: The best RandomForestRegressor model hyperparameters are searched by using train/test data split (80% train, 20% test) with k-fold cross validation (3).  
- **Performance metric**: The best model found through GridSerachCV is {'max_depth': 7, 'min_samples_leaf': 150, 'n_estimators': 120, min_samples_split=2 (default)} and achieved a mean absolute error 1.2037 on test data. A previous model {'max_depth'=10, 'min_samples_leaf'=100, 'n_estimators'=100 (default), 'min_samples_split'=1000} achieved a mean absolute error 0.7220 on test data. As such, the results show that a more comprehensive grid search is required for a larger parameter range/values. In case a grid search such process is extensive/time-consuming, other methods such as RandomizedSearchCV [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) or Bayes methods such as [BayesSearchCV](https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html) or mixed methods should be used.