<a href="https://colab.research.google.com/github/seecode4/seeRepo1/blob/main/mec2-projects/Student_MLE_MiniProject_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Project: Build a Machine Learning Model

## Predict Total Fare on the NYC Taxi Dataset

Welcome to the NYC Taxi Fare Prediction project! In this Colab, we will continue using the NYC Taxi Dataset to predict the fare amount for taxi rides using a subset of available features. We will go through three main stages: building a baseline model, creating a full model, and performing hyperparameter tuning to enhance our predictions.

Now that you've completed exploratory data analysis on this dataset you should have a good understanding of the feature space.

## Project Objectives

The primary objectives of this project are as follows:

Baseline Model: We will start by building a simple baseline model to establish a benchmark for our predictions. This model will serve as a starting point to compare the performance of our subsequent models.

Full Model: Next, we will develop a more comprehensive model that leverages machine learning techniques to improve prediction accuracy. We will use Scikit-Learn's model pipeline to build a framework that enables rapid experimentation.

Hyperparameter Tuning: Lastly, we will optimize our full model by fine-tuning its hyperparameters. By systematically adjusting the parameters that control model behavior, we aim to achieve the best possible performance for our prediction task.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Load the NYC taxi dataset into a Pandas DataFrame and do a few basic checks to ensure the data is loaded properly. Note, there are several months of data that can be used. For simplicity, use the Yellow Taxi 2022-01 parquet file [here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet). Here are your tasks:

  1. Load the `yellow_tripdata_2022-01.parquet` file into Pandas.
  2. Print the first 5 rows of data.
  3. Drop any rows of data that contain NULL values.
  4. Create a new feature, 'trip_duration' that captures the duration of the trip in minutes.
  5. Create a varible named 'target_variable' to store the name of the thing we're trying to predict, 'total_amount'.
  6. Create a list called 'feature_cols' containing the feature names that we'll be using to predict our target variable. The list should contain 'VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', and 'trip_duration'.

In [2]:
# Load the dataset into a pandas DataFrame (from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
# url = "https://www.nyc.gov/site/tlc/about/data.page/yellow_tripdata_2022-01.parquet"
# Feature Info: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"
df_raw = pd.read_parquet(url)

In [3]:
# Display the first few rows of the dataset
print(df_raw.head(5))

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2022-01-01 00:35:40   2022-01-01 00:53:29              2.0   
1         1  2022-01-01 00:33:43   2022-01-01 00:42:07              1.0   
2         2  2022-01-01 00:53:21   2022-01-01 01:02:19              1.0   
3         2  2022-01-01 00:25:21   2022-01-01 00:35:23              1.0   
4         2  2022-01-01 00:36:48   2022-01-01 01:14:20              1.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           3.80         1.0                  N           142           236   
1           2.10         1.0                  N           236            42   
2           0.97         1.0                  N           166           166   
3           1.09         1.0                  N           114            68   
4           4.30         1.0                  N            68           163   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \


In [4]:
# Drop rows with missing values.
print(df_raw.isnull().sum())
df = df_raw.dropna()
print(df_raw.shape)
print(df.shape)

VendorID                     0
tpep_pickup_datetime         0
tpep_dropoff_datetime        0
passenger_count          71503
trip_distance                0
RatecodeID               71503
store_and_fwd_flag       71503
PULocationID                 0
DOLocationID                 0
payment_type                 0
fare_amount                  0
extra                        0
mta_tax                      0
tip_amount                   0
tolls_amount                 0
improvement_surcharge        0
total_amount                 0
congestion_surcharge     71503
airport_fee              71503
dtype: int64
(2463931, 19)
(2392428, 19)


In [6]:
# Create new feature, 'trip_duration'.
# without .loc get Warning - A value is trying to be set on a copy of a slice from a DataFrame.
df.loc[:,'trip_duration'] = ((df.loc[:,'tpep_dropoff_datetime']
                        - df.loc[:,'tpep_pickup_datetime']).dt.total_seconds()/60).round(2)
print(df[['tpep_dropoff_datetime', 'tpep_pickup_datetime', 'trip_duration']].head())

  tpep_dropoff_datetime tpep_pickup_datetime  trip_duration
0   2022-01-01 00:53:29  2022-01-01 00:35:40          17.82
1   2022-01-01 00:42:07  2022-01-01 00:33:43           8.40
2   2022-01-01 01:02:19  2022-01-01 00:53:21           8.97
3   2022-01-01 00:35:23  2022-01-01 00:25:21          10.03
4   2022-01-01 01:14:20  2022-01-01 00:36:48          37.53


In [7]:
# Create 'target_variable' to store the name of the thing we're trying to predict, 'total_amount'
y = df[['total_amount']]

In [8]:
# Create a list called feature_col to store specified column names
feature_col = ['VendorID', 'trip_distance', 'payment_type', 'PULocationID',
               'DOLocationID', 'trip_duration']
X = df[feature_col]
print(type(X), X.shape)
print(type(y), y.shape)
print(X.head())

<class 'pandas.core.frame.DataFrame'> (2392428, 6)
<class 'pandas.core.frame.DataFrame'> (2392428, 1)
   VendorID  trip_distance  payment_type  PULocationID  DOLocationID  \
0         1           3.80             1           142           236   
1         1           2.10             1           236            42   
2         2           0.97             1           166           166   
3         2           1.09             2           114            68   
4         2           4.30             1            68           163   

   trip_duration  
0          17.82  
1           8.40  
2           8.97  
3          10.03  
4          37.53  


Splitting a dataset into training and test sets is a crucial step in machine learning model development. It allows us to evaluate the performance and generalization ability of our models accurately. The training set is used to train the model, while the test set serves as an independent sample for evaluating its performance.

1. **Model Training**: The training set is used to fit the model, allowing it to learn the underlying patterns and relationships between the features and the target variable. By exposing the model to a diverse range of examples in the training set, it can capture the underlying structure of the data.

2. **Model Evaluation**: The test set, which is independent of the training set, is crucial for evaluating how well the trained model generalizes to unseen data. It provides an unbiased assessment of the model's performance on new instances. By measuring the model's accuracy, precision, recall, or other evaluation metrics on the test set, we can estimate how well the model will perform on unseen data.

3. **Preventing Overfitting**: Overfitting occurs when a model learns the training data's noise and idiosyncrasies instead of the underlying patterns. By evaluating the model on the test set, we can identify if the model is overfitting. If the model performs significantly worse on the test set compared to the training set, it indicates overfitting. In such cases, we might need to adjust the model, feature selection, or regularization techniques to improve generalization.

4. **Hyperparameter Tuning**: Splitting the dataset allows us to perform hyperparameter tuning on the model. Hyperparameters are configuration settings that control the learning process, such as learning rate, regularization strength, or the number of hidden layers in a neural network. By using a validation set (often created from a portion of the training set), we can iteratively adjust the hyperparameters and select the best combination that maximizes the model's performance on the validation set. The final evaluation on the test set provides an unbiased estimate of the model's performance.

By splitting the dataset into training and test sets, we can ensure that our models are both well-trained and accurately evaluated. This separation helps us understand how the model will perform on new, unseen data, which is critical for assessing its effectiveness and making informed decisions about its deployment.

Here is your task:

  1. Use Scikit-Learn's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the data into training and test sets. Don't forget to set the random state.

In [9]:
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=1)

In [10]:
# import inspect
# print(inspect.getsource(train_test_split))
# print(inspect.getsource(LinearRegression))

The importance of a baseline model, even if it uses a simple strategy like always predicting the mean, cannot be understated. Here's why a baseline model is valuable:

1. **Performance Comparison**: A baseline model serves as a reference point for evaluating the performance of more sophisticated models. By establishing a simple yet reasonable baseline, we can determine whether our advanced models offer any significant improvement over this basic approach. It helps us set realistic expectations and gauge the effectiveness of our efforts.

2. **Model Complexity**: A baseline model provides insight into the complexity required to solve the prediction task. If a simple strategy like predicting the median performs reasonably well, it suggests that the problem might not necessitate complex modeling techniques. Conversely, if the baseline model performs poorly, it indicates the presence of more intricate patterns that need to be captured by more sophisticated models.

3. **Minimum Performance Requirement**: A baseline model can establish a minimum performance requirement for a predictive task. If we cannot outperform the baseline, it suggests that our models have failed to capture even the most fundamental relationships within the data. In such cases, we may need to revisit our data preprocessing steps, feature engineering techniques, or consider other external factors affecting the task.

4. **Identifying Data Issues**: A baseline model can help identify potential issues within the dataset. If the baseline model performs poorly, it may indicate problems like missing values, outliers, or data inconsistencies. These issues can be further investigated and resolved to improve the overall model performance.

While a baseline model like always predicting the median may not offer the highest prediction accuracy, its importance lies in its role as a starting point for model development and evaluation. It provides a solid foundation for comparing and assessing the performance of more complex models, ensuring that any improvements made are meaningful and significant.

Here is your task:

  1. Create a model that always predicts the mean total fare of the training dataset. Use Scikit-Learn's [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) to evaluate this model. Is it any good?

In [11]:
# Create a baseline for mean absolute error of total amount
y_train_mean = y_train.mean()
print("y_train_mean:", y_train_mean)

# Note predicted value is the same for the whole array
y_test_pred = np.full(shape=y_test.shape, fill_value=y_train_mean)
# print(y_test.head(10))

# Get mean absolute error of total amount in y_test
y_test_mae = mean_absolute_error(y_test, y_test_pred)
print("Mean Absolute Error:", round(y_test_mae, 2))

y_train_mean: total_amount    19.062666
dtype: float64
Mean Absolute Error: 9.19


With a baseline metric in place, we can try to build a machine learning model. Obviously, if the model can't beat the baseline then there are some major issues to be resolved.

It's always a good idea to start with a simple machine learning model, like linear regression, and build upon it if necessary.

Here are your tasks:

  1. Use Scikit-Learn's [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to preprocess the categorical and continuous features independently. Apply the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to the continuous columns and [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to the categorical columns.

  One-hot encoding is a popular technique used to represent categorical variables numerically in machine learning models. It transforms categorical features into a binary vector representation, where each category is represented by a binary column. Here's an explanation of one-hot encoding:

  When working with categorical variables, such as colors (e.g., red, blue, green) or vehicle types (e.g., car, truck, motorcycle), machine learning algorithms often require numerical inputs. However, directly assigning numerical values to categories can introduce unintended relationships or orderings between them. For example, assigning the values 0, 1, and 2 to the categories red, blue, and green may imply a sequential relationship, which is not desired.

  One-hot encoding solves this problem by creating new binary columns, equal to the number of unique categories in the original feature. Each binary column represents a specific category and takes a value of 1 if the data point belongs to that category, and 0 otherwise. This encoding ensures that no implicit ordering or relationship exists between the categories.

  2. Integrate the preprocessor in the previous step with Scikit-Learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model using a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

  3. Train the pipeline on the training data.

  4. Evaluate the model using mean absolute error as a metric on the test data. Does the model beat the baseline?


In [12]:
# Use Scikit-Learn's ColumnTransformer to preprocess the categorical and continuous features independently.
# feature_col = ['VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', 'trip_duration']

def get_col_indexes(df, col_names):
  return [df.columns.get_loc(name) for name in col_names]

# Categorical features and corresponding column index
cat_columns = ['VendorID', 'payment_type', 'PULocationID', 'DOLocationID']
cat_ix = get_col_indexes(X, cat_columns)

# Continuous features and corresponding column index
cont_columns = ['trip_distance', 'trip_duration']
cont_ix = get_col_indexes(X, cont_columns)
print(cat_ix, cont_ix)

[0, 2, 3, 4] [1, 5]


In [13]:
# Create a pipeline object containing the column transformations and regression model.

# data preparation for categorical and continuous features
t = [('cat', OneHotEncoder(handle_unknown='ignore'), cat_ix),
     ('cont', StandardScaler(), cont_ix)]
col_transform = ColumnTransformer(transformers=t)

# define the model
model_linreg = LinearRegression(fit_intercept=True)

# define the data preparation and modeling pipeline
p_linreg = Pipeline(steps=[('prep', col_transform), ('m', model_linreg)])

In [14]:
%%time
# Fit the pipeline on the training data.
p_linreg.fit(X_train, y_train)

CPU times: user 11 s, sys: 7.06 s, total: 18.1 s
Wall time: 11.8 s


In [15]:
# Make predictions on the test data.
y_pred_linreg = p_linreg.predict(X_test)

# evaluate the model using mean absolute error as a metric
y_linreg_mae = mean_absolute_error(y_test, y_pred_linreg)
print("Linear Regression mean absolute error:", round(y_linreg_mae,2))

Linear Regression mean absolute error: 3.39


In [16]:
# Linear Regression, provides coefficient values indicating
#   the direction and magnitude of relationships
# same as m_coef = model_linreg.coef_
m_coef = p_linreg.named_steps['m'].coef_
print(type(m_coef), m_coef.shape)
print(model_linreg.coef_)

<class 'numpy.ndarray'> (1, 523)
[[ 4.60956067e-01 -4.60956077e-01 -1.38098916e+00 -5.03929369e+00
  -1.25829542e+01  1.93824877e+01 -3.79250634e-01  2.77009754e+01
   6.11373516e-01  6.24821494e+00 -8.01267323e+00  2.73499268e+01
   7.77297299e+00 -8.48416008e+00  5.69870403e+00  3.49773371e+00
   1.44351714e+01  8.61993236e-01 -7.95387843e+00 -7.77215785e+00
   1.75371240e+00  5.77703599e+00  7.99632043e+00 -3.04258535e+00
   1.29744242e+01  1.06199373e+01  4.06030058e+00  1.06628187e+01
  -2.05541519e-01  3.01949143e+01 -9.34598347e+00 -8.32232839e+00
  -1.28208367e+00  7.17134077e-01  7.86770027e+00  7.54453029e+00
   7.71061626e-01 -6.10508496e+00  1.56259358e+00 -8.17216557e+00
  -1.10208711e+00  2.28434084e-01 -1.88761733e+00 -1.59407866e+00
   4.05418398e+00  6.45005771e+00 -6.81184311e+00 -9.82309986e+00
  -8.67406338e+00 -8.93986513e+00  1.60187910e+01 -7.62861336e+00
  -8.96383826e-01 -5.70325255e-01 -8.91540267e+00 -2.80049584e+00
  -9.11666903e+00  3.47971826e+00 -8.484189

Random Forest Regression and Linear Regression are two commonly used regression algorithms, each with its own advantages and suitability for different scenarios. Random Forest Regression offers several advantages over Linear Regression, including:

1. **Non-linearity**: Random Forest Regressor is capable of capturing non-linear relationships between features and the target variable. In contrast, Linear Regression assumes a linear relationship between the features and the target. When faced with non-linear relationships or complex feature interactions, Random Forest Regressor can provide more accurate predictions.

2. **Robustness to Outliers**: Random Forest Regressor is generally more robust to outliers compared to Linear Regression. Outliers can disproportionately impact the coefficients and predictions of Linear Regression models. However, as an ensemble of decision trees, Random Forest Regressor can mitigate the effect of outliers by averaging predictions from multiple trees.

3. **Feature Importance**: Random Forest Regressor provides a measure of feature importance, which helps identify the most influential features for making predictions. This information is useful for feature selection, understanding the underlying relationships in the data, and gaining insights into the problem domain. Unlike Linear Regression, which provides coefficient values indicating the direction and magnitude of relationships, Random Forest Regressor explicitly highlights feature importance.

4. **Handling of Categorical Variables**: Random Forest Regressor can effectively handle categorical variables without requiring pre-processing steps like one-hot encoding. It can directly incorporate categorical variables into the model, making it more convenient when working with mixed data types. In contrast, Linear Regression often requires categorical variables to be encoded or transformed before use.

5. **Handling of High-Dimensional Data**: Random Forest Regressor can handle datasets with a large number of features (high dimensionality) by automatically selecting subsets of features during the construction of individual decision trees. This reduces the risk of overfitting, which is a concern with Linear Regression when dealing with high-dimensional data.

6. **Resistance to Multicollinearity**: Random Forest Regressor is less affected by multicollinearity, which occurs when predictor variables are highly correlated. In Linear Regression, highly correlated features can lead to unstable coefficient estimates, making it challenging to interpret the individual effects of each feature. Random Forest Regressor, as an ensemble approach, is less impacted by multicollinearity because each tree is built independently.

Here are your tasks:

  1. Build a Random Forest Regressor model using Scikit-Learn's [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and train it on the train data.

  2. Evaluate the performance of the model on the test data using mean absolute error as a metric. Mess around with various input parameter configurations to see how they affect the model. Can you beat the performance of the linear regression model?

In [17]:
# May be useful in determining n_jobs
import joblib

N_CORES = joblib.cpu_count(only_physical_cores=True)
print(f"Number of physical cores: {N_CORES}")

Number of physical cores: 1


In [18]:
%%time
# Build random forest regressor model (n_jobs=-1 to use all processors)
# from sklearn.ensemble import RandomForestRegressor
model_rfreg = RandomForestRegressor(n_estimators=50, max_depth=5,
                                    random_state=1, n_jobs=-1,
                                    oob_score=False)

# define the data preparation and modeling pipeline
p_rfreg = Pipeline(steps=[('prep', col_transform), ('m', model_rfreg)])

CPU times: user 60 µs, sys: 27 µs, total: 87 µs
Wall time: 92.3 µs


In [19]:
print(X_train.shape, y_train.shape)
# Change shape of y in fit, to avoid
# /usr/local/lib/python3.10/dist-packages/sklearn/base.py:1152:
# DataConversionWarning: A column-vector y was passed when a 1d array was expected.
# Please change the shape of y to (n_samples,), for example using ravel().
# return fit_method(estimator, *args, **kwargs)
print(p_rfreg.named_steps['m'].get_params)

(1913942, 6) (1913942, 1)
<bound method BaseEstimator.get_params of RandomForestRegressor(max_depth=5, n_estimators=50, n_jobs=-1, random_state=1)>


In [20]:
%%time
# Fit the pipeline on the training data.
p_rfreg.fit(X_train, y_train.values.ravel())

CPU times: user 6min 11s, sys: 1.26 s, total: 6min 13s
Wall time: 3min 40s


In [21]:
# RandomForestRegressor helps identify the most influential features for making predictions
# same as feat_importances = p_rfreg.named_steps['m'].feature_importances_
feat_importances = model_rfreg.feature_importances_
feat_names = col_transform.get_feature_names_out()
print("Random Forest feature_names shape: ", feat_names.shape)
print("Random Forest feature_importances_ shape: ", feat_importances.shape)

# Get top 10 features
feat_imp_df = pd.DataFrame(list(zip(feat_names, feat_importances)),
                           columns=['feat_name', 'feat_importance'])
feat_imp_sorted = feat_imp_df.sort_values('feat_importance', ascending=False)
print(feat_imp_sorted.head(10))

Random Forest feature_names shape:  (523,)
Random Forest feature_importances_ shape:  (523,)
                 feat_name  feat_importance
521    cont__trip_distance         0.469959
108  cat__PULocationID_107         0.296208
522    cont__trip_duration         0.143166
396  cat__DOLocationID_140         0.066259
1          cat__VendorID_2         0.010500
260  cat__PULocationID_265         0.005256
5      cat__payment_type_4         0.003741
2      cat__payment_type_1         0.003306
261    cat__DOLocationID_1         0.000793
0          cat__VendorID_1         0.000718


In [22]:
%%time
# %%timeit #Time repeated execution of a single statement for more accuracy
# Make predictions on the test data
y_pred_rfreg = p_rfreg.predict(X_test)

# evaluate the model using mean absolute error as a metric
y_rfreg_mae = mean_absolute_error(y_test, y_pred_rfreg)
print("Random Forest n_estimators=50, max_depth=5, mean absolute error:",
      round(y_rfreg_mae,2))

Random Forest n_estimators=50, max_depth=5, mean absolute error: 2.56
CPU times: user 1.51 s, sys: 58.6 ms, total: 1.57 s
Wall time: 1.02 s


In [23]:
%%time
# Model Performance
rf_score = p_rfreg.score(X_test, y_test)
print(type(rf_score), rf_score)

<class 'numpy.float64'> -5.057650908656426
CPU times: user 1.56 s, sys: 68.2 ms, total: 1.63 s
Wall time: 1.47 s


Hyperparameter tuning plays a critical role in machine learning model development. It involves selecting the optimal values for the hyperparameters, which are configuration settings that control the behavior of the learning algorithm. Here's why hyperparameter tuning is so important in ML:

1. **Optimizing Model Performance**: The choice of hyperparameters can significantly impact the model's performance. By fine-tuning the hyperparameters, we can improve the model's accuracy, precision, recall, or other performance metrics. It helps to extract the maximum predictive power from the chosen algorithm and ensures that the model is well-suited to the specific problem at hand.

2. **Avoiding Overfitting and Underfitting**: Hyperparameter tuning helps strike a balance between overfitting and underfitting.

3. **Exploring Model Complexity**: Hyperparameter tuning enables us to explore the complexity of the model. For instance, in algorithms like decision trees or neural networks, we can adjust the number of layers, the number of neurons, or the maximum depth of the tree. By systematically modifying these hyperparameters, we can understand how different levels of complexity impact the model's performance and find the right balance between simplicity and complexity.

Note, there are multiple approaches to hyperparemeter tuning.  

While grid search is the easiest to understand and implement there are many advantages of Bayesian search over grid search for hyperparameter tuning:

1. **Efficiency**: Bayesian search is generally more efficient than grid search. Grid search explores all possible combinations of hyperparameter values, which can be computationally expensive and time-consuming, especially when dealing with a large number of hyperparameters or a wide range of values. Bayesian search, on the other hand, intelligently selects the next hyperparameter configuration to evaluate based on the results of previous evaluations. It focuses on areas of the hyperparameter space that are more likely to yield better performance, reducing the number of evaluations needed.

2. **Flexibility**: Bayesian search is flexible in handling continuous and discrete hyperparameters. It can handle both types of hyperparameters naturally and effectively. In contrast, grid search is more suitable for discrete hyperparameters but may struggle with continuous ones, as it requires discretization or defining a finite set of values to search over.

3. **Adaptive Search**: Bayesian search adapts its search strategy based on the results of previous evaluations. It maintains a probability distribution over the hyperparameter space, updating it with each evaluation. This allows it to dynamically allocate more evaluations to promising regions and explore unexplored areas. In contrast, grid search follows a fixed and predefined search grid, regardless of the results of previous evaluations.

4. **Better Convergence**: Bayesian search has the potential to converge to the optimal hyperparameter configuration more quickly.

Here are your tasks:

  1. Perform a grid-search on a Random Forest Regressor model. Only search the space for the parameters 'n_estimators', 'max_depth', and 'min_samples_split'. Note, this can take some time to run. Make sure you set reasonable boundaries for the search space. Use Scikit-Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) method.

  2. After you've identified the best parameters, train a random forest regression model using these parameters on the full training data.

  3. Evaluate the model from the previous step using the test data. How does your model perform?

In [None]:
%%time
# Define the hyperparameters to tune.
# Only search the space for the parameters 'n_estimators', 'max_depth', and 'min_samples_split'
n_est_list = [30, 60, 100]
max_depth_list = [2, 4, 8]
min_samp_list = [4, 100, 1000]
# Create the grid
parameter_grid = [{'n_estimators': n_est_list,
                  'max_depth': max_depth_list,
                  'min_samples_split': min_samp_list}]

CPU times: user 7 µs, sys: 0 ns, total: 7 µs
Wall time: 10 µs


In [None]:
# To keep colab from disconnecting - not clear if this helps
# Ref: https://stackoverflow.com/questions/71456390/how-to-keep-the-google-colab-running-without-disconnecting-in-2022
import time
from pynput.mouse import Controller ,Button

MouseClick = Controller()
cnt = 0
while True:
    MouseClick.click(Button.right, 1)  # left hand mouse
    time.sleep(60)
    cnt += 1
    # stop after 12 hours
    if (cnt > (60*12)):
      break

In [None]:
%%time
# Perform grid search to find the best hyperparameters. This could take a while.
rfr = RandomForestRegressor(random_state=1, n_jobs=-1, oob_score=False)
# scoring='accuracy' for classifiers; use score for regressor
reg = GridSearchCV(rfr, parameter_grid, cv=3, verbose=3)
reg.fit(X_train, y_train.values.ravel())

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV 1/3] END max_depth=2, min_samples_split=4, n_estimators=30;, score=0.584 total time=  18.6s
[CV 2/3] END max_depth=2, min_samples_split=4, n_estimators=30;, score=0.584 total time=  20.2s
[CV 3/3] END max_depth=2, min_samples_split=4, n_estimators=30;, score=0.001 total time=  19.1s
[CV 1/3] END max_depth=2, min_samples_split=4, n_estimators=60;, score=0.624 total time=  39.2s
[CV 2/3] END max_depth=2, min_samples_split=4, n_estimators=60;, score=0.623 total time=  38.8s
[CV 3/3] END max_depth=2, min_samples_split=4, n_estimators=60;, score=0.001 total time=  39.9s
[CV 1/3] END max_depth=2, min_samples_split=4, n_estimators=100;, score=0.632 total time= 1.1min
[CV 2/3] END max_depth=2, min_samples_split=4, n_estimators=100;, score=0.627 total time= 1.1min
[CV 3/3] END max_depth=2, min_samples_split=4, n_estimators=100;, score=0.001 total time= 1.1min
[CV 1/3] END max_depth=2, min_samples_split=100, n_estimators=30;, score

Note: The 'scoring' parameter of GridSearchCV must be a str among {'precision_micro', 'precision_weighted', 'balanced_accuracy', 'max_error', 'neg_mean_gamma_deviance', 'completeness_score', 'roc_auc_ovr', 'precision', 'f1_samples', 'adjusted_rand_score', 'positive_likelihood_ratio', 'neg_brier_score', 'fowlkes_mallows_score', 'neg_mean_absolute_error', 'neg_root_mean_squared_error', 'f1', 'normalized_mutual_info_score', 'roc_auc_ovr_weighted', 'rand_score', 'neg_mean_absolute_percentage_error', 'roc_auc_ovo', 'recall_weighted', 'jaccard_macro', 'matthews_corrcoef', 'neg_median_absolute_error', 'r2', 'roc_auc_ovo_weighted', 'top_k_accuracy', 'f1_weighted', 'recall_macro', 'explained_variance', 'neg_mean_poisson_deviance', 'f1_micro', 'jaccard_samples', 'jaccard', 'v_measure_score', 'jaccard_micro', 'homogeneity_score', 'f1_macro', 'mutual_info_score', 'neg_mean_squared_log_error', 'precision_macro', 'recall', 'recall_micro', 'roc_auc', 'average_precision', 'jaccard_weighted', 'adjusted_mutual_info_score', 'neg_mean_squared_error', 'neg_negative_likelihood_ratio', 'neg_log_loss', 'precision_samples', 'accuracy', 'recall_samples'}, a callable, an instance of 'list', an instance of 'tuple', an instance of 'dict' or None.

In [None]:
# Get the best model and its parameters.
print(reg.best_params_)
print(reg.best_score_)

{'max_depth': 2, 'min_samples_split': 4, 'n_estimators': 100}
0.4197946354218076


From the debug prints, best score is for <br>
[CV 1/3] END max_depth=2, min_samples_split=4, n_estimators=100;, score=0.632 total time= 1.1min<br>
[CV 1/3] END max_depth=2, min_samples_split=1000, n_estimators=100;, score=0.632 total time= 1.1min

In [None]:
# Fit the best classifier on the training data.
rfr_best = RandomForestRegressor(n_estimators=100, max_depth=2,
                                    min_samples_split=4,
                                    random_state=1, n_jobs=-1,
                                    oob_score=False)
rfr_best.fit(X_train, y_train.values.ravel())

In [None]:
# Make predictions on the test data
y_pred_rfr_best = rfr_best.predict(X_test)

In [None]:

# evaluate the model using mean absolute error as a metric
y_rfr_best_mae = mean_absolute_error(y_test, y_pred_rfr_best)
print(y_rfr_best_mae)

# Model Performance
rfr_best_score = rfr_best.score(X_test, y_test)
print(type(rfr_best_score), rfr_best_score)

3.9667082863622727
<class 'numpy.float64'> 0.6736309233705144


In [None]:
%%time
# Build random forest regressor model (n_jobs=-1 to use all processors)
# Try hyperpram that gave the most negative score
# max_depth=8, min_samples_split=4, n_estimators=30;, score=-535.408
rfr_neg = RandomForestRegressor(n_estimators=30, max_depth=8,
                                    min_samples_split=4,
                                    random_state=1, n_jobs=-1,
                                    oob_score=False)
rfr_neg.fit(X_train, y_train.values.ravel())

# Make predictions on the test data
y_pred_rfr_neg = rfr_neg.predict(X_test)

# evaluate the model using mean absolute error as a metric
y_rfr_neg_mae = mean_absolute_error(y_test, y_pred_rfr_neg)
print(y_rfr_neg_mae)

# Model Performance
rfr_neg_score = rfr_neg.score(X_test, y_test)
print(type(rfr_neg_score), rfr_neg_score)

1.8688383397444912
<class 'numpy.float64'> -1.2953822721775885
CPU times: user 3min 18s, sys: 617 ms, total: 3min 19s
Wall time: 1min 56s


In [None]:
%%time
# Define the hyperparameters to tune - try2 with data prep (col_transformer)
# Only search the space for the parameters 'n_estimators', 'max_depth', and 'min_samples_split'
rfr = RandomForestRegressor(random_state=1, n_jobs=-1, oob_score=False)

# define the data preparation and modeling pipeline
pipe_rfr = Pipeline(steps=[('prep', col_transform), ('m', rfr)])

n_est_list = [50, 100]
max_depth_list = [2, 4, 5]
min_samp_list = [2, 4]
# Create the grid - use <component>__<parameter>
parameter_grid = [{'m__n_estimators': n_est_list,
                  'm__max_depth': max_depth_list,
                  'm__min_samples_split': min_samp_list}]

CPU times: user 82 µs, sys: 1e+03 ns, total: 83 µs
Wall time: 88.7 µs


In [None]:
%%time
# Perform grid search to find the best hyperparameters. This could take a while.
# scoring='accuracy' for classifiers; use score for regressor
p_reg = GridSearchCV(pipe_rfr, parameter_grid, cv=3, verbose=3,
                   scoring='r2')
p_reg.fit(X_train, y_train.values.ravel())

# Get the best model and its parameters.
print(p_reg.best_params_)
print(p_reg.best_score_)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV 1/3] END m__max_depth=2, m__min_samples_split=2, m__n_estimators=50;, score=0.361 total time=  46.5s
[CV 2/3] END m__max_depth=2, m__min_samples_split=2, m__n_estimators=50;, score=0.474 total time=  47.6s
[CV 3/3] END m__max_depth=2, m__min_samples_split=2, m__n_estimators=50;, score=0.001 total time=  48.0s
[CV 1/3] END m__max_depth=2, m__min_samples_split=2, m__n_estimators=100;, score=0.384 total time= 1.5min
[CV 2/3] END m__max_depth=2, m__min_samples_split=2, m__n_estimators=100;, score=0.488 total time= 1.6min
[CV 3/3] END m__max_depth=2, m__min_samples_split=2, m__n_estimators=100;, score=0.001 total time= 1.6min
[CV 1/3] END m__max_depth=2, m__min_samples_split=4, m__n_estimators=50;, score=0.361 total time=  46.5s
[CV 2/3] END m__max_depth=2, m__min_samples_split=4, m__n_estimators=50;, score=0.474 total time=  46.6s
[CV 3/3] END m__max_depth=2, m__min_samples_split=4, m__n_estimators=50;, score=0.001 total time

In [None]:
# Get the best model and its parameters.
print(p_reg.best_params_)
print(p_reg.best_score_)

{'m__max_depth': 2, 'm__min_samples_split': 2, 'm__n_estimators': 100}
0.2908998718351085


Summary:
1. Loaded yellow_tripdata_2022-01.parquet file into Pandas and added 'trip_duration' feature. <br>
   y is the target variable which is the
'total_amount'. This is what we are trying to predict.<br>
   'feature_cols' has the feature names used to predict the 'total_amount'.

2. For model evaluation<br>
    First tried a baseline model using the mean value as predictor.
    Then tried a pipeline object containing the column transformations and regression model.  
       Mean baseline mean absolute error: 9.19
       Linear Regression mean absolute error: 3.39
       Random Forest max_depth 5, n_estimators=50, mean absolute error: 2.56
         The top features of importance with this model are:
         521    cont__trip_distance         0.469959
         108  cat__PULocationID_107         0.296208
         522    cont__trip_duration         0.143166
         396  cat__DOLocationID_140         0.066259

3. With GridSearchCV, with no column transformation; just RandomForestRegressor with parameter_grid with,<br>
     n_est_list = [30, 60, 100]  max_depth_list = [2, 4, 8]  min_samp_list = [4, 100, 1000]<br>
     Verbose=3 gives<br>
       [CV 1/3] END max_depth=2, min_samples_split=4, n_estimators=100;, score=0.632 total time= 1.1min
       [CV 1/3] END max_depth=2, min_samples_split=1000, n_estimators=100;, score=0.632 total time= 1.1min
       [CV 2/3] END max_depth=8, min_samples_split=4, n_estimators=30;, score=-535.408 total time= 1.2min - most negative score
     Using the best_params, {'max_depth': 2, 'min_samples_split': 4, 'n_estimators': 100} and trying RandomForestRegressor again,
       mean absolute error = 3.96 (worse than 2. above)
       rfr_best_score is 0.673 (matches verbose debug)
     Using the params with the most negative score, {max_depth=8, min_samples_split=4, n_estimators=30}, score=-535.408 (from debug) and trying RandomForestRegressor again,
       mean absolute error = 1.87 (least error)
       rfr_score is -1.295

4. With GridSearchCV, on pipeline with column transformation and RandomForestRegressor and scoring='r2' <br>
     n_est_list = [50, 100]  max_depth_list = [2, 4, 5]  min_samp_list = [2, 4]<br>
       [CV 2/3] END m__max_depth=2, m__min_samples_split=2, m__n_estimators=100;, score=0.488 total time= 1.6min
       [CV 2/3] END m__max_depth=2, m__min_samples_split=4, m__n_estimators=100;, score=0.488 total time= 1.5min
       [CV 1/3] END m__max_depth=5, m__min_samples_split=2, m__n_estimators=50;, score=-125.524 total time= 2.3min - most negative score
     The most negative score is with, {max_depth=5, min_samples_split=2, n_estimators=50}, score=-125.524 (from debug). This matches what was tried in 2. above for model evaluation.

       Mean absolute error (mae) with the different tries:
        Baseline model to always predict mean: 9.19 - worst
        Linear Regression: 3.39 - better
        Random Forest Regressor (mae, score) values:
         {max_depth=5, min_samples_split=2, n_estimators=50}: 2.56, -5.05
         {max_depth=2, min_samples_split=4, n_estimators=100}: 3.96, 0.673 - best score
         {max_depth=8, min_samples_split=4, n_estimators=30}: 1.87, -1.295 - best mae
     With Random Forest Regressor, the best hyperparameters (score) got from GridSearchCV, don't seem to give the best mean absolute error.