# Hyperparameter tuning (single-node with scikit-learn)

<img src="https://saturn-public-assets.s3.us-east-2.amazonaws.com/example-resources/scikit-learn.png" width="300">

This notebook describes a machine learning training workflow using the famous [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). That dataset contains information on taxi trips in New York City.

In this exercise, you'll load data into a `pandas` data frame and use `scikit-learn` to answer this question

> based on characteristics that can be known at the beginning of a trip, what tip will this trip earn (as a % of the total fare)?

**NOTE:** This notebook has some cells that can take 3-10 minutes to run. Consider opening [this dask-ml notebook](./hyperparameter-dask.ipynb) and running that while you're waiting for cells in this notebook to complete.

<hr>

## Load data

This example is designed to run quickly with small, relatively inexpensive resources. So let's just load a single month of taxi data for training.

In [None]:
import pandas as pd

taxi = pd.read_csv(
    "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
).sample(frac=0.01, replace=False)

The code below computes the size of this dataset in memory. One month is about 7.6 million rows and 1.5 GB.

In [None]:
print(f"Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum() / 1e6} MB")

You can examine the structure of the data with `pandas` commands:

`.head()` = view the first few rows

In [None]:
taxi.head()

`.dtypes` = list all the columns and the type of data in them

In [None]:
taxi.dtypes

<hr>

## Prep for Training

Before training a model, we need to transform this dataset into a format that's better-suited to the research question. The function below does that with `pandas` operations.

<details><summary>(click here to learn why data scientists do this)</summary>

**Compute the Target**

The raw data don't contain a column that cleanly describes the tip as a percentage of the total fare. So we need to add one!

**Add Features**

Giving a machine learning model a richer description of each training observation improves its ability to describe the relationship between those observations' characteristics and the target. These characteristics are called "features".

For example, instead of giving a model a raw timestamp, it can be valuable to provide multiple derived characteristics like hour of the day and day of the week. It's plausible, for example, that weekend rides might have a different distribution of tips because they tend to be for leisure, where weekday rides might be mostly people travelling for work.

**Remove Unused Features**

If the goal is to produce a model that could predict the tip for a ride, then characteristics that can only be known AFTER the tip have to be excluded. For example, you can't know the dropoff time or the type of payment until a ride has concluded.

Such features should be dropped before training.
    
</details>

In [None]:
numeric_feat = [
    "pickup_weekday",
    "pickup_weekofyear",
    "pickup_hour",
    "pickup_week_hour",
    "pickup_minute",
    "passenger_count",
]
categorical_feat = [
    "PULocationID",
    "DOLocationID",
]
features = numeric_feat + categorical_feat
target_col = "tip_fraction"


def prep_df(df: pd.DataFrame, target_col: str) -> pd.DataFrame:
    """
    Prepare a raw taxi dataframe for training.
        * computes the target ('tip_fraction')
        * adds features
        * removes unused features
    """
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df[target_col] = df.tip_amount / df.fare_amount

    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_weekofyear"] = df.tpep_pickup_datetime.dt.isocalendar().week
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [target_col]].astype(float).fillna(-1)

    return df

Run the code below to get a new data frame, `taxi_train`, that can be used directly for model training.

In [None]:
target_col = "tip_fraction"
taxi_train = prep_df(taxi, target_col)

`taxi_train` is a `pandas` dataframe that will be passed in to a machine learning model. Before going further, check the first few rows of the dataset to make sure that the features look reasonable.

In [None]:
taxi_train.head()

Now that the dataframe has been processed, check its size in memory again.

In [None]:
print(
    f"Num rows: {len(taxi_train)}, Size: {round(taxi_train.memory_usage(deep=True).sum() / 1e9, 2)} GB"
)

As you can see above, removing unused columns dropped the size of the training data to 0.55 GB, about one third the size of the raw data.

<hr>

## Run grid search

Now that you've loaded and preprocessed your training data, use the code below to find the best set of hyperparameters for your mode. This is done with `scikit-learn`'s `GridSearchCV`.

Setting `n_jobs=-1` tells scikit-learn to use all available cores on this machine to train models.

In [None]:
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline(
    steps=[
        (
            "preprocess",
            ColumnTransformer(
                transformers=[
                    ("num", StandardScaler(), numeric_feat),
                    ("cat", OneHotEncoder(handle_unknown="ignore", sparse=False), categorical_feat),
                ]
            ),
        ),
        ("clf", ElasticNet(normalize=False, max_iter=100)),
    ]
)

params = {
    "clf__l1_ratio": np.arange(0, 1.1, 0.1),
    "clf__alpha": [0, 0.5, 1, 2],
}

grid_search = GridSearchCV(
    pipeline, params, cv=3, n_jobs=-1, verbose=1, scoring="neg_mean_squared_error"
)

With the grid search set up, it's time to train some models!

**NOTE:** This will take a few minutes to run. Consider opening [this dask-ml notebook](./hyperparameter-dask.ipynb) and running that while you're waiting for this model to train.

In [None]:
%%time
_ = grid_search.fit(taxi_train[features], taxi_train[target_col])
grid_search.best_score_

In [None]:
grid_search.best_params_

<hr>

## Save model

`GridSearchCV` automatically fits the best parameters to the full data and stores in `best_estimator_`

In [None]:
import cloudpickle
import os

MODEL_PATH = "models"
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)

with open(f"{MODEL_PATH}/elastic_net_scikit.pkl", "wb") as f:
    cloudpickle.dump(grid_search.best_estimator_, f)

<hr>

## Calculate metrics on test set

Machine learning training tries to create a model which can produce useful results on new data that it didn't see during training. To test how well we've accomplished that in this example, read in another month of taxi data.

In [None]:
taxi_test = pd.read_csv(
    "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
).sample(frac=0.01, replace=False)

Before creating predictions on this new dataset, it has to be transformed in exactly the way that the original training data were prepared. Thankfully you've already wrapped that transformation logic in a function!

In [None]:
taxi_test = prep_df(taxi_test, target_col=target_col)

`scikit-learn` comes with many functions for calculating metrics that describe how well a model's predictions match the actual values. For a complete list, see ["Metrics and scoring"](https://scikit-learn.org/stable/modules/model_evaluation.html) in the `sciki-learn` docs.

This tutorial uses the `mean_squared_error` to evaluate the model. This metric penalizes large errors more than small errors.

In [None]:
from sklearn.metrics import mean_squared_error

preds = grid_search.predict(taxi_test[features])
mean_squared_error(taxi_test[target_col], preds, squared=False)

<hr>

## Next Steps

In this tutorial, you learned how to use `pandas` for feature engineering and `scikit-learn` for hyperparameter optimization.

Next, try [this dask-ml + Dask notebook](./xgboost-dask.ipynb) to learn how to use Dask to train larger models and / or reduce training time.

<hr>