# Hyperparameter tuning (single-node)

<table>
    <tr>
        <td>
            <img src="../_img/scikit-learn.png" width="300">
        </td>
        <td>
            <img src="../_img/snowflake.png" width="450">
        </td>
    </tr>
</table>

This notebook describes a machine learning training workflow using the famous [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). That dataset contains information on taxi trips in New York City.

In this exercise, you'll load data from Snowflake into a `pandas` data frame and use `scikit-learn` to answer this question

> based on characteristics that can be known at the beginning of a trip, what tip will this trip earn (as a % of the total fare)?

**NOTE:** This notebook has some cells that can take 3-10 minutes to run. Consider opening [this dask-ml notebook](./hyperparameter-dask.ipynb) and running that while you're waiting for cells in this notebook to complete.

<hr>

## Connect to Snowflake

This example uses data stored in a Snowflake data warehouse that is managed by the team at Saturn Cloud. We've set up a read-only user for use in these examples. If you would like to access data stored in your own Snowflake account, you should set up [Credentials](https://www.saturncloud.io/docs/concepts/credentials/) for your account, user, and password then set the other connection information accordingly. For more details on Snowflake connection information, see ["Connecting to Snowflake"](https://docs.snowflake.com/en/user-guide/python-connector-example.html#connecting-to-snowflake) in the `snowflake-connector-python` docs.

Note that in order to update environment variables your Jupyter server will need to be stopped.

In [None]:
import os
import numpy as np

import snowflake.connector

conn_info = {
    "account": os.environ["EXAMPLE_SNOWFLAKE_ACCOUNT"],
    "user": os.environ["EXAMPLE_SNOWFLAKE_USER"],
    "password": os.environ["EXAMPLE_SNOWFLAKE_PASSWORD"],
    "database": os.environ["TAXI_DATABASE"],
}
conn = snowflake.connector.connect(**conn_info)

<hr>

## Load data

This example is designed to run quickly with small resources. So let's just load a single month of taxi data for training.

This example uses Snowflake to handle the hard work of creating new features, then creates a `pandas` data frame with the result.

In [None]:
query = """
SELECT * FROM (
    SELECT
        pickup_taxizone_id,
        dropoff_taxizone_id,
        passenger_count,
        DIV0(tip_amount, fare_amount) AS tip_fraction,
        DAYOFWEEKISO(pickup_datetime) - 1 AS pickup_weekday,
        WEEKOFYEAR(pickup_datetime) AS pickup_weekofyear,
        HOUR(pickup_datetime) AS pickup_hour,
        (pickup_weekday * 24) + pickup_hour AS pickup_week_hour,
        MINUTE(pickup_datetime) AS pickup_minute
    FROM taxi_yellow
    WHERE
        DATE_TRUNC('MONTH', pickup_datetime) = %s
) SAMPLE (1)
"""
taxi = conn.cursor().execute(query, "2019-01-01").fetch_pandas_all()
taxi.columns = taxi.columns.str.lower()

In [None]:
print(f"Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum() / 1e6} MB")

In [None]:
numeric_feat = [
    "pickup_weekday",
    "pickup_weekofyear",
    "pickup_hour",
    "pickup_week_hour",
    "pickup_minute",
    "passenger_count",
]
categorical_feat = [
    "pickup_taxizone_id",
    "dropoff_taxizone_id",
]
features = numeric_feat + categorical_feat
y_col = "tip_fraction"

taxi_train = taxi[features + [y_col]].astype(float).fillna(-1)

In [None]:
taxi_train.head()

<hr>

## Run grid search

Now that you've loaded and preprocessed your training data, use the code below to find the best set of hyperparameters for your mode. This is done with `scikit-learn`'s `GridSearchCV`.

Setting `n_jobs=-1` tells scikit-learn to use all available cores on this machine to train models.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline(
    steps=[
        (
            "preprocess",
            ColumnTransformer(
                transformers=[
                    ("num", StandardScaler(), numeric_feat),
                    ("cat", OneHotEncoder(handle_unknown="ignore", sparse=False), categorical_feat),
                ]
            ),
        ),
        ("clf", ElasticNet(normalize=False, max_iter=100)),
    ]
)

params = {
    "clf__l1_ratio": np.arange(0, 1.1, 0.1),
    "clf__alpha": [0, 0.5, 1, 2],
}

grid_search = GridSearchCV(
    pipeline, params, cv=3, n_jobs=-1, verbose=1, scoring="neg_mean_squared_error"
)

With the grid search set up, it's time to train some models!

**NOTE:** This will take a few minutes to run. Consider opening [this dask-ml notebook](./hyperparameter-dask.ipynb) and running that while you're waiting for this model to train.

In [None]:
%%time
_ = grid_search.fit(taxi_train[features], taxi_train[y_col])
grid_search.best_score_

In [None]:
grid_search.best_params_

<hr>

## Save model

`GridSearchCV` automatically fits the best parameters to the full data and stores in `best_estimator_`

In [None]:
import cloudpickle
import os

MODEL_PATH = "models"
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)

with open(f"{MODEL_PATH}/elastic_net_scikit.pkl", "wb") as f:
    cloudpickle.dump(grid_search.best_estimator_, f)

<hr>

## Calculate metrics on test set

Machine learning training tries to create a model which can produce useful results on new data that it didn't see during training. To test how well we've accomplished that in this example, read in another month of taxi data.

In [None]:
taxi_test = conn.cursor().execute(query, "2019-02-01").fetch_pandas_all()
taxi_test.columns = taxi_test.columns.str.lower()
taxi_test = taxi_test.astype(float).fillna(-1)

`scikit-learn` comes with many functions for calculating metrics that describe how well a model's predictions match the actual values. For a complete list, see ["Metrics and scoring"](https://scikit-learn.org/stable/modules/model_evaluation.html) in the `sciki-learn` docs.

This tutorial uses the `mean_squared_error` to evaluate the model. This metric penalizes large errors more than small errors.

In [None]:
from sklearn.metrics import mean_squared_error

preds = grid_search.predict(taxi_test[features])
mean_squared_error(taxi_test[y_col], preds, squared=False)

<hr>

## Next Steps

In this tutorial, you learned how to use `pandas` for feature engineering and `scikit-learn` for hyperparameter optimization, based on data retrieved from Snowflake using `snowflake-connector-python`.

Next, try [this dask-ml + Dask notebook](./xgboost-dask.ipynb) to learn how to use Dask to train larger models and / or reduce training time.

<hr>