# Hyperparameter tuning

## Dask

<img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="400">

**Hardware**: 10 nodes - r5.8xlarge's (32 CPU, 256 GB RAM each)

In [1]:
from ml_utils import MLUtils

ml_utils = MLUtils(
    ml_task='tip',
    tool='dask',
    model='elastic_net',
)

In [2]:
from dask.distributed import Client
from dask_saturn import SaturnCluster

cluster = SaturnCluster(n_workers=10, scheduler_size='xlarge', worker_size='8xlarge', nthreads=32)
client = Client(cluster)
cluster

[2020-08-08 16:05:58] INFO - dask-saturn | Cluster is ready


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

# Load data and feature engineering

In [3]:
import pandas as pd
import numpy as np
import dask.dataframe as dd

In [4]:
%%time
tip_train = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_train_sample', engine='pyarrow')
len(tip_train)

CPU times: user 88.6 ms, sys: 4.02 ms, total: 92.6 ms
Wall time: 2.54 s


10994502

<br>
Let's take the same sample we used in the single node scikit example

In [5]:
sample = tip_train.sample(frac=0.1, replace=False, random_state=42)
len(sample)

1099448

In [6]:
sample = sample.persist()

# Run grid search

- use `dask-ml` preprocessing and grid search classes
- still using `sklearn.linear_model.ElasticNet` for model fitting
- we won't `refit` with best model, because we want to use `dask_ml.wrappers.ParallelPostFit` to use dask to parallelize predictions

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from dask_ml.compose import ColumnTransformer
from dask_ml.preprocessing import StandardScaler, DummyEncoder, Categorizer
from dask_ml.model_selection import GridSearchCV

features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col

pipeline = Pipeline(steps=[
    ('categorize', Categorizer(columns=ml_utils.tip_vars.categorical_feat)),
    ('onehot', DummyEncoder(columns=ml_utils.tip_vars.categorical_feat)),
    ('scale', ColumnTransformer(transformers=[('num', StandardScaler(), ml_utils.tip_vars.numeric_feat)])),
    ('clf', ElasticNet(normalize=False, max_iter=100)),
])

params = ml_utils.tip_vars.elastic_net_grid_search_params

grid_search = GridSearchCV(pipeline, params, cv=3, scoring='neg_mean_squared_error', refit=False)

In [16]:
%%time
with ml_utils.time_fit():
    _ = grid_search.fit(sample[features], sample[y_col])

('score-95610b837428c16a66fe5986591b4bf1', 106, 1) has failed... retrying


CPU times: user 1.1 s, sys: 87.9 ms, total: 1.19 s
Wall time: 1min


Get best_params manually because we set `refit=False`

In [10]:
best_params = (pd.DataFrame(grid_search.cv_results_)
               .sort_values('mean_test_score', ascending=False)
               .loc[0, 'params'])
best_params

{'clf__alpha': 0, 'clf__l1_ratio': 0.0}

<br>

This wrapper allows us to parallelize predictions using Dask. The `fit` step is not affected

In [11]:
%%time
from dask_ml.wrappers import ParallelPostFit

best_estimator = ParallelPostFit(estimator=pipeline)
_ = best_estimator.fit(sample[features], sample[y_col])

CPU times: user 345 ms, sys: 255 ms, total: 600 ms
Wall time: 1.41 s


## Save model

Grab the sklearn estimator out of the dask wrapper

In [12]:
sklearn_estimator = best_estimator.estimator.named_steps['clf']
ml_utils.write_model(sklearn_estimator)

uploading model to 's3://saturn-titan/nyc-taxi/ml_results/models/tip__dask__elastic_net.pkl'
successfully uploaded model


## Predict on test set

And calculate metrics. Save predictions and metrics to S3.

Notice that the below cell runs super fast, because it hasn't actually done anything due to Dask's lazy evaluation

In [13]:
%%time

tip_test = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_test', engine='pyarrow')
preds = tip_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']
preds = preds.assign(predicted=best_estimator.predict(tip_test[features]))

CPU times: user 154 ms, sys: 709 µs, total: 155 ms
Wall time: 601 ms


We can `persist` the DataFrame to compute all the predictions and store in RAM

In [14]:
%%time
from dask.distributed import wait
preds = preds.persist()
_ = wait(preds)

preds.head()

CPU times: user 65.4 ms, sys: 15.9 ms, total: 81.3 ms
Wall time: 8.73 s


Unnamed: 0,id,actual,predicted
0,2e8f402e4dc44f2fae8b9328a237c4d2,0.117647,0.218926
1,5f067a4121244f42bf460867c23b39c9,0.216842,0.218926
2,60e8442d3d434df4959261905a279f55,0.15,0.218926
3,2d1537ce2ed347778e078eaee7eacd44,0.10625,0.218926
4,13bb8a9ecbd04b559b7b9e40904026b0,0.0,0.218926


In [13]:
%%time
ml_utils.write_predictions(preds)

CPU times: user 80.2 ms, sys: 6.66 ms, total: 86.8 ms
Wall time: 1min 25s


<br>

If the `preds` DataFrame was _really_ big, you would want to use `dask_ml.metrics.mean_squared_error`. Here, the `preds` columns are pulled down to the client because we're using `sklearn.metrics.mean_squared_error`.

In [17]:
%%time
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

CPU times: user 619 ms, sys: 940 ms, total: 1.56 s
Wall time: 2.5 s


Unnamed: 0,ml_task,tool,model,metric,value,fit_seconds
0,tip,dask,elastic_net,rmse,0.20726,60.090035
