# Hyperparameter tuning

## Dask

**Hardware**: 10 nodes - r5.8xlarge's (32 CPU, 256 GB RAM each)

In [21]:
from ml_utils import MLUtils

ml_utils = MLUtils(
    ml_task='tip',
    tool='dask',
    model='elastic_net',
)

In [2]:
from dask.distributed import Client
from dask_saturn import SaturnCluster

cluster = SaturnCluster(n_workers=10, scheduler_size='xlarge', worker_size='8xlarge', nthreads=32)
client = Client(cluster)
cluster

[2020-08-06 23:25:08] INFO - dask-saturn | Cluster is ready


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

# Load data and feature engineering

In [3]:
import pandas as pd
import numpy as np
import dask.dataframe as dd

In [4]:
%%time
tip_train = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_train_sample', engine='pyarrow')
len(tip_train)

CPU times: user 70.9 ms, sys: 9.36 ms, total: 80.3 ms
Wall time: 2.42 s


10994913

In [5]:
tip_train.head()

Unnamed: 0,id,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,28a18fa5fa2f44f29ffd98fc9159829d,238.0,132.0,0,29,7,19,7,1.0,0.199616
1,a6578145ff824f5fb94e90457b040883,236.0,246.0,6,28,9,16,153,1.0,0.130435
2,91726ecac3b44e8bbfea68d725f35556,90.0,148.0,6,28,22,44,166,6.0,0.166667
3,a3b0d14ad1644dd6b90f1af6be002e55,141.0,186.0,6,28,9,34,153,1.0,0.152299
4,70aa5a0c6bc147dd8553e201b63ba0fe,100.0,142.0,6,28,22,7,166,1.0,0.169231


<br>
Let's take the same sample we used in the single node scikit example

In [6]:
sample = tip_train.sample(frac=0.1, replace=False, random_state=42)
len(sample)

1099492

# Run grid search

- use `dask-ml` preprocessing and grid search classes
- still using `sklearn.linear_model.ElasticNet` for model fitting
- we won't `refit` with best model, because we want to use `dask_ml.wrappers.ParallelPostFit` to use dask to parallelize predictions

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from dask_ml.compose import ColumnTransformer
from dask_ml.preprocessing import StandardScaler, DummyEncoder, Categorizer
from dask_ml.model_selection import GridSearchCV

features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col

pipeline = Pipeline(steps=[
    ('categorize', Categorizer(columns=ml_utils.tip_vars.categorical_feat)),
    ('onehot', DummyEncoder(columns=ml_utils.tip_vars.categorical_feat)),
    ('scale', ColumnTransformer(transformers=[('num', StandardScaler(), ml_utils.tip_vars.numeric_feat)])),
    ('clf', ElasticNet(normalize=False, max_iter=100)),
])

params = ml_utils.tip_vars.elastic_net_grid_search_params

grid_search = GridSearchCV(pipeline, params, cv=3, scoring='neg_mean_squared_error', refit=False)

In [8]:
%%time
_ = grid_search.fit(sample[features], sample[y_col])

CPU times: user 34.1 ms, sys: 4.84 ms, total: 39 ms
Wall time: 10.2 s


Get best_params manually because we set `refit=False`

In [9]:
best_params = (pd.DataFrame(grid_search.cv_results_)
               .sort_values('mean_test_score', ascending=False)
               .loc[0, 'params'])
best_params

{'clf__alpha': 0.5}

<br>

This wrapper allows us to parallelize predictions using Dask. The `fit` step is not affected

In [10]:
%%time
from dask_ml.wrappers import ParallelPostFit

best_estimator = ParallelPostFit(estimator=pipeline)
_ = best_estimator.fit(sample[features], sample[y_col])

CPU times: user 359 ms, sys: 260 ms, total: 619 ms
Wall time: 8.53 s


## Save model

Grab the sklearn estimator out of the dask wrapper

In [11]:
sklearn_estimator = best_estimator.estimator.named_steps['clf']
ml_utils.write_model(sklearn_estimator)

## Predict on test set

And calculate metrics. Save predictions and metrics to S3.

Notice that the below cell runs super fast, because it hasn't actually done anything due to Dask's lazy evaluation

In [16]:
%%time

amt_test = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_test', engine='pyarrow')
preds = amt_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']
preds = preds.assign(predicted=best_estimator.predict(amt_test[features]))

CPU times: user 123 ms, sys: 0 ns, total: 123 ms
Wall time: 684 ms


We can `persist` the DataFrame to compute all the predictions and store in RAM

In [17]:
%%time
from dask.distributed import wait
preds = preds.persist()
_ = wait(preds)

preds.head()

CPU times: user 70.6 ms, sys: 0 ns, total: 70.6 ms
Wall time: 6.57 s


Unnamed: 0,id,actual,predicted
0,8e8109754e3e4cb7879c4e9ee216d58d,0.097087,0.153217
1,a30e7c87866f417ab15dee5617f272a0,0.166667,0.153217
2,1a7a611d0809489d99a5120727e0476a,0.12,0.153217
3,736e84ca12a640cc858c210bd58f744c,0.089474,0.153217
4,f2c24299d9a34ce986b7a271c5cc80b2,0.0,0.153217


In [18]:
%%time
ml_utils.write_predictions(preds)

CPU times: user 37.4 ms, sys: 0 ns, total: 37.4 ms
Wall time: 3.06 s


<br>

If the `preds` DataFrame was _really_ big, you would want to use `dask_ml.metrics.mean_squared_error`. Here, the `preds` columns are pulled down to the client because we're using `sklearn.metrics.mean_squared_error`.

In [19]:
%%time
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

CPU times: user 558 ms, sys: 933 ms, total: 1.49 s
Wall time: 2.28 s


Unnamed: 0,ml_task,tool,model,metric,value
0,tip,dask,elastic_net,rmse,0.052227
