# Hyperparameter tuning

## Plain scikit-learn

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1200px-Scikit_learn_logo_small.svg.png" width="300">

**Hardware**: r5.8xlarge (32 CPU, 256 GB RAM)

In [1]:
from ml_utils import MLUtils

ml_utils = MLUtils(
    ml_task='tip',
    tool='scikit',
    model='elastic_net',
)

# Load data and feature engineering

In [2]:
import numpy as np
import pandas as pd

In [3]:
%%time
tip_train = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_train_sample')
tip_train.shape

CPU times: user 5.91 s, sys: 5.24 s, total: 11.2 s
Wall time: 39.6 s


(10994502, 12)

In [4]:
tip_train.head()

Unnamed: 0,id,pickup_datetime,dropoff_datetime,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,326fdd4d9a1843488a38d16a3bb6278b,2016-07-16 18:24:40,2016-07-16 18:49:56,237.0,249.0,5,28,18,24,138,1.0,0.114286
1,d58919163315476fbd3269d13c31173c,2016-07-17 06:17:08,2016-07-17 06:53:45,132.0,239.0,6,28,6,17,150,1.0,0.224423
2,caa9550ccbda4c1690514a10012e22ef,2016-07-16 17:13:58,2016-07-16 17:21:27,161.0,163.0,5,28,17,13,137,1.0,0.221429
3,812739604c0f474995830e5bb0c5d272,2016-07-16 02:23:48,2016-07-16 03:03:08,148.0,75.0,5,28,2,23,122,1.0,0.208254
4,76ecb54bb45c49d293e81588a4e09720,2016-07-17 21:32:38,2016-07-17 22:00:32,138.0,87.0,6,28,21,32,165,5.0,0.235584


<br>
Let's take a sample so we can parallelize the grid search (otherwise it would take quite a while)

In [5]:
sample = tip_train.sample(frac=0.1, replace=False, random_state=42)
sample.shape

(1099450, 12)

# Run grid search

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV

features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col

pipeline = Pipeline(steps=[
    ('preprocess', ColumnTransformer(transformers=[
        ('num', StandardScaler(), ml_utils.tip_vars.numeric_feat),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), ml_utils.tip_vars.categorical_feat),
    ])),
    ('clf', ElasticNet(normalize=False, max_iter=100)),
])

params = ml_utils.tip_vars.elastic_net_grid_search_params

grid_search = GridSearchCV(pipeline, params, cv=3, n_jobs=-1, scoring='neg_mean_squared_error')

In [9]:
%%time
with ml_utils.time_fit():
    _ = grid_search.fit(sample[features], sample[y_col])
grid_search.best_score_

CPU times: user 3min 34s, sys: 18.5 s, total: 3min 53s
Wall time: 2h 17min 6s


  positive)


-0.0318456308524957

In [10]:
grid_search.best_params_

{'clf__alpha': 0.5, 'clf__l1_ratio': 0.0}

## Save model

`GridSearchCV` automatically fits the best paramemters to the full data and stores in `best_estimator_`

In [11]:
ml_utils.write_model(grid_search.best_estimator_)

uploading model to 's3://saturn-titan/nyc-taxi/ml_results/models/tip__scikit__elastic_net.pkl'
successfully uploaded model


## Predict on test set

And calculate metrics. Save predictions and metrics to S3.

In [12]:
%%time

tip_test = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_test')
preds = tip_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']
preds['predicted'] = grid_search.predict(tip_test[features])

CPU times: user 1min 45s, sys: 1min 41s, total: 3min 27s
Wall time: 4min 5s


In [13]:
preds.head()

Unnamed: 0,id,actual,predicted
0,2e8f402e4dc44f2fae8b9328a237c4d2,0.117647,0.218623
1,5f067a4121244f42bf460867c23b39c9,0.216842,0.218905
2,60e8442d3d434df4959261905a279f55,0.15,0.218806
3,2d1537ce2ed347778e078eaee7eacd44,0.10625,0.218072
4,13bb8a9ecbd04b559b7b9e40904026b0,0.0,0.211103


In [14]:
%%time
ml_utils.write_predictions(preds)

Writing predictions to 's3://saturn-titan/nyc-taxi/ml_results/predictions/tip__scikit__elastic_net'
Done writing predictions
CPU times: user 10.3 s, sys: 2.79 s, total: 13.1 s
Wall time: 1min 21s


In [15]:
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

Unnamed: 0,ml_task,tool,model,metric,value,fit_seconds
0,tip,scikit,elastic_net,rmse,0.207689,8226.132331
