# Tip Prediction (XGBoost + Dask)

**Hardware**: r5.8xlarge (32 CPU, 256 GB RAM)

In [1]:
import dask_xgboost as dxgb
import numpy as np
import pandas as pd

import dask.dataframe as dd
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

from sklearn.metrics import mean_squared_error

from ml_utils import MLUtils

This notebook relies on `xgboost` 0.90 and `dask-xgboost` 0.1.11. As of this writing, that is the newest version of `xgboost` supported by `dask-xgboost`.

To begin, initialize an `ml_utils` object. This is a small object used to handle naming and storing the model.

In [2]:
ml_utils = MLUtils(
    ml_task='tip',
    tool='dask',
    model='xgboost',
)

## Set up a Dask cluster

Create a Dask Cluster with the following specs.

* 10 workers, each:
    - `r5.8xlarge` EC2 instances
    - `nproc=1` (one worker process per instance)
    - `nthreads=32` (use all available cores)

In [3]:
cluster = SaturnCluster(
    n_workers=10,
    scheduler_size='xlarge',
    worker_size='8xlarge',
    nproc=1,
    nthreads=32
)
client = Client(cluster)
cluster

[2020-08-07 19:32:42] INFO - dask-saturn | Starting cluster. Status: stopped
[2020-08-07 19:32:48] INFO - dask-saturn | Starting cluster. Status: starting
[2020-08-07 19:33:00] INFO - dask-saturn | Starting cluster. Status: starting
[2020-08-07 19:33:26] INFO - dask-saturn | Cluster is ready


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

# Load data and feature engineering

In [4]:
%%time
tip_train = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_train_sample', engine='pyarrow')
len(tip_train)

CPU times: user 72.6 ms, sys: 8.05 ms, total: 80.7 ms
Wall time: 2.86 s


10994913

In [5]:
tip_train.head()

Unnamed: 0,id,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,28a18fa5fa2f44f29ffd98fc9159829d,238.0,132.0,0,29,7,19,7,1.0,0.199616
1,a6578145ff824f5fb94e90457b040883,236.0,246.0,6,28,9,16,153,1.0,0.130435
2,91726ecac3b44e8bbfea68d725f35556,90.0,148.0,6,28,22,44,166,6.0,0.166667
3,a3b0d14ad1644dd6b90f1af6be002e55,141.0,186.0,6,28,9,34,153,1.0,0.152299
4,70aa5a0c6bc147dd8553e201b63ba0fe,100.0,142.0,6,28,22,7,166,1.0,0.169231


<br>
Take a sample of the training data.

In [6]:
sample = tip_train.sample(frac=0.1, replace=False, random_state=42)
len(sample)

1099492

# Train a model

To be sure that the estimate of training time doesn't include data I/O, `persist()` the dataframe to the workers in the Dask cluster first.

In [8]:
sample.persist()

Unnamed: 0_level_0,id,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
npartitions=19,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
,object,float64,float64,int64,int64,int64,int64,int64,float64,float64
,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...


In [22]:
%%time
features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col

xgb_reg = dxgb.train(
    client=client,
    params={
        "verbosity": 1,
        "max_depth": 8,
        "random_state": 42,
        "objective": "reg:squarederror",
        "nthread": 1,
        "learning_rate": 0.1
    },
    data=sample[features],
    labels=sample[y_col],
    num_boost_round=100
)

CPU times: user 58.3 ms, sys: 7.48 ms, total: 65.8 ms
Wall time: 40.1 s


## Save model

Now that we've trained a model, store it in S3 so it can be deployed in the future.

In [23]:
ml_utils.write_model(xgb_reg)

uploading model to 's3://saturn-titan/nyc-taxi/ml_results/models/tip__dask__xgboost.pkl'
successfully uploaded model


## Predict on test set

And calculate metrics. Save predictions and metrics to S3.

In [24]:
%%time
amt_test = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_test', engine='pyarrow')
preds = amt_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']

CPU times: user 77.9 ms, sys: 468 µs, total: 78.4 ms
Wall time: 190 ms


In [25]:
preds['predicted'] = dxgb.predict(
    client=client,
    model=xgb_reg,
    data=amt_test[features]
)

In [26]:
preds.head()

Unnamed: 0,id,actual,predicted
0,8e8109754e3e4cb7879c4e9ee216d58d,0.097087,0.162214
1,a30e7c87866f417ab15dee5617f272a0,0.166667,0.161161
2,1a7a611d0809489d99a5120727e0476a,0.12,0.158262
3,736e84ca12a640cc858c210bd58f744c,0.089474,0.151156
4,f2c24299d9a34ce986b7a271c5cc80b2,0.0,0.15826


In [27]:
%%time
ml_utils.write_predictions(preds)

Writing predictions to 's3://saturn-titan/nyc-taxi/ml_results/predictions/tip__dask__xgboost'
Done writing predictions
CPU times: user 163 ms, sys: 23.6 ms, total: 187 ms
Wall time: 7.36 s


In [28]:
rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

Unnamed: 0,ml_task,tool,model,metric,value
0,tip,dask,xgboost,rmse,0.052061
