# Tip Prediction (XGBoost + Dask)

**Hardware**: 10 nodes - r5.8xlarge (32 CPU, 256 GB RAM)

In [1]:
import dask.dataframe as dd
import dask_xgboost as dxgb
import numpy as np
import pandas as pd

from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

from sklearn.metrics import mean_squared_error

from ml_utils import MLUtils

This notebook relies on `xgboost` 0.90 and `dask-xgboost` 0.1.11. As of this writing, that is the newest version of `xgboost` supported by `dask-xgboost`.

To begin, initialize an `ml_utils` object. This is a small object used to handle naming and storing the model.

In [2]:
ml_utils = MLUtils(
    ml_task='tip',
    tool='dask',
    model='xgboost',
)

## Set up a Dask cluster

Create a Dask Cluster with the following specs.

* 10 workers, each:
    - `r5.8xlarge` EC2 instances
    - `nproc=1` (one worker process per instance)
    - `nthreads=32` (use all available cores)

In [3]:
cluster = SaturnCluster(
    n_workers=10,
    scheduler_size='xlarge',
    worker_size='8xlarge',
    nproc=1,
    nthreads=32
)
client = Client(cluster)
cluster

[2020-08-08 19:15:18] INFO - dask-saturn | Cluster is ready


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

# Load data and feature engineering

In [4]:
%%time
tip_train = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_train', engine='pyarrow')
len(tip_train)

CPU times: user 1.07 s, sys: 28.9 ms, total: 1.1 s
Wall time: 15.5 s


219889897

In [5]:
tip_train.head()

Unnamed: 0,id,pickup_datetime,dropoff_datetime,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,a2acf0e861b84476b1ba8b91f08448f4,2016-07-10 06:56:05,2016-07-10 06:59:53,263.0,236.0,6,27,6,56,150,1.0,0.6
1,bac9b63e96f14ca9a4e9c236d709f5f2,2016-07-10 10:50:18,2016-07-10 10:55:21,142.0,163.0,6,27,10,50,154,5.0,0.226667
2,6ec52187eb4a4ab6ae3ce632030c0322,2016-07-10 10:50:18,2016-07-10 11:08:38,74.0,66.0,6,27,10,50,154,1.0,0.0
3,21102720ac044762977f91b30f8857a1,2016-07-10 10:50:19,2016-07-10 10:55:38,161.0,142.0,6,27,10,50,154,1.0,0.166667
4,3d002b5aa2404553adc0ad79c776bf32,2016-07-10 10:50:19,2016-07-10 11:08:54,226.0,261.0,6,27,10,50,154,1.0,0.206957


<br>
Let's take the same sample we used in the single node example

In [6]:
sample = tip_train.sample(frac=0.5, replace=False, random_state=42)
len(sample)

109944945

# Train a model

To be sure that the estimate of training time doesn't include data I/O, `persist()` the dataframe to the workers in the Dask cluster first.

In [7]:
sample = sample.persist()
_ = wait(sample)

In [8]:
%%time
features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col

with ml_utils.time_fit():
    xgb_reg = dxgb.train(
        client=client,
        params={
            "verbosity": 1,
            "max_depth": 8,
            "random_state": 42,
            "objective": "reg:squarederror",
            "nthread": 1,
            "learning_rate": 0.1
        },
        data=sample[features],
        labels=sample[y_col],
        num_boost_round=100
    )

CPU times: user 478 ms, sys: 39.5 ms, total: 518 ms
Wall time: 9min 47s


## Save model

Now that we've trained a model, store it in S3 so it can be deployed in the future.

In [9]:
ml_utils.write_model(xgb_reg)

uploading model to 's3://saturn-titan/nyc-taxi/ml_results/models/tip__dask__xgboost.pkl'
successfully uploaded model


## Predict on test set

And calculate metrics. Save predictions and metrics to S3.

In [10]:
%%time
tip_test = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_test', engine='pyarrow')
preds = tip_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']

CPU times: user 131 ms, sys: 166 µs, total: 131 ms
Wall time: 577 ms


In [12]:
preds['predicted'] = dxgb.predict(
    client=client,
    model=xgb_reg,
    data=tip_test[features]
)

In [13]:
preds.head()

Unnamed: 0,id,actual,predicted
0,2e8f402e4dc44f2fae8b9328a237c4d2,0.117647,0.220339
1,5f067a4121244f42bf460867c23b39c9,0.216842,0.219414
2,60e8442d3d434df4959261905a279f55,0.15,0.214625
3,2d1537ce2ed347778e078eaee7eacd44,0.10625,0.225546
4,13bb8a9ecbd04b559b7b9e40904026b0,0.0,0.235255


In [14]:
%%time
ml_utils.write_predictions(preds)

Writing predictions to 's3://saturn-titan/nyc-taxi/ml_results/predictions/tip__dask__xgboost'
Done writing predictions
CPU times: user 186 ms, sys: 55.6 ms, total: 242 ms
Wall time: 18.9 s


In [15]:
rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

Unnamed: 0,ml_task,tool,model,metric,value,fit_seconds
0,tip,dask,xgboost,rmse,0.206804,587.82643
