# Tip Prediction (XGBoost)

**Hardware**: r5.8xlarge (32 CPU, 256 GB RAM)

In [1]:
import numpy as np
import pandas as pd
import xgboost

from sklearn.metrics import mean_squared_error

from ml_utils import MLUtils

This notebook relies on `xgboost` 0.90. As of this writing, that is the newest version of `xgboost` supported by `dask-xgboost`, and this notebook is intended to compliment `tip_dask_xgboost.ipynb`.

In [2]:
print(xgboost.__version__)

0.90


To begin, initialize an `ml_utils` object. This is a small object used to handle naming and storing the model.

In [3]:
ml_utils = MLUtils(
    ml_task='tip',
    tool='scikit',
    model='xgboost',
)

# Load data and feature engineering

In [4]:
%%time
tip_train = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_train')
tip_train.shape

CPU times: user 2min 6s, sys: 1min 20s, total: 3min 27s
Wall time: 12min 39s


(219892717, 12)

In [5]:
tip_train.head()

Unnamed: 0,id,pickup_datetime,dropoff_datetime,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,2b7d0e26212043c38b8ab3a7693965f1,2016-07-10 06:56:05,2016-07-10 06:59:53,263.0,236.0,6,27,6,56,150,1.0,0.6
1,979d4717e9e04c44bf62959308348860,2016-07-10 10:50:18,2016-07-10 10:55:21,142.0,163.0,6,27,10,50,154,5.0,0.226667
2,07fd0421b17f4bf78a8cf09a3dabcac7,2016-07-10 10:50:18,2016-07-10 11:08:38,74.0,66.0,6,27,10,50,154,1.0,0.0
3,8d039019a6d94d7e81ce4bcbd3a04f36,2016-07-10 10:50:19,2016-07-10 10:55:38,161.0,142.0,6,27,10,50,154,1.0,0.166667
4,d71237b38d5e4d39b9c43c665080d8da,2016-07-10 10:50:19,2016-07-10 11:08:54,226.0,261.0,6,27,10,50,154,1.0,0.206957


<br>

Take a sample of the training data, then drop the full `tip_train` to save memory.

In [6]:
sample = tip_train.sample(frac=0.5, replace=False, random_state=42)
sample.shape

(109946358, 12)

# Train a model

In [7]:
xgb_reg = xgboost.XGBRegressor(
    objective="reg:squarederror",
    learning_rate=0.1,
    max_depth=8,
    n_estimators=100,
    nthread=32
)

In [8]:
%%time
features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col

with ml_utils.time_fit():
    model = xgb_reg.fit(X=tip_train[features], y=tip_train[y_col].values)

CPU times: user 1d 3h 8min 18s, sys: 8min 14s, total: 1d 3h 16min 33s
Wall time: 2h 45min 26s


## Save model

Now that we've trained a model, store it in S3 so it can be deployed in the future.

In [9]:
ml_utils.write_model(model)

uploading model to 's3://saturn-titan/nyc-taxi/ml_results/models/tip__scikit__xgboost.pkl'
successfully uploaded model


## Predict on test set

And calculate metrics. Save predictions and metrics to S3. Before doing that, remove the training data from memory.

In [10]:
%%time
tip_test = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_test')
preds = tip_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']
preds['predicted'] = model.predict(tip_test[features])

CPU times: user 3min 41s, sys: 14.9 s, total: 3min 56s
Wall time: 1min 48s


In [11]:
preds.head()

Unnamed: 0,id,actual,predicted
0,11d37225291e4ba4aa7eebad9f7b272a,0.117647,0.246762
1,8bc943e88a654bb2a0861f393c7ba491,0.216842,0.234995
2,3431b20f16c5456aa29fa45bccad6f47,0.15,0.224734
3,52cf7e80dd104f84a81b488c1f78d531,0.10625,0.226385
4,e0ae8ce6dbff40b195332af413d24b0e,0.0,0.31453


In [12]:
%%time
ml_utils.write_predictions(preds)

Writing predictions to 's3://saturn-titan/nyc-taxi/ml_results/predictions/tip__scikit__xgboost'
Done writing predictions
CPU times: user 10 s, sys: 2.45 s, total: 12.5 s
Wall time: 1min 26s


In [13]:
rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

Unnamed: 0,ml_task,tool,model,metric,value,fit_seconds
0,tip,scikit,xgboost,rmse,14.929407,9926.049958
