# Tip Prediction (XGBoost)

**Hardware**: r5.8xlarge (32 CPU, 256 GB RAM)

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

from ml_utils import MLUtils

To begin, initialize an `ml_utils` object. This is a small object used to handle naming and storing the model.

In [2]:
ml_utils = MLUtils(
    ml_task='tip',
    tool='scikit',
    model='xgboost',
)

# Load data and feature engineering

In [3]:
%%time
tip_train = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_train_sample')
tip_train.shape

CPU times: user 5.07 s, sys: 4.53 s, total: 9.6 s
Wall time: 28.8 s


(10994913, 10)

In [4]:
tip_train.head()

Unnamed: 0,id,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,28a18fa5fa2f44f29ffd98fc9159829d,238.0,132.0,0,29,7,19,7,1.0,0.199616
1,a6578145ff824f5fb94e90457b040883,236.0,246.0,6,28,9,16,153,1.0,0.130435
2,91726ecac3b44e8bbfea68d725f35556,90.0,148.0,6,28,22,44,166,6.0,0.166667
3,a3b0d14ad1644dd6b90f1af6be002e55,141.0,186.0,6,28,9,34,153,1.0,0.152299
4,70aa5a0c6bc147dd8553e201b63ba0fe,100.0,142.0,6,28,22,7,166,1.0,0.169231


<br>
Take a sample of the training data, then drop the full `train` to save memory.

In [5]:
sample = tip_train.sample(frac=0.1, replace=False, random_state=42)
sample.shape

(1099491, 10)

In [6]:
del tip_train

# Train a model

In [7]:
xgb_reg = XGBRegressor(
    objective="reg:squarederror",
    learning_rate=0.1,
    max_depth=8,
    n_estimators=100,
    nthread=32
)

In [8]:
%%time
features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col
model = xgb_reg.fit(X=sample[features], y=sample[y_col].values)

CPU times: user 9min 52s, sys: 11.6 s, total: 10min 4s
Wall time: 19.1 s


## Save model

Now that we've trained a model, store it in S3 so it can be deployed in the future.

In [9]:
ml_utils.write_model(model)

uploading model to 's3://saturn-titan/nyc-taxi/ml_results/models/tip__scikit__xgboost.pkl'
successfully uploaded model


## Predict on test set

And calculate metrics. Save predictions and metrics to S3. Before doing that, remove the training data from memory.

In [10]:
del sample

In [11]:
%%time
amt_test = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_test')
preds = amt_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']

CPU times: user 15 s, sys: 11.5 s, total: 26.5 s
Wall time: 1min 4s


In [12]:
preds['predicted'] = model.predict(amt_test[features])

In [13]:
preds.head()

Unnamed: 0,id,actual,predicted
0,8e8109754e3e4cb7879c4e9ee216d58d,0.097087,0.161691
1,a30e7c87866f417ab15dee5617f272a0,0.166667,0.158902
2,1a7a611d0809489d99a5120727e0476a,0.12,0.158882
3,736e84ca12a640cc858c210bd58f744c,0.089474,0.157415
4,f2c24299d9a34ce986b7a271c5cc80b2,0.0,0.147651


In [14]:
%%time
ml_utils.write_predictions(preds)

Writing predictions to 's3://saturn-titan/nyc-taxi/ml_results/predictions/tip__scikit__xgboost'
Done writing predictions
CPU times: user 10.1 s, sys: 2.1 s, total: 12.2 s
Wall time: 1min 12s


In [15]:
rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

Unnamed: 0,ml_task,tool,model,metric,value
0,tip,scikit,xgboost,rmse,0.05157
