# Tip percent prediction - Random forest

## Plain scikit-learn

**Hardware**: r5.8xlarge (32 CPU, 256 GB RAM)

In [1]:
from ml_utils import MLUtils

ml_utils = MLUtils(
    ml_task='tip',
    tool='scikit',
    model='random_forest',
)

# Load data and feature engineering

In [2]:
import numpy as np
import pandas as pd

In [3]:
%%time
tip_train = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_train')
tip_train.shape

CPU times: user 2min 7s, sys: 1min 19s, total: 3min 27s
Wall time: 10min 14s


(219889897, 12)

In [4]:
tip_train.head()

Unnamed: 0,id,pickup_datetime,dropoff_datetime,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,a2acf0e861b84476b1ba8b91f08448f4,2016-07-10 06:56:05,2016-07-10 06:59:53,263.0,236.0,6,27,6,56,150,1.0,0.6
1,bac9b63e96f14ca9a4e9c236d709f5f2,2016-07-10 10:50:18,2016-07-10 10:55:21,142.0,163.0,6,27,10,50,154,5.0,0.226667
2,6ec52187eb4a4ab6ae3ce632030c0322,2016-07-10 10:50:18,2016-07-10 11:08:38,74.0,66.0,6,27,10,50,154,1.0,0.0
3,21102720ac044762977f91b30f8857a1,2016-07-10 10:50:19,2016-07-10 10:55:38,161.0,142.0,6,27,10,50,154,1.0,0.166667
4,3d002b5aa2404553adc0ad79c776bf32,2016-07-10 10:50:19,2016-07-10 11:08:54,226.0,261.0,6,27,10,50,154,1.0,0.206957


<br>

Training with the full data ran out of memory, so let's take a sample to be able to successfully train

In [5]:
sample = tip_train.sample(frac=0.5, replace=False, random_state=42)
sample.shape

(109944948, 12)

# Random forest regressor

We don't need to one-hot encode or scale because we are using tree models. We can parallelize the training across all the cores in the single node using `n_jobs=-1`

In [6]:
from sklearn.ensemble import RandomForestRegressor

features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col

rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)

In [13]:
%%time
with ml_utils.time_fit():
    _ = rf.fit(sample[features], sample[y_col])

CPU times: user 20h 51min 39s, sys: 2min 35s, total: 20h 54min 15s
Wall time: 45min 28s


## Save model

In [14]:
ml_utils.write_model(rf)

uploading model to 's3://saturn-titan/nyc-taxi/ml_results/models/tip__scikit__random_forest.pkl'
successfully uploaded model


## Predict on test set

And calculate metrics. Save predictions and metrics to S3.

In [15]:
%%time

tip_test = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_test')
preds = tip_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']
preds['predicted'] = rf.predict(tip_test[features])

CPU times: user 2min 23s, sys: 31 s, total: 2min 54s
Wall time: 1min 41s


In [16]:
preds.head()

Unnamed: 0,id,actual,predicted
0,2e8f402e4dc44f2fae8b9328a237c4d2,0.117647,0.21183
1,5f067a4121244f42bf460867c23b39c9,0.216842,0.21183
2,60e8442d3d434df4959261905a279f55,0.15,0.21183
3,2d1537ce2ed347778e078eaee7eacd44,0.10625,0.222221
4,13bb8a9ecbd04b559b7b9e40904026b0,0.0,0.232162


In [17]:
%%time
ml_utils.write_predictions(preds)

Writing predictions to 's3://saturn-titan/nyc-taxi/ml_results/predictions/tip__scikit__random_forest'
Done writing predictions
CPU times: user 9.86 s, sys: 1.72 s, total: 11.6 s
Wall time: 1min 18s


In [18]:
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

Unnamed: 0,ml_task,tool,model,metric,value,fit_seconds
0,tip,scikit,random_forest,rmse,0.208525,2728.26685
