# High tip classification - Random forest

## Plain scikit-learn

**Hardware**: r5.8xlarge (32 CPU, 256 GB RAM)

In [1]:
from ml_utils import MLUtils

ml_utils = MLUtils(
    ml_task='high_tip',
    tool='scikit',
    model='random_forest',
)

# Load data and feature engineering

In [2]:
import numpy as np
import pandas as pd

In [3]:
%%time
tip_train = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_train')
tip_train.shape

CPU times: user 1min 54s, sys: 1min 8s, total: 3min 2s
Wall time: 9min 21s


(219897929, 10)

In [4]:
tip_train.head()

Unnamed: 0,id,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,c92d7d1d969a4c5a89c7dc93f905881e,263.0,236.0,6,27,6,56,150,1.0,0.3
1,f6f9b9c710ff4adfb29c8687ee646d0d,142.0,163.0,6,27,10,50,154,5.0,0.166667
2,cadce931261a42a2b26878a5f610e900,74.0,66.0,6,27,10,50,154,1.0,0.0
3,445763736d5d456697048f87ca10f47f,161.0,142.0,6,27,10,50,154,1.0,0.128205
4,51c777d26f2d4f278e8862d2ce6dbd42,226.0,261.0,6,27,10,50,154,1.0,0.166667


Training with the full data ran out of memory, so let's take a sample to be able to successfully train

In [5]:
sample = tip_train.sample(frac=0.3, replace=False, random_state=42)
sample.shape

(65969379, 10)

# Random forest classifier

We'll train a classifier than can predict "high-tip" rides - those where the tip percent is in the top 50% of tips. Use the full set to determine threshold, but will still train with the sample.

In [6]:
high_tip = tip_train[ml_utils.tip_vars.y_col].median()
high_tip

0.1665351223362273

In [7]:
features = ml_utils.tip_vars.features
y_clf = 'high_tip'
sample[y_clf] = (sample[ml_utils.tip_vars.y_col] > high_tip).astype(int)

In [8]:
sample[[ml_utils.tip_vars.y_col, y_clf]].head()

Unnamed: 0,tip_fraction,high_tip
186000404,0.1667,1
155907643,0.122699,0
55724746,0.166369,0
164951253,0.212014,1
185169052,0.065359,0


In [9]:
sample[y_clf].value_counts()

0    33005795
1    32963584
Name: high_tip, dtype: int64

<br>

We don't need to one-hot encode or scale because we are using tree models. We can parallelize the training across all the cores in the single node using `n_jobs=-1`

In [10]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)

In [11]:
%%time
_ = rfc.fit(sample[features], sample[y_clf])

CPU times: user 4h 3min 53s, sys: 4min 23s, total: 4h 8min 16s
Wall time: 9min 33s


## Save model

In [12]:
ml_utils.write_model(rfc)

## Predict on test set

And calculate metrics. Save predictions and metrics to S3.

In [20]:
%%time

amt_test = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_test')
amt_test[y_clf] = (amt_test[ml_utils.tip_vars.y_col] > high_tip).astype(int)

preds = amt_test[['id', y_clf]].copy()
preds.columns = ['id', 'actual']
preds['predicted'] = rfc.predict_proba(amt_test[features])[:, 1]

CPU times: user 3min 49s, sys: 41.4 s, total: 4min 31s
Wall time: 1min 17s


In [21]:
preds.head()

Unnamed: 0,id,actual,predicted
0,8e8109754e3e4cb7879c4e9ee216d58d,0,0.495435
1,a30e7c87866f417ab15dee5617f272a0,1,0.482681
2,1a7a611d0809489d99a5120727e0476a,0,0.491725
3,736e84ca12a640cc858c210bd58f744c,0,0.698884
4,f2c24299d9a34ce986b7a271c5cc80b2,0,0.660963


In [25]:
%%time
ml_utils.write_predictions(preds)

CPU times: user 10.5 s, sys: 2.15 s, total: 12.6 s
Wall time: 2min 25s


In [26]:
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(preds.actual, preds.predicted)
ml_utils.write_metric_df('auc', auc)

Unnamed: 0,ml_task,tool,model,metric,value
0,high_tip,scikit,random_forest,auc,0.557234
