# Random Forest

In [1]:
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint
from time import time
import pandas as pd
import os

In [2]:
if '__file__' in locals():
    current_folder = os.path.dirname(os.path.abspath(__file__))
else:
    current_folder = os.getcwd()

In [3]:
merge_features = '"{}"'.format(os.path.join(current_folder, '..', 'Features', 'Merge features.ipynb'))
calcular_auc = '"{}"'.format(os.path.join(current_folder, '..', 'Calcular AUC.ipynb'))
set_de_entrenamiento_testing_y_prediccion = '"{}"'.format(os.path.join(
    current_folder,
    '..',
    'Set de entrenamiento, testing y predicción.ipynb'
))
hiperparametros_csv = os.path.join(current_folder, 'hiperparametros', 'random_forest.csv')

In [4]:
pd.options.mode.chained_assignment = None
%run $merge_features

KeyboardInterrupt: La limpieza ya corrió en este Kernel

KeyboardInterrupt: La limpieza ya corrió en este Kernel

KeyboardInterrupt: La limpieza ya corrió en este Kernel

In [5]:
assert(df_features.shape[0] == df['person'].unique().shape[0])

Cargo los sets de entrenamiento, testing y predicción.

In [6]:
%run $set_de_entrenamiento_testing_y_prediccion

labels_with_features = labels.merge(df_features, how='inner', on='person')
data = labels_with_features.drop('label', axis=1)
target = labels_with_features['label']

## Entrenamiento rápido

Obtenemos las métricas con cross validation.

In [7]:
param = {
    'n_estimators': 100
}

cv_splits = 10 # cantidad de splits en el cross validation

regr = RandomForestRegressor(**param)

In [8]:
%%time
scores = cross_val_score(regr, data, target, cv=cv_splits, scoring='roc_auc')
print("Accuracy: %0.6f (+/- %0.6f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.845585 (+/- 0.020962)
Wall time: 11min 5s


## Feature importance

In [9]:
regr.fit(data, target)
feature_importance = pd.DataFrame(data={
    'columna':data.columns,
    'importancia':regr.feature_importances_
}).set_index('columna')
feature_importance.sort_values('importancia', ascending=False)

Unnamed: 0_level_0,importancia
columna,Unnamed: 1_level_1
dias ultimo checkout,0.104756
event_count,0.037656
viewed product,0.035763
days until 31-05 mean,0.030530
screen_resolution_width std,0.029182
screen_resolution_width mean,0.022694
screen_resolution_height mean,0.022415
days until 31-05 std,0.021463
brand listing,0.020988
ad campaign hit,0.020750


# Hiperparámetros

En esta sección vamos a buscar los hiperparámetros de random forest con un Random Search y cross validation. Para construir este Random Search se usó como base el código de sklearn https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py.

Hiperparámetros a probar. 

In [12]:
param_dist = {
    'n_estimators': list(range(1,150,5)),
    'max_depth': list(range(5,80,5)),
    'max_features': randint(1, data.shape[1]),
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(2, 100),
    'bootstrap': [True, False]
}

cv_splits = 10 # cantidad de splits en el cross validation
n_iter_search = 1 # cantidad de puntos, en total splits*n_iter_search RF a probar

regr = RandomForestRegressor()

Nota: hay más info en la consola desde la cual se corre jupyter.

Se puede aumentar *n_jobs* para que corra más procesos en paralelo, pero se corre el riesgo de que se cuelgue por falta de memoria. Recomiendo que prueben ir aumentando *n_jobs* con un *n_iter_search* bajo hasta encontrar el mayor *n_jobs* que se banque su compu.

In [13]:
random_search = RandomizedSearchCV(regr, param_distributions=param_dist, iid=False, refit=True, verbose=10,
                                   return_train_score=True, n_iter=n_iter_search, cv=cv_splits,
                                   scoring='roc_auc', n_jobs=2);

start = time()
random_search.fit(data, target)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:   59.3s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:  1.5min
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:  3.0min finished


RandomizedSearchCV took 212.70 seconds for 1 candidates parameter settings.


El **mejor** Random Forest fue:

In [16]:
print('score: {}'.format(random_search.best_score_))
random_search.best_params_

score: 0.8703461300583676


{'bootstrap': True,
 'max_depth': 35,
 'max_features': 90,
 'min_samples_leaf': 15,
 'min_samples_split': 2,
 'n_estimators': 71}

El resultado de la búsqueda la podemos importar a un DataFrame de Pandas y analizarlo.

In [15]:
stats_training = pd.DataFrame(data=random_search.cv_results_)
stats_training.head(2)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_max_depth,param_max_features,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params,split0_test_score,...,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,30.062374,0.476919,0.031018,0.002205,True,35,90,15,2,71,"{'bootstrap': True, 'max_depth': 35, 'max_feat...",0.864919,...,0.965342,0.965016,0.966003,0.965218,0.964678,0.965055,0.96416,0.96494,0.966129,0.965753,0.965229,0.000575


Escribo el mejor resultado en un archivo.

In [17]:
hyperparameter_data = {
    'algorithm': 'random_forest',
    'hyperparameters': random_search.best_params_,
    'cv_splits': cv_splits,
    'auc': random_search.best_score_,
    'features': data.columns
} 

In [18]:
%run -i write_hyperparameters.py