## tidymodels to sklearn

Sklearn implementation of this tidymodels example: https://www.tidymodels.org/start/case-study/

What is this example about? Hyperparameter tuning:

* sklearn pipelines
* transforming a single column sequentially
* estimators and preprocessing steps as parameters in tuning grid
* single validation set instead of cross-validation

References:

https://gist.github.com/amberjrivera/8c5c145516f5a2e894681e16a8095b5c#use-gridsearchcv-to-identify-the-best-estimator-and-optimize-over-the-entire-pipeline

https://stackoverflow.com/questions/50265993/alternate-different-models-in-pipeline-for-gridsearchcv

https://stackoverflow.com/questions/42266737/parallel-pipeline-to-get-best-model-using-gridsearch/42271829#42271829

In [1]:
import pandas as pd
import numpy as np

np.random.seed(753)
hotels = pd.read_csv('https://tidymodels.org/start/case-study/hotels.csv')

In [2]:
hotels = hotels.dropna()

In [3]:
hotels.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,49990,49991,49992,49993,49994,49995,49996,49997,49998,49999
hotel,City_Hotel,City_Hotel,Resort_Hotel,Resort_Hotel,Resort_Hotel,City_Hotel,Resort_Hotel,City_Hotel,City_Hotel,City_Hotel,...,Resort_Hotel,Resort_Hotel,City_Hotel,City_Hotel,City_Hotel,Resort_Hotel,Resort_Hotel,City_Hotel,Resort_Hotel,City_Hotel
lead_time,217,2,95,143,136,67,47,56,80,6,...,283,197,414,225,73,172,48,155,140,12
stays_in_weekend_nights,1,0,2,2,1,2,0,0,0,2,...,2,2,0,2,0,0,0,0,2,2
stays_in_week_nights,3,1,5,6,4,2,2,3,4,2,...,8,8,2,4,2,2,4,4,5,1
adults,2,2,2,2,2,2,2,0,2,2,...,2,2,2,2,2,2,2,2,2,2
children,none,none,none,none,none,none,children,children,none,children,...,none,none,none,none,none,children,none,none,none,none
meal,BB,BB,BB,HB,HB,SC,BB,BB,BB,BB,...,BB,Undefined,HB,BB,SC,BB,FB,BB,HB,BB
country,DEU,PRT,GBR,ROU,PRT,GBR,ESP,ESP,FRA,FRA,...,GBR,GBR,DEU,BRA,FRA,PRT,PRT,DEU,GBR,DEU
market_segment,Offline_TA/TO,Direct,Online_TA,Online_TA,Direct,Online_TA,Direct,Online_TA,Online_TA,Online_TA,...,Offline_TA/TO,Offline_TA/TO,Groups,Online_TA,Online_TA,Direct,Direct,Offline_TA/TO,Direct,Online_TA
distribution_channel,TA/TO,Direct,TA/TO,TA/TO,Direct,TA/TO,Direct,TA/TO,TA/TO,TA/TO,...,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,Direct,Direct,TA/TO,Direct,TA/TO


In [4]:
hotels \
    .groupby('children') \
    .agg(count=('children', 'count')) \
    .assign(prop = lambda x: x['count'] / x['count'].sum())

Unnamed: 0_level_0,count,prop
children,Unnamed: 1_level_1,Unnamed: 2_level_1
children,4026,0.080988
none,45685,0.919012


In [5]:
from sklearn.model_selection import train_test_split

features = hotels.drop('children', axis=1)
outcome = hotels['children']

X_train, X_test, y_train, y_test = train_test_split(
    features, 
    outcome, 
    test_size=0.25, 
    stratify=outcome
)

In [8]:
from sklearn.model_selection import ShuffleSplit

validation_set = ShuffleSplit(n_splits=1, train_size=0.8)

In [9]:
from sklearn.linear_model import LogisticRegression

https://www.tomasbeuzen.com/post/scikit-learn-gridsearch-pipelines/

https://stackoverflow.com/questions/16437022/how-to-tune-parameters-of-nested-pipelines-by-gridsearchcv-in-scikit-learn

In [10]:
import holidays

relevant_holidays = [
    "Christmas Day", "Good Friday", "New Year's Day",
    "Easter Monday"
]

def make_holidays(df, relevant_holidays=None):
    def __make_holidays(array, relevant_holidays=None):

        all_holidays = list(map(holidays.EuropeanCentralBank().get, array))
        return np.array([holiday if holiday in relevant_holidays else "_none" for holiday in all_holidays])
    
    return df.apply(
        func=__make_holidays, axis='columns', result_type='broadcast', relevant_holidays=relevant_holidays
    )

A pipeline is key to make transformations on the same column sequentially. In this case, we identify a date that is a holiday, replace it with a string, and then `OneHotEncode` it. But any multistep transformation can be expressed as such. 

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

holiday_indicator = make_pipeline(
    FunctionTransformer(make_holidays, kw_args={'relevant_holidays': relevant_holidays}),
    OneHotEncoder(drop=['_none'], sparse=False)
)

In [21]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector

nominal = features.columns[features.dtypes == object].drop('arrival_date')
numeric = features.columns[features.dtypes != object]

log_preprocess = make_column_transformer(
    (holiday_indicator, ["arrival_date"]),
    (OneHotEncoder(sparse=False, handle_unknown='ignore'), nominal),
    (VarianceThreshold(), numeric),
    (StandardScaler(), numeric)
)

In [12]:
from sklearn.pipeline import Pipeline

model_pipeline = Pipeline([
    ('transformer', log_preprocess),
    ('classifier', LogisticRegression(max_iter=5000))
])

In [13]:
log_grid = {
    'classifier__C': 10**np.linspace(-4, -1, num=30)
}

In [None]:
# Next step is random forest

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [23]:
from category_encoders.ordinal import OrdinalEncoder

rf_preprocess = make_column_transformer(
    (holiday_indicator, ["arrival_date"]),
    (OrdinalEncoder(), nominal),
    remainder='passthrough'
)

In [24]:
model_grid = [
    log_grid,
    {
        'transformer': [rf_preprocess],
        'classifier': [RandomForestClassifier()],
        'classifier__max_features': np.linspace(1, 25, num=5, dtype=int),
        'classifier__min_samples_split': np.linspace(2, 40, num=5, dtype=int)
    }
]

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score

roc_auc_scorer = {
    'roc_auc': make_scorer(roc_auc_score, needs_proba=True)
}

model_tuner = GridSearchCV(
    estimator=model_pipeline, 
    param_grid=model_grid, 
    cv=validation_set, 
    scoring=roc_auc_scorer,
    refit='roc_auc'
)

In [26]:
model_results = model_tuner.fit(X_train, y_train)

['test_roc_auc', 'C', 'max_features', 'min_samples_split']

In [50]:
cv_results = pd.DataFrame(model_results.cv_results_)
renamed = {
    'mean_test_roc_auc': 'test_roc_auc',
    "param_classifier__C": "C",
    "param_classifier__max_features": "max_features",
    "param_classifier__min_samples_split": "min_samples_split"
}

cv_results \
    .rename(columns=renamed) \
    [list(renamed.values())] \
    .sort_values('test_roc_auc', ascending=False)

Unnamed: 0,test_roc_auc,C,max_features,min_samples_split
36,0.904333,,7.0,11.0
37,0.900241,,7.0,21.0
35,0.899664,,7.0,2.0
31,0.898898,,1.0,11.0
38,0.898265,,7.0,30.0
41,0.897192,,13.0,11.0
42,0.896677,,13.0,21.0
39,0.896293,,7.0,40.0
46,0.896264,,19.0,11.0
49,0.89488,,19.0,40.0


In [55]:
best_model = model_results.best_estimator_

In [200]:
importances = best_model.named_steps.classifier.feature_importances_
importances.shape

(25,)

In [None]:
transformers = best_model.named_steps.transformer.transformers_

In [206]:
pd.DataFrame({'name':[transform[2] for transform in transformers]}) \
    .apply(lambda x: x if type(x['name'][0]) is str else [features.iloc[:, x['name']].columns], axis=1) \
    .explode('name') \
    .reset_index(drop=True)

Unnamed: 0,name
0,arrival_date
1,hotel
2,meal
3,country
4,market_segment
5,distribution_channel
6,reserved_room_type
7,assigned_room_type
8,deposit_type
9,customer_type


In [217]:
ohe = transformers[0][1].steps[1][1]

In [223]:
ohe.get_feature_names()

array(['x0_Christmas Day', 'x0_Easter Monday', 'x0_Good Friday',
       "x0_New Year's Day"], dtype=object)

In [230]:
ordinal = transformers[1][1]

In [233]:
ordinal.get_feature_names()

['hotel',
 'meal',
 'country',
 'market_segment',
 'distribution_channel',
 'reserved_room_type',
 'assigned_room_type',
 'deposit_type',
 'customer_type',
 'required_car_parking_spaces']

In [237]:
transformers[2][2]

[1, 2, 3, 4, 9, 10, 11, 14, 16, 18, 20]