# Exercise M1.06

The goal of this exercise is to evaluate the impact of feature preprocessing on a pipeline that uses a decision-tree-based classifier instead of a logistic regression.

* The first question is to empirically evaluate whether scaling numerical features is helpful or not;
* The second question is to evaluate whether it is empirically better (both from a computational and a statistical perspective) to use integer coded or one-hot encoded categories.

In [1]:
import pandas as pd
# Disable jedi autocompleter
%config Completer.use_jedi = False

In [2]:
df = pd.read_csv('data/adult-census.csv')

target = df['class']
data = df.drop(columns=['class', 'education-num', 'fnlwgt'])

As in the previous notebook, we use the utility `make_column_selector` to select only columns with a specific data type. Besides, we list in advance all categories for the categorical columns.

In [3]:
from sklearn.compose import make_column_selector as selector

In [4]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

In [7]:
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

## Reference Pipeline (no numerical scaling and integer-coded categories)

First let's time the pipeline used in the main notebook to serve as a reference:

In [6]:
import time

from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_validate

In [8]:
categorical_prep = OrdinalEncoder(handle_unknown="use_encoded_value",
                                  unknown_value=-1)
preprocessor = ColumnTransformer([
    ('categorical', categorical_prep, categorical_columns)],
    remainder='passthrough')

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

start = time.time()
res = cross_validate(model, data, target)
elapsed_time = time.time() - start

scores = res['test_score']

print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f} "
      f"with a fitting time of {elapsed_time:.3f}")

The mean cross-validation accuracy is: 0.874 +/- 0.003 with a fitting time of 8.348


## Scaling Numerical Features

Let's write a similar pipeline that also scales the numerical features using `StandardScaler`:

In [9]:
from sklearn.preprocessing import StandardScaler

In [14]:
preprocessor = ColumnTransformer([
    ('numerical', StandardScaler(), numerical_columns),
    ('categorical', categorical_prep, categorical_columns)])

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

start = time.time()
res = cross_validate(model, data, target)
elapsed_time = time.time() - start

scores = res['test_score']

print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f} "
      f"with a fitting time of {elapsed_time:.3f}")

The mean cross-validation accuracy is: 0.873 +/- 0.003 with a fitting time of 8.972


As we can see from the above examples, both accuracy and training time are approximately the same as the reference pipeline.

Scaling numerical features in indeed useless for most decision tree models in general and for `HistGradientBoostingClassifier` in particular.

## One-hot Encoding of Categorical Variables

We observed that integer coding of categorical variables can be very detrimental for linear models. However, it does not seem to be the case for `HistGradientBoostingClassifier` models, as the cross-validation score of the reference pipeline with `OrdinalEncoder` is reasonably good.

Let's see if we can get an even better accuracy with `OneHotEncoder`.

Hint: `HistGradientBoostingClassifier` does not yet support sparse input data. You might want to use `OneHotEncoder(handle_unknown="ignore", sparse=False)` to force the use of a dense representation as a workaround

In [15]:
from sklearn.preprocessing import OneHotEncoder

In [16]:
categorical_prep = OneHotEncoder(handle_unknown="ignore",
                                  sparse=False)
preprocessor = ColumnTransformer([
    ('categorical', categorical_prep, categorical_columns)],
    remainder='passthrough')

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

start = time.time()
res = cross_validate(model, data, target)
elapsed_time = time.time() - start

scores = res['test_score']

print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f} "
      f"with a fitting time of {elapsed_time:.3f}")

The mean cross-validation accuracy is: 0.873 +/- 0.003 with a fitting time of 24.555


From an accuracy point of view, the result is almost exactly the same. The reason is that `HistGradientBoostingClassifier` is expressive and robust enough to deal with misleading ordering of integer coded categories (which was not the case for linear models).

However from a computation point of view, the training time is significantly longer: this is caused by the fact that `OneHotEncoder` generates approximately 10 times more features than `OrdinalEncoder`.

Note that the current implementation `HistGradientBoostingClassifier` is still incomplete, and once sparse representation are handled correctly, training time might improve with such kinds of encodings.

The main take away message is that arbitrary integer coding of categories is perfectly fine for `HistGradientBoostingClassifier` and yields fast training times.