# 📝 Exercise M1.05

The goal of this exercise is to **evaluate the impact of feature preprocessing**
on a pipeline that uses a **decision-tree-based classifier** instead of logistic
regression.

- The first question is to **empirically evaluate whether scaling numerical
  feature** is helpful or not;
- The second question is to evaluate whether it is empirically better (both
  from a computational and a statistical perspective) to use **integer coded or
  one-hot encoded categories**.

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

As in the previous notebooks, we use the utility `make_column_selector`
to only select column with a specific data type. Besides, we list in
advance all categories for the categorical columns.

In [3]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

## Reference pipeline (no numerical scaling and integer-coded categories)

First let's **time the pipeline** we used in the main notebook to serve as a
**reference**:

In [4]:
%%time
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.873 +/- 0.003
CPU times: user 13.2 s, sys: 172 ms, total: 13.3 s
Wall time: 6.84 s


## Scaling numerical features

Let's write a similar pipeline that also scales the numerical features using
**`StandardScaler`** (or similar):

In [6]:
from sklearn.preprocessing import StandardScaler

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns),
    ('numerical', StandardScaler(), numerical_columns)
])

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
cv = cross_validate(model, data, target)
scores = cv['test_score']
print("The mean cross-validation accuracy is: "
     f"{scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.873 +/- 0.003


## One-hot encoding of categorical variables

For linear models, we have observed that integer coding of categorical
variables can be very detrimental. However for
`HistGradientBoostingClassifier` models, it does not seem to be the case as
the cross-validation of the reference pipeline with `OrdinalEncoder` is good.

Let's see if we can get an even better accuracy with **`OneHotEncoder`**.

Hint: **`HistGradientBoostingClassifier`** does not yet support sparse input
data. You might want to use
**`OneHotEncoder(handle_unknown="ignore",sparse=False)`** to force the use of a
**dense representation** as a workaround.

In [7]:
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer([
    ('categorical', OneHotEncoder(handle_unknown="ignore", sparse=False), categorical_columns),
    ('numerical', StandardScaler(), numerical_columns)
])

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
cv = cross_validate(model, data, target)
scores = cv['test_score']
print("The mean cross-validation accuracy is: "
     f"{scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.873 +/- 0.002


One gets the **same accuracy** with both the `OrdinalEncoder` and the `OneHotEncoder`.