# 📝 Exercise M1.04

The goal of this exercise is to evaluate the impact of using an arbitrary
integer encoding for categorical variables along with a linear
classification model such as Logistic Regression.

To do so, let's try to use `OrdinalEncoder` to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
`LogisticRegression`. The generalization performance of the pipeline can be
evaluated by cross-validation and then compared to the score obtained when
using `OneHotEncoder` or to some other baseline score.

First, we load the dataset.

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

In the previous notebook, we used `sklearn.compose.make_column_selector` to
automatically select columns with a specific data type (also called `dtype`).
Here, we will use this selector to get only the columns containing strings
(column with `object` dtype) that correspond to categorical features in our
dataset.

In [3]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]

Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a
`LogisticRegression` classifier.

Because `OrdinalEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="use_encoded_value"` and
`unknown_value` parameters. You can refer to the
[scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)
for more details regarding these parameters.

In [44]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression

# Write your code here.
model = make_pipeline(
    OrdinalEncoder(), LogisticRegression(max_iter = 500))

Your model is now defined. Evaluate it using a cross-validation using
`sklearn.model_selection.cross_validate`.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">Be aware that if an error happened during the cross-validation,
<tt class="docutils literal">cross_validate</tt> will raise a warning and return NaN (Not a Number)
as scores.  To make it raise a standard Python exception with a traceback,
you can pass the <tt class="docutils literal"><span class="pre">error_score="raise"</span></tt> argument in the call to
<tt class="docutils literal">cross_validate</tt>. An exception will be raised instead of a warning at the first
encountered problem  and <tt class="docutils literal">cross_validate</tt> will stop right away instead of
returning NaN values. This is particularly handy when developing
complex machine learning pipelines.</p>
</div>

In [40]:
from sklearn.model_selection import cross_validate

# Write your code here.

cv_results = cross_validate(model, data_categorical, target, error_score="raise")
cv_results

ValueError: Found unknown categories [' Holand-Netherlands'] in column 7 during transform

In [45]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data_categorical, target)
cv_results

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 418, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 707, in score
    Xt = transform.transform(Xt)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 930, in transform
    X_int, X_mask = self._transform(
  File "/opt/conda/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 142, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories [' Holand-Netherlands'] in column 7 during tra

{'fit_time': array([0.37285161, 0.28725672, 0.35897374, 0.38506269, 0.33514285]),
 'score_time': array([0.02859473, 0.0299592 , 0.03008056, 0.01529455, 0.0305233 ]),
 'test_score': array([0.75514382, 0.75555328, 0.75573301,        nan, 0.75788288])}

Now, we would like to compare the generalization performance of our previous
model with a new model where instead of using an `OrdinalEncoder`, we will
use a `OneHotEncoder`. Repeat the model evaluation using cross-validation.
Compare the score of both models and conclude on the impact of choosing a
specific encoding strategy when using a linear model.

In [41]:
from sklearn.preprocessing import OneHotEncoder

# Write your code here.
model = make_pipeline(
    OneHotEncoder(), LogisticRegression(max_iter = 500))

cv_results2 = cross_validate(model, data_categorical, target)
cv_results2

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 418, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 707, in score
    Xt = transform.transform(Xt)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 509, in transform
    X_int, X_mask = self._transform(
  File "/opt/conda/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 142, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories [' Holand-Netherlands'] in column 7 during tra

{'fit_time': array([0.86602902, 0.88205767, 0.78567529, 0.87298703, 0.82375693]),
 'score_time': array([0.03053951, 0.03146529, 0.02974367, 0.01517153, 0.02821445]),
 'test_score': array([0.83222438, 0.83560242, 0.82872645,        nan, 0.83466421])}

In [46]:
import numpy

print("Comparing training times, ordinal vs onehot", cv_results['fit_time'].mean(), cv_results2['fit_time'].mean())
print("Comparing performances", numpy.nanmean(cv_results['test_score']), numpy.nanmean(cv_results2['test_score']))

Comparing training times, ordinal vs onehot 0.34785752296447753 0.846101188659668
Comparing performances 0.7560782479242659 0.8328043656122273
