# üìù Exercise M1.04

The goal of this exercise is to evaluate the impact of using an arbitrary
integer encoding for categorical variables along with a linear
classification model such as Logistic Regression.

To do so, let's try to use `OrdinalEncoder` to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
`LogisticRegression`. The statistical performance of the pipeline can be
evaluated by cross-validation and then compared to the score obtained when
using `OneHotEncoder` or to some other baseline score.

First, we load the dataset.

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

In the previous notebook, we used `sklearn.compose.make_column_selector` to
automatically select columns with a specific data type (also called `dtype`).
Here, we will use this selector to get only the columns containing strings
(column with `object` dtype) that correspond to categorical features in our
dataset.

In [3]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]
data_categorical

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States
...,...,...,...,...,...,...,...,...
48837,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States
48838,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States
48839,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States
48840,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,United-States



We filter our dataset that it contains only categorical features.
Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a
`LogisticRegression` classifier.

Because `OrdinalEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="use_encoded_value"` and
`unknown_value` parameters. You can refer to the
[scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)
for more details regarding these parameters.

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression

# Write your code here.

model = make_pipeline(OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1), LogisticRegression(max_iter=1000))
model

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(handle_unknown='use_encoded_value',
                                unknown_value=-1)),
                ('logisticregression', LogisticRegression(max_iter=1000))])

Your model is now defined. Evaluate it using a cross-validation using
`sklearn.model_selection.cross_validate`.

In [7]:
from sklearn.model_selection import cross_validate

# Write your code here.
results = cross_validate(model, data_categorical, target, cv=7, return_train_score=True)
results

{'fit_time': array([0.61698222, 0.60000062, 0.51700258, 0.4715898 , 0.54400253,
        0.55699873, 0.62760305]),
 'score_time': array([0.05300236, 0.04199958, 0.02901554, 0.04800415, 0.04202199,
        0.03699899, 0.02600074]),
 'test_score': array([0.75551734, 0.75680711, 0.75351103, 0.75605561, 0.75505231,
        0.75591228, 0.75677225]),
 'train_score': array([0.75566119, 0.75578062, 0.7550879 , 0.755882  , 0.75480712,
        0.75600143, 0.75585812])}

Now, we would like to compare the statistical performance of our previous
model with a new model where instead of using an `OrdinalEncoder`, we will
use a `OneHotEncoder`. Repeat the model evaluation using cross-validation.
Compare the score of both models and conclude on the impact of choosing a
specific encoding strategy when using a linear model.

In [17]:
from sklearn.preprocessing import OneHotEncoder

# Write your code here.
model = make_pipeline(OneHotEncoder(sparse=True, handle_unknown = 'ignore'), LogisticRegression(max_iter=1000))
results = cross_validate(model, data_categorical, target, cv=7, return_train_score=True)
print(results)

score = results["test_score"]
print(f"Score : {score.mean():.3f} et pr√©cision : {score.std():.3f}")

{'fit_time': array([1.17471385, 1.20977116, 1.10197759, 1.27097583, 1.2805984 ,
       1.09199619, 1.01700521]), 'score_time': array([0.02401829, 0.02499771, 0.02599931, 0.02501965, 0.02700353,
       0.02699733, 0.02499843]), 'test_score': array([0.83233018, 0.83361995, 0.83147034, 0.82886627, 0.83330945,
       0.83732263, 0.83445607]), 'train_score': array([0.83401013, 0.8331502 , 0.83412956, 0.8341813 , 0.83367968,
       0.83346471, 0.83396632])}
Score : 0.833 et pr√©cision : 0.002
