# Exercise 105

## Part 1

The goal of this exercise is to evaluate the impact of using an arbitrary integer encoding for categorical variables along with a linear classification model such as Logistic Regression.

To do so, let's try to use `OrdinalEncoder` to preprocess the categorical variables. This preprocessor is assembled in a pipeline with `LogisticRegression`. The generalization performance of the pipeline can be evaluated by cross-validation and then compared to the score obtained when using `OneHotEncoder` or to some other baseline score.

First, we load the datasset.


In [2]:
# Standard imports
import pandas as pd
import numpy as np

In [3]:
# Load the dataset
df = pd.read_csv('data/adult-census.csv')
df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K


In [4]:
# Separate target from the rest of the data
target = df['class']

# drop redundan columns
data = df.drop(columns=['class', 'education-num', 'fnlwgt'])

Select objects dtypes using `sklearn.compose.make_column_selector`

In [6]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]

In [8]:
categorical_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a `LogisticRegression` classifier.

Because `OrdinalEncoder` can raise errors if it sees an unknown category at prediction time, you can set the `handle_unknown="use_encoded_value` and `unknown_value` parameters.

In [9]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import  LogisticRegression

In [13]:
model = make_pipeline(OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), 
                      LogisticRegression(max_iter=500))

Now that we defined the model, we can go ahead and evaluate it using cross-validation

In [14]:
from sklearn.model_selection import cross_validate

res = cross_validate(model, data_categorical, target)
res

{'fit_time': array([0.68291903, 0.41284108, 0.53323007, 0.61784101, 0.42726493]),
 'score_time': array([0.04455805, 0.04169703, 0.04239798, 0.05333376, 0.04661489]),
 'test_score': array([0.75514382, 0.75555328, 0.75573301, 0.75307125, 0.75788288])}

In [27]:
scores = res['test_score']
print(f"The mean cross-validation accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.755 +/- 0.002


Now we want to compate the generalization performance of our previous model with anew model where instead of using an `OrdinalEncoder`, we will use a `OneHotEncoder`.

In [28]:
from sklearn.preprocessing import OneHotEncoder

model = make_pipeline(OneHotEncoder(handle_unknown='ignore'), LogisticRegression(max_iter=500))
res = cross_validate(model, data_categorical, target)
res

{'fit_time': array([1.18171215, 0.95225096, 1.02713704, 1.09081602, 1.04541922]),
 'score_time': array([0.04588509, 0.04491282, 0.04246402, 0.04252815, 0.03971076]),
 'test_score': array([0.83222438, 0.83560242, 0.82872645, 0.83312858, 0.83466421])}

In [29]:
scores = res['test_score']
print(f"The mean cross-validation accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.833 +/- 0.002


<div class="alert alert-block alert-warning">
The important message here is: linear model and OrdinalEncoder are used together only for ordinal categorical features, features with a specific ordering. Otherwise, your model will perform poorly.
</div>