# Handling Categorical Data

## Encoding of Categorical Variables

In this section, we will present typical ways of dealing with categorical variables by encoding them, namely **ordinal encoding** and **one-hoe encoding**

So, without further due, let's start first by importing the required modules and loading the data set.

In [1]:
# 1. Standard imports
import pandas as pd

# Few tweaked Pandas options for friendlier output:

# Use 2 decimal places in output display
%precision %.3f
pd.set_option("display.precision", 3)

# Disable jedi autocompleter
%config Completer.use_jedi = False

In [2]:
df = pd.read_csv('data/adult-census.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
# this time we are going to keep the categorical eduction variable; education and drop the numerical `education`
df = df.drop(columns='education-num')
target = df['class']
data = df.drop(columns='class')

### Identify Categorical Variables

As we saw in the previous section, a numerical variable is a quantity represented by a real or integer number. These variables can be naturally handled by machine learning algorithms that are typically composed of a sequence of arithmetic instructions such as additions and multiplications.

In contrast, categorical variables have discrete values, typically  represented by string labels taken from a finite list of possible choices. e.g:

In [4]:
data['native-country'].value_counts().sort_index()

 ?                               857
 Cambodia                         28
 Canada                          182
 China                           122
 Columbia                         85
 Cuba                            138
 Dominican-Republic              103
 Ecuador                          45
 El-Salvador                     155
 England                         127
 France                           38
 Germany                         206
 Greece                           49
 Guatemala                        88
 Haiti                            75
 Holand-Netherlands                1
 Honduras                         20
 Hong                             30
 Hungary                          19
 India                           151
 Iran                             59
 Ireland                          37
 Italy                           105
 Jamaica                         106
 Japan                            92
 Laos                             23
 Mexico                          951
 

There are different ways to select categorical data:

In [5]:
# 1.
data.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [6]:
# 2.
data.select_dtypes(include='object').columns

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex', 'native-country'],
      dtype='object')

### Select Features Based on their Data Type

scikit-learn comes with a helper function `make_column_selector`, which allows us to select columns based on their data type. Since this is a tutorial about sklearn, in the following we will illustrate how to use this helper.

In [7]:
from sklearn.compose import make_column_selector as selector

categorical_cols_selector = selector(dtype_include=object)
categorical_cols = categorical_cols_selector(data)
data_categorical = data[categorical_cols]
print(f"There are {len(categorical_cols)} categorical features in the dataset:")
categorical_cols

There are 8 categorical features in the dataset:


['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

## Strategies to Encode Categories

### Encoding Ordinal Categories

The most intuitive strategy is to encode each category with a different number using the `OrdinalEncoder`:

In [8]:
from sklearn.preprocessing import OrdinalEncoder

education_column = data_categorical[['education']]

encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)
education_encoded

array([[ 1.],
       [11.],
       [ 7.],
       ...,
       [11.],
       [11.],
       [11.]])

We see that each category in `education` has been replaced by a numeric value. We could check the mapping between the categories and the numerical values by checking the fitted attribute `categories_`.

In [9]:
encoder.categories_

[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

Now we can apply the encoder to all categorical features:

In [10]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[ 4.,  1.,  4.,  7.,  3.,  2.,  1., 39.],
       [ 4., 11.,  2.,  5.,  0.,  4.,  1., 39.],
       [ 2.,  7.,  2., 11.,  0.,  4.,  1., 39.],
       [ 4., 15.,  2.,  7.,  0.,  2.,  1., 39.],
       [ 0., 15.,  4.,  0.,  3.,  4.,  0., 39.]])

In [11]:
encoder.categories_

[array([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private',
        ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
       dtype=object),
 array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object),
 array([' Divorced', ' Married-AF-spouse', ' Married-civ-spouse',
        ' Married-spouse-absent', ' Never-married', ' Separated',
        ' Widowed'], dtype=object),
 array([' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair',
        ' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners',
        ' Machine-op-inspct', ' Other-service', ' Priv-house-serv',
        ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support',
        ' Transport-moving'], dtype=object),
 array([' Husband', ' Not-in-family', ' Other-relative', ' Own-child',
        ' Unmarried'

In [12]:
len(encoder.categories_)

8

We see that the categories have been encoded for each feature independently. We also not that the number of features before and after the encoding is the same.

However, be careful when applying this encoding strategy: Using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3 ...)

By default, `OrdinalEncoder` uses an arbitrary and often meaningless strategy to map string category labels to integers. However, it accepts a `categories` constructor argument to pass categories in the expected ordering explicitly.

<div class="alert alert-block alert-warning">
As a rule of thumb: If a categorical variable does not carry any meaningful order information then avoid this type of encoding and consider using one-hot encoding instead.</div>

### Encoding Nominal Categories

`OneHotEncoder` is an alternative encoder that prevents the downstream models to make a false assumption about the ordering of categories. For a given feature, it will create as many new coumns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to `1` while all the columns of the other categories will be set to `0`.

As in the last section, we will start by encoding a single column to understand how the encoding works.

In [13]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
education_encoded = encoder.fit_transform(education_column)
education_encoded

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

<div class="alert alert-block alert-info">
<b>Note:</b> sparse=False is used in the OnHotEncoder for didactic puposes, namely easier visualization of the data.<br>
Sparse matrices are efficient data structures when most of your matrix elements are zero. They are a kind of compression and are used to save a huge amount of memory.</div>

We see that encoding a single feature will give a NumPy array full of zeros and ones. We can get a better understanding using the associated feature names resulting from the transformation.

In [15]:
feature_names = encoder.get_feature_names(input_features=['education'])
education_encoded = pd.DataFrame(education_encoded, columns=feature_names)
education_encoded

Unnamed: 0,education_ 10th,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,education_ 7th-8th,education_ 9th,education_ Assoc-acdm,education_ Assoc-voc,education_ Bachelors,education_ Doctorate,education_ HS-grad,education_ Masters,education_ Preschool,education_ Prof-school,education_ Some-college
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
48839,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
48840,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


As we can see, each category (unique value) became a column; the encoding returned, for each sample, a 1 to specify which category it belongs to.  
Let's apply this encoding to the full dataset.

In [16]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.

In [17]:
print(f"The encoded dataset contains {data_encoded.shape[1]} features")

The encoded dataset contains 102 features


Let's wrap this NumPy array in a DataFrame with informative column names as provided by the encoder object:

In [18]:
columns_encoded = encoder.get_feature_names(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


We went from 8 features to 102 features in total. The categories initially inside the corresponding features represent now each a feature with column name the original name plus the categorie name as suffix.

### Choosing an Encoding Strategy

Choosing an encoding strategy will depend onthe underlying models and the type of categories (i.e. ordinal vs nominal).

<div class="alert alert-block alert-info">
<b>Note:</b> In general `OneHotEncoder` is the encoding strategy used when the downstream models are linear models while `OrdinalEncoder` is often a good strategy with tree-based models.</div>

## Evaluate Our Predictive Pipeline

We can now integrate this encoder inside a machine learning pipeline like we did with numerical data: let's train a linear classifier on the encoded data and check the generalization performance of this machine learning pipeline using cross-validation.

Before we create the pipeline, we have to linger on the native-country. Let's recall some statistics regarding this column.

In [19]:
data['native-country'].value_counts()

 United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Nicaragua                        49
 Greece                           49
 

We see that the `Holand-Netherlands` category is occurring rarely. This will be a problem during cross-validation: if the sample ends up in the test set during splitting then the classifier would not have seen the category during training and will not be able to encode it.

There are two solution to bypass this issue:

1. list all possible categories and provide it to the encoder via the keyword argument `categories`;  
2. use the parameter `handle_unknown`.

Here, we will use the latter solution for simplicity. We can now create our machine learing pipeline.

In [20]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(OneHotEncoder(handle_unknown='ignore'), 
                      LogisticRegression(max_iter=500))

<div class="alert alert-block alert-info">
<b>Note:</b> <br>
    Here, we need to increase the maximum number of iterations to obtain a fully converged <em>LogisticRegression</em> and silence a <em>ConvergenceWarning</em>. Contrary to the numerical features, the one-hot encoded categorical features are all on the same scale (values are 0 or 1), so they would not benefit from scaling. In this case, increasing <em>max_iter</em> is the right thing to do.</div>

Finally, we can check the model's generalization performance only using the categorical columns

In [21]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data_categorical, target)
cv_results

{'fit_time': array([1.4166019 , 1.05931211, 1.08475804, 1.17788887, 1.16256189]),
 'score_time': array([0.05358386, 0.04595089, 0.04472995, 0.04690599, 0.04290414]),
 'test_score': array([0.83222438, 0.83560242, 0.82872645, 0.83312858, 0.83466421])}

In [22]:
scores = cv_results['test_score']
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

The accuracy is: 0.833 +/- 0.002


As you can see, this representation of the categorical variables is slightly more predictive of the revenue than the nuerical variables that we used previously.