## How to Handle Categorical Data?


Real world data comes with their unique blends. Sometime working with real world data, you will have to deal with categorical data and other time not. Categorical data are those types of data whose features' values contain limited number of categories. Take an example of feature `gender` that can have two categories: `male` and `female`.

In many cases, categorical features have text values. And most ML models accept numerical inputs. That is the reason why we have to manipulate these types of categories to be in proper format accepted by ML algorithms.

There are four techniques to encode or convert the categorical features into numbers. Here are them:

* Mapping Method
* Ordinary Encoding
* Label Encoding
* Pandas Dummies
* OneHot Encoding

Note that some of these encoding techniques can produce same output, the difference is only implementation. The first 3 will produce the numerical outputs while the latter will produce the one hot matrix (with 1s and 0s).

Let's implement them

In [2]:
# import the required packages and load the dataset

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#We are going to use Titanic dataset from seaborn datasets. There are so many categorical features to choose from.

titanic = sns.load_dataset('titanic')

In [2]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
titanic.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


You can see that even displaying information about the dataset, some features like deck or class have category as data type.

Let's peek at some categorical features in our data.

In [5]:
titanic['class'].value_counts()

Third     491
First     216
Second    184
Name: class, dtype: int64

In [9]:
titanic['deck'].value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64

In [10]:
titanic['sex'].value_counts()

male      577
female    314
Name: sex, dtype: int64

In [11]:
titanic['alone'].value_counts()

True     537
False    354
Name: alone, dtype: int64

## 1. Mapping Method

Mapping method is straight forward way to encode categorical features with few categories. Let's apply it to the class feature: It has three categories: `Third`, `First`, `Second`. We create a dictionary whose keys are categories and values are numerics to encode to and then map it to the dataframe.

Here is how it is done:

In [12]:
map_dict = {
    'First':0,
    'Second': 1,
    'Third': 2 
}

In [13]:
titanic['class'] = titanic['class'].map(map_dict)

In [15]:
titanic['class'].value_counts()

2    491
0    216
1    184
Name: class, dtype: int64

As you can see, the class feature is encoded. Everywhere the class was First, it was replaced with 0. Samething happened to other classes.

In [27]:
titanic['age'].isnull().sum()

177

In [26]:
titanic['age'].notnull().sum()

714

In [28]:
titanic['age'].isnull().sum() + titanic['age'].notnull().sum()

891

## 2. Ordinary Encoding

This will also convert categorical data into numbers. Let's implement it

In [29]:
titanic['alone'].value_counts()

True     537
False    354
Name: alone, dtype: int64

In [30]:
titanic['alive'].value_counts()

no     549
yes    342
Name: alive, dtype: int64

In [31]:
from sklearn.preprocessing import OrdinalEncoder

categorical_features = titanic[['alone','alive']]
encoder = OrdinalEncoder()
categorical_features_encoded = encoder.fit_transform(categorical_features)

In [32]:
categorical_features_encoded

array([[0., 0.],
       [0., 1.],
       [1., 1.],
       ...,
       [0., 0.],
       [1., 1.],
       [1., 0.]])

The output of the encoder is a NumPy array. We can convert it back to the pandas dataframe.

In [33]:
titanic[['alone','alive']] = pd.DataFrame(categorical_features_encoded, columns = categorical_features.columns, index=categorical_features.index)

In [34]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,2,man,True,,Southampton,0.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,0,woman,False,C,Cherbourg,1.0,0.0
2,1,3,female,26.0,0,0,7.925,S,2,woman,False,,Southampton,1.0,1.0
3,1,1,female,35.0,1,0,53.1,S,0,woman,False,C,Southampton,1.0,0.0
4,0,3,male,35.0,0,0,8.05,S,2,man,True,,Southampton,0.0,1.0


In [35]:
encoder.categories_

[array([False,  True]), array(['no', 'yes'], dtype=object)]

__Warning__: Ordinary Encoder can't handle missing values. It will be error. Try it on `embarked` and see...

## 3. Label Encoding

Label Encoding is noted to used for encoding target features ([per sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html?highlight=label%20encoder#sklearn.preprocessing.LabelEncoder)) but otherwise, it can also be used to achieve our purpose of encoding categorical features.

It also can't support missing values. So, to make it simple, let's drop all missing values.

In [3]:
titanic_cleaned = titanic.dropna()

In [5]:
titanic_cleaned.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [6]:
from sklearn.preprocessing import LabelEncoder

deck_features = titanic_cleaned[['deck']]

label_encoder = LabelEncoder()

deck_features_encoded = label_encoder.fit_transform(deck_features)

  y = column_or_1d(y, warn=True)


Same as ordinary encoder, the output of Label Encoder is a NumPy array.

In [7]:
deck_features_encoded

array([2, 2, 4, 6, 2, 3, 0, 2, 3, 1, 2, 5, 5, 2, 4, 0, 3, 3, 2, 1, 4, 3,
       3, 2, 1, 5, 2, 1, 0, 2, 5, 5, 1, 1, 6, 0, 3, 3, 2, 2, 2, 3, 6, 2,
       1, 4, 1, 2, 2, 2, 3, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 4, 2, 3, 1, 2,
       2, 2, 4, 5, 2, 5, 4, 3, 1, 4, 2, 1, 3, 6, 2, 4, 2, 4, 1, 2, 0, 2,
       2, 2, 4, 4, 4, 3, 1, 2, 1, 2, 3, 2, 1, 2, 4, 3, 5, 1, 1, 1, 1, 1,
       2, 2, 0, 4, 2, 4, 4, 2, 0, 4, 1, 3, 0, 2, 5, 3, 3, 3, 0, 1, 1, 3,
       0, 3, 4, 1, 1, 3, 1, 1, 2, 5, 2, 4, 4, 2, 2, 5, 2, 4, 4, 1, 1, 2,
       1, 1, 3, 4, 1, 1, 3, 4, 1, 1, 3, 1, 3, 1, 0, 4, 1, 4, 4, 3, 4, 3,
       0, 3, 1, 2, 1, 2])

In [8]:
titanic_cleaned[['deck']] = pd.DataFrame(deck_features_encoded, columns = deck_features.columns, index= deck_features.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_cleaned[['deck']] = pd.DataFrame(deck_features_encoded, columns = deck_features.columns, index= deck_features.index)


In [9]:
titanic_cleaned.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,2,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,2,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,4,Southampton,no,True
10,1,3,female,4.0,1,1,16.7,S,Third,child,False,6,Southampton,yes,False
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,2,Southampton,yes,True


In [10]:

label_encoder.classes_

array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)

In [11]:

titanic_cleaned['deck'].value_counts()

2    51
1    43
3    31
4    30
0    12
5    11
6     4
Name: deck, dtype: int64

## 4. Pandas Dummies

This is also simple way to handle categorical features. It will create extra features based on the available categories. Let's apply it to the feature `who`.

In [12]:
who_dummy = pd.get_dummies(titanic['who'], drop_first=True)

In [13]:
who_dummy

Unnamed: 0,man,woman
0,1,0
1,0,1
2,0,1
3,0,1
4,1,0
...,...,...
886,1,0
887,0,1
888,0,1
889,1,0


In [14]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [16]:
titanic = pd.concat([titanic.drop('who',axis=1),who_dummy],axis=1)

In [17]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,deck,embark_town,alive,alone,man,woman
0,0,3,male,22.0,1,0,7.25,S,Third,True,,Southampton,no,False,1,0
1,1,1,female,38.0,1,0,71.2833,C,First,False,C,Cherbourg,yes,False,0,1
2,1,3,female,26.0,0,0,7.925,S,Third,False,,Southampton,yes,True,0,1
3,1,1,female,35.0,1,0,53.1,S,First,False,C,Southampton,yes,False,0,1
4,0,3,male,35.0,0,0,8.05,S,Third,True,,Southampton,no,True,1,0


## 5. One Hot Encoding

This is the last encoding type of our list. It will convert a feature into one hot matrix. Additional features corresponding to the values of the given categories will be created. Basically same as dummies.

In [26]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()

town_encoded = onehot.fit_transform(titanic_cleaned[['embark_town']])

In [27]:
town_encoded

<182x3 sparse matrix of type '<class 'numpy.float64'>'
	with 182 stored elements in Compressed Sparse Row format>

In [22]:
onehot.categories_

[array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)]

The output of One hot encoder is a sparse matrix. We will need to convert it into `NumPy` array.

In [28]:
town_encoded = town_encoded.toarray()

In [29]:
columns = list(onehot.categories_)

town_df = pd.DataFrame(town_encoded, columns = columns)

In [30]:
town_df.head()

Unnamed: 0,Cherbourg,Queenstown,Southampton
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


In [31]:
len(town_df)

182

In [32]:
len(titanic_cleaned)

182

In [33]:
drop_embark = titanic_cleaned.drop('embark_town',axis=1)

drop_embark[['Cherbourg', 'Queenstown', 'Southampton']] = town_df

In [34]:
drop_embark.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,alive,alone,Cherbourg,Queenstown,Southampton
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,2,yes,False,0.0,0.0,1.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,2,yes,False,0.0,0.0,1.0
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,4,no,True,0.0,0.0,1.0
10,1,3,female,4.0,1,1,16.7,S,Third,child,False,6,yes,False,0.0,0.0,1.0
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,2,yes,True,0.0,0.0,1.0
