<a name='0'></a>
# How to Handle Categorical Data?

Real world data comes with their unique blends. Sometime working with real world data, you will have to deal with categorical data and other time not. Categorical data are those types of data whose features' values contain limited number of categories. Take an example of feature `gender` that can have two categories: `male and female`.

In many cases, categorical features have text values. And most ML models accept numerical inputs. That is the reason why we have to manipulate these types of categories to be in proper format accepted by ML algorithms. 

There are four techniques to encode or convert the categorical features into numbers. Here are them:

* [Mapping Method](#1)
* [Ordinary Encoding](#2)
* [Label Encoding](#3)
* [Pandas Dummies](#4)
* [OneHot Encoding](#5)

Note that some of these encoding techniques can produce same output, the difference is only implementation. The first 3 will produce the numerical outputs while the latter will produce the one hot matrix (with 1s and 0s). 

Let's implement them

In [1]:
# Loading the dataset 

import seaborn as sns
import pandas as pd

We are going to use Titanic dataset from seaborn datasets. There are so many categorical features to choose from. 

In [41]:
titanic = sns.load_dataset('titanic')

In [4]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [6]:
titanic.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [7]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


You can see that even displaying information about the dataset, some features like `deck` or `class` have category as data type.

Let's peek at some categorical features in our data. 

In [8]:
titanic['sex'].value_counts()

sex
male      577
female    314
Name: count, dtype: int64

In [9]:
titanic['embarked'].value_counts()

embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [10]:
titanic['class'].value_counts()

class
Third     491
First     216
Second    184
Name: count, dtype: int64

In [11]:
titanic['who'].value_counts()

who
man      537
woman    271
child     83
Name: count, dtype: int64

In [12]:
titanic['adult_male'].value_counts()

adult_male
True     537
False    354
Name: count, dtype: int64

In [13]:
titanic['embark_town'].value_counts()

embark_town
Southampton    644
Cherbourg      168
Queenstown      77
Name: count, dtype: int64

In [14]:
titanic['alone'].value_counts()

alone
True     537
False    354
Name: count, dtype: int64

In [15]:
titanic['deck'].value_counts()

deck
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: count, dtype: int64

<a name='1'></a>
## 1. Mapping Method

Mapping method is straight forward way to encode categorical features with few categories. Let's apply it to the class feature: It has three categories: `Third, First, Second`. We create a dictionary whose keys are categories and values are numerics to encode to and then map it to the dataframe. 

Here is how it is done:

In [24]:
map_dict = {
    'First':1,
    'Second': 2,
    'Third': 3,
}

In [25]:
titanic['class'] = titanic['class'].map(map_dict)

In [27]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,3,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,1,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,3,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,1,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,3,man,True,,Southampton,no,True


In [28]:
titanic['class'].value_counts()

class
3    491
1    216
2    184
Name: count, dtype: int64

As you can see, the class feature is encoded. Everywhere the class was `First`, it was replaced with 0. Samething happened to other classes. 

<a name='2'></a>
## 2. Ordinary Encoding

This will also convert categorical data into numbers. Let's implement it

In [30]:
titanic.head(2)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,3,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,1,woman,False,C,Cherbourg,yes,False


In [None]:
# !pip install scikit-learn

In [31]:
from sklearn.preprocessing import OrdinalEncoder

cats_feats = titanic[['alive', 'alone']]

encoder = OrdinalEncoder()

cats_encoded = encoder.fit_transform(cats_feats)

In [32]:
cats_encoded

array([[0., 0.],
       [1., 0.],
       [1., 1.],
       ...,
       [0., 0.],
       [1., 1.],
       [0., 1.]])

The output of the encoder is a NumPy array. We can convert it back to the pandas dataframe. 

In [33]:
titanic[['alive', 'alone']] = pd.DataFrame(cats_encoded, columns=cats_feats.columns, index=cats_feats.index)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,3,man,True,,Southampton,0.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,1,woman,False,C,Cherbourg,1.0,0.0
2,1,3,female,26.0,0,0,7.925,S,3,woman,False,,Southampton,1.0,1.0
3,1,1,female,35.0,1,0,53.1,S,1,woman,False,C,Southampton,1.0,0.0
4,0,3,male,35.0,0,0,8.05,S,3,man,True,,Southampton,0.0,1.0


In [34]:
encoder.categories_

[array(['no', 'yes'], dtype=object), array([False,  True])]

**Warning**: Ordinary Encoder can't handle missing values. It will be error. Try it on `embarked` and see...

<a name='3'></a>
## 3. Label Encoding

Label Encoding is noted to used for encoding target features [(per sklearn documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html?highlight=label%20encoder#sklearn.preprocessing.LabelEncoder) but otherwise, it can also be used to achieve our purpose of encoding categorical features. 

It also can't support missing values. So, to make it simple, let's drop all missing values. 

In [42]:
titanic = sns.load_dataset('titanic')

titanic_cleaned = titanic.dropna()

In [43]:
titanic_cleaned.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [46]:
from sklearn.preprocessing import LabelEncoder

deck_feat = titanic_cleaned[['deck']]

label_encoder = LabelEncoder()

deck_encoded = label_encoder.fit_transform(deck_feat)

  y = column_or_1d(y, warn=True)


Same as ordinary encoder, the output of Label Encoder is a NumPy array. 

In [48]:
titanic_cleaned['deck'] = pd.DataFrame(deck_encoded, columns=deck_feat.columns, index=deck_feat.index)

titanic_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_cleaned['deck'] = pd.DataFrame(deck_encoded, columns=deck_feat.columns, index=deck_feat.index)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,2,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,2,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,4,Southampton,no,True
10,1,3,female,4.0,1,1,16.7,S,Third,child,False,6,Southampton,yes,False
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,2,Southampton,yes,True


In [49]:
label_encoder.classes_

array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)

In [50]:
titanic_cleaned['deck'].value_counts()

deck
2    51
1    43
3    31
4    30
0    12
5    11
6     4
Name: count, dtype: int64

<a name='4'></a>
## 4. Pandas Dummies

This is also simple way to handle categorical features. It will create extra features based on the available categories. Let's apply it to the feature `who`. 

In [68]:
dummies = pd.get_dummies(titanic['sex'], drop_first=True)

In [74]:
# titanic  = sns.load_dataset('titanic')

In [71]:
titanic = pd.concat([titanic.drop('sex',axis=1),dummies],axis=1)

In [73]:
titanic.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,male
0,0,3,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,True
1,1,1,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,False
2,1,3,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,False
3,1,1,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,False
4,0,3,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,True


In [57]:
titanic.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,male
0,0,3,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,True
1,1,1,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,False
2,1,3,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,False
3,1,1,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,False
4,0,3,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,True


In [31]:
# Or you can do it at once with this code

#titanic[['man', 'woman']] = pd.get_dummies(titanic['who'], drop_first=True)

<a name='5'></a>
## 5. One Hot Encoding

This is the last encoding type of our list. It will convert a feature into one hot matrix. Additional features corresponding to the values of the given categories will be created. Basically same as dummies.

In [None]:
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder()

town_encoded = one_hot.fit_transform(titanic_cleaned[['embark_town']])

one_hot.categories_

In [78]:
town_encoded

<182x3 sparse matrix of type '<class 'numpy.float64'>'
	with 182 stored elements in Compressed Sparse Row format>

The output of One hot encoder is a sparse matrix. We will need to convert it into NumPy array. 

In [79]:
town_encoded = town_encoded.toarray()

In [80]:
columns = list(one_hot.categories_)

town_df = pd.DataFrame(town_encoded, columns =columns)

town_df.head()

Unnamed: 0,Cherbourg,Queenstown,Southampton
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


In [81]:
len(town_df)

182

In [82]:
len(titanic_cleaned)

182

In [83]:
drop_embark = titanic_cleaned.drop('embark_town',axis=1)

drop_embark[['Cherbourg', 'Queenstown', 'Southampton']] = town_df

In [40]:
drop_embark.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,alive,alone,Cherbourg,Queenstown,Southampton
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,2,yes,False,0.0,0.0,1.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,2,yes,False,0.0,0.0,1.0
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,4,no,True,0.0,0.0,1.0
10,1,3,female,4.0,1,1,16.7,S,Third,child,False,6,yes,False,0.0,0.0,1.0
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,2,yes,True,0.0,0.0,1.0


Hopefully these techniques will help you to handle all kinds of categorical features.

[Back to top!](#0)