# Encoding of Categorical Features

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelBinarizer

Consider the following imaginary dataset with two categorical features and one numerical feature. 

Note that the categorical feature `size` is **ordinal**, i.e. we can assume without bias that **small < medium < large**, while column `country` is not.

Categorical features in which the values do not have an inherent order are called **nominal** features.

In [2]:
df = pd.DataFrame({'country': ['australia', 'germany', 'russia','korea','germany', np.nan], 
                   'size': ['medium', 'large', 'small', 'medium', 'medium', 'small'],
                   'score': [1.4, 2.8, 4.8, 5.2, 1.1, 0.5]})
df

Unnamed: 0,country,size,score
0,australia,medium,1.4
1,germany,large,2.8
2,russia,small,4.8
3,korea,medium,5.2
4,germany,medium,1.1
5,,small,0.5


## Encoding of Ordinal Features

In [3]:
# Create a mapper

scale_mapper = {
    "small": 0,
    "medium": 1,
    "large": 2
}

df['size'] = df['size'].replace(scale_mapper)

df

Unnamed: 0,country,size,score
0,australia,1,1.4
1,germany,2,2.8
2,russia,0,4.8
3,korea,1,5.2
4,germany,1,1.1
5,,0,0.5


## One-Hot Encoding of Nominal Features

In [4]:
# use drop_first=True to exclude one of the countries and avoid dependency between the columns
df = pd.get_dummies(df, columns=['country'], dummy_na=True)

df

Unnamed: 0,size,score,country_australia,country_germany,country_korea,country_russia,country_nan
0,1,1.4,1,0,0,0,0
1,2,2.8,0,1,0,0,0
2,0,4.8,0,0,0,1,0
3,1,5.2,0,0,1,0,0
4,1,1.1,0,1,0,0,0
5,0,0.5,0,0,0,0,1
