#Encoding Categorical Features


Features can come in various different kind of format. Typically we distinguish between interval(continuous) and categorical (discrete) features. And the categorical features can be further categorized into ordinal and nominal features.

Most implementations of machine learning algorithms require numerical data as input, so we have to prepare our data accordingly. 


##Sample Data

Let's first create a simple dataset that describe a group of people, containing all three kinds of features:
* Numerical Variable: income
* Nominal Variable: hair color
* Ordinal Variable: level of educational experience

In [36]:
import pandas as pd
df = pd.DataFrame({'income':[20000, 40000,60000], 
                   'hair color':['black', 'blonde', 'grey'], 
                   'education':['high school graduate', 'some college', 'college graduate']})
df

Unnamed: 0,education,hair color,income
0,high school graduate,black,20000
1,some college,blonde,40000
2,college graduate,grey,60000


##Ordinal Features

Ordinal features are usually treated as numerical variables. To achieve that, We have to make sure that the correct values are associated with the corresponding strings. Thus, we first need to set-up an explicit mapping dictionary and use the dictionary to convert ordinal features to numerical features.

In [37]:
#create a copy of df to protect the original df
df0 = df

#mapping
education_mapping = {
           'high school graduate': 1,
           'some college': 2,
           'college graduate': 3}

df0['education'] = df0['education'].map(education_mapping)
df0

Unnamed: 0,education,hair color,income
0,1,black,20000
1,2,blonde,40000
2,3,grey,60000


##Nominal Features

To represent nomial value in numerical format, we usually convert nomial variables to a series of dummy variales. In other word, each possible hair color value becomes a feature column itself with values 1 or 0.

We can achieve this either by scikit 'OneHotEncoder' or pandas 'get_dummies'.

###scikit - OneHotEncoder

The OneHotEncoder takes a list of dictionary entries and transforms it to vectors. However, OneHotEncoder only takes integer columns as input. So we can use the LabelEncoder first to transform the hair color column to integer columns and then use OneHotEncoder.

In [24]:
#create a copy of df to protect the original df
df1 = df

from sklearn.preprocessing import LabelEncoder
color_le = LabelEncoder()
df1['hair color'] = color_le.fit_transform(df1['hair color'])

df1

Unnamed: 0,education,hair color,income
0,high school graduate,0,20000
1,some college,1,40000
2,college graduate,2,60000


In [25]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

hair_color_dummies = ohe.fit_transform(df1[['hair color']].values)
hair_color_dummies

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

Finally, merge the hair color dummy variables back to the original dataframe to get the complete dataset.

In [26]:
hair_color_df = pd.DataFrame(hair_color_dummies,columns = ['hair_color = black','hair_color = blonde','hair_color = grey'])
df1 = pd.concat([df1[['education','income']],hair_color_df],axis = 1)
df1

Unnamed: 0,education,income,hair_color = black,hair_color = blonde,hair_color = grey
0,high school graduate,20000,1.0,0.0,0.0
1,some college,40000,0.0,1.0,0.0
2,college graduate,60000,0.0,0.0,1.0


###pandas - get_dummies

pandas comes with a convenience function to create new categories for nominal features. Note that the function automatically convert ALL categorical variables to dummy variables, including Ordianal variables. To avoid this, we first need to convert ordinal features to numerical features, then let get_dummies handle the rest.

In [39]:
#start with df0 (ordinal education column has already been converted to numerical features in previous step) 
df2 = pd.get_dummies(df0)
df2

Unnamed: 0,education,income,hair color_black,hair color_blonde,hair color_grey
0,1,20000,1.0,0.0,0.0
1,2,40000,0.0,1.0,0.0
2,3,60000,0.0,0.0,1.0
