### Handling categorical data

Most of the ML libraries are designed to work well with numerical variables. So categorical variables in their original form of text description can’t be directly used for model building. Let’s learn some of the common methods of handling categorical data based on their number of levels.

### Create dummy variable: 
This is a Boolean variable that indicates the presence of a category with the value 1 and 0 for absence. You should create k-1 dummy variables, where k is the number of level. 

In [1]:
import random
random.seed(2017)
import pandas as pd
from patsy import dmatrices

df = pd.DataFrame({'A': ['high', 'medium', 'low'],
                   'B': [10,20,30]},
                    index=[0, 1, 2])
                   
print df                   

        A   B
0    high  10
1  medium  20
2     low  30


In [2]:
df_with_dummies= pd.get_dummies(df, prefix='A', columns=['A'])

print df_with_dummies

    B  A_high  A_low  A_medium
0  10     1.0    0.0       0.0
1  20     0.0    0.0       1.0
2  30     0.0    1.0       0.0


### Convert categories to numeric labels

Another simple method is to represent the text description of each level with a number by using ‘Label Encoder’ function of Scikit-learn. If the number of levels are high (example zip code, state etc), then you apply the business logic to combine levels to groups. For example zip code or state can be combined to regions, however in this method there is a risk of losing critical information. Another method is to combine categories based on similar frequency (new category can be high, medium, low).  

In [3]:
import pandas as pd

# using pandas package's factorize function
df['A_pd_factorized'] = pd.factorize(df['A'])[0]

# Alternatively you can use sklearn package's LabelEncoder function
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

df['A_LabelEncoded'] = le.fit_transform(df.A)
print df

        A   B  A_pd_factorized  A_LabelEncoded
0    high  10                0               0
1  medium  20                1               2
2     low  30                2               1
