## Label Encoder vs OneHotEncoder for transforming categorical data into numerict data
When categorical data can be presented as ordianry, e.g: Small < Large < SuperLarge, Label Encoder is the best way to use. We can use dictionary with map in pandas python or Label Encoder from scikit learn. 

In [46]:
import pandas as pd
import numpy as np
data = {'size':['S','SL','S','M','L','M'],
       'group':['Children','Adult','Teen','Teen','Adult','Senior'],
       'eye':['blue','black','brown','brown','blue','black']}
df= pd.DataFrame(data)
df.head()

Unnamed: 0,eye,group,size
0,blue,Children,S
1,black,Adult,SL
2,brown,Teen,S
3,brown,Teen,M
4,blue,Adult,L


In [12]:
# Using map from pandas python
df.loc[:,'size']=df.loc[:,'size'].map({'S':0,'M':1,'L':2,'SL':3})
df.head()

Unnamed: 0,eye,group,size
0,blue,Children,0
1,black,Adult,3
2,brown,Teen,0
3,brown,Teen,1
4,blue,Adult,2


In [16]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
df.loc[:,'group']= le.fit_transform(df.loc[:,'group'])
df.head()

Unnamed: 0,eye,group,size
0,blue,1,0
1,black,0,3
2,brown,3,0
3,brown,3,1
4,blue,0,2


In [14]:
# Inverse first value of group
le.inverse_transform(df.loc[1,'group'])

'Adult'

## For non ordinary categorical value, OneHot Encoder provide a solution

    Encode categorical integer features using a one-hot aka one-of-K scheme.
    The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete).
    The output will be a sparse matrix where each column corresponds to one possible value of one feature.
    It is assumed that input features take on values in the range [0, n_values).
    
This encoding is needed for to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.
Further useful [data type](http://www.ritchieng.com/pandas-changing-datatype/)

In [48]:
data = {'size':['S','SL','S','M','L','M'],
       'group':['Children','Adult','Teen','Teen','Adult','Senior'],
       'eye':['blue','black','brown','brown','blue','black']}
df= pd.DataFrame(data)
# Step 1: Convert all categorical values into numeric with Label encoder
le = LabelEncoder()
df=  df.apply(le.fit_transform)
df.head()

Unnamed: 0,eye,group,size
0,1,1,2
1,0,0,3
2,2,3,2
3,2,3,1
4,1,0,0


In [49]:
df['eye'].values

array([1, 0, 2, 2, 1, 0], dtype=int64)

In [52]:
# step 2: transform with OneHotEncoder 
ohe = OneHotEncoder(sparse=False)
tmp=pd.DataFrame(ohe.fit_transform(df[['eye']]),columns=['blue','black','brown'],dtype=np.int8)  # note: indicating data type

In [56]:
# step 3: Combine the result
pd.concat([tmp,df[['group','size']]], axis=1)

Unnamed: 0,blue,black,brown,group,size
0,0,1,0,1,2
1,1,0,0,0,3
2,0,0,1,3,2
3,0,0,1,3,1
4,0,1,0,0,0
5,1,0,0,2,1


In [62]:
# Alternative, we could transform all categorical attributes with OneHotEncoder after step 1
data = {'size':['S','SL','S','M','L','M'],
       'group':['Children','Adult','Teen','Teen','Adult','Senior'],
       'eye':['blue','black','brown','brown','blue','black']}
df= pd.DataFrame(data)
# Step 1: Convert all categorical values into numeric with Label encoder
le = LabelEncoder()
df=  df.apply(le.fit_transform)
df.head()

Unnamed: 0,eye,group,size
0,1,1,2
1,0,0,3
2,2,3,2
3,2,3,1
4,1,0,0


In [65]:
# step 2:
ohe = OneHotEncoder(sparse=False)
ohe_label=ohe.fit_transform(df)
ohe_label

array([[ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Extra example [here](http://www.ritchieng.com/machinelearning-one-hot-encoding/)

### Pandas categorical transformation solution
Pandas provides get_dummies method to convert categorical values into OneHotEncoder (most efficient). We can indicate which columns we want to transform 

In [67]:
pd.get_dummies(df, columns=["eye"]).head()

Unnamed: 0,group,size,eye_0,eye_1,eye_2
0,1,2,0,1,0
1,0,3,1,0,0
2,3,2,0,0,1
3,3,1,0,0,1
4,0,0,0,1,0


### Summary: Techniques to convert categorical data into numeric values. 
The remaining notebook credits to [fast ml](http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/) and [this](http://pbpython.com/categorical-encoding.html). The [Automobile Data Set](http://mlr.cs.umass.edu/ml/index.html) includes a good mix of categorical values as well as continuous values and serves as a useful example that is relatively easy to understand. Since domain understanding is an important aspect when deciding how to encode various categorical values - this data set makes a good case study.

In [68]:
import pandas as pd
import numpy as np

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [69]:
# filter categorical attribute only 
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


In [70]:
## Check corupt data
obj_df[obj_df.isnull().any(axis=1)]

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
27,dodge,gas,turbo,,sedan,fwd,front,ohc,four,mpfi
63,mazda,diesel,std,,sedan,fwd,front,ohc,four,idi


In [71]:
# summary for doors
obj_df["num_doors"].value_counts()

four    114
two      89
Name: num_doors, dtype: int64

In [72]:
# assign to most frequent values
obj_df = obj_df.fillna({"num_doors": "four"})

## Approach #1 - Most frequent values

In [73]:
obj_df["num_cylinders"].value_counts()

four      159
six        24
five       11
eight       5
two         4
three       1
twelve      1
Name: num_cylinders, dtype: int64

In [74]:
# create dict for replacing
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
                                  "two": 2, "twelve": 12, "three":3 }}

In [75]:
obj_df.replace(cleanup_nums, inplace=True)
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi


## Approach #2 - Label Encoding
For example, the body_style column contains 5 different values. We could choose to encode it like this:

    convertible -> 0
    hardtop -> 1
    hatchback -> 2
    sedan -> 3
    wagon -> 4


In [76]:
obj_df["body_style"] = obj_df["body_style"].astype('category')
obj_df.dtypes

make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
dtype: object

In [77]:
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi,2
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi,3
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi,3


## Approach #3 - One Hot Encoding
for exampla, the column drive_wheels where we have values of 4wd , fwd or rwd . By using get_dummies we can convert this to three columns with a 1 or 0 corresponding to the correct value:

In [78]:
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,drive_wheels_4wd,drive_wheels_fwd,drive_wheels_rwd
0,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1
1,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1
2,alfa-romero,gas,std,2,hatchback,front,ohcv,6,mpfi,2,0,0,1
3,audi,gas,std,4,sedan,front,ohc,4,mpfi,3,0,1,0
4,audi,gas,std,4,sedan,front,ohc,5,mpfi,3,1,0,0


The new data set contains three new columns:

    drive_wheels_4wd
    drive_wheels_rwd
    drive_wheels_fwd

This function is powerful because you can pass as many category columns as you would like and choose how to label the columns using prefix . Proper naming will make the rest of the analysis just a little bit easier.

In [79]:
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,body_convertible,body_hardtop,body_hatchback,body_sedan,body_wagon,drive_4wd,drive_fwd,drive_rwd
0,alfa-romero,gas,std,2,front,dohc,4,mpfi,0,1,0,0,0,0,0,0,1
1,alfa-romero,gas,std,2,front,dohc,4,mpfi,0,1,0,0,0,0,0,0,1
2,alfa-romero,gas,std,2,front,ohcv,6,mpfi,2,0,0,1,0,0,0,0,1
3,audi,gas,std,4,front,ohc,4,mpfi,3,0,0,0,1,0,0,1,0
4,audi,gas,std,4,front,ohc,5,mpfi,3,0,0,0,1,0,1,0,0


## Approach #4 - Custom Binary Encoding
you may be able to use some combination of label encoding and one hot encoding to create a binary column that meets your needs for further analysis.

In this particular data set, there is a column called engine_type that contains several different values:

In [80]:
obj_df["engine_type"].value_counts()

ohc      148
ohcf      15
ohcv      13
dohc      12
l         12
rotor      4
dohcv      1
Name: engine_type, dtype: int64

For the sake of discussion, maybe all we care about is whether or not the engine is an Overhead Cam (OHC) or not. In other words, the various versions of OHC are all the same for this analysis. If this is the case, then we could use the str accessor plus np.where to create a new column the indicates whether or not the car has an OHC engine.

In [83]:
obj_df.head(4)

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi,2
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi,3


In [90]:
np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [92]:
obj_df["OHC_Code"] = np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)
obj_df[["make", "engine_type", "OHC_Code"]].head()

Unnamed: 0,make,engine_type,OHC_Code
0,alfa-romero,dohc,1
1,alfa-romero,dohc,1
2,alfa-romero,ohcv,1
3,audi,ohc,1
4,audi,ohc,1


Scikit-learn also supports binary encoding by using the LabelBinarizer. We use a similar process as above to transform the data but the process of creating a pandas DataFrame adds a couple of extra steps.

In [86]:
from sklearn.preprocessing import LabelBinarizer

lb_style = LabelBinarizer()
lb_results = lb_style.fit_transform(obj_df["body_style"])
pd.DataFrame(lb_results, columns=lb_style.classes_).head()

Unnamed: 0,convertible,hardtop,hatchback,sedan,wagon
0,1,0,0,0,0
1,1,0,0,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,0,0,1,0


## Advanced Approaches

There are even more advanced algorithms for categorical encoding. I do not have a lot of personal experience with them but for the sake of rounding out this guide, I wanted to included them. This [article](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/) provides some additional technical background. The other nice aspect is that the author of the article has created a scikit-learn contrib package call [categorical-encoding](http://contrib.scikit-learn.org/categorical-encoding/) which implements many of these approaches. It is a very nice tool for approaching this problem from a different perspective.

Here is a brief introduction to using the library for some other types of encoding. For the first example, we will try doing a Backward Difference encoding.

For this technique read [this](http://pbpython.com/categorical-encoding.html)

## Using the vectorizer. 
credit [fast ml](http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/)

In [None]:
from sklearn.feature_extraction import DictVectorizer as DV

vectorizer = DV( sparse = False )
vec_x_cat_train = vectorizer.fit_transform( x_cat_train )
vec_x_cat_test = vectorizer.transform( x_cat_test ) 