## Handling categorical  data 

<img src="images/types_data.jpg" width="60%"/>

### **Nominal variable** - Without any particular implied order like say color of eyes- Black, blue, green, brown. There is no ranking or ordering to the colors

### person’s gender, marital status, hometown, or the types of movies...


### **Ordinal Categorical **- it occurs with a natural ordering.
### like size(small,medium,large),exam grades, movies ratings....

### **Discrete data** represent items that can be counted; 

### **Continuous data** has infinte no of possible values

In [15]:
import pandas as pd

In [16]:
df = pd.DataFrame([['green','L',10.30,'class1'],['red','M',12.30,'class2'],
                   ['blue','XL',18.30,'class3']])
                   
df.columns = ['Color','Size','Price','Target']   
df

Unnamed: 0,Color,Size,Price,Target
0,green,L,10.3,class1
1,red,M,12.3,class2
2,blue,XL,18.3,class3


### XL>L>M  then make sense (ordinal data)

### green>red>blue ....no comparision possible i.e does not make any sense (nominal data)

In [17]:
size_mapping = {'M':1,'L':2,'XL':3}

In [18]:
df['Size'] = df['Size'].map(size_mapping)

In [19]:
df

Unnamed: 0,Color,Size,Price,Target
0,green,2,10.3,class1
1,red,1,12.3,class2
2,blue,3,18.3,class3


In [20]:
df1 = pd.get_dummies(df[['Color','Size','Price']])

### also called one hot encoding

In [24]:
#label encoding
df1['target']=df.Target.map({'class1':0,'class2':1,'class3':2})

In [23]:
df1

Unnamed: 0,Size,Price,Color_blue,Color_green,Color_red,target
0,2,10.3,0,1,0,0
1,1,12.3,0,0,1,1
2,3,18.3,1,0,0,2


### Label Encoding 

In [8]:
from sklearn.preprocessing import LabelEncoder

In [9]:
labels = ['setosa','versicolor','virginica']

In [11]:
labels

['setosa', 'versicolor', 'virginica']

In [10]:
encoder = LabelEncoder()

encoder.fit_transform(labels)


array([0, 1, 2], dtype=int64)

In [12]:
more_labels = ['versicolor','setosa','virginica','setosa','versicolor']

In [13]:
new_labels = encoder.transform(more_labels)

In [14]:
new_labels

array([1, 0, 2, 0, 1], dtype=int64)

## Example encoding categorical data

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [36]:
#importing dataset
data = pd.read_csv('dataset/Data.csv')

In [37]:
data.head()
#data.T.drop_duplicates().T
#df.drop('Salary1',inplace=True,axis=1)
#del data['Salary1']

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [38]:
X = data.iloc[:,:-1].values
y = data.iloc[:,3].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [39]:
#Missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN',strategy='mean',axis=0)
impute = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [40]:
data.Country.unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [41]:
# encoding categorical data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
label_x = LabelEncoder()
X[:,0] = label_x.fit_transform(X[:,0])

In [42]:
X[:5]

array([[0L, 44.0, 72000.0],
       [2L, 27.0, 48000.0],
       [1L, 30.0, 54000.0],
       [2L, 38.0, 61000.0],
       [1L, 40.0, 63777.77777777778]], dtype=object)

### Note:

if there are 3 levels - High, Medium and Low, we can create only 2 variables:
1. High - 1 if high 0 otherwise
2. Medium - 1 if medium 0 otherwise

The third for Low is not required because a 0 in both High & Medium indicates a low. If you make a separate Low variable, it will lead to redundancy.

## Label Encoding Vs One Hot Encoding

Label Encoding gives numerical value to different classes. If I have ‘red’, ‘green’ and ‘blue’ values in my column. It will give them 0,1 and 2. The problem with this approach is that there is no relation between these three classes yet our Algo might consider them to be ordered (that is there is some relation between them) maybe 0<1<2 that is ‘red’<‘green’<‘blue’. This doesn’t make sense, right?

So, in this case, I’d rather go for One Hot Encoding. In this, I will get three columns and the presence of a class will be represented 1 otherwise 0. But, here the three classes are separated out to three different columns(features). The Algo is only worried about their presence/absence without making any assumptions of their relationship.

In [43]:
onehotencoder = OneHotEncoder(categorical_features=[0])

In [44]:
X = onehotencoder.fit_transform(X).toarray()

In [45]:
X=X.astype(int)

In [46]:
X

array([[    1,     0,     0,    44, 72000],
       [    0,     0,     1,    27, 48000],
       [    0,     1,     0,    30, 54000],
       [    0,     0,     1,    38, 61000],
       [    0,     1,     0,    40, 63777],
       [    1,     0,     0,    35, 58000],
       [    0,     0,     1,    38, 52000],
       [    1,     0,     0,    48, 79000],
       [    0,     1,     0,    50, 83000],
       [    1,     0,     0,    37, 67000]])

In [47]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'], dtype=object)

In [48]:
label_y = LabelEncoder()
y = label_y.fit_transform(y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)