# Categorical encoding

Encode categorical data into numeric data for use in models.  

**Method 1: label encoder**  
This can be used for data that is either binary (i.e. yes/no, hot/cold etc.) or data that is relational (i.e. small, medium, large). 

Ensure that if the data is not relational, that you use one hot encoding (method 2). This is because ML models are based on equations and if your data is not related, it could introduce bias. Take for example countries, if you encode France = 1, Spain = 2, Germany = 3, the model may interpret that Germany has a higher score than Spain or France.  

**Method 2: one hot encoder**  
Use for the majority of categorical data (any data that does not fit in method 1)


In [1]:
## ------- Import libraries and data ------- ##

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer


lst = [['France', 44, 72000, 'No'], 
       ['Spain', 27, 54000, 'Yes'],
       ['Germany', 30, 54000, 'No'],
       ['Spain', 38, 61000, 'No'],
       ['Germany', 40, 63777.78, 'Yes'],
       ['France', 35, 58000, 'Yes'],
       ['Spain',38.77 , 28000, 'No'],
       ['France', 48,798000, 'Yes'],
       ['Germany', 50, 83000, 'No'],
       ['France', 35, 58000, 'Yes']] 

df = pd.DataFrame(lst, columns =['Country', 'Age', 'Salary', 'Purchased'])

# set X as all independent variables
X = df.iloc[:,:-1].values

# set y as dependent variable
y = df.iloc[:,-1].values

## ------- METHOD 1: Label Encoder ------- ##

# You wouldn't use this method for the data above but just as an example of how to implement: 
example_X = X.copy()

# label encoder create object 
labelencoder_X = LabelEncoder()

# fit object to the data and replace categorical data
example_X[:,0]  = labelencoder_X.fit_transform(example_X[:,0])


# DEPENDED VARIABLE
print('LABEL ENCODER \n')
print('y before Label Encoder: \n\n', y, '\n\n')

# label encoder create object 
labelencoder_y = LabelEncoder()

# fit object to the data and replace categorical data
y  = labelencoder_y.fit_transform(y)
y_mapping_vals = dict(zip(labelencoder_y.classes_, labelencoder_y.transform(labelencoder_y.classes_)))

print('y after Label Encoder: \n\n', y, '\n\n')
print('Mapping for label encoder: ', y_mapping_vals)
print('\n','_'*7,'\n')

# ## ------- METHOD 2: One Hot Encoder ------- ##

print('ONE HOT ENCODER \n')
print('X before One Hot Encoder: \n\n', X, '\n\n')

X2 = X.copy() # for later

# Create object
columntransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')

#fit to data - simple method: 
X = columntransformer.fit_transform(X) 

print('X after One Hot Encoder: \n\n', X, '\n\n')


# # TO ALSO GET FEATURE NAMES

print('To get feature names: \n\n')
print('X2 before One Hot Encoder: \n\n', X2, '\n\n')

columntransformer2 = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='drop')
X2 = columntransformer2.fit_transform(X2)
col_names = columntransformer2.get_feature_names()

print('X2 after One Hot Encoder: \n\n', X2, '\n\n')
print('Feature names :',col_names)




LABEL ENCODER 

y before Label Encoder: 

 ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes'] 


y after Label Encoder: 

 [0 1 0 0 1 1 0 1 0 1] 


Mapping for label encoder:  {'No': 0, 'Yes': 1}

 _______ 

ONE HOT ENCODER 

X before One Hot Encoder: 

 [['France' 44.0 72000.0]
 ['Spain' 27.0 54000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.78]
 ['France' 35.0 58000.0]
 ['Spain' 38.77 28000.0]
 ['France' 48.0 798000.0]
 ['Germany' 50.0 83000.0]
 ['France' 35.0 58000.0]] 


X after One Hot Encoder: 

 [[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 54000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.78]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77 28000.0]
 [1.0 0.0 0.0 48.0 798000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]] 


To get feature names: 


X2 before One Hot Encoder: 

 [['France' 44.0 72000.0]
 ['Spain' 27.0 54000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 4