# Encoding

## When and Why is Encoding Used?
Encoding is needed for some machine learning models when you have **categorical** predictors (e.g., *color*: red, yellow, blue).

Some machine learning algorithms (e.g., decision trees in R) can work directly with categorical data, but most others cannot.

The machine learning algorithms that cannot use categorical data require that the categorical data be transformed to numeric data.




## How Do We Transform Categorical Data Into Numeric?
There are two ways we can do this:
1. Integer Encoding
2. One-Hot Encoding

### Integer Encoding
This requires simpy assigning an integer value to each of the categories in the categorical data.

For example:

*   Small = 1
*   Medium = 2
*   Large = 3

This works for cases where the categorical data have a natural ordering between them (as is the case with small, medium & large). However, when the categorical data do not have a natural ordering (e.g., blue, yellow, green), one-hot encoding is required.




### One-Hot Encoding
One-hot encoding involves creating a new binary variable for each category.

For example, say we have this original dataset:

Bird Species | Color
-------------|------
Blue Jay     | Blue
Warbler      | Yellow
Mountain Bird| Blue
Parrot       | Green

We would one-hot encode these data as follows:

Bird Species | Blue | Yellow | Green
-------------|------|--------|------
Blue Jay     | 1    |0      |0
Warbler      |0     |1      |0
Mountain Bird| 1    |0      |0
Parrot       |0     |0       |1



## Example

In [None]:
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

### Integer Encoding

In [None]:
# Create dataframe
heights = {'Name': ['Michael', 'Jim', 'Pam', 'Dwight', 'Kelly'],
           'Age': [45, 35, 33, 46, 28],
           'Height': ['Average', 'Tall', 'Short', 'Average', 'Short']}
heights_df = pd.DataFrame(heights)
heights_df

Unnamed: 0,Name,Age,Height
0,Michael,45,Average
1,Jim,35,Tall
2,Pam,33,Short
3,Dwight,46,Average
4,Kelly,28,Short


In [None]:
# Create instance of labelencoder
labelencoder = LabelEncoder()

# Assign numerical values and store in another column
heights_df['Height Label'] = labelencoder.fit_transform(heights_df['Height'])
heights_df

Unnamed: 0,Name,Age,Height,Height Label
0,Michael,45,Average,0
1,Jim,35,Tall,2
2,Pam,33,Short,1
3,Dwight,46,Average,0
4,Kelly,28,Short,1


In [None]:
# Another option - using .replace()
heights_df['Height Label2'] = heights_df['Height'].replace({'Short': 0, 'Average': 1, 'Tall': 2})
heights_df

  heights_df['Height Label2'] = heights_df['Height'].replace({'Short': 0, 'Average': 1, 'Tall': 2})


Unnamed: 0,Name,Age,Height,Height Label,Height Label2
0,Michael,45,Average,0,1
1,Jim,35,Tall,2,2
2,Pam,33,Short,1,0
3,Dwight,46,Average,0,1
4,Kelly,28,Short,1,0


In [None]:
heights_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name           5 non-null      object
 1   Age            5 non-null      int64 
 2   Height         5 non-null      object
 3   Height Label   5 non-null      int64 
 4   Height Label2  5 non-null      int64 
dtypes: int64(3), object(2)
memory usage: 328.0+ bytes


### One-Hot Encoding

In [None]:
# Using pandas
# Create dataframe
foods = {'Name': ['Michael', 'Jim', 'Pam', 'Dwight', 'Kelly'],
         'Age': [45, 35, 33, 46, 28],
         'Food': ['Pizza', 'Ham and Cheese', 'Pizza', 'Beets', 'Cupcake']}
foods_df = pd.DataFrame(foods)
foods_df

Unnamed: 0,Name,Age,Food
0,Michael,45,Pizza
1,Jim,35,Ham and Cheese
2,Pam,33,Pizza
3,Dwight,46,Beets
4,Kelly,28,Cupcake


In [None]:
# Generate binary values using get_dummies
dum_df = pd.get_dummies(foods_df, columns=["Food"], prefix="", prefix_sep="") # Can change prefix using prefix argument
# dum_df = pd.get_dummies(foods_df, columns=["Food"] ) # Can change prefix using prefix argument
dum_df

Unnamed: 0,Name,Age,Beets,Cupcake,Ham and Cheese,Pizza
0,Michael,45,False,False,False,True
1,Jim,35,False,False,True,False
2,Pam,33,False,False,False,True
3,Dwight,46,True,False,False,False
4,Kelly,28,False,True,False,False
