### We will explore various encoding technique here
More links:
    https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

### Types of Encoding:
1. Nominal Variable: 
    Nominal Variable are simple categorical variable which do not have any preference over each other. Eg. Male, Female
        a) One hot Encoding
        b) One hot Encoding for many variable (using KDD Orange - Ensemblem techquie)
        c) Mean Encoding
2. Ordinal Variable:
    Ordinal variable are categorical variable which have some specific preferences over each other/ or rank. Eg. High, Medium, Low
        a) Label Encoding
        b) Target Guided Ordinal Encoding

In [60]:
#importing all required library
import pandas as pd
import numpy as np

In [61]:
dataset = pd.read_csv("../Dataset/insurance.csv")
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [11]:
#As we can see here that columns (smoker, region) are categorical variables
print(dataset["smoker"].unique())
print(dataset["region"].unique())

['yes' 'no']
['southwest' 'southeast' 'northwest' 'northeast']


In [12]:
# We can see here the possible value and their count
print(dataset["smoker"].value_counts())
print(dataset["region"].value_counts())

no     1064
yes     274
Name: smoker, dtype: int64
southeast    364
northwest    325
southwest    325
northeast    324
Name: region, dtype: int64


We ususally perform encoding so the machine learning algorithm can understand the values, since algo works in numbers.

## One hot Encoding

In [13]:
# Firstly we would perform one hot encoding. In one hot encoding, we create separate columns for possible values
# and mark if the exists.

pd.get_dummies(dataset)
#But with one hot encoding, we occur with dummy trap which produces undesirable result.

Unnamed: 0,age,bmi,children,expenses,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.92,1,0,0,1,0,0,0,1
1,18,33.8,1,1725.55,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.46,0,1,1,0,0,0,1,0
3,33,22.7,0,21984.47,0,1,1,0,0,1,0,0
4,32,28.9,0,3866.86,0,1,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,31.0,3,10600.55,0,1,1,0,0,1,0,0
1334,18,31.9,0,2205.98,1,0,1,0,1,0,0,0
1335,18,36.9,0,1629.83,1,0,1,0,0,0,1,0
1336,21,25.8,0,2007.95,1,0,1,0,0,0,0,1


## Dummy Variable
In dummy variable, we perform same action as the one hot encoding, but to avoid dummy variable trap, we drop one of the columns produced.

In [16]:
pd.get_dummies(dataset, drop_first=True)
# Now we can see the one of the column from column smoker and region has been dropped. If we want to get dummy variable on 
# specific column then we would have to write column name in column signature.

Unnamed: 0,age,bmi,children,expenses,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.92,0,1,0,0,1
1,18,33.8,1,1725.55,1,0,0,1,0
2,28,33.0,3,4449.46,1,0,0,1,0
3,33,22.7,0,21984.47,1,0,1,0,0
4,32,28.9,0,3866.86,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...
1333,50,31.0,3,10600.55,1,0,1,0,0
1334,18,31.9,0,2205.98,0,0,0,0,0
1335,18,36.9,0,1629.83,0,0,0,1,0
1336,21,25.8,0,2007.95,0,0,0,0,1


# Label Encoding
Label Encoding is a type of ordinal encoding, which have particular order of importance.

Unique values are arranged alphabatecally and assigned rank(0, 1, 2...)

It could be used for both nominal or ordinal encoding

In [21]:
from sklearn.preprocessing import  LabelEncoder
label_encoder = LabelEncoder()
dataset_Label_encoded  = dataset.copy()

In [22]:
dataset_Label_encoded['region'] = label_encoder.fit_transform(dataset['region'])
dataset_Label_encoded['sex'] = label_encoder.fit_transform(dataset['sex'])
dataset_Label_encoded['smoker'] = label_encoder.fit_transform(dataset['smoker'])
dataset_Label_encoded

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,0,27.9,0,1,3,16884.92
1,18,1,33.8,1,0,2,1725.55
2,28,1,33.0,3,0,2,4449.46
3,33,1,22.7,0,0,1,21984.47
4,32,1,28.9,0,0,1,3866.86
...,...,...,...,...,...,...,...
1333,50,1,31.0,3,0,1,10600.55
1334,18,0,31.9,0,0,0,2205.98
1335,18,0,36.9,0,0,2,1629.83
1336,21,0,25.8,0,0,3,2007.95


## Ordinal Encoding
These are categorical variable in which categories can be meaningfully ordered
Usually, Ordinal Encoding is done starting from 1.

In [51]:
print(dataset["region"].unique())
print(dataset["sex"].unique())
print(dataset["smoker"].unique())

['southwest' 'southeast' 'northwest' 'northeast']
['female' 'male']
['yes' 'no']


In [52]:
# making copy of dataset
dataset_ordinal_encoding = dataset.copy()

In [87]:
# assigning labels 
sex_order_label = {"male": 1,"female":2}
smoker_order_label = {"yes":1,"no":2}
region_order_label = {"southwest":1,"northwest":2,"southeast":3, "northeast":4}

In [54]:
dataset_ordinal_encoding["sex_oe"] = dataset["sex"].map(sex_order_label)
dataset_ordinal_encoding["smoker_oe"] = dataset["smoker"].map(smoker_order_label)
dataset_ordinal_encoding["region_oe"] = dataset["region"].map(region_order_label)
dataset_ordinal_encoding

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses,sex_oe,smoker_oe,region_oe
0,19,female,27.9,0,yes,southwest,16884.92,1,1,0
1,18,male,33.8,1,no,southeast,1725.55,0,0,2
2,28,male,33.0,3,no,southeast,4449.46,0,0,2
3,33,male,22.7,0,no,northwest,21984.47,0,0,3
4,32,male,28.9,0,no,northwest,3866.86,0,0,3
...,...,...,...,...,...,...,...,...,...,...
1333,50,male,31.0,3,no,northwest,10600.55,0,0,3
1334,18,female,31.9,0,no,northeast,2205.98,1,0,1
1335,18,female,36.9,0,no,southeast,1629.83,1,0,2
1336,21,female,25.8,0,no,southwest,2007.95,1,0,0


In [55]:
dataset_ordinal_encoding.drop(columns=["sex","smoker","region"], inplace=True)

In [56]:
dataset_ordinal_encoding

Unnamed: 0,age,bmi,children,expenses,sex_oe,smoker_oe,region_oe
0,19,27.9,0,16884.92,1,1,0
1,18,33.8,1,1725.55,0,0,2
2,28,33.0,3,4449.46,0,0,2
3,33,22.7,0,21984.47,0,0,3
4,32,28.9,0,3866.86,0,0,3
...,...,...,...,...,...,...,...
1333,50,31.0,3,10600.55,0,0,3
1334,18,31.9,0,2205.98,1,0,1
1335,18,36.9,0,1629.83,1,0,2
1336,21,25.8,0,2007.95,1,0,0


## Mean Encoding

Mean encoding is similar to label encoding, except here labels are correlated directly with the target. 

In [75]:
# creating dataset
data={'Temperature':['hot','cold','veryhot','hot','veryhot','hot','veryhot','veryhot','veryhot','hot','cold'],
      'Target':[1,1,1,1,1,0,0,1,1,1,0]}
df = pd.DataFrame(data)

In [76]:
df

Unnamed: 0,Temperature,Target
0,hot,1
1,cold,1
2,veryhot,1
3,hot,1
4,veryhot,1
5,hot,0
6,veryhot,0
7,veryhot,1
8,veryhot,1
9,hot,1


In [79]:
mean_encode = df.groupby(['Temperature'])['Target'].mean()
mean_encode

Temperature
cold       0.50
hot        0.75
veryhot    0.80
Name: Target, dtype: float64

In [82]:
df['Temp_mean_encoding'] = df['Temperature'].map(mean_encode)
df

Unnamed: 0,Temperature,Target,Temp_mean_encoding
0,hot,1,0.75
1,cold,1,0.5
2,veryhot,1,0.8
3,hot,1,0.75
4,veryhot,1,0.8
5,hot,0,0.75
6,veryhot,0,0.8
7,veryhot,1,0.8
8,veryhot,1,0.8
9,hot,1,0.75


## Target Guided Ordinal Encoding

In [83]:
# According to mean value, we can see that veryhot(0.80) is the highest, then hot(0.75) and lastly cold(0.50)
# We can assign labels based on these mean values
temp_tgoe = {0.50:1,0.75:2,0.80:3}

In [84]:
df["Temp_target_ordinal_enc"] = df["Temp_mean_encoding"].map(temp_tgoe)

In [85]:
df

Unnamed: 0,Temperature,Target,Temp_mean_encoding,Temp_target_ordinal_enc
0,hot,1,0.75,2
1,cold,1,0.5,1
2,veryhot,1,0.8,3
3,hot,1,0.75,2
4,veryhot,1,0.8,3
5,hot,0,0.75,2
6,veryhot,0,0.8,3
7,veryhot,1,0.8,3
8,veryhot,1,0.8,3
9,hot,1,0.75,2
