# Encoder and Imputers:

**In Machine learning concepts and in model building , we see that very few columns or very rare columns have data in text. Because  , Machine learning only accepts numbers  , don't understand text . But, what if any column has data in text which is very important , so how can we convert that text into numbers(or coding) so that system can understand??**

So , We gonna use **Encoder and Imputers** To convert text into numbers and to fill missing values(null values).

**What is imputers function?**

The imputer is an estimator used to fill the missing values in datasets. For numerical values, it uses mean, median, and constant. For categorical values, it uses the most frequently used and constant value. You can also train your model to predict the missing labels.

**Important Q. For interview : How we gonna fill nan values.?**

Answer: Dont's say we gonna use fillna or something else. Just say **we gonna use some advanced imputational techinques to fill nan values.**

**What is Encoder?**

Encoding means to convert data into a required format . In the Pictionary example we convert a word (text) into a drawing (image). In the machine learning context, we convert a sequence of words in Spanish into a two-dimensional vector or basically in numbers, this two-dimensional vector is also known as hidden state

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd

In [3]:
# let's create a DataFrame:

df = pd.DataFrame({'Salary':[25000,48000,71000,85000,90000,55000],
                  'City':['Bengaluru','Delhi','Hyderabad','Bengaluru','Hyderabad','Bengaluru'],
                  'Gender':['Male','Female','Female','Female','Male','Male'],
                  'Exp':[1,3,5,6,9,None]})
df

Unnamed: 0,Salary,City,Gender,Exp
0,25000,Bengaluru,Male,1.0
1,48000,Delhi,Female,3.0
2,71000,Hyderabad,Female,5.0
3,85000,Bengaluru,Female,6.0
4,90000,Hyderabad,Male,9.0
5,55000,Bengaluru,Male,


**This DataFrame contains text data , numbers as well as none values. Here City and Gender are categorical data , so let's see how we gonna convert these into numbers , so that system can understand.**

# 1. LabelEncoder:

**It will convert text into numbers of each column one by one. At once , it will convert text of only one column , for next column , again has to apply this.**

In [6]:
# Importing Encoders: LabelEncoder is case sensitive.

from sklearn.preprocessing import LabelEncoder

In [7]:
lab_enc = LabelEncoder()  # Initialize LabelEncoder first

In [8]:
df2 = lab_enc.fit_transform(df['City']) # As city contains text so we have taken this to fit , first.
pd.Series(df2)                         #As there is only one column , so create series with this column.

0    0
1    1
2    2
3    0
4    2
5    0
dtype: int32

**So here it converted text into numbers , Firstly it arranged the data in alphabetical order and then assigned index (or numbers) as Bengaluru comes first so it is 0 ,Delhi is 1 and Hyderabad is 2.**

In [9]:
# To assign numbers instead of text in city:

df['City']=df2       # As df2 has converted data.
df

Unnamed: 0,Salary,City,Gender,Exp
0,25000,0,Male,1.0
1,48000,1,Female,3.0
2,71000,2,Female,5.0
3,85000,0,Female,6.0
4,90000,2,Male,9.0
5,55000,0,Male,


**Similarly , we can do this for Gender.**

# 2. OneHotEncoder & Simple Imputer:

**OneHotEncoder , will convert text into numbers(Or required format) in once of all columns , unlike LabelEncoder once column at a time.**

**SimpleImputer is used to fill null values present in columns.**

In [10]:
# Let's understand other Encoder and Imputer:

from sklearn.preprocessing import OneHotEncoder  #Case sensitive
from sklearn.impute import SimpleImputer          #SimpleImputer is first Imputational technique to fill null values.
from sklearn.compose import make_column_transformer # This is used to join all structured/un columns after using both above.

In [11]:
# Let's initialize :

ohe = OneHotEncoder()
si = SimpleImputer()

In [12]:
# let's create/take DataFrame:

df = pd.DataFrame({'Salary':[25000,48000,71000,85000,90000,55000],
                  'City':['Bengaluru','Delhi','Hyderabad','Bengaluru','Hyderabad','Bengaluru'],
                  'Gender':['Male','Female','Female','Female','Male','Male'],
                  'Exp':[1,3,5,6,9,None]})
df

Unnamed: 0,Salary,City,Gender,Exp
0,25000,Bengaluru,Male,1.0
1,48000,Delhi,Female,3.0
2,71000,Hyderabad,Female,5.0
3,85000,Bengaluru,Female,6.0
4,90000,Hyderabad,Male,9.0
5,55000,Bengaluru,Male,


In [13]:
ct = make_column_transformer((ohe,['City','Gender']),
                            (si,['Exp']),
                            remainder = 'passthrough')

Here make_column_transformer will join all original/converted/imputed columns together

In ohe , we define City and Gender , as these are only categorical and contains text , so OneHotEncoder will convert text into numbers for both of these together.

In si , we gave exp , as there is null values in exp only , so we gave this here , so that SimpleImputer can fill these null values.

And , passthrough will keep all other columns , in which ohe and si are not used.

In [15]:
# Here we have given columns name as city , gender and exp , but didn't defined DataFrame , so let's fit this with df:

ct.fit_transform(df)

array([[1.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 1.0e+00, 2.5e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 3.0e+00, 4.8e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 1.0e+00, 0.0e+00, 5.0e+00, 7.1e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 6.0e+00, 8.5e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 1.0e+00, 9.0e+00, 9.0e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 4.8e+00, 5.5e+04]])

In [18]:
# Let's create a Dataframe to understand the output after ohe and si being used:


encoded = pd.DataFrame(ct.fit_transform(df))
encoded

Unnamed: 0,0,1,2,3,4,5,6
0,1.0,0.0,0.0,0.0,1.0,1.0,25000.0
1,0.0,1.0,0.0,1.0,0.0,3.0,48000.0
2,0.0,0.0,1.0,1.0,0.0,5.0,71000.0
3,1.0,0.0,0.0,1.0,0.0,6.0,85000.0
4,0.0,0.0,1.0,0.0,1.0,9.0,90000.0
5,1.0,0.0,0.0,0.0,1.0,4.8,55000.0


**We had only 4 columns before , but after encoding and imputing there are 7 columns , with names 0 to 6. We increased 3 columns , Let's Understand this:**

**OneHotEncoder:**

**Let's start with City Columns** , as we applied OneHotEncoder on City  , so here first column is for Bengaluru , and in df index 0 is bengaluru , so at that position 1.0 is assigned , and in df index 3 and 5 are bengaluru , so in first columns at index 3 and 5 , 1.0 is assigned and others are zero because there is no bengaluru.

Similarly , second columns is for Delhi , so in df index 1 is delhi , so in second column at index 1 , 1.0 is assihned . Others are 0.

Similarly  , Third Column is for Hyderabad , so in df , index 2 and 4 is hyderabad , so in third column , at index 2 and 4 , 1.0 is assigned , others are 0.

**Then , it goes to Gender , Fourth column is for Female as per alphabetical f comes first , in df , index 1,2,3 is female , so at index 1,2,3 , 1.0 is assigned , others are zero .**

Similarly , Fifth column is for Male , and in df index 0,4,5 is male , so at index 0,4,5 ,,1.0 is assigned , others are zero.

**Imputing(SimpleImputer):**

As we have applied SimpleImputer on exp to fill nan values at index 5 , so all others values are same , and **SimpleInputer filled the Null values by taking mean of all , as it was continuous data.**


**make_column_transformer :**

As there is only salary column left , on which OneHotEncoder and SimpleImputer are not applied , because they are already numbers and have no null values. so **make_column_transformer  , kept this column , along with all,**

In [19]:
# Rename the columns as per our choice:

encoded = pd.DataFrame(ct.fit_transform(df),columns = ['City Bemgaluru','City Delhi','City Hyderabad',
                                                       'Gender Female','Gender male','exp','Salary'])
encoded

Unnamed: 0,City Bemgaluru,City Delhi,City Hyderabad,Gender Female,Gender male,exp,Salary
0,1.0,0.0,0.0,0.0,1.0,1.0,25000.0
1,0.0,1.0,0.0,1.0,0.0,3.0,48000.0
2,0.0,0.0,1.0,1.0,0.0,5.0,71000.0
3,1.0,0.0,0.0,1.0,0.0,6.0,85000.0
4,0.0,0.0,1.0,0.0,1.0,9.0,90000.0
5,1.0,0.0,0.0,0.0,1.0,4.8,55000.0


In [20]:
# Original Data set:
df

Unnamed: 0,Salary,City,Gender,Exp
0,25000,Bengaluru,Male,1.0
1,48000,Delhi,Female,3.0
2,71000,Hyderabad,Female,5.0
3,85000,Bengaluru,Female,6.0
4,90000,Hyderabad,Male,9.0
5,55000,Bengaluru,Male,


# get_dummies:

**It is a Pandas Technique.**

**OneHotEncoding and get_dummies are almost equal. Major difference is , if you want to reduce(drop_first = True) columns size of the dataset  , you can use get_dummies , as we saw 'ohe' for sure increases column size of dataset.**

**OHE doesn't add columns name to your DataFrame (as we saw already , it just gives index from 0 to n) , but get_dummies add columns/variables names.**

**Sometimes having more columns might overfit the model.**

In [23]:
df1 = pd.get_dummies(df[['City','Gender']])
df1

Unnamed: 0,City_Bengaluru,City_Delhi,City_Hyderabad,Gender_Female,Gender_Male
0,1,0,0,0,1
1,0,1,0,1,0
2,0,0,1,1,0
3,1,0,0,1,0
4,0,0,1,0,1
5,1,0,0,0,1


**So here column size is reduced from 7 to 5 , as we only selected the city and gender , on which OHE applied.**

**And we got the variables/columns name too.**


In [24]:
# To drop first column of each categorical data:

df1 = pd.get_dummies(df[['City','Gender']] , drop_first = True)
df1

Unnamed: 0,City_Delhi,City_Hyderabad,Gender_Male
0,0,0,1
1,1,0,0
2,0,1,0
3,0,0,0
4,0,1,1
5,0,0,1


**So from city , first column i.e City_Bengaluru is dropped , and from Gender , first column i.e Gender_Female is dropped.**

# 3. Ordinal Encoder:

**It is also used to convert text into numbers (or required format) , but encoded (converted) data will be in some order or will follow some strategy.**
 
                                       OR

**Ordinal encoding converts each label into integer values and the encoded data represents the sequence of labels.**


When to apply Ordinal Encoder?

**when you want to give importance to higher data , as below we will do for manager , as it is high post than all.**

In [25]:
from sklearn.preprocessing import OrdinalEncoder

In [26]:
import pandas as pd


In [27]:
employee = pd.DataFrame({'Position':['SE','Manager','Team_Lead','SSE'],
                        'Project':['A','B','C','D'],
                        'Salary':[25000,8500,71000,48000]})
employee

Unnamed: 0,Position,Project,Salary
0,SE,A,25000
1,Manager,B,8500
2,Team_Lead,C,71000
3,SSE,D,48000


In [30]:
ord_enc= OrdinalEncoder(categories = [['SE','SSE','Team_Lead','Manager'],['A','B','C','D']])


**we have to give the order , in which we want to keep our data. means which position should get less importance and which position should get higher importance.**

**In categories , we have to give the categories/column based on order we want.**

In [31]:
encoded_df = ord_enc.fit_transform(employee[['Position','Project']])
encoded_df

array([[0., 0.],
       [3., 1.],
       [2., 2.],
       [1., 3.]])

**Now we taken this object , then fitted DataFrame employee , and which column we want to proceed with , as we have given Position and Projects.**

# 3. Binary Encoder:

**Binary encoding is a technique used to transform categorical data into numerical data by encoding categories as integers and then converting them into binary code(i.e 0 or 1).**

In [1]:
import pandas as pd

In [3]:
df = pd.DataFrame({'Cat_Data':['A','B','C','D','E','F','G','H','I','A','A','D']})
df

Unnamed: 0,Cat_Data
0,A
1,B
2,C
3,D
4,E
5,F
6,G
7,H
8,I
9,A


In [6]:
# To install category_encoders to use BinaryEncoder:

!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.5.1.post0-py2.py3-none-any.whl (72 kB)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.5.1.post0


In [7]:
from category_encoders import BinaryEncoder
from sklearn.preprocessing import OneHotEncoder


In [8]:
bi_enc=BinaryEncoder()   #Initialize as usual

**Count the number of categories (except duplicates).**

**like A-1 ,B-2 , C-3 and so on.** 

**Open calculator in programming mode , and click number 1 for A , and check binary , then click number 2 for B , and check Binary.**

In [10]:
df_bi = bi_enc.fit_transform(df)
df_bi

Unnamed: 0,Cat_Data_0,Cat_Data_1,Cat_Data_2,Cat_Data_3
0,0,0,0,1
1,0,0,1,0
2,0,0,1,1
3,0,1,0,0
4,0,1,0,1
5,0,1,1,0
6,0,1,1,1
7,1,0,0,0
8,1,0,0,1
9,0,0,0,1


**Now , we had nine categories (except duplicates ) i.e from A to I , now it has created  4 column for each category and named them as Cat_Data_0  , Cat_Data_1 and so on. And these values for A to I(or of index 0 to 11 including duplicates) are coming from programmer calculator behind the scene**

How ?

open calculator , then select programmer calculator , now for A ,write 1 , and then see the binary value  , it is 0001. 

Similarly ,for B , write 2 and see the binary value , it is 0010.

Similarly , for C , enter 3 and see the binary value , it is 0011. and so on for all.

# Comparing the same with OneHotEncoder:

In [11]:
ohe = OneHotEncoder(sparse = False)
ohe.fit_transform(df[['Cat_Data']])


array([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0.]])

**What's the difference in both?**

The OneHotEncoder gives 9 number of columns as there are 9 categories(except duplicates). it means number of columns equal to number of categories.

And , The BinaryEncoder Just given 4-5 columns ,as per the values in programmer calculator , If there are 20 categories , then also , it will just give 4-5 columns , based on the values in programmer calculator.


**Important Note : So if there are 4-5 categories , do use OneHotEncoder , and if there are many categories use BinaryEncoder.**

# Knn Imputer:

**Knn imputer is also used to fill the missing values present in Dataset , but it will try to find the relation with other columns  and impute the data according to relation.**

In [13]:
from sklearn.impute import KNNImputer

In [14]:
df = pd.DataFrame({'Salary':[25000,48000,71000,85000,90000,55000],
                  'City':['Bengaluru','Delhi','Hyderabad','Bengaluru','Hyderabad','Bengaluru'],
                  'Gender':['Male','Female','Female','Female','Male','Male'],
                  'Exp':[1,3,5,6,9,None]})
df

Unnamed: 0,Salary,City,Gender,Exp
0,25000,Bengaluru,Male,1.0
1,48000,Delhi,Female,3.0
2,71000,Hyderabad,Female,5.0
3,85000,Bengaluru,Female,6.0
4,90000,Hyderabad,Male,9.0
5,55000,Bengaluru,Male,


In [17]:
knnimp = KNNImputer(n_neighbors =2 )   # initialize as usual , but has to give n neighbors in this case.
knnimp = pd.DataFrame(knnimp.fit_transform(df[['Salary','Exp']])) 
knnimp

Unnamed: 0,0,1
0,25000.0,1.0
1,48000.0,3.0
2,71000.0,5.0
3,85000.0,6.0
4,90000.0,9.0
5,55000.0,4.0


**Here , we will focus only on exp , as this is imputation and null values is only present in exp column.**

**Now , in KNNImputer , if we wanna work on exp , then along with this column , we have to pass other column too whuch have no nulls ,so salary doesn't have any nulls , that's why in fit_transform we have given salary and exp.**

**Here , system will see the corresponding salary of nan exp , i.e 49000 , n_neighbors = 2 means system will see which two salaries are close to 49,000 , i.e 48000 and 71000. Now it will check corresponding exp of these two salaries  , and mean of these two exp. will be filled in place of null values. If we tak n_neighbors = 3 , then same step will be taken  with 3 close salaries.**

# Iterative Imputer:

**This method treats other columns(which doesn't have nulls as feature , train them , and treat null column as label. Finally it will predict tha NaN data and impute . It's just like Regression , here null column is Label.** 

In [19]:
# Before using Iterative Imputer , we need to enable it using below code:

from sklearn.experimental import enable_iterative_imputer

# import Iterative Imputer

from sklearn.impute import IterativeImputer

In [20]:
df = pd.DataFrame({'Salary':[25000,48000,71000,85000,90000,55000],
                  'City':['Bengaluru','Delhi','Hyderabad','Bengaluru','Hyderabad','Bengaluru'],
                  'Gender':['Male','Female','Female','Female','Male','Male'],
                  'Exp':[1,3,5,6,9,None]})
df

Unnamed: 0,Salary,City,Gender,Exp
0,25000,Bengaluru,Male,1.0
1,48000,Delhi,Female,3.0
2,71000,Hyderabad,Female,5.0
3,85000,Bengaluru,Female,6.0
4,90000,Hyderabad,Male,9.0
5,55000,Bengaluru,Male,


In [21]:
iter_impute = IterativeImputer()
ite_imp = pd.DataFrame(iter_impute.fit_transform(df[['Salary','Exp']]),columns=['Salary','Exp'])
ite_imp

Unnamed: 0,Salary,Exp
0,25000.0,1.0
1,48000.0,3.0
2,71000.0,5.0
3,85000.0,6.0
4,90000.0,9.0
5,55000.0,3.864759


**As we passed Salary and Exp , so it will treat Salary as Feature , and Exp as Label**

**So, it will behave like linear regression in which , index 0 to 4 will be treated as training data , and index 5 will  be testing data. so on the basis of what it is learned'trained , in testing it will predict NaN , and this value is predicted value based on what it it learned.**

**We can see , for 1 year of exp , salary is 25000 , for 3 years of exp , salary is 48000 and for 3.86 years of exp(which is predicted) salary is 55000 , which seems accurate(or making sense) based on previous data.**
