## IMPUTERS AND ENCODERS

Before moving on to the next model, decision trees, let's first understand some important concepts.

Our models cannot understand text data directly, so we need to convert text into numbers. Additionally, one of the most common issues in data is null values. We need to learn how to handle them effectively.

While simple methods like mean, mode, or median imputation are widely used, there are more advanced techniques we can use. Let's dive into the concepts of:

#### 1. Imputers
Imputing refers to filling in missing values (nulls) with estimated values. Instead of using simple methods like fillna, weâ€™ll explore different imputation techniques that provide better estimates for the missing data.

#### 2. Encoding
Encoding is the process of converting categorical data (like text labels) into numerical values so that the model can understand it. Since machine learning models work with numbers, encoding is essential. There are several encoding techniques that we will explore to handle categorical variables effectively.



In [2]:
#Lets import

import warnings
warnings.filterwarnings('ignore')


import pandas as pd

In [4]:
info=pd.DataFrame({'Pay':[25000,48000,71000,85000,90000,55000],
                'State':['Bengluru','Delhi','Hyderabad','Bengluru','Hyderabad','Bengluru'],
                 'Gender':['Male','Female','Female','Female','Male','Male'],
                'Exp':[1,3,5,6,9,None]})
info

Unnamed: 0,Pay,State,Gender,Exp
0,25000,Bengluru,Male,1.0
1,48000,Delhi,Female,3.0
2,71000,Hyderabad,Female,5.0
3,85000,Bengluru,Female,6.0
4,90000,Hyderabad,Male,9.0
5,55000,Bengluru,Male,


In [20]:
from sklearn.preprocessing import LabelEncoder

In [22]:
encoder=LabelEncoder()

In [26]:
info=encoder.fit_transform(info['State'])
info

array([0, 1, 2, 0, 2, 0])

In [34]:
pd.Series(info)

0    0
1    1
2    2
3    0
4    2
5    0
dtype: int64

## One-Hot Encoding (OHE):
It is a technique used to convert categorical data into numerical data. It creates new binary columns for each category in the original column.

#### How it works:
For a categorical column with N unique categories, OHE creates N new columns.
Each column will have a 1 for the category present in that row and a 0 for others.

In [41]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer

In [43]:
ohe=OneHotEncoder()
imputer=SimpleImputer()

In [45]:
info=pd.DataFrame({'Pay':[25000,48000,71000,85000,90000,55000],
                'State':['Bengluru','Delhi','Hyderabad','Bengluru','Hyderabad','Bengluru'],
                 'Gender':['Male','Female','Female','Female','Male','Male'],
                'Exp':[1,3,5,6,9,None]})
info

Unnamed: 0,Pay,State,Gender,Exp
0,25000,Bengluru,Male,1.0
1,48000,Delhi,Female,3.0
2,71000,Hyderabad,Female,5.0
3,85000,Bengluru,Female,6.0
4,90000,Hyderabad,Male,9.0
5,55000,Bengluru,Male,


In [57]:
#OneHotEncoder: applied to the State and Gender columns, which are categorical. It converts them into binary (0 or 1) columns.

#SimpleImputer: applied to the Exp (Experience) column. 
#It will impute (fill in) any missing values using a simple strategy (like mean, median, or most frequent). 
#You can specify the imputation strategy as an argument.

#Remainder: This ensures that any other columns not specified (other than State, Gender, and Exp) are passed through unchanged.

ct=make_column_transformer(
    (ohe,['State','Gender']),
(imputer,['Exp']),
remainder='passthrough') #Passthrough to keep all the columns

In [59]:
info_encoded=pd.DataFrame(ct.fit_transform(info))
info_encoded

Unnamed: 0,0,1,2,3,4,5,6
0,1.0,0.0,0.0,0.0,1.0,1.0,25000.0
1,0.0,1.0,0.0,1.0,0.0,3.0,48000.0
2,0.0,0.0,1.0,1.0,0.0,5.0,71000.0
3,1.0,0.0,0.0,1.0,0.0,6.0,85000.0
4,0.0,0.0,1.0,0.0,1.0,9.0,90000.0
5,1.0,0.0,0.0,0.0,1.0,4.8,55000.0


In [61]:
#Column 0-2
#State_Bengluru, State_Delhi, and State_Hyderabad become new columns.
#and for each row, the respective state column gets a 1 (for that state) and 0 for others.

#Column 3-4
#The Gender column had categories Male and Female. Similarly, One-Hot Encoding creates two new columns:
#Gender_Male and Gender_Female, where a 1 indicates the gender of the person

#Column 5 
#The SimpleImputer was applied to this column, likely replacing the missing value with the mean (or another chosen strategy) of the existing Exp values.

In [63]:
new_column_names = ['State_Bengluru', 'State_Delhi', 'State_Hyderabad', 'Gender_Female', 'Gender_Male', 'Exp', 'Pay']

# Assign the new column names to the DataFrame
info_encoded.columns = new_column_names

In [65]:
info_encoded

Unnamed: 0,State_Bengluru,State_Delhi,State_Hyderabad,Gender_Female,Gender_Male,Exp,Pay
0,1.0,0.0,0.0,0.0,1.0,1.0,25000.0
1,0.0,1.0,0.0,1.0,0.0,3.0,48000.0
2,0.0,0.0,1.0,1.0,0.0,5.0,71000.0
3,1.0,0.0,0.0,1.0,0.0,6.0,85000.0
4,0.0,0.0,1.0,0.0,1.0,9.0,90000.0
5,1.0,0.0,0.0,0.0,1.0,4.8,55000.0


## Ordinal Encoding:
It is a technique used to convert categorical data into numerical data, where categories have a meaningful order or ranking. 

Unlike One-Hot Encoding, which creates separate binary columns for each category, Ordinal Encoding assigns an integer to each category based on its rank or order.

In [88]:
from sklearn.preprocessing import OrdinalEncoder

# Dataframe
info = pd.DataFrame({
    'Pay': [25000, 48000, 71000, 85000, 90000, 55000],
    'State': ['Bengaluru', 'Delhi', 'Hyderabad', 'Bengaluru', 'Hyderabad', 'Bengaluru'],
    'Gender': ['Male', 'Female', 'Female', 'Female', 'Male', 'Male'],
    'Exp': [1.0, 3.0, 5.0, 6.0, 9.0, None]
})

In [90]:
# Initialize OrdinalEncoder
encoder = OrdinalEncoder()

ct=make_column_transformer(
    (encoder,['State','Gender']),
remainder='passthrough')

# Apply Ordinal Encoding to 'State' column
info_encoded = pd.DataFrame(ct.fit_transform(info))

In [None]:
#Display the transformed dataframe
info

In [98]:
new_column_names = ['State_encoded','Gender_encoded','Pay','Exp']

# Assign the new column names to the DataFrame
info_encoded.columns = new_column_names

info_encoded

Unnamed: 0,State_encoded,Gender_encoded,Pay,Exp
0,0.0,1.0,25000.0,1.0
1,1.0,0.0,48000.0,3.0
2,2.0,0.0,71000.0,5.0
3,0.0,0.0,85000.0,6.0
4,2.0,1.0,90000.0,9.0
5,0.0,1.0,55000.0,


In [100]:
# Bengaluru gets a 0, Delhi gets a 1, and Hyderabad gets a 2.
# Male gets a 1 and Female gets a 0

#### When to Use Ordinal Encoding:
It is useful when the categories have an inherent order (like Low, Medium, High).
It should not be used for nominal categories (where there is no order), like colors or names of cities.