# Five important ways of imputing missing values 
you can used machine learning models to imputem missing values from the data.This method is called data imputation and commonly used in data preprocessing to handle missing and incomplete data. There are several methods and models you can use depending upon the your data and missing values.
1. **`Simple Imputation Techniques:`**
  - `mean/median imputation:` Replace the missing values with the mean or median of the column, suitable for numeric data.
  - `mode:` Replace the missing values with the mode of the column or the most frequent value, suitable for categorical data.
2. **`KNN:`** This algorithm can be used to impute missing values based on similarity of rows.
3. **`Regression imputation:`** Use the regression model in order to predict the missing values in the data. 
4. **`Decision Trees and Random forest:`** These can handle missing values inherently.They can also be used to predict the missing values based on the pattern learned from other data.
5. **`Advanced Techniques:`**
  - `Multiple imputation by chained equations (MICE):`

## Simple imuptation techniques.

### 1. mean/median imputation

In [8]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
# load the dataset 
df = sns.load_dataset('titanic')
# mean imputation 
# to find the missing values in the data 
print(df.isnull().sum())
# filling missing values with mean of the age column
df['age']= df['age'].fillna(df['age'].mean())
print('-----------------')
print(df.isnull().sum())


survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
-----------------
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


#### median imputation

In [7]:
# filling missing values in the age column with median 
df1= sns.load_dataset('titanic')
print(df1.isnull().sum())
df1['age']= df1['age'].fillna(df['age'].median())
print('-----------------')
print(df1.isnull().sum())

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
-----------------
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


Mode imputation

In [11]:
# using mode to fill missing values 
print(df['embark_town'].isnull().sum())
df['embark_town']= df['embark_town'].fillna(df['embark_town'].mode()[0])
print("-------------")
print(df['embark_town'].isnull().sum())

0
-------------
0


In [12]:
print(df['embarked'].isnull().sum())
df['embarked']= df['embarked'].fillna(df['embarked'].mode()[0])
print('------------')
print(df['embarked'].isnull().sum())

2
------------
0


### 2. K-Nearset Neighbours (KNN) Imputation

In [15]:
from sklearn.impute import KNNImputer
# load the dataset 
df2= sns.load_dataset('titanic')

ki= KNNImputer(n_neighbors=4)
df2['age']= ki.fit_transform(df2[['age']])
df2['age'].isnull().sum()

0

### 3. Regression Imputation 

In [20]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer 
imputer = IterativeImputer(max_iter=10)
df= sns.load_dataset('titanic')
df['age']= imputer.fit_transform(df[['age']])
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Imputation by using Radom Forest Algorithm

In [23]:
# import libraries 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error,mean_absolute_percentage_error,mean_squared_error,r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
# load the dataset 
df = sns.load_dataset('titanic')
# drop the deck column 
df.drop('deck',axis=1,inplace=True)


In [25]:
# here find out the missing values in age column 
df.isnull().sum().sort_values(ascending=False)

age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [27]:
# encode the categorical variables 
columns_to_encode= ['sex','who','class','embark_town','embarked','alive']
# using a for to encode all the categorical columns 
# dictionay to store label encoder of each column 
label_encoder = {}
for col in columns_to_encode:
    # creating a label encoder for each column
    le = LabelEncoder()
    # encoding the categoircal variables 
    df[col]= le.fit_transform(df[col])
    # storing column name and its label enocder
    label_encoder[col]= le

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [28]:
# let's create a data with missing and without missing values 
# spliting the data into two parts one with missing values and one without missing values 
df_with_missing= df[df['age'].isna()]
df_without_missing= df.dropna()


In [None]:
rf.fit(x,y)
# predict the model 
y_pred = rf.predict(df_with_missing.drop('age',axis=1))

In [49]:
# select the features and labels 
x = df_without_missing.drop('age',axis=1)
y = df_without_missing['age']
# intilaize the the Algorithm
rf = RandomForestRegressor(n_estimators=200,random_state=42,max_depth=3)
# split the data into training and testing data 
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.20)
# fit the model on this data 
rf.fit(x_train,y_train)
# make predication from the data 
y_pred = rf.predict(x_test)
# evaluate the model 
print('mean_absolute_error:',mean_absolute_error(y_test,y_pred))
print('mean squared error:',mean_squared_error(y_test,y_pred))
print('Rmse',np.sqrt(mean_squared_error(y_test,y_pred)))
print('mean_absolute_percentage_error',mean_absolute_percentage_error(y_test,y_pred))
print('r2_score', r2_score(y_test,y_pred))
# fit the model on the data 


mean_absolute_error: 9.024734481854564
mean squared error: 134.556280470824
Rmse 11.599839674358607
mean_absolute_percentage_error 0.39582507474388084
r2_score 0.4499937406855061


In [52]:
y_pred= rf.predict(df_with_missing.drop('age',axis=1))

In [53]:
# remove warnings 
import warnings
warnings.filterwarnings('ignore')
df_with_missing['age'] = y_pred
df_with_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,28.434587,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,33.527942,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,28.697527,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,28.724138,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,28.240828,0,0,7.8792,1,2,2,False,1,1,True


In [42]:
# concatenation of two into one data
df_complete= pd.concat([df_with_missing,df_without_missing],axis=0)
df_complete.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,31.572931,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,35.400497,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,19.556944,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,34.494813,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,21.752167,0,0,7.8792,1,2,2,False,1,1,True


In [43]:
# inverse tranform the encoded column back to original form 
for col in columns_to_encode:
    le = label_encoder[col]
    df_complete[col]= le.inverse_transform(df[col])

df_complete.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,male,31.572931,0,0,8.4583,S,Third,man,True,Southampton,no,True
17,1,2,female,35.400497,0,0,13.0,C,First,woman,True,Cherbourg,yes,True
19,1,3,female,19.556944,0,0,7.225,S,Third,woman,False,Southampton,yes,True
26,0,3,female,34.494813,0,0,7.225,S,First,woman,True,Southampton,yes,True
28,1,3,male,21.752167,0,0,7.8792,S,Third,man,False,Southampton,no,True
