# Five important ways for imputing Missing Vlaues

YOu can impute missing values using machine learning models. This process is known as data imputation and is commondly used in data preprocessing to handle missing or incomplete data. There are several methods and models you can use, depecding on the nature of your data and the missing values:

    1. Simple Imputation Techniques:
   
       1. Mean/Median Imputation: Replace missing values with the mean or median of the column. Sutable for numerical data.
       2. Mode impuation: Replace missing values with the mode (most frequent value) of the column. Useful for categorical data.
    2. K-Nearest Neighbors (KNN): This algorithm can be used to impute missing values based on the similarity of rows.
    3. Regression Imputation: Use a regression model to predict the missing values based on other variables in your dataset.
    4. Decision Trees and Random Forests: These can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.
    5. Advanced Techniques:
       1. Multiple by Chained Equation (MICE): This is a more sophicticated technique that models each variable with missing values as a function of other variables in a round-robin fashion.
       2. Deep Learning Methods: Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets.
    6. Time series Specific Methods: For time-series data, you might use techniques like interpolation, forward-fill, or backward-fill.
   


# 2. K-Nearest Neighbors (KNN)

KNN is a machine learning algorithm that can be used for imputing missing values. It works by finding the most similar data point to the one with the missing value based on other availabel features. the missing value is then imputed with the mean or median of the most similar data points.

let's see how to implement KN impuation in python using The Titanic dataset.

In [1]:
import pandas as pd
import numpy as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [2]:
# laod the dataset
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
# check the missing number of each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [7]:
# impute missing value using KNN imputer
from sklearn.impute import KNNImputer

# call the KNN class with number of neighbors = 4
imputer = KNNImputer(n_neighbors=4)

#impute missing values with KNN imputer column age
df['age'] = imputer.fit_transform(df[['age']])

#check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)
 

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

# 3. Regression Imputation

Regression imputation uses a regression model to predict the missing values based on other variables in the dataset. it works well for both categorical and numerical data.

let's see how to implement regression impuation in python using the titanic dataset.

In [8]:
# load the dataset 
df = sns.load_dataset('titanic')

# check the missing values
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [10]:
# impute missing values with regression model
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# call the IterativeImputer class with max_iter = 10
imputer = IterativeImputer(max_iter=10)

# impute missing values with regression imputer 
df['age'] = imputer.fit_transform(df[['age']])

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

# 4. Random Forests for imputing Missing Values

Random forests can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.

let's see how to implement Random forests in python using the titanic dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer

# load the dataset 
df = sns.load_dataset('titanic')

# check the missing values
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [2]:
 # remove deck column 
df.drop('deck', axis=1, inplace=True)
# check the missing values
df.isnull().sum().sort_values(ascending=False)

age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [3]:
df['embark_town'].value_counts()

embark_town
Southampton    644
Cherbourg      168
Queenstown      77
Name: count, dtype: int64

In [4]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')

In [5]:
# encode the data uisng label encoding
from sklearn.preprocessing import LabelEncoder

# column to encode
columns_to_encode = ['sex', 'embarked', 'who', 'class', 'embark_town', 'alive']

#Dictionary to store LabelEncoders for each column
label_encoder = {}

# loop to apply LabelEncoder to each column
for col in columns_to_encode:
    # create a new labelEncoder fro the column
    le = LabelEncoder()

    # fit the transform the data, then inverse from it
    df[col] = le.fit_transform(df[col])

    # store the encoder in the dictionary
    label_encoder[col] = le

df.head()



Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


We have to first impute the missing values in the age column before we can use it to predict the missing values in the embarked and emark_town columns.

In [6]:
# split the dataset into two parts: one with missing values, one without mising values
df_with_missing = df[df['age'].isna()]
#drop removes all rows with missing values
df_without_missing = df.dropna()

In [7]:
df_with_missing.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [8]:
df_without_missing.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

let's see the shape of the dataset with and without the missing values:

In [9]:
print("The shape of the original dataset is: ", df.shape)
print("The shape of the dataset with missing values removed is: ", df_without_missing.shape)
print("The shape of the dataset with missing value is: ", df_with_missing.shape)


The shape of the original dataset is:  (891, 14)
The shape of the dataset with missing values removed is:  (714, 14)
The shape of the dataset with missing value is:  (177, 14)


In [10]:
df_with_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,,0,0,7.8792,1,2,2,False,1,1,True


In [11]:
df_without_missing.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [13]:
# Regression Imputation

# split the data into x and y and we will only take the columns with no missing values
x = df_without_missing.drop(['age'], axis=1)
y = df_without_missing['age']

# split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.2)

# Random forest Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)

# evaluate the model
y_pred = rf_model.predict(x_test)
print("RMSE for Random Forest Impuation:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 score for Random Forest Impuation:", r2_score(y_test, y_pred))
print("MAE  for Random Forest Impuation:", mean_absolute_error(y_test, y_pred))
print("MAPE  for Random Forest Impuation:", mean_absolute_percentage_error(y_test, y_pred))



RMSE for Random Forest Impuation: 11.081260589808045
R2 score for Random Forest Impuation: 0.33769388288226154
MAE  for Random Forest Impuation: 8.666661815622195
MAPE  for Random Forest Impuation: 0.40839466096086574


In [20]:
#remove Warning
import warnings
warnings.filterwarnings('ignore')

# predict missing values
y_pred = rf_model.predict(df_with_missing.drop(['age'], axis=1))

#replace the missing values with the predicted values
df_with_missing['age'] = y_pred

# check the missing values
df_with_missing.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [17]:
# concatenate the two datafreame
df_complete = pd.concat([df_with_missing, df_without_missing], axis=0)

#print the shape of the complete dataframe
print("The shape of the complete Dataframe is :", df_complete.shape)

# check the first five row fo the complete datafram
df_complete.head()

The shape of the complete Dataframe is : (891, 14)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,32.976583,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,35.642218,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,18.347,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,35.571486,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,20.651429,0,0,7.8792,1,2,2,False,1,1,True


In [19]:
# Inverse tranfrom of encoded columns
for col in columns_to_encode:
    # Retrive the corresponding LabelEncoder for the column
    le = label_encoder[col]

    # inverse Tranform the data
    df_complete[col] = le.inverse_transform(df[col])

# check the first fiv rows of the completer dataframe
df_complete.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,male,32.976583,0,0,8.4583,S,Third,man,True,Southampton,no,True
17,1,2,female,35.642218,0,0,13.0,C,First,woman,True,Cherbourg,yes,True
19,1,3,female,18.347,0,0,7.225,S,Third,woman,False,Southampton,yes,True
26,0,3,female,35.571486,0,0,7.225,S,First,woman,True,Southampton,yes,True
28,1,3,male,20.651429,0,0,7.8792,S,Third,man,False,Southampton,no,True


# 5.1 Multiple Imputation by chained Equation (MICE)

Multiple Impuation by chained equations (MISE) is a more sophistciated technique that models each variable with missing values as a funcation of other variables in a round-robin fashion. It works well forboth categorical and numerical data.

to demonstrate Multiple by chained Equations (MICE) in python, we can use the iterativeImputer Class from the sklearn.impute module. MICE ia a sophisticated method of imputation that models each feature with missing values as a function of other features, and it uses that estimate fro imputation. It does this in a round-robin fashion: each features is modeled in turn. The MICE algorithm is implemented in the iterativeImputer class.

Let's see How to implement MICE in python using titanic dataset.