# Imputing Missing Values



# Missing Value Kya Hai? 🤔

Missing value ka matlab hai data mein woh jagah jahan koi information available nahi hoti. Yeh aksar tab hota hai jab kisi survey, form ya data collection process mein kuch cheezein chhoot jaati hain. Missing values ko samajhna aur handle karna data analysis ka ek zaroori hissa hai. 😊

## Missing Values Ki Wajahain 🛠️
- Kisi ne form ka kuch hissa fill nahi kiya.
- Data collection mein error ho gaya.
- Kisi variable ki value applicable hi nahi thi.

## Missing Values Ka Asar 📉
- Analysis aur results galat ho sakte hain.
- Machine learning models ki performance kharaab ho sakti hai.
- Data ka structure incomplete lagta hai.

Yeh zaroori hai ke missing values ko samajhdari se handle karein, warna aapke analysis aur predictions galat ho sakte hain. 😊

# Missing Values Ko Impute Karne Ke 5 Ahem Tareeqay ✨

Aap missing values ko machine learning models ka istemal karke impute kar saktay hain. Yeh process data imputation kehlata hai aur data preprocessing mein missing ya incomplete data ko handle karne ke liye aam tor par istemal hota hai. Neeche kuch tareeqay aur models diye gaye hain jo aap apne data aur missing values ki nature ke mutabiq use kar saktay hain:

## Simple Imputation Techniques 🛠️

- **Mean/Median Imputation**: Missing values k column ko mean ya median se replace karein. Yeh numerical data ke liye behtareen hai.
- **Mode Imputation**: Missing values k column ko mode (sabse zyada bar aane wali value) se replace karein. Yeh categorical data ke liye mufeed hai.
- **K-Nearest Neighbors (KNN)**: Yeh algorithm rows ki similarity ke base par missing values ko impute kar sakta hai.

## Regression Imputation 📈

- Regression model ka istemal karke missing values ko predict karein jo dataset ke doosre variables par mabni ho.

## Decision Trees aur Random Forests 🌳

- Yeh models missing values ko naturally handle karte hain. Yeh patterns ko seekh kar missing values ko predict karne ke liye bhi istemal ho saktay hain.

## Advanced Techniques 🚀

- **Multiple Imputation by Chained Equations (MICE)**: Yeh ek advanced technique hai jo har variable k missing values ko saath round-robin fashion mein model karti hai.
- **Deep Learning Methods**: Neural networks, khaaskar autoencoders, complex datasets mein missing values ko impute karne mein kaafi effective hain.
- **Time Series Specific Methods**: Agar aapka data time-series hai, to interpolation, forward-fill, ya backward-fill jaise techniques ka istemal karein.

## Important Tips 📝

- Apne data ki type, missingness ka pattern (e.g., random, completely random, ya not random), aur missing data ki amount ke mutabiq sahi method ka intekhab karein.
- Yeh samajhna zaroori hai ke imputation bias introduce kar sakta hai ya aapke data ki distribution ko affect kar sakta hai, is liye isay ehtiyaat aur samajhdari ke saath karein. 😊

## 1. Simple Imputation Techniques 🛠️

### 1.1. Mean/Median Imputation ✨

- **Mean/Median Imputation** ka matlab hai missing values k column k0 mean ya median se replace karna. 
- Yeh ek simple aur effective method hai, lekin iske kuch limitations bhi hain:
    - Dataset ki variance ko kam kar deta hai. 📉
    - Agar missing values random na hoon, to biased estimates ka sabab ban sakta hai. ⚠️

Chaliye dekhte hain Python mein Titanic dataset ka istemal karke mean/median imputation kaise implement karte hain. 🚢

### 1.1.1 Mean Imputation 🧮

Mean imputation ka matlab hai missing values ko column ke average (mean) se replace karna. Yeh ek simple aur aam tareeqa hai jo numerical data ke liye kaafi useful ho sakta hai. 😊

In [71]:
# import libraries
import pandas as pd 
import numpy as np

# Load Dataset
df = pd.read_csv("../data_scraping/datasets/Titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [72]:
# Check Missing Values
df.isna().sum().sort_values(ascending=False)

Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

we can see the `age` columan has 177 missing values. let replace these missing values with mean.....

In [73]:
# Fill missing values with mean
df["Age"] = df["Age"].fillna(df["Age"].mean())

# Check Missing Values again
df.isna().sum().sort_values(ascending=False)

Cabin          687
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

we can see that `age` column 177 missing values is filled with mean value of `age` column.

### 1.1.2 Median Imputation 🧮
Median imputation ka matlab hai missing values ko column ke median se replace karna. Yeh bhi ek simple aur aam tareeqa hai jo numerical data ke liye kaafi useful ho sakta hai. 😊

In [74]:
# import libraries
import pandas as pd 
import numpy as np

# Load Dataset
df = pd.read_csv("../data_scraping/datasets/Titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [75]:
# Check the Missing Values
df.isnull().sum().sort_values(ascending=False)

Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

we can see the `age` columan has 177 missing values. let replace these missing values with median.....

In [76]:
# Replace missing values with median
df["Age"] = df["Age"].fillna(df["Age"].median())

# Check Missing Values again
df.isna().sum().sort_values(ascending=False)

Cabin          687
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

we can see that `age` column 177 missing values is filled with median value of `age` column.

### 1.1.3 Mode Imputation 🧮
Mode imputation ka matlab hai missing values ko column ke mode (sabse zyada bar aane wali value) se replace karna. Yeh categorical data ke liye behtareen hai. 😊

In [77]:
# import libraries
import pandas as pd 
import numpy as np

# Load Dataset
df = pd.read_csv("../data_scraping/datasets/Titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [78]:
# Check Categorical Variables
df.select_dtypes(include=['object']).head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


In [79]:
# Check Missing Values
df.isnull().sum().sort_values(ascending=False)

Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

we can see in the above sell `cabin` is categorical column and has 687 missing values. let replace these missing values with mode.....

In [80]:
# Replace missing values with mode
df["Cabin"] = df["Cabin"].fillna(df["Cabin"].mode()[0])

# Check Missing Values again
df.isna().sum().sort_values(ascending=False)

Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
dtype: int64

now the  `cabin` column 687 missing values is filled with mode value of `cabin` column.

## 1.2. K-Nearest Neighbors

### K-Nearest Neighbors (KNN) Algorithm 🧑‍🤝‍🧑

K-Nearest Neighbors (KNN) ek simple aur powerful algorithm hai jo missing values ko impute karne ke liye bhi use hoti hai. Yeh algorithm data points ki similarity ke basis par kaam karta hai. 😊

#### KNN Kaise Kaam Karta Hai? 🤔
- Har missing value ke liye, KNN uske aas-paas ke **K nearest neighbors** ko dhoondta hai.
- Yeh neighbors un data points par mabni hote hain jo missing value ke bagair hain.
- Missing value ko in neighbors ki average, median, ya mode se replace kiya jata hai.

#### KNN Imputation Ke Fayde 🌟
- **Flexible**: Numerical aur categorical dono data ke liye kaam karta hai.
- **Pattern Preservation**: Data ke underlying patterns ko preserve karta hai.
- **Non-parametric**: Kisi assumption ki zarurat nahi hoti.

#### KNN Imputation Ke Nuqsanat ⚠️
- **Computationally Expensive**: Large datasets ke liye slow ho sakta hai.
- **Outliers Ka Asar**: Outliers kaafi influence kar sakte hain.

#### KNN Imputation Ka Istemaal Kab Karein? 🛠️
- Jab aapke paas **small to medium-sized dataset** ho.
- Jab missing values **randomly distributed** hoon.
- Jab aapko data ke patterns ko preserve karna ho.

KNN ek zabardast tareeqa hai missing values ko handle karne ka, lekin isay samajhdari aur computational resources ko madde nazar rakhte hue use karein. 😊


In [81]:
# import libraries
import pandas as pd 
import numpy as np

# Load Dataset
df = pd.read_csv("../data_scraping/datasets/Titanic.csv")
df.head()
# Check Missing Values
print(f"Dataset with Missing Values\n{df.isna().sum().sort_values(ascending=False)}")

# Impute missing values with KNN
from sklearn.impute import KNNImputer

# Call the Imputer
imputer = KNNImputer(n_neighbors=5)

# Fit the imputer to the data
df["Age"] = imputer.fit_transform(df[["Age"]])

# Check Missing Values again
print(f"Dataset without Missing Values\n{df.isna().sum().sort_values(ascending=False)}")



Dataset with Missing Values
Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64
Dataset without Missing Values
Cabin          687
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64


## 1.3. Regression Imputation 📈
- **Regression Imputation** ek advanced technique hai jo missing values ko predict karne ke liye regression models ka istemal karti hai. 📈
- Yeh method un variables ke darmiyan ke relationships ko samajhne aur unka faida uthane mein madad karta hai jo aapas mein correlated hote hain. 🤝
- Regression model ek dependent variable (jisme missing values hain) aur independent variables (jo complete hain) ke darmiyan ek equation fit karta hai. 🧮
- Iske baad, yeh equation ka istemal karke missing values ko predict kiya jata hai. 🔮
- Yeh technique tab kaafi useful hoti hai jab aapke data mein strong linear relationships hoon. 😊
- useful for numerical data.
- **Limitations**: Agar relationships weak hain ya data non-linear hai, to regression imputation kaam nahi karega. ⚠️


In [82]:
# import libraries
import pandas as pd 
import numpy as np

# Load Dataset
df = pd.read_csv("../data_scraping/datasets/Titanic.csv")
df.head()
# Check Missing Values
print(f"Dataset with Missing Values\n{df.isna().sum().sort_values(ascending=False)}")

# impute missing values with regression imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# call the IterativeImputer class with max_iter = 10
imputer = IterativeImputer(max_iter=10)

#impute missing values with regression imputer
df['Age'] = imputer.fit_transform(df[['Age']])

# check the number of missing values in each column
print(f"Dataset without Missing Values\n{df.isna().sum().sort_values(ascending=False)}")


Dataset with Missing Values
Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64
Dataset without Missing Values
Cabin          687
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64


## 1.4. Random Forests Imputer🌳
- **Random Forests** ek ensemble learning technique hai jo decision trees ka istemal karti hai. 🌳
- Yeh method missing values ko impute karne ke liye bhi kaam karta hai. 😊
- Random Forests ka istemal karte waqt, yeh algorithm data ke multiple decision trees banata hai aur unka average ya mode le kar final prediction karta hai. 📊
- Yeh method data ke complex relationships ko samajhne mein madad karta hai aur missing values ko accurately predict kar sakta hai. 🔍
- useful for both numerical and categorical data.
- **Limitations**: Yeh method computationally expensive ho sakta hai aur large datasets par slow ho sakta hai. ⚠️
- Agar aapke paas bahut saari features hain, to Random Forests ka istemal karna mushkil ho sakta hai. ⚠️
- not suitable for small datasets....

In [89]:
# import libraries
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer
# Load Dataset
df = pd.read_csv("../data_scraping/datasets/Titanic.csv")
df.head()
# Check Missing Values
print(f"Dataset with Missing Values\n{df.isna().sum().sort_values(ascending=False)}")

Dataset with Missing Values
Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64


In [90]:
# drop the Cabin column because it has too many missing values
df.drop(columns=["Cabin"], inplace=True)
# Check Missing Values again
print(df.isnull().sum().sort_values(ascending=False))

Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64


In [91]:
# Check the categorical variables
df.select_dtypes(include=['object'])

# Now convert the categorical variables to numerical variables using label encoding
from sklearn.preprocessing import LabelEncoder
columns_to_encode = ['Name','Sex','Ticket','Embarked']

# create a dictionary to store the label encoders
label_encoders = {}

# Loop to apply LabelEncoder to each column
for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()

    # Fit and transform the data, then inverse transform it
    df[col] = le.fit_transform(df[col])

    # Store the encoder in the dictionary
    label_encoders[col] = le

# Check the first few rows of the DataFrame
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,108,1,22.0,1,0,523,7.25,2
1,2,1,1,190,0,38.0,1,0,596,71.2833,0
2,3,1,3,353,0,26.0,0,0,669,7.925,2
3,4,1,1,272,0,35.0,1,0,49,53.1,2
4,5,0,3,15,1,35.0,0,0,472,8.05,2


we have first impute the missing values of `age` ....


In [92]:
# Split the dataset into two parts: one with missing values, one without
df_with_missing = df[df['Age'].isna()]
# dropna removes all rows with missing values
df_without_missing = df.dropna()

print("The shape of the original dataset is: ", df.shape)
print("The shape of the dataset with missing values removed is: ",  df_missing.shape)
print("The shape of the dataset with missing values is: ", df_no_missing.shape)

The shape of the original dataset is:  (891, 11)
The shape of the dataset with missing values removed is:  (177, 11)
The shape of the dataset with missing values is:  (714,)


In [93]:
# Regression Imputation

# split the data into X and y and we will only take the columns with no missing values
X = df_without_missing.drop(['Age'], axis=1)
y = df_without_missing['Age']

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Random Forest Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# evaluate the model
y_pred = rf_model.predict(X_test)
print("RMSE for Random Forest Imputation: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score for Random Forest Imputation: ", r2_score(y_test, y_pred))
print("MAE for Random Forest Imputation: ", mean_absolute_error(y_test, y_pred))
print("MAPE for Random Forest Imputation: ", mean_absolute_percentage_error(y_test, y_pred))

RMSE for Random Forest Imputation:  11.653764627586007
R2 Score for Random Forest Imputation:  0.26749108441711167
MAE for Random Forest Imputation:  9.307397902097902
MAPE for Random Forest Imputation:  0.4682002326118319


In [95]:
# check the number of missing values in each column
print(f"Dataset with Missing Values\n{df_with_missing.isna().sum().sort_values(ascending=False)}")

Dataset with Missing Values
Age            177
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         0
dtype: int64


In [102]:
# predict the missing values using the Random Forest model
predicted_values = rf_model.predict(df_with_missing.drop(['Age'], axis=1))

In [104]:
# remove warning
import warnings
warnings.filterwarnings('ignore')

# replace the missing values with the predicted values
df_with_missing['Age'] = predicted_values

# check the missing values
df_with_missing.isnull().sum().sort_values(ascending=False)

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [108]:
# concatenate the two dataframes
df_final = pd.concat([df_without_missing, df_with_missing], axis=0)

# print the shape of the complete dataframe
print("The shape of the complete dataframe is: ", df_final.shape)

The shape of the complete dataframe is:  (891, 11)


In [110]:
for col in columns_to_encode:
    # Retrieve the corresponding LabelEncoder for the column
    le = label_encoders[col]

    # Inverse transform the data
    df_final[col] = le.inverse_transform(df[col])
    
# check the first 5 rows of the complete dataframe
df_final.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [111]:
df_final.isna().sum().sort_values(ascending=False)

Embarked       2
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
dtype: int64

# 🔍 Deep Learning Methods: Missing Values ki Imputation ke liye Autoencoders ka Use

Neural networks, khaaskar autoencoders, bohat effective hain un datasets ke liye jahan data missing ho aur structure complex ho. 💻 Yeh methods specially un situations mein kaam aate hain jahan traditional statistical techniques fail ho jaati hain due to non-linear relationships.

## 🤖 Autoencoder Kya Hota Hai?

- Autoencoder aik aisa neural network hota hai jo apne input ko hi output mein copy karne ke liye train kiya jata hai.
- Isme aik hidden layer hoti hai jo input ka compressed version (code) create karti hai.
- Yeh network do hisson mein divide hota hai:
    - **Encoder**: Input ko compress karta hai.
    - **Decoder**: Compressed data se input ko reconstruct karta hai.

## 🛠️ Imputation ke liye Autoencoders ka Kaam Karne Ka Tareeqa

- Training ke dauraan, model ko aise inputs diye jaate hain jahan kuch values missing hoti hain.
- Network yeh seekhta hai ke kaise missing values ko predict kiya jaaye, taake known values ka reconstruction error minimize ho.
- Is tarah model data ka ek robust representation seekh leta hai. 🔄

## ✅ Autoencoders ke Fayde

- 🌀 **Complex Patterns Handle Karna**: Non-linear relationships ko bhi samajh sakte hain.
- 📈 **Scalability**: Bade datasets ke saath efficiently kaam karte hain.
- 🧰 **Flexibility**: Har tarah ke data (images, text, time-series) ke liye adapt ho jaate hain.

## 🧪 Implementation Tips

- 📊 **Data Preprocessing**: Autoencoder se pehle data ko normalize ya standardize karna zaroori hai.
- 🏗️ **Network Architecture**: Layers ki type aur quantity ka selection data ki complexity par depend karta hai.
- 🎯 **Training Techniques**: Dropout ya noise addition jaise methods se model aur resilient banta hai.

## 🔍 Example Use-Cases

- 🖼️ **Image Data**: Missing pixels fill karna ya corrupted images ko reconstruct karna.
- 📉 **Time-Series Data**: Jaise stock prices ya weather data mein missing values ko predict karna.
- 📊 **Tabular Data**: Machine learning ke datasets mein missing entries handle karna.
