<a href="https://colab.research.google.com/github/varshith2005a/Data-analytics-lab/blob/main/Imputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Imputation

Imputation is the process of replacing missing data with substituted values.

In data science and statistics, datasets often have gapsâ€”cells where information was not recorded or was lost. Because most machine learning algorithms cannot process blank values (NaN/Null), you must handle them. Instead of deleting the entire row or column (which loses valuable information), imputation fills these gaps with estimated values.

The goal of imputation is not just to "fill the hole," but to do so in a way that preserves the overall statistical relationships (like the mean, variance, and correlations) of the dataset.

In [None]:
import pandas as pd
df=pd.read_excel('/content/Employee_data.xlsx')
print("Original DataFrame with Missing Values:")
print(df)

Original DataFrame with Missing Values:
    Employee_ID              Name   Department    Age    Salary  Join_Date
0           101          John Doe  Engineering   28.0   75000.0 2021-01-15
1           102        Jane Smith    Marketing   34.0   82000.0 2019-03-12
2           103      Mike Johnson  Engineering    NaN   90000.0 2020-06-01
3           104    Sarah Williams           HR   29.0   62000.0 2022-02-20
4           105      Robert Brown        Sales   45.0   55000.0 2018-11-10
5           101          John Doe  Engineering   28.0   75000.0 2021-01-15
6           106       Emily Davis    Marketing   31.0       NaN 2021-07-22
7           107      Chris Miller        Sales   22.0   48000.0 2023-01-05
8           108       Anna Taylor           HR  205.0   72000.0 2020-10-15
9           109      David Wilson  Engineering   40.0  120000.0 2015-05-19
10          110       Linda Moore        Sales   37.0   64000.0 2017-08-30
11          111    James Anderson          ???   29.0   5900

1. Simple Imputation (Univariate)

**Concept: **This is the most basic approach. You replace missing values with a summary statistic of that column (mean, median, mode) or a constant value. It treats every feature independently.

Mean: Good for normally distributed numerical data.

Median: Better if the data has outliers.

Most Frequent (Mode): Used for categorical data.

In [None]:
from sklearn.impute import SimpleImputer
import pandas as pd

df_simple=df.copy()

# Identify numerical columns for mean imputation
numerical_cols = df_simple.select_dtypes(include=['number']).columns

# Initialize the imputer for numerical columns
imputer=SimpleImputer(strategy='mean')
df_simple[numerical_cols]=imputer.fit_transform(df_simple[numerical_cols])

# Display the DataFrame with imputed values
print("\n---Simple Imputation (Mean)---")
print(df_simple)


---Simple Imputation (Mean)---
    Employee_ID              Name   Department       Age         Salary  \
0         101.0          John Doe  Engineering   28.0000   75000.000000   
1         102.0        Jane Smith    Marketing   34.0000   82000.000000   
2         103.0      Mike Johnson  Engineering   35.9375   90000.000000   
3         104.0    Sarah Williams           HR   29.0000   62000.000000   
4         105.0      Robert Brown        Sales   45.0000   55000.000000   
5         101.0          John Doe  Engineering   28.0000   75000.000000   
6         106.0       Emily Davis    Marketing   31.0000  108624.979167   
7         107.0      Chris Miller        Sales   22.0000   48000.000000   
8         108.0       Anna Taylor           HR  205.0000   72000.000000   
9         109.0      David Wilson  Engineering   40.0000  120000.000000   
10        110.0       Linda Moore        Sales   37.0000   64000.000000   
11        111.0    James Anderson          ???   29.0000   59000.000

2. K-Nearest Neighbors (KNN) ImputationConcept:

This method finds the k rows (neighbors) that are most similar to the row with the missing value. It then averages the values of those neighbors to fill the gap.

This is often more accurate than simple imputation because it accounts for the correlation between rows.

Key Parameter: n_neighbors (the number of neighbors to use).

In [None]:
from sklearn.impute import KNNImputer
df_knn=df.copy()
knn_imputer=KNNImputer(n_neighbors=2)
df_knn[numerical_cols]=imputer.fit_transform(df_knn[numerical_cols])
print("\n ---KNN Imputation---")
print(df_knn)


 ---KNN Imputation---
    Employee_ID              Name   Department       Age         Salary  \
0         101.0          John Doe  Engineering   28.0000   75000.000000   
1         102.0        Jane Smith    Marketing   34.0000   82000.000000   
2         103.0      Mike Johnson  Engineering   35.9375   90000.000000   
3         104.0    Sarah Williams           HR   29.0000   62000.000000   
4         105.0      Robert Brown        Sales   45.0000   55000.000000   
5         101.0          John Doe  Engineering   28.0000   75000.000000   
6         106.0       Emily Davis    Marketing   31.0000  108624.979167   
7         107.0      Chris Miller        Sales   22.0000   48000.000000   
8         108.0       Anna Taylor           HR  205.0000   72000.000000   
9         109.0      David Wilson  Engineering   40.0000  120000.000000   
10        110.0       Linda Moore        Sales   37.0000   64000.000000   
11        111.0    James Anderson          ???   29.0000   59000.000000   
12

3. Multivariate Imputation by Chained Equations (MICE)

Concept: Also known as Iterative Imputation. This is a sophisticated method that models each feature with missing values as a function of other features.

It fills missing values with a placeholder (e.g., mean).

It treats the column with missing values as the "target" and runs a regression model (like BayesianRidge) using other columns as features to predict the true value.

It repeats this process multiple times until the values converge

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df_mice=df.copy()
mice_imputer=IterativeImputer(max_iter=10,random_state=0)
df_mice[numerical_cols]=imputer.fit_transform(df_mice[numerical_cols])
print("\n --- MICE (Iterative) Imputation ---")
print(df_mice)


 --- MICE (Iterative) Imputation ---
    Employee_ID              Name   Department       Age         Salary  \
0         101.0          John Doe  Engineering   28.0000   75000.000000   
1         102.0        Jane Smith    Marketing   34.0000   82000.000000   
2         103.0      Mike Johnson  Engineering   35.9375   90000.000000   
3         104.0    Sarah Williams           HR   29.0000   62000.000000   
4         105.0      Robert Brown        Sales   45.0000   55000.000000   
5         101.0          John Doe  Engineering   28.0000   75000.000000   
6         106.0       Emily Davis    Marketing   31.0000  108624.979167   
7         107.0      Chris Miller        Sales   22.0000   48000.000000   
8         108.0       Anna Taylor           HR  205.0000   72000.000000   
9         109.0      David Wilson  Engineering   40.0000  120000.000000   
10        110.0       Linda Moore        Sales   37.0000   64000.000000   
11        111.0    James Anderson          ???   29.0000   590

4. Time-Series Imputation (Forward/Backward Fill)

Concept: If your data is time-series data (ordered by time), using the mean is dangerous because it ignores trends. instead, we propagate the last known value forward or the next valid value backward.

FFill (Forward Fill): Takes the previous valid value and fills it forward.

BFill (Backward Fill): Takes the next valid value and fills it backward.

In [None]:
df_time=df.copy()
df_ffill=df_time.ffill()
sf_bfill=df_time.bfill()
df_interpolate=df_time.interpolate(method='linear')
print("\n--- Time Series: Linear Interpolation ---")
print(df_interpolate)



--- Time Series: Linear Interpolation ---
    Employee_ID              Name   Department    Age    Salary  Join_Date
0           101          John Doe  Engineering   28.0   75000.0 2021-01-15
1           102        Jane Smith    Marketing   34.0   82000.0 2019-03-12
2           103      Mike Johnson  Engineering   31.5   90000.0 2020-06-01
3           104    Sarah Williams           HR   29.0   62000.0 2022-02-20
4           105      Robert Brown        Sales   45.0   55000.0 2018-11-10
5           101          John Doe  Engineering   28.0   75000.0 2021-01-15
6           106       Emily Davis    Marketing   31.0   61500.0 2021-07-22
7           107      Chris Miller        Sales   22.0   48000.0 2023-01-05
8           108       Anna Taylor           HR  205.0   72000.0 2020-10-15
9           109      David Wilson  Engineering   40.0  120000.0 2015-05-19
10          110       Linda Moore        Sales   37.0   64000.0 2017-08-30
11          111    James Anderson          ???   29.0   5

  df_interpolate=df_time.interpolate(method='linear')
