# Data Imputation
- Data imputation is the process of `filling missing values in a dataset` using various techniques. There are several ways to perform data imputation, some of which are listed below: 
1. __Mean imputation:__ This involves replacing the missing values with the mean value of the corresponding feature. This method is simple and effective, but it may not work well if the missing values are not missing at random. 
2. __Median imputation:__ This involves replacing the missing values with the median value of the corresponding feature. This method is robust to outliers, but it may not work well if the distribution of the feature is skewed. 
3. __Mode imputation:__ This involves replacing the missing values with the mode (most frequent value) of the corresponding feature. This method is suitable for categorical features, but it may not work well if the mode is not representative of the overall distribution. 
4. __Regression imputation:__ This involves using regression analysis to predict the missing values based on the values of other features. This method can work well if there is a strong correlation between the missing values and the other features, but it may not work well if the correlation is weak or non-linear. 
5. __K-nearest neighbor imputation:__ This involves finding the k nearest neighbors of the sample with missing values and using their values to impute the missing values. This method can work well if the data has a natural clustering structure, but it may not work well if the nearest neighbors are not representative of the missing sample. 
6. __Multiple imputation:__ This involves creating multiple imputed datasets using a statistical model and combining them to obtain a final imputed dataset. This method can provide more accurate estimates of the missing values and their uncertainty, but it may be computationally intensive and require more data.

In [10]:
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.datasets import load_diabetes
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from plotly.subplots import make_subplots


In [4]:
# Load the diabetes dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target



In [17]:

def impute_data(strategy):
    # Create a new DataFrame with missing values
    df_missing = df.copy()
    df_missing.iloc[::2, 2] = np.nan
    df_missing.iloc[::3, 3] = np.nan

    # Impute the missing values
    if strategy == 'mean':
        imputer = SimpleImputer(strategy='mean')
    elif strategy == 'median':
        imputer = SimpleImputer(strategy='median')
    elif strategy == 'mode':
        imputer = SimpleImputer(strategy='most_frequent')
    # elif strategy == 'regression':
    #     imputer = LinearRegression()
    elif strategy == 'knn':
        imputer = KNNImputer(n_neighbors=5)

    # Fit the imputer and transform the data
    imputed_data = imputer.fit_transform(df_missing)

    # Create a new DataFrame with the imputed data
    df_imputed = pd.DataFrame(imputed_data, columns=df.columns)

    return df_missing, df_imputed

In [18]:
def plot_imputed_data(df_missing, df_imputed, strategy):
    # Create a scatter plot of the missing data
    fig_missing = px.scatter(
        df_missing, x='bmi', y='bp', color='target', opacity=0.5, 
        title='Missing Data'
    )
    fig_missing.update_traces(marker=dict(size=7))

    # Create a scatter plot of the imputed data
    fig_imputed = px.scatter(
        df_imputed, x='bmi', y='bp', color='target', opacity=0.5, 
        title='Imputed Data'
    )
    fig_imputed.update_traces(marker=dict(size=7))

    # Combine the two plots into a single figure
    fig = make_subplots(
        subplot_titles=('Missing Data', f'{strategy} Imputed Data'), 
        rows=1, cols=2, 
        shared_xaxes=False, shared_yaxes=False, 
        horizontal_spacing=0.05, vertical_spacing=0.05,
    )
    fig.add_trace(fig_missing['data'][0], row=1, col=1)
    fig.add_trace(fig_imputed['data'][0], row=1, col=2)
    fig.update_layout(showlegend=False)
    fig.show()

In [19]:
# Test the function with different imputation strategies
strategies = ['mean', 
              'median',
              'mode', 
            #   'regression',
              'knn'
              ]
for strategy in strategies:
    df_missing, df_imputed = impute_data(strategy)
    plot_imputed_data(df_missing, df_imputed, strategy)