# Module 1: Data Analysis and Data Preprocessing

## Section 1: Handling missing data

### Part 3: IterativeImputer

In this part, we will explore the Iterative Imputer, a powerful method for handling missing data using Scikit-Learn. The Iterative Imputer is capable of imputing missing values in a dataset by modeling each feature with missing values as a function of other features and iteratively updating the imputed values. Let's dive into how Iterative Imputer works and how you can use it for data preprocessing.

### 3.1 Understanding Iterative Imputer

The Iterative Imputer is based on the idea of using machine learning models to predict missing values based on the non-missing values in the dataset. It treats each feature with missing values as a target variable and uses the other features as predictors to estimate the missing values. This process is repeated multiple times until convergence, improving the imputed values at each iteration. By default, it uses a Bayesian Ridge regression model.

### 3.2 Using Iterative Imputer in Scikit-Learn

The Iterative Imputer is applied to the DataFrame df containing missing values, and the missing values are imputed. The result is a new DataFrame df_imputed with the missing values filled using the imputed values.

To use Iterative Imputer, you need to import it from Scikit-Learn and create an instance of the imputer. Here's an example of how to use Iterative Imputer to handle missing data:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(1)
x = np.linspace(0, 10, 100)
y = 2 * x + 5 + np.random.normal(0, 2, 100)
# Introduce missing values to 'y' feature only
missing_indices = np.random.choice(100, size=20, replace=False)
y_with_missing = y.copy()
y_with_missing[missing_indices] = np.nan
# Convert the NumPy arrays to a DataFrame
df = pd.DataFrame({'x': x, 'y': y_with_missing})

# Create a linear regression model as the estimator
regressor = LinearRegression()
# Create an IterativeImputer instance with the regressor as the estimator
imputer = IterativeImputer(estimator=regressor, max_iter=100, random_state=42)
# Fit and transform the imputer on the DataFrame to impute the missing values
imputed_data = imputer.fit_transform(df)
# Convert the imputed data back to a DataFrame
df_imputed = pd.DataFrame(imputed_data, columns=['x', 'y'])
# Fit the linear regression estimator on the complete data (without missing values)
regressor.fit(df.dropna()[['x']].values, df.dropna()['y'].values)

# Plot the original data with missing values and the imputed data
plt.scatter(df['x'], df['y'], label='Original Data with Missing Values', color='blue')
plt.scatter(df_imputed.loc[missing_indices, 'x'], df_imputed.loc[missing_indices, 'y'], label='Imputed Values', color='red', marker='x', s=100)
plt.plot(x, regressor.predict(x.reshape(-1, 1)), label='Linear Regression Line', color='green')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Iterative Imputer with Linear Regression Estimator')
plt.legend()
plt.grid(True)
plt.show()

In this example, we start with a DataFrame df containing missing values in columns 'A', 'B', and 'C'. We import the IterativeImputer class from Scikit-Learn's sklearn.impute module. Then, we create an instance of the IterativeImputer with max_iter=10 and random_state=42 for reproducibility. Next, we use the fit_transform() method of the imputer to impute the missing values in the DataFrame df. The result is a NumPy array imputed_data with the missing values filled using the imputed values. Finally, we convert the NumPy array back to a DataFrame df_imputed with the same column names. As you can see, the missing values in 'A', 'B', and 'C' have been filled with the imputed values generated by the Iterative Imputer.

Please note that the Iterative Imputer uses machine learning models to predict the missing values, which means that the imputed values may vary between different runs, even with the same random_state. The number of iterations (max_iter) and the choice of machine learning models may also affect the imputation results. Always evaluate the imputed data and consider the characteristics of your dataset before making further analyses.

Also, note that this example runs in a small dataset and machine learning models use to need a good amount of data to be more precise.

### 3.3 Summary

The Iterative Imputer is a powerful method for handling missing data by using machine learning models to predict missing values based on non-missing values in the dataset. It iteratively improves the imputed values until convergence, making it a robust and effective imputation technique.

In the next section, we will explore the missing data indicator.