## **Power Of Imputers**

Imputers in Machine Learning are techniques used to handle missing data in datasets. Missing data is a common problem in real-world datasets, and imputers provide strategies to replace these missing values with appropriate estimates, so that machine learning models can operate without errors. Imputing missing data ensures that the dataset remains complete and that no valuable data points are lost due to missing values.

### Why Imputers Are Important:
- **Handling Missing Data**: Many machine learning algorithms cannot handle missing values directly and may give errors or produce suboptimal results if missing data is present.
- **Improves Model Accuracy**: By properly imputing missing values, the dataset becomes more complete and can lead to better model performance.
- **Retains Data**: Imputation allows you to keep as much data as possible instead of discarding rows or columns with missing values, which could reduce the amount of information available for learning.

### Types of Imputers

1. **Mean/Median/Mode Imputation**:
   - **Mean**: For continuous numerical features, missing values are replaced with the mean of the non-missing values.
   - **Median**: Replaces missing values with the median, often preferred when there are outliers.
   - **Mode**: Used for categorical features, replaces missing values with the most frequent category (the mode).

   Example (using `SimpleImputer` from scikit-learn):
   ```python
   from sklearn.impute import SimpleImputer
   import numpy as np
   import pandas as pd

   # Sample data with missing values
   data = {'Age': [25, np.nan, 35, 40, np.nan], 
           'Gender': ['Male', 'Female', 'Female', np.nan, 'Male']}
   df = pd.DataFrame(data)

   # Imputing missing numerical data with mean
   imputer = SimpleImputer(strategy='mean')
   df['Age'] = imputer.fit_transform(df[['Age']])

   # Imputing missing categorical data with mode
   imputer = SimpleImputer(strategy='most_frequent')
   df['Gender'] = imputer.fit_transform(df[['Gender']])

   print(df)
   ```

2. **K-Nearest Neighbors (KNN) Imputation**:
   - Uses the K-nearest neighbors algorithm to impute missing values based on the values of the K nearest observations.
   - For each missing value, KNN finds K similar instances (rows) based on other feature values and imputes the missing value by averaging (or mode in the case of categorical features) the non-missing values of the neighbors.

   Example:
   ```python
   from sklearn.impute import KNNImputer
   import pandas as pd

   # Sample data with missing values
   df = pd.DataFrame({
       'Feature1': [1, 2, np.nan, 4],
       'Feature2': [5, np.nan, np.nan, 8],
       'Feature3': [10, 11, 12, 13]
   })

   # Using KNNImputer
   imputer = KNNImputer(n_neighbors=2)
   df_imputed = imputer.fit_transform(df)

   print(df_imputed)
   ```

3. **Multivariate Imputation by Chained Equations (MICE)**:
   - Also called **Iterative Imputer**, this method models each feature with missing values as a function of the other features. It imputes values by drawing multiple imputation steps in a sequence (chained equations).
   - More advanced and can capture relationships between variables better than simpler methods like mean or mode imputation.

   Example:
   ```python
   from sklearn.experimental import enable_iterative_imputer
   from sklearn.impute import IterativeImputer
   import pandas as pd
   import numpy as np

   # Sample data with missing values
   df = pd.DataFrame({
       'Feature1': [1, 2, np.nan, 4],
       'Feature2': [5, np.nan, np.nan, 8],
       'Feature3': [10, 11, 12, 13]
   })

   # Using Iterative Imputer
   imputer = IterativeImputer()
   df_imputed = imputer.fit_transform(df)

   print(df_imputed)
   ```

4. **Constant Imputation**:
   - This method allows missing values to be filled with a constant value, often used when you want to replace missing values with a domain-specific constant.
   - For example, you can fill missing values with `0` for numerical data or `'Unknown'` for categorical data.

   Example:
   ```python
   imputer = SimpleImputer(strategy='constant', fill_value=0)
   df['Age'] = imputer.fit_transform(df[['Age']])
   ```

### Summary of Imputation Strategies:

| Method             | When to Use                                                                 |
|--------------------|-----------------------------------------------------------------------------|
| **Mean**           | When the feature is continuous, and the data distribution is not skewed.     |
| **Median**         | When the feature is continuous, especially if the data contains outliers.    |
| **Mode**           | For categorical features.                                                   |
| **KNN Imputation** | When the missing data is correlated with other features and patterns can be learned. |
| **MICE**           | When the dataset is complex, and relationships between features are important to preserve. |
| **Constant Imputation** | When a specific value makes sense for your domain, e.g., 0 for missing numerical data or "Unknown" for missing categorical data. |

### Pros and Cons of Imputation

| Pros                                  | Cons                                                       |
|---------------------------------------|-------------------------------------------------------------|
| Retains valuable data                 | Can introduce bias (especially with simple methods like mean imputation) |
| Improves model performance            | KNN and MICE imputers are computationally expensive         |
| Prevents loss of data when values are missing | Can distort the data distribution                           |

### Best Practices
- Always evaluate the nature of your data and choose the imputation method carefully.
- Simple methods like mean and mode may work well for small datasets, but advanced methods like KNN or MICE may be better for larger datasets with complex patterns.


Import all the required packages

In [1]:
import pandas as pd
import numpy as np

Create a dummy data to understand the data imputation

In [2]:
data = {'age':[25, np.nan, 30, np.nan, 35],
        'salary': [50000, 60000, np.nan, 90000, np.nan]}

dataframe = pd.DataFrame(data)
dataframe

Unnamed: 0,age,salary
0,25.0,50000.0
1,,60000.0
2,30.0,
3,,90000.0
4,35.0,


Data Imputation via Mean, Median or Mode

In [3]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(dataframe)
imputed_df = pd.DataFrame(imputed_data, columns = dataframe.columns)
print(imputed_df)

    age        salary
0  25.0  50000.000000
1  30.0  60000.000000
2  30.0  66666.666667
3  30.0  90000.000000
4  35.0  66666.666667


Data Imputation via KNN Imputer

In [4]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
imputed_data = imputer.fit_transform(dataframe)
imputed_df = pd.DataFrame(imputed_data, columns = dataframe.columns)
print(imputed_df)

    age   salary
0  25.0  50000.0
1  27.5  60000.0
2  30.0  55000.0
3  27.5  90000.0
4  35.0  55000.0
