
### Imputation Redux: Imputing Missing Categorical Data

You might have realized that because OneHotEncoder fails with missing values, and SciKit learn prefers to work with numerical data, we face a bit of a conundrum when trying to deal with missing categorical values.  Note that we really don't want to use these imputers after a one-hot encoding either, because there is is no guarantee these imputers will follow the implicit rule that only one column of a one-hot encoded categorical set can be `1`. Here are two strategies.

#### Using Pandas or Most Frequent Category

You can use pandas to replace nulls, using one of the methods we covered previously. Alternatively, you can use SimpleImputer with the `strategy='most_frequent'` option to impute missing values with the most frequent category in each column before one-hot encoding.


In [1]:
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

# Create DataFrame with missing values
df = pd.DataFrame({
    'color': ['red', 'blue', np.nan, 'red', np.nan, 'green'],
    'target': [1, 0, 1, 0, 0, 1]
})

# Impute missing values
imp = SimpleImputer(strategy='most_frequent')
df['color'] = imp.fit_transform(df[['color']])[:,0]
df

Unnamed: 0,color,target
0,red,1
1,blue,0
2,red,1
3,red,0
4,red,0
5,green,1


#### Use a Different Library!

As you might imagine, others have struggled with this, and so there are other libraries designed to address this problem.  For instance, the `fancyimpute` package has both a KNNImputer and an IterativeImputer you might try.  Here's an example with the `KNNImputer` from `fancyimpute`.

In [2]:
!pip install fancyimpute


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m



#### FancyImputes K-Nearest Neighbors (KNN) Imputer

KNN from fancyimputer won't work with categorical data directly, but instead of using the `mean` (which is used by SciKit Learn's KNNImputer) is uses the `mode` for imputation, which is what we want.  To use KNN, first you should first encode your data using an `OrdinalEncoder` or `LabelEncoder`, then impute, then transform your data back into the categorical values you want.  This is more complicated than it should be because there is no easy way to preserve nulls in your data.



In [3]:
import pandas as pd
import numpy as np
from fancyimpute import KNN
from sklearn.preprocessing import LabelEncoder

# Create DataFrame with missing values
data = {
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Apple', None, 'Banana'],
    'Color': ['Red', 'Yellow', 'Red', None, 'Green', 'Yellow']
}

df = pd.DataFrame(data)

# Dictionary to hold LabelEncoders for each column
encoders = {}

# Replace categorical string values with numerical representations
for col in df.columns:
    le = LabelEncoder()
    not_null_mask = df[col].notnull()
    df.loc[not_null_mask, col] = le.fit_transform(df.loc[not_null_mask, col].astype(str))
    encoders[col] = le

# Use KNN to impute the missing values
knn_imputer = KNN()
df_imputed = knn_imputer.fit_transform(df)

# Round imputed values and convert to int for decoding
# Note that the rounding is necessary because NaNs force columns to become floats
df_imputed = pd.DataFrame(np.round(df_imputed), columns=df.columns).astype(int)

# Decode imputed values back to original categorical values
for col in df.columns:
    df_imputed[col] = encoders[col].inverse_transform(df_imputed[col])

print(df_imputed)

Imputing row 1/6 with 0 missing, elapsed time: 0.000
    Fruit   Color
0   Apple     Red
1  Banana  Yellow
2  Cherry     Red
3   Apple     Red
4  Banana   Green
5  Banana  Yellow


Other strategies may be applied in a similar manner, after which you can one-hot encode your data, and proceed with additional processing!

Note that there is currently no elegant solution for imputation of categorical variables, and so if you want something more sophisticated than a SimpleImputer with a "most_frequent" strategy, you'll probably need to write some code.  However, we can turn the above method into our own "Imputer" class like this:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder
from fancyimpute import KNN
import pandas as pd
import numpy as np

class CategoricalKNNImputer(BaseEstimator, TransformerMixin):
    def __init__(self, include_numeric=False, include_cols = []):
        self.encoders = {}
        self.knn_imputer = KNN()
        self.include_numeric = include_numeric
        self.include_cols = include_cols
    
    def fit(self, X, y=None):
        X = X.copy()
        
        if self.include_numeric:
            self.cols = X.columns.tolist()
        else:
            self.cols = X.select_dtypes(include=['object', 'category']).columns.tolist()+self.include_cols
            
        for col in self.cols:
            le = LabelEncoder()
            not_null_mask = X[col].notnull()
            if not_null_mask.sum() > 0:  # Only if there are non-null values to fit
                X.loc[not_null_mask, col] = le.fit_transform(X.loc[not_null_mask, col].astype(str))
                self.encoders[col] = le
        return self
    
    def transform(self, X):
        X_original = X.copy()
        X = X.copy()
        
        for col in self.cols:
            if col in self.encoders:  # Only if encoder exists
                not_null_mask = X[col].notnull()
                X.loc[not_null_mask, col] = self.encoders[col].transform(X.loc[not_null_mask, col].astype(str))
        
        X_imputed = self.knn_imputer.fit_transform(X)
        X_imputed = pd.DataFrame(X_imputed, columns=X.columns)
        
        for col in self.cols:
            if col in self.encoders:  # Only if encoder exists
                X_imputed.loc[:, col] = np.round(X_imputed.loc[:, col])  # Rounding only categorical columns
                X_imputed[col] = X_imputed[col].astype(int)  # Converting to int before decoding
                X_imputed[col] = self.encoders[col].inverse_transform(X_imputed[col])
        
        if not self.include_numeric:
            replacements = [x for x in X.columns if x not in self.cols]
            #numeric_cols = X_original.select_dtypes(include=[np.number]).columns
            X_imputed[replacements] = X_original[replacements]
        
        return X_imputed



The details of the Python might be more than you can understand at this point, but you should be able to recognize roughly what's going on here; we're simply building a component that works with SciKit Learn to do KNN based imputation on categorical columns.  You can apply this just like other SciKit Learn components, using `fit` and `transform`.

### Important Considerations

When training machine learning models with imputed data, it's crucial to follow best practices to ensure the robustness and generalizability of your models. Here’s a guide that covers considerations like data leakage, when to use imputed data, and other relevant aspects:

#### 1. Data Splitting
Always split your dataset into training, validation (optional), and test sets before any imputation to avoid data leakage. Leakage occurs when information from the validation/test sets is used to inform any part of the modeling process, leading to overly optimistic performance estimates.

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

#### 2. Imputation
Perform imputation separately on each set:
   - Fit the imputer on the training set.
   - Transform both the training and test sets.

```python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputer.fit(X_train)  # Fit only on the training set

X_train_imputed = imputer.transform(X_train)  # Transform the training set
X_test_imputed = imputer.transform(X_test)  # Transform the test set
```

#### 3. When to Use Imputed Data
- When the amount of missing data is substantial, imputation can leverage the available information, which would otherwise be discarded if only complete cases are used.
- When the data are missing at random or missing completely at random, imputation can yield unbiased estimators.

#### 4. When to Avoid Imputed Data
- When missingness is related to the unobserved value itself (missing not at random), imputation might introduce bias.
- When there are very few observed cases, imputation might overfit the training data, and it's better to use complete cases if available.

#### 5. Model Evaluation
- Evaluate model performance on the test set with imputed values, focusing on metrics relevant to your specific problem.
- Consider performing sensitivity analyses by using different imputation methods and comparing the results.
- Additionally, assess the model's performance on only complete cases in the test set, to understand how much information is gained (or lost) due to imputation.

#### 6. Other Considerations
- **Hyperparameter Tuning and Model Selection:** Conduct model selection and hyperparameter tuning using only the training set. Use techniques like cross-validation to assess model generalization on the training set before final evaluation on the test set.
- **Complex Imputation Methods:** More advanced imputation methods like model-based imputation or multiple imputations may provide better results but come with their assumptions and computational cost.
- **Documentation:** Document all the steps involved in the imputation process, the reasons for choosing a particular imputation method, and any assumptions made.
