# This one shows how things work if the target column is not encoded

This code preservse the target column. It always casts the target column to `str` and and then imputes.

This does not solve the problems:
1. `IterativeImputer` strictly uses `float` imput. If it can't cast the columns it crashes. E.g., `"YES"`and `" NO"` does not work. `"1"` and `"0"` works but is treated as regression.
2. If there are categories in the test set that are not in the train set, it also fails (general problem)

In [1]:
from jenga.tasks.openml import OpenMLTask
from jenga.corruptions.generic import MissingValues
    
import pandas as pd
import numpy as np

from data_imputation_paper.imputation import SKLearnModeImputer, SKLearnIterativeImputer, SKLearnKNNImputer
from data_imputation_paper.evaluation import Evaluator

## Make determenistic

In [2]:
np.random.seed(42)

## Create example tasks

In [3]:
task = OpenMLTask(seed=42, openml_id=4552)

if task.contains_missing_values():
    raise ValueError("This would distort the evaluation because we wouln't have a full ground truth.")

Found 14 categorical columns: ['V1', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15']
Found 2 numeric columns: ['V2', 'V16']


In [4]:
categorical_task = OpenMLTask(seed=42, openml_id=1044)

if task.contains_missing_values():
    raise ValueError("This would distort the evaluation because we wouln't have a full ground truth.")

Found 3 categorical columns: ['P1stFixation', 'P2stFixation', 'nextWordRegress']
Found 24 numeric columns: ['lineNo', 'assgNo', 'fixcount', 'firstPassCnt', 'prevFixDur', 'firstfixDur', 'firstPassFixDur', 'nextFixDur', 'firstSaccLen', 'lastSaccLen', 'prevFixPos', 'landingPos', 'leavingPos', 'totalFixDur', 'meanFixDur', 'nRegressFrom', 'regressLen', 'regressDur', 'pupilDiamMax', 'pupilDiamLag', 'timePrtctg', 'nWordsInTitle', 'titleNo', 'wordNo']


## Insert missing values using jenga

In [5]:
numerical_missing = MissingValues(column='V2', fraction=0.5, na_value=np.nan, missingness='MCAR')
categorical_as_string_missing = MissingValues(column='V4', fraction=0.5, na_value=np.nan, missingness='MCAR')

In [6]:
categorical_missing = MissingValues(column='P2stFixation', fraction=0.5, na_value=np.nan, missingness='MCAR')

## Create Evaluator

Evaluators repeadetly:
1. insert missing values into the dataset
2. fit the imputer
3. evauluate the train and test performance of the imputation

Then it returns the mean evaluation result.

## Examples for mode imputation

In [7]:
Evaluator(task, numerical_missing, SKLearnModeImputer()).evaluate(10).result

100%|██████████| 10/10 [00:00<00:00, 60.31it/s]


Unnamed: 0,train,test
MAE,14.924881,14.53182
MSE,712.367137,642.044099
RMSE,26.686727,25.333848


In [8]:
Evaluator(categorical_task, categorical_missing, SKLearnModeImputer()).evaluate(10).result

100%|██████████| 10/10 [00:02<00:00,  4.29it/s]


Unnamed: 0,train,test
F1_micro,0.796491,0.803748
F1_macro,0.759048,0.764236
F1_weighted,0.777158,0.784791


In [9]:
Evaluator(task, categorical_as_string_missing, SKLearnModeImputer()).evaluate(10).result

100%|██████████| 10/10 [00:01<00:00,  8.26it/s]


Unnamed: 0,train,test
F1_micro,0.91357,0.921624
F1_macro,0.806124,0.812721
F1_weighted,0.901091,0.910407


## MICE imputation

In [10]:
Evaluator(task, numerical_missing, SKLearnIterativeImputer(strategy="mice", data_encoding_type="ordinal")).evaluate(10).result

100%|██████████| 10/10 [00:02<00:00,  3.46it/s]


Unnamed: 0,train,test
MAE,26.763217,26.145984
MSE,2138.865953,1991.639188
RMSE,46.246003,44.619189


In [11]:
Evaluator(task, numerical_missing, SKLearnIterativeImputer(strategy="mice", data_encoding_type="one-hot")).evaluate(10).result

100%|██████████| 10/10 [00:36<00:00,  3.67s/it]


Unnamed: 0,train,test
MAE,26.692983,26.072904
MSE,2131.902162,1983.634069
RMSE,46.167661,44.52978


In [12]:
Evaluator(task, categorical_as_string_missing, SKLearnIterativeImputer(strategy="mice", data_encoding_type="ordinal")).evaluate(10).result

  0%|          | 0/10 [00:00<?, ?it/s]


ValueError: could not convert string to float: ' NO'

In [13]:
Evaluator(task, categorical_as_string_missing, SKLearnIterativeImputer(strategy="mice", data_encoding_type="one-hot")).evaluate(10).result

  0%|          | 0/10 [00:00<?, ?it/s]


ValueError: could not convert string to float: ' NO'

In [14]:
Evaluator(categorical_task, categorical_missing, SKLearnIterativeImputer(strategy="mice", data_encoding_type="ordinal")).evaluate(10).result

  0%|          | 0/10 [00:00<?, ?it/s]


ValueError: Classification metrics can't handle a mix of binary and continuous targets

In [15]:
Evaluator(categorical_task, categorical_missing, SKLearnIterativeImputer(strategy="mice", data_encoding_type="one-hot")).evaluate(10).result

  0%|          | 0/10 [00:00<?, ?it/s]


ValueError: Classification metrics can't handle a mix of binary and continuous targets

## missForest imputation

In [16]:
# we can feed arguments to estimators' constructor
Evaluator(task, numerical_missing, SKLearnIterativeImputer(strategy="missforest", data_encoding_type="ordinal", estimator_args={"n_estimators": 10})).evaluate(5).result

100%|██████████| 5/5 [00:06<00:00,  1.33s/it]


Unnamed: 0,train,test
MAE,26.846602,26.178817
MSE,2158.266196,1993.500618
RMSE,46.454749,44.643612


In [17]:
# we can feed arguments to estimators' constructor
Evaluator(task, numerical_missing, SKLearnIterativeImputer(strategy="missforest", data_encoding_type="ordinal", estimator_args={"n_estimators": 10})).evaluate(5).result

100%|██████████| 5/5 [00:06<00:00,  1.29s/it]


Unnamed: 0,train,test
MAE,26.841527,25.859312
MSE,2161.269771,1954.956929
RMSE,46.487048,44.206321


In [18]:
# one hot encoding is default
Evaluator(task, categorical_as_string_missing, SKLearnIterativeImputer(strategy="missforest", estimator_args={"n_estimators": 10})).evaluate(5).result

  0%|          | 0/5 [00:00<?, ?it/s]


ValueError: could not convert string to float: ' NO'

In [19]:
# one hot encoding is default
Evaluator(task, categorical_as_string_missing, SKLearnIterativeImputer(strategy="missforest", estimator_args={"n_estimators": 10})).evaluate(5).result

  0%|          | 0/5 [00:00<?, ?it/s]


ValueError: could not convert string to float: ' NO'

In [20]:
# we can feed arguments to estimators' constructor
Evaluator(categorical_task, categorical_missing, SKLearnIterativeImputer(strategy="missforest", data_encoding_type="ordinal", estimator_args={"n_estimators": 10})).evaluate(5).result

  0%|          | 0/5 [00:21<?, ?it/s]


ValueError: Classification metrics can't handle a mix of binary and continuous targets

In [21]:
# one hot encoding is default
Evaluator(categorical_task, categorical_missing, SKLearnIterativeImputer(strategy="missforest", estimator_args={"n_estimators": 10})).evaluate(5).result

  0%|          | 0/5 [00:20<?, ?it/s]


ValueError: Classification metrics can't handle a mix of binary and continuous targets

## KNN imputation

In [22]:
# we can feed arguments to estimators' constructor
Evaluator(task, numerical_missing, SKLearnKNNImputer(data_encoding_type="ordinal")).evaluate(5).result

100%|██████████| 5/5 [00:00<00:00,  5.61it/s]


Unnamed: 0,train,test
MAE,26.785305,26.017652
MSE,2139.364651,1964.396117
RMSE,46.251761,44.310206


In [23]:
# we can feed arguments to estimators' constructor
Evaluator(task, numerical_missing, SKLearnKNNImputer(data_encoding_type="one-hot")).evaluate(5).result

100%|██████████| 5/5 [00:00<00:00,  5.38it/s]


Unnamed: 0,train,test
MAE,26.719682,26.028597
MSE,2142.68526,1961.532568
RMSE,46.28646,44.28363


In [24]:
# one hot encoding is default
Evaluator(task, categorical_as_string_missing, SKLearnKNNImputer(data_encoding_type="ordinal")).evaluate(5).result

  0%|          | 0/5 [00:00<?, ?it/s]


ValueError: could not convert string to float: ' NO'

In [25]:
# one hot encoding is default
Evaluator(task, categorical_as_string_missing, SKLearnKNNImputer(data_encoding_type="one-hot")).evaluate(5).result

  0%|          | 0/5 [00:00<?, ?it/s]


ValueError: could not convert string to float: ' NO'

In [26]:
# we can feed arguments to estimators' constructor
Evaluator(categorical_task, categorical_missing, SKLearnKNNImputer(data_encoding_type="ordinal")).evaluate(5).result

  0%|          | 0/5 [00:06<?, ?it/s]


ValueError: Classification metrics can't handle a mix of binary and continuous targets

In [27]:
# one hot encoding is default
Evaluator(categorical_task, categorical_missing, SKLearnKNNImputer(data_encoding_type="one-hot")).evaluate(5).result

  0%|          | 0/5 [00:05<?, ?it/s]


ValueError: Classification metrics can't handle a mix of binary and continuous targets

## Missing Values in Categorical Columns and `sklearn`s' `IterativImputer`

Using `sklearn`s' `IterativImputer` can't be used for categorical columns at the moment.

**There are two types of categorical values with different difficulties:**
1. Strings: We need to encode these values to process them
2. Numerical: Both estimators (`BayesianRidge` and `RandomForestRegressor`) will treat the imputation problem as regression instead of classification.

**To 1.:** Using `OrdinalEncoder` to encode the column basically shifts this into a column of type 2. Using `OneHotEncoder` introduces the disadvantage that `sklearns` imputer can't find the missing values anymore because they only search for `np.nan` or any other given value (at learst as far as I know).

**To 2.:** Could not find a solution yet ...


With `sklearn` v0.24, which is not publsihed yet, we can use `OrdinalEncoder` with arguments `handle_unknown='use_encoded_value'` and `unknown_value=np.nan` to encode the categories and preserve the missing values.