<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imputing-Missing-Values" data-toc-modified-id="Imputing-Missing-Values-1">Imputing Missing Values</a></span></li><li><span><a href="#Let's-apply-the-feature-engineering-hierarchy-to-imputing-missing-values" data-toc-modified-id="Let's-apply-the-feature-engineering-hierarchy-to-imputing-missing-values-2">Let's apply the feature engineering hierarchy to imputing missing values</a></span></li><li><span><a href="#How-to-impute-missing-values:-Ad-hoc" data-toc-modified-id="How-to-impute-missing-values:-Ad-hoc-3">How to impute missing values: Ad hoc</a></span></li><li><span><a href="#How-to-impute-missing-values:-Hand-crafted-Rules" data-toc-modified-id="How-to-impute-missing-values:-Hand-crafted-Rules-4">How to impute missing values: Hand-crafted Rules</a></span></li><li><span><a href="#How-to-impute-missing-values:-Learned-Rules" data-toc-modified-id="How-to-impute-missing-values:-Learned-Rules-5">How to impute missing values: Learned Rules</a></span></li><li><span><a href="#What-is-fit_transform?" data-toc-modified-id="What-is-fit_transform?-6">What is fit_transform?</a></span></li><li><span><a href="#How-to-impute-missing-values:-Learned-Simple-Model" data-toc-modified-id="How-to-impute-missing-values:-Learned-Simple-Model-7">How to impute missing values: Learned Simple Model</a></span></li><li><span><a href="#KNN-Based-Missing-Value-Imputation" data-toc-modified-id="KNN-Based-Missing-Value-Imputation-8">KNN Based Missing Value Imputation</a></span></li><li><span><a href="#scikit-learn's-IterativeImputer" data-toc-modified-id="scikit-learn's-IterativeImputer-9">scikit-learn's IterativeImputer</a></span></li><li><span><a href="#Marking-imputed-values" data-toc-modified-id="Marking-imputed-values-10">Marking imputed values</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-11">Check for understanding</a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-12">Takeaways</a></span></li><li><span><a href="#Sources-of-Inspiration" data-toc-modified-id="Sources-of-Inspiration-13">Sources of Inspiration</a></span></li><li><span><a href="#Bonus-Materials" data-toc-modified-id="Bonus-Materials-14">Bonus Materials</a></span></li><li><span><a href="#What-can-you-do-when-you-have-missing-values?" data-toc-modified-id="What-can-you-do-when-you-have-missing-values?-15">What can you do when you have missing values?</a></span></li><li><span><a href="#How-to-impute-missing-values:-Learned-Complex-Model" data-toc-modified-id="How-to-impute-missing-values:-Learned-Complex-Model-16">How to impute missing values: Learned Complex Model</a></span></li></ul></div>

<center><h2>Imputing Missing Values</h2></center>

<center><h2>Let's apply the feature engineering hierarchy to imputing missing values</h2></center>

- Ad hoc
- Hand-crafted rules
- Feature learning:
    - Rule-based models
    - Simple models
    - Complex models

<center><h2>How to impute missing values: Ad hoc</h2></center>

1. Visually inspect.
1. Try to get the missing data!
1. Given domain knowledge, guess value.

<center><h2>How to impute missing values: Hand-crafted Rules</h2></center>

1. Replace with a reasonable guess based on knowledge of the underlying domain (heuristic).
1. Replace with random value sampled from the empirical distribution.

<center><h2>How to impute missing values: Learned Rules</h2></center>

Calculate the central tendency of existing values and impute them for missing data:

- Median works best for numeric features. 


- Mode works best for categorical features.

- Another option for categorical features - add "missing" as a new level for the feature.

There only reason to impute the mean is because the median is too costly to compute. In this situations, you can easily compute the median.

In [3]:
reset -fs

In [4]:
from sklearn.datasets        import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [5]:
import numpy as np

# Let's replace a single actual value with a missing value
original_data_point = X_train[0][0]
X_train[0][0] = np.nan

In [6]:
from sklearn.impute import SimpleImputer

# help(SimpleImputer)

In [7]:
# Create our imputer to replace missing values with the median
imp = SimpleImputer(missing_values=np.nan, strategy='median')

X_train_imp = imp.fit_transform(X_train)

<center><h2>What is fit_transform?</h2></center>

`fit(X,  y=None)` Fit the model according to the given training data. This is often estimating parameters. In the case of SimpleImputer that finding the mean.

`fit_transform(X, y=None, **fit_params)` Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

This is only for training data!

Test data will just use `transform` method (the parameters come from the training dataset)

<center><img src="../images/pipeline-diagram.png" width="75%"/></center>

In [8]:
# Let's compare
print(f"Orginal datapoint: {original_data_point:>5}")
print(f"Imputated datapoint: {X_train_imp[0][0]}")

Orginal datapoint:   5.1
Imputated datapoint: 5.8


In [9]:
# Apply model to test dataset with transform method
# The median of the feature in the training will be used to predict missing values in the test set
X_test_imp = imp.transform(X_test)

<center><h2>How to impute missing values: Learned Simple Model</h2></center>

Fit a model that estimates a missing value based on other features.

- [Linear Regression](https://en.wikipedia.org/wiki/Imputation_(statistics)#Regression) 
- [k-nearest neighbors algorithm (k-NN) ](http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf) 

<center><h2>KNN Based Missing Value Imputation</h2></center>
 
Each missing value is imputed from the mean of n nearest neighbors, in the training set, so long as the features which neither sample are missing are near. 

Euclidean distance is the distance default metric used.

In [10]:
import numpy as np
from sklearn.impute import KNNImputer

X = [[1,      2, np.nan], 
     [3,      4, 3], 
     [np.nan, 6, 5], 
     [8,      8, 7]]

imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

<center><h2>scikit-learn's IterativeImputer</h2></center>

Models each feature with missing values as a function of other features, and uses that estimate for imputation.

An iterated round-robin fashion: 

- At each step, a feature column is designated as output y and the other feature columns are treated as inputs X. 
- A regressor is fit on (X, y) for known y. 
- Then, the regressor is used to predict the missing values of y. 

This is done for each feature in an iterative fashion.

Source: https://scikit-learn.org/stable/modules/impute.html

In [11]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# help(IterativeImputer)

In [12]:
imp = IterativeImputer()
X_train_imp = imp.fit_transform(X_train)

In [13]:
print(f"Orginal datapoint: {original_data_point:>5}")
print(f"Imputated datapoint: {X_train_imp[0][0]}")

Orginal datapoint:   5.1
Imputated datapoint: 5.038868308952402


<center><h2>Marking imputed values</h2></center>

The presence of missing data is a feature.

In [14]:
from sklearn.impute import MissingIndicator

X = np.array([[-1, 42, -1, 1, 3]])

indicator = MissingIndicator(missing_values=-1, features="all")

In [15]:
# The result is an indicator vector
mask_all = indicator.fit_transform(X)
mask_all

array([[ True, False,  True, False, False]])

Source: https://scikit-learn.org/stable/modules/impute.html

<center><h2>Check for understanding</h2></center>

What should you do if you are missing target values?

Discard that instance. One of the assumptions of Supervised Machine Learning is that each instance has a label.

Reframe the problem as a Reinforcement Learning or Unsupervised problem.

<center><h2>Takeaways</h2></center>

- Feature Engineering (FE) creates derived data that improve model fitting, thus improving performance metrics.
- All feature engineering, including imputation, be done through
    - Ad hoc
    - Learned rules
    - Simple models
    - Complex models

<br>

<center><h2>Sources of Inspiration</h2></center>

- https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e

<center><h2>Bonus Materials</h2></center>

<center><h2>What can you do when you have missing values?</h2></center>

You have 3 choices:

1. Drop rows with missing values
2. Impute missing values
3. Choose a machine learning algorithm that is robust to missing value

1. Dropping rows with missing values is the easiest and works well when there is a small proportion of missing value relative to complete values.

2. Imputing missing values includes ad hoc choices, hand-crafted rules, and machine learning models.

3. Most machine learning models do not handle missing values. One notable exception is [lightgbm](https://github.com/microsoft/LightGBM) which can handle missing values.

In [16]:
# Another example of IterativeImputer
import numpy as np
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer

# Create dataset with missing values
data = [[61, 22, 43,np.nan,67],
        [np.nan, 6, 27, 8, 11],
        [83, 51, np.nan, 32, 9],
        [74, np.nan, 35, 26, 97],
        [np.nan, 4, 13,45, 33]]

# Impute missing values using iterative imputer
iter_imp = IterativeImputer(random_state= 42)
iter_imp.fit_transform(data)



array([[61.        , 22.        , 43.        , 27.74898065, 67.        ],
       [50.92363569,  6.        , 27.        ,  8.        , 11.        ],
       [83.        , 51.        , 28.62176532, 32.        ,  9.        ],
       [74.        , 20.72107515, 35.        , 26.        , 97.        ],
       [67.54222008,  4.        , 13.        , 45.        , 33.        ]])

<center><h2>How to impute missing values: Learned Complex Model</h2></center>

<center><img src="../images/dl.png" width="70%"/></center>

Sources: 

- https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/
- https://ssc.io/pdf/p2017-biessmann.pdf