Multivariate / Regression Imputation

The IterativeImputer class models each feature with missing values as a function of other features, and uses that estimate for imputation. 

It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs x. 

A regressor is fit on (x, y) for known y. Then, the regressor is used to predict the missing values of y. 

This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

Apply Iterative imputer with 10 iterations on the provided array x. Round the imputed values to integers

Then apply your model on another array x_test (containing some more MV) without fitting again

In [2]:
import numpy as np
#Note: This estimator is still experimental for now: default parameters or details of behaviour might change 
#without any deprecation cycle.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

x=np.array([[10, 21], [32, 61], [40, 81], [np.nan, 32], [71, np.nan]])
print("Original x: \n",x)

#Apply Iterative imputer with 10 iterations on array x
imp = IterativeImputer(max_iter=10)
transformed_x= np.round(imp.fit_transform(x))
print("Regression imputed x: \n",transformed_x)

#Apply Iterative imputer on another array x_test without fitting again
print("\nUsing the same imputation model again...")
x_test = np.array([[np.nan, 2], [6, np.nan], [np.nan, 6]])
print("Test data with a lot of missing values: \n",x_test)
print("The model learned that the second feature is about double the first: \n",np.round(imp.transform(x_test)))

Original x: 
 [[10. 21.]
 [32. 61.]
 [40. 81.]
 [nan 32.]
 [71. nan]]
Regression imputed x: 
 [[ 10.  21.]
 [ 32.  61.]
 [ 40.  81.]
 [ 16.  32.]
 [ 71. 140.]]

Using the same imputation model again...
Test data with a lot of missing values: 
 [[nan  2.]
 [ 6. nan]
 [nan  6.]]
The model learned that the second feature is about double the first: 
 [[ 1.  2.]
 [ 6. 13.]
 [ 3.  6.]]


# Deterministic vs. Stochastic Regression Imputation

In Deterministic Regression Imputation, predicted values out of regression model are (directly) used to impute MV
This approach is simple, but distorts distributions (variances, correlations etc.)

In Stochastic Regression Imputation a residual component is added to the pure regression results. 
The residual is calculated as a random value based on the given distribution

To have a baseline for comparison, start by applying deterministic regression imputation on the provided array x, similar to the previous task.

Now, applying stochastic regression imputation on the same data. Hint: the sample_posterior attribute will be needed.

Experiment a bit with this kind of imputation by assigning different values to the random_state variable and checking the results!


In [3]:
import numpy as np
#Note: This estimator is still experimental for now: default parameters or details of behaviour might change 
#without any deprecation cycle.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt


x=np.array([[1, 2, 3, 15, 17], [3, 6.5, 8, 14, 17], [4, 7.5, 13, 15, 19], [np.nan, 3, 5, 12, 16], [7, np.nan, 22, 11, 17]])
print("Original x: \n",x)


#Deterministic Regression Imputation: sample_posterior=False
imp = IterativeImputer(max_iter=10, sample_posterior=False)
transformed_x= np.round(imp.fit_transform(x),1)
print("Deterministic Regression Imputation of x: \n",transformed_x)

#Stochastic Regression Imputation: sample_posterior=True
random_seed=0
imp = IterativeImputer(max_iter=10, random_state=random_seed,sample_posterior=True)
transformed_x= np.round(imp.fit_transform(x),1)
print("Stochastic Regression Imputation of x (round 1): \n",transformed_x)

#Random seed can be used to create different randomizations
random_seed=1
imp = IterativeImputer(max_iter=10, random_state=random_seed,sample_posterior=True)
transformed_x= np.round(imp.fit_transform(x),1)
print("Stochastic Regression Imputation of x (round 2): \n",transformed_x)

Original x: 
 [[ 1.   2.   3.  15.  17. ]
 [ 3.   6.5  8.  14.  17. ]
 [ 4.   7.5 13.  15.  19. ]
 [ nan  3.   5.  12.  16. ]
 [ 7.   nan 22.  11.  17. ]]
Deterministic Regression Imputation of x: 
 [[ 1.   2.   3.  15.  17. ]
 [ 3.   6.5  8.  14.  17. ]
 [ 4.   7.5 13.  15.  19. ]
 [ 1.7  3.   5.  12.  16. ]
 [ 7.  12.6 22.  11.  17. ]]
Stochastic Regression Imputation of x (round 1): 
 [[ 1.   2.   3.  15.  17. ]
 [ 3.   6.5  8.  14.  17. ]
 [ 4.   7.5 13.  15.  19. ]
 [ 3.8  3.   5.  12.  16. ]
 [ 7.  19.  22.  11.  17. ]]
Stochastic Regression Imputation of x (round 2): 
 [[ 1.   2.   3.  15.  17. ]
 [ 3.   6.5  8.  14.  17. ]
 [ 4.   7.5 13.  15.  19. ]
 [ 1.   3.   5.  12.  16. ]
 [ 7.   8.5 22.  11.  17. ]]
