# **Multivariate Feature Imputation**

Multivariate imputation effectively uses complete variables to predict missing values. Skikit-learn describe it as follows:

"IterativeImputer models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned."

The estimates of the The MICE package in R uses Multivariate Imputation by chained equations. The skikit-learn iterative imputer is inspired by MICE with the exception that it only produces one set of results.

If you go to the following [link](https://scikit-learn.org/stable/auto_examples/impute/plot_missing_values.html#sphx-glr-auto-examples-impute-plot-missing-values-py) you will see a rather complicated approach to estimating missing values. The objective is to build a number of methods and  the results are then combined/aggregated. This is very similar to a technique you will learn about called random forest regression.  I like to think of this approach as getting a bunch of different experts into a room and they all put their opinions on a piece of paper and this is then aggregated to get the overall group opinion.


In [1]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

import numpy as np
from sklearn.impute import SimpleImputer

y=np.array([[780,750,690,710,680,730,690,720,740,900,950,975,995,1000,1010,1020],
    [5.1,4.5,np.nan,3.3,3.6,9.3,6.7,2.8,5.4,np.nan,7.8,np.nan,np.nan,10.1,6.7,np.nan],
    [78000,75000,100000,71000,68000,70000,69000,72000,74000,69000,102000,101000,79000,114000,101000,95000],
    [0.5,0.55,0.1,0.6,0.7,0.45,0.56,0.73,0.45,0.67,0.43,0.23,0.78,0.42,0.36,0.23]])
#y=np.reshape(y, (4, 16))
y=y.transpose()
#random_state=0
imp = IterativeImputer(max_iter=100, sample_posterior=True,tol=0.000001)
y=imp.fit_transform(y)
print(y)

#imp.fit(y)
#y_res=imp.transform(y)


# the model learns that the second feature is double the first
#print(y_res)

[[7.80000000e+02 5.10000000e+00 7.80000000e+04 5.00000000e-01]
 [7.50000000e+02 4.50000000e+00 7.50000000e+04 5.50000000e-01]
 [6.90000000e+02 4.63780386e+00 1.00000000e+05 1.00000000e-01]
 [7.10000000e+02 3.30000000e+00 7.10000000e+04 6.00000000e-01]
 [6.80000000e+02 3.60000000e+00 6.80000000e+04 7.00000000e-01]
 [7.30000000e+02 9.30000000e+00 7.00000000e+04 4.50000000e-01]
 [6.90000000e+02 6.70000000e+00 6.90000000e+04 5.60000000e-01]
 [7.20000000e+02 2.80000000e+00 7.20000000e+04 7.30000000e-01]
 [7.40000000e+02 5.40000000e+00 7.40000000e+04 4.50000000e-01]
 [9.00000000e+02 9.10666828e+00 6.90000000e+04 6.70000000e-01]
 [9.50000000e+02 7.80000000e+00 1.02000000e+05 4.30000000e-01]
 [9.75000000e+02 6.60202176e+00 1.01000000e+05 2.30000000e-01]
 [9.95000000e+02 6.93183845e+00 7.90000000e+04 7.80000000e-01]
 [1.00000000e+03 1.01000000e+01 1.14000000e+05 4.20000000e-01]
 [1.01000000e+03 6.70000000e+00 1.01000000e+05 3.60000000e-01]
 [1.02000000e+03 2.55572620e+00 9.50000000e+04 2.300000

Re-run this analysis a number of times you will notice you are getting different results each time. This is because the the random seed is not fixed and the algorithm is using differing combinations of variables and data to solve the problem.

Adjust the tolerence and go to this [link](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) and experiment with the parameters. You will have to do this in real life problems. Also re-run the linear regression analysis and see what it shows up.

How would you stabilise the results?