# Handle Missing Value

Real-world data would certainly have missing values. This could be due to many reasons such as data entry errors or data collection problems. Irrespective of the reasons, it is important to handle missing data because any statistical results based on a dataset with non-random missing values could be biased. Also, many ML algorithms do not support data with missing values.

After starting a machine learning or a data science project you begin your EDA or exploratory data analysis hoping to find interesting patterns and insights about the data before you go on to extract features and build your model. But it is very common to find a lot of values missing in your data. These missing values arise due to many factors not in your direct control. Sometimes due to the ways the data was captured. In some cases the values are not available at all for observation. Nevertheless you will need to handle those missing values before you move further. Lets look at the ways to do that. To be honest there isn’t a single standard technique or a general solution to handle missing values but there are a few ways which you can use depending upon your use case to help you deal with missing values in your data. But before that lets see what are the types of missing data.

**Types of missing values**
We can classify the missing values in different types. Each type of missing value require slightly different handling. The main types are —

1. Missing completely at Random (MCAR)

1. Missing at Random (MAR)

1. Missing Not at Random (MNAR)



**MCAR:** Missing Completely At Random. It is the highest level of randomness. This means that the missing values in any features are not dependent on any other features values. This is the desirable scenario in case of missing data.

**MAR:** Missing At Random. This means that the missing values in any feature are dependent on the values of other features.

**MNAR:** Missing Not At Random. Missing not at random data is a more serious issue and in this case, it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?

**What to do with the missing values?** 

Now that we have identified the missing values in our data, next we should check the extent of the missing values to decide the further course of action.

Ignore the missing values

Missing data under 10% for an individual case or observation can generally be ignored, except when the missing data is a MAR or MNAR.
The number of complete cases i.e. observation with no missing data must be sufficient for the selected analysis technique if the incomplete cases are not considered.

**Techniques of dealing with missing data**

There are a few techniques which can help you deal with missing values in your data set —
1. Drop missing values/columns/rows
1. Imputation

**1. Deletion**

In this method, cases which have missing values for one or more features are deleted. If the cases having missing values are small in number, it is better to drop them. Though this is an easy approach, it might lead to a significant decrease in the sample size. Also, the data may not always be missing completely at random. This may lead to biased estimation of parameters.

**2. Imputation**

Imputation is the process of substituting the missing data by some statistical methods. Imputation is useful in the sense that it preserves all cases by replacing missing data with an estimated value based on other available information. But imputation methods should be used carefully as most of them introduce a large amount of bias and reduce variance in the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as mno
from sklearn import linear_model
%matplotlib inline

data = pd.read_csv('../input/melbourne-housing-market/Melbourne_housing_FULL.csv')
data_full=data

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
mno.matrix(data, figsize = (20, 6))

In [None]:
for i in (data.columns):
    # count number of rows with missing values
    n_miss = data[i].isnull().sum()
    perc = n_miss / data.shape[0] * 100
    
    print('> {}, Missing: {}  total {} %' .format (i, n_miss, perc))

You can already see that the BuildingArea column has the first entry as NaN which means the value is not available there. We can go ahead and drop all the rows which does not have the BuildingArea value or we can drop the BuildingArea column altogether.


In [None]:
data.dropna(inplace=True)
data

In [None]:
#dropna function in Pandas removes all the rows with missing values

mno.matrix(data, figsize = (20, 6))

In [None]:
#Putting axis=1 removes the columns with missing values
data.dropna(inplace=True, axis=1)
mno.matrix(data, figsize = (20, 6))

Both the approaches have their own advantages and disadvantages and you will have to analyze for your use case to decide what needs to be done. If we drop the rows our total number of data points to train our model will go down which can reduce the model performance. Do this only if you have large number of training examples and the rows with missing data are not very high in number. Dropping the column altogether will remove a feature from our model i.e the model predictions will be independent of the building area. Sometimes you can drop variables or columns if the data is missing for more than 60% observations but only if that variable is insignificant. In general dropping data is not a good approach in most cases since you loose a lot of potentially useful information. Lets look at a better approach for dealing with missing data.

**Pros:**

Complete removal of data with missing values results in robust and highly accurate model

Deleting a particular row or a column with no specific information is better, since it does not have a high weightage

**Cons:**

Loss of information and data

Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset

In [None]:
data_full['BuildingArea'].fillna(data_full['BuildingArea'].mean(), inplace=True)
data_full.head(5)

# A Better Option: Imputation

A slightly better approach towards handling missing data is Imputation. Imputation means to replace or fill the missing data with some value.
Imputation fills in the missing value with some number. The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.


There are lot of ways to impute the data.

1. A constant value that belongs to the set of possible values of that variable, such as 0, distinct from all other values
1. A mean, median or mode value for the column
1. A value estimated by another predictive model
1. Multiple Imputation

# **1. Constant value Imputation**

In [None]:
data = pd.read_csv('../input/melbourne-housing-market/Melbourne_housing_FULL.csv')
data.head()

In [None]:
data.head()

fill Nan with constanct value (here constant value is 1480000)

In [None]:
data['Price']=data['Price'].fillna(1480000.0)
data.head()

# 2. Replace with mean value

**works only with Numerical features**

In [None]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_selected = data[['Rooms','Bedroom2', 'Bathroom', 'Car', 'Landsize','Price', 'Lattitude', 'Longtitude', 'Propertycount', 'BuildingArea']]
data_with_imputed_values = my_imputer.fit_transform(data_selected)
df=pd.DataFrame(data_with_imputed_values)
df.head()

In [None]:
df.columns=data_selected.columns
df

# An Extension To Imputation

Imputation is the standard approach, and it usually works well. However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. Here's how it might look:

In [None]:
# make copy to avoid changing original data (when Imputing)
new_data = data_selected.copy()


# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = data_selected.columns

Also you can find correlation matrix b/w features so that you can predict & replace the value on the basis of corrilated features

In [None]:
colormap = plt.cm.RdBu
plt.figure(figsize=(32,10))
plt.title('Correlation of Features', y=1.05, size=15)
sns.heatmap(data_selected.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

We then plot a correlation matrix to find out which variables are correlated to each other which we will use as independent predictor variables for predicting missing values.

We see that our variable of interest BuildingArea is correlated to Rooms, Bedroom2, Bathroom, Car, Landsize. We will use these variables to predict the missing values of BuilaingArea.

The predicted values from the model are inserted into the original dataframe. It theoretically provides good estimates for missing values. However, there are several disadvantages of this model which tend to outweigh the advantages. The replaced values are completely determined by a model applied to other variables and they tend to fit together too well, in other words, they contain no error. One must also assume that there is a linear relationship between the variables used in the regression equation which may not be the case always.

# **Why not impute with a common value?**

Then, why not impute the missing data with the Measure of Central Tendency of the variable? That does sound like a safe approach (and also pretty easy to implement).

Apparently not.

When we replace the missing data with some common value we might under(over)estimate it. In other words, we add some bias to our estimation. For example, a person who earns just enough to meet his daily needs might not be comfortable in mentioning his salary, and thus the value for the variable salary would be missing for such a person. However, if we impute it with the mean value of the variable, we are overestimating that person's salary and thus introducing bias in our analysis.

We have some better options
Imputaion by Prediction :
1. Regression Method
2. Clustering Method
3. Multiple Imputer

**Regression Methods**

The variables with missing values are treated as dependent variables and variables with complete cases are taken as predictors or independent variables. The independent variables are used to fit a linear equation for the observed values of the dependent variable. This equation is then used to predict values for the missing data points.

The disadvantage of this method is that the identified independent variables would have a high correlation with the dependent variable by virtue of selection. This would result in fitting the missing values a little too well and reducing the uncertainty about that value. Also, this assumes that relationship is linear which might not be the case in reality.

**K-Nearest Neighbour Imputation (KNN)**

This method uses k-nearest neighbour algorithms to estimate and replace missing data. The k-neighbours are chosen using some distance measure and their average is used as an imputation estimate. This could be used for estimating both qualitative attributes (the most frequent value among the k nearest neighbours) and quantitative attributes (the mean of the k nearest neighbours).

One should try different values of k with different distance metrics to find the best match. The distance metric could be chosen based on the properties of the data. For example, Euclidean is a good distance measure to use if the input variables are similar in type (e.g. all measured widths and heights). Manhattan distance is a good measure to use if the input variables are not similar in type (such as age, gender, height, etc.).

The advantage of using KNN is that it is simple to implement. But it suffers from the curse of dimensionality. It works well for a small number of variables but becomes computationally inefficient when the number of variables is large.

**Multiple Imputation**

Multiple imputations is an iterative method in which multiple values are estimated for the missing data points using the distribution of the observed data. The advantage of this method is that it reflects the uncertainty around the true value and returns unbiased estimates.

**MI involves the following three basic steps**
1. Imputation: The missing data are filled in with estimated values and a complete data set is created. This process of imputation is repeated m times and m datasets are created.

2. Analysis: Each of the m complete data sets is then analysed using a statistical method of interest (e.g. linear regression).

3. Pooling: The parameter estimates (e.g. coefficients and standard errors) obtained from each analysed data set are then averaged to get a single point estimate.



Pima Indians Diabetes dataset is used for our analysis. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. You can read more about the dataset in the Data section of this kernel.

In [None]:
df = pd.read_csv("../input/diabetes/diabetes.csv")
df.head(3)

A simple df.info() is ran for a quick and abstract check for missing data in any of the variables. This lists the number of non-null values and the datatype of each variable.

In [None]:
df.info()

In [None]:
mno.matrix(df, figsize = (20, 6))

None of the variables seem to have any missing value based on our above observation. But there's more to it than what meets the eye. df.describe() which gives the Five Number Summary would show that some variables have 0.0 as their minimum value which would be meaningless in their case. Plasma glucose concentration, Diastolic blood pressure, Triceps skinfold thickness, 2-Hour serum insulin and Body mass index cannot be zero.

Imagine BMI to be zero. That would be a disaster!

On the contrary, Pregnencies can be zero because either that person is a female who has not had a baby yet.

In [None]:
df.describe()

Thus, for feasibility of further analysis, we replace all these "missing" data with nan and calculate their number. As you can observe, there's enough evidence of significant missingness in those variables.

In [None]:
df.loc[df["Glucose"] == 0.0, "Glucose"] = np.NAN
df.loc[df["BloodPressure"] == 0.0, "BloodPressure"] = np.NAN
df.loc[df["SkinThickness"] == 0.0, "SkinThickness"] = np.NAN
df.loc[df["Insulin"] == 0.0, "Insulin"] = np.NAN
df.loc[df["BMI"] == 0.0, "BMI"] = np.NAN

df.isnull().sum()[1:6]

A better way of realizing this missingness is by visualizing the same using missingno package by drawing a nullity matrix. And as we can observe, SkinThickness and Insulin have a large amount of their data missing whose number is mentioned above. To keep our kernel short, I would consider only these variables for further visualizations.

You can learn more about missingno package in the link mentioned in References below

In [None]:
mno.matrix(df, figsize = (20, 6))

# **1 Regression to impute missing data**

When we have multiple variables with missing values, we can't just directly use Regression Imputation to impute one of them as the predictors contain missing data themselves. But then, how can we impute one variable without imputing another?

We can avoid this Catch-22 situation by initially imputing all the variables with missing values using some trivial methods like Simple Random Imputation (we impute the missing data with random observed values of the variable) which is later followed by Regression Imputation of each of the variables iteratively.
it also has two type:

**1. Deterministic Regression Imputation**

**2. Stochastic Regression Imputation**

In [None]:
missing_columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

lets define a function random_imputation that replaces the missing values with some random observed values of the variable. The method is repeated for all the variables containing missing values, after which they serve as parameters in the regression model to estimate other variable values.

Simple Random Imputation is one of the crude methods since it ignores all the other available data and thus it's very rarely used. But it serves as a good starting point for regression imputation.

do the random imputation for training features

In [None]:
def random_imputation(df, feature):
    number_missing = df[feature].isnull().sum()
    observed_values = df.loc[df[feature].notnull(), feature]
    df.loc[df[feature].isnull(), feature + '_imp'] = np.random.choice(observed_values, number_missing, replace = True)
    return df

In [None]:
for feature in missing_columns:
    df[feature + '_imp'] = df[feature]
    df = random_imputation(df, feature)


# **1.1. Deterministic Regression Imputation**
In Deterministic Regression Imputation, we replace the missing data with the values predicted in our regression model and repeat this process for each variable.

In [None]:
deter_data = pd.DataFrame(columns = ["Det" + name for name in missing_columns])

for feature in missing_columns:
        
    deter_data["Det" + feature] = df[feature + "_imp"]
    parameters = list(set(df.columns) - set(missing_columns) - {feature + '_imp'})
    
    #Create a Linear Regression model to estimate the missing data
    model = linear_model.LinearRegression()
    model.fit(X = df[parameters], y = df[feature + '_imp'])
    
    #observe that I preserve the index of the missing data from the original dataframe
    deter_data.loc[df[feature].isnull(), "Det" + feature] = model.predict(df[parameters])[df[feature].isnull()]

In [None]:
mno.matrix(deter_data, figsize = (20,5))

A major disadvantage in this method is that we reduce the inherent variability in the imputed variable. In other words, since we substitute the missing data with regression outputs, the predicted values lie along the regression hyperplane where the variable would have actually contained some noise/bias.

We can visualize the above fact in a number of ways. First one is plotting histograms for both the incomplete data and the complete data in which we can observe that the plot of the completed data is taller and narrower when compared to that of the incomplete data. In other words, the complete data has a lesser standard deviation (thus lesser variability) than the incomplete data.

Another method would be plotting a boxplot in which we can observe that the IQ Range is pretty compressed for the complete data when compared to that in the incomplete data.

In [None]:
sns.set()
fig, axes = plt.subplots(nrows = 2, ncols = 2)
fig.set_size_inches(8, 8)

for index, variable in enumerate(["Insulin", "SkinThickness"]):
    sns.distplot(df[variable].dropna(), kde = False, ax = axes[index, 0])
    sns.distplot(deter_data["Det" + variable], kde = False, ax = axes[index, 0], color = 'red')
    sns.boxplot(data = pd.concat([df[variable], deter_data["Det" + variable]], axis = 1),ax = axes[index, 1])
plt.tight_layout()

In [None]:
pd.concat([df[["Insulin", "SkinThickness"]], deter_data[["DetInsulin", "DetSkinThickness"]]], axis = 1).describe().T

# **1.2. Stochastic Regression Imputation**

To add uncertainity back to the imputed variable values, we can add some normally distributed noise with a mean of zero and the variance equal to the standard error of regression estimates . This method is called as Random Imputation or Stochastic Regression Imputation



In [None]:
random_data = pd.DataFrame(columns = ["Ran" + name for name in missing_columns])

for feature in missing_columns:
        
    random_data["Ran" + feature] = df[feature + '_imp']
    parameters = list(set(df.columns) - set(missing_columns) - {feature + '_imp'})
    
    model = linear_model.LinearRegression()
    model.fit(X = df[parameters], y = df[feature + '_imp'])
    
    #Standard Error of the regression estimates is equal to std() of the errors of each estimates
    predict = model.predict(df[parameters])
    std_error = (predict[df[feature].notnull()] - df.loc[df[feature].notnull(), feature + '_imp']).std()
    
    #observe that I preserve the index of the missing data from the original dataframe
    random_predict = np.random.normal(size = df[feature].shape[0], 
                                      loc = predict, 
                                      scale = std_error)
    random_data.loc[(df[feature].isnull()) & (random_predict > 0), "Ran" + feature] = random_predict[(df[feature].isnull()) & 
                                                                            (random_predict > 0)]

When we introduce this Gaussian noise we may end up imputing some negative values for the missing data due to the spread of the distibution for a particular pair of mean and standard deviation. But, as per our discussion earlier, there might be some variable whose values can never be zero. For example, a negative value for Insulin concentrations would be meaningless.

We can avoid this situation by retaining the values introduced by simple random imputation which is discussed above. This apparently reduces the variability that we introduce, but it's something we have to deal with, especially in case of these variables whose values are restricted to ceratin parts of the real number line.

In [None]:
sns.set()
fig, axes = plt.subplots(nrows = 2, ncols = 2)
fig.set_size_inches(8, 8)

for index, variable in enumerate(["Insulin", "SkinThickness"]):
    sns.distplot(df[variable].dropna(), kde = False, ax = axes[index, 0])
    sns.distplot(random_data["Ran" + variable], kde = False, ax = axes[index, 0], color = 'red')
    axes[index, 0].set(xlabel = variable + " / " + variable + '_imp')
    
    sns.boxplot(data = pd.concat([df[variable], random_data["Ran" + variable]], axis = 1),
                ax = axes[index, 1])
    
    plt.tight_layout()


We can observe from the plots above that we have introduced some degree of variability into the variables and retained the native distribution as well.

In [None]:
pd.concat([df[["Insulin", "SkinThickness"]], random_data[["RanInsulin", "RanSkinThickness"]]], axis = 1).describe().T

An issue of concern about is that, Regression Imputation might not serve as the best method when a variable is missing majority of it's data, as in case of insulin. In these cases we have to use more powerful approaches as Maximum Likelihood Imputation and Multple Imputaton.

Notes
* Regression Imputation assumes that the data is Missing At Random, more about it can be found in the refereneces below.
* For a better Regression model, we might have to follow different Data Transformation methods depending on our data.
* Do observe that we have included Outcome as one of our predictors eventhough it is caused by the other variables under scrutiny.
* This kernel does not describe the best method for many cases, rather it justs the demonstrates Regression Imputation as one of the methods.

# 2. KNN Imputer

KNN Imputer was first supported by Scikit-Learn in December 2019 when it released its version 0.22. This imputer utilizes the k-Nearest Neighbors method to replace the missing values in the datasets with the mean value from the parameter ‘n_neighbors’ nearest neighbors found in the training set. By default, it uses a Euclidean distance metric to impute the missing values.
To see this imputer in action, we will import it from Scikit-Learn’s impute package -

Univariate methods used for missing value imputation are simplistic ways of estimating the value and may not provide an accurate picture always. For example, let us say we have variables related to the density of cars on road and levels of pollutants in the air and there are few observations that are missing for the level of pollutants, imputing the level of pollutants by mean/median level of pollutants may not necessarily be an appropriate strategy.

In such scenarios, algorithms like k-Nearest Neighbors (kNN) can help to impute the values of missing data. Sociologists and community researchers suggest that human beings live in a community because neighbors generate a feeling of security and safety, attachment to community, and relationships that bring out a community identity through participation in various activities.

A similar imputation methodology that works on data is k-Nearest Neighbours (kNN) that identifies the neighboring points through a measure of distance and the missing values can be estimated using completed values of neighboring observations.

**Example-**
Suppose, you run out of stock of necessary food items in your house, and due to the lockdown none of the nearby stores are open. Therefore, you ask your neighbors for help and you will end up cooking whatever they supply to you. This is an example of imputation from a 1-nearest neighbor (taking the help of your closest neighbor).

Instead, if you identify 3 neighbors from whom you ask for help and choose to combine the items supplied by 3 of your nearest neighbors, that is an example of imputation from 3-nearest neighbors. Similarly, missing values in datasets can be imputed with the help of values of observations from the k-Nearest Neighbours in your dataset. Neighboring points for a dataset are identified by certain distance metrics, generally euclidean distance.

![](https://lh3.googleusercontent.com/kgxBFV7v73pLLDLPa3hAQwkQKseBwKPI_CA5ISMtPdh9lIOWzIy4Qop4BuZ_WkT0_qr106SVpuC60hw0mqs6_aRbsihrF0kxu_a5PJa77EBOT5uqmMhLcEJFKE7PSKIntpgerVfX)

Consider the above diagram that represents the working of kNN. In this case, the oval area represents the neighboring points of the green squared data point. We use a measure of distance to identify the neighbors. For a detailed introduction to kNN and distance measure calculations, 

**Distance calculation in the presence of missing values**
Let’s look at an example to understand this. Consider a pair of observations in a two-dimensional space (2,0), (2,2), (3,3). The  graphical representation of these points are shown below:
![](https://lh4.googleusercontent.com/iIecYWAJ08ZMkqbkTEYFOJvVwJvVk6kst80v2QQh0QIUMaphrNeEnlGwP1H8gupSnCM2X2YEMsnyecVstXhUkAHnOMH03zMdNmvSIcotz36ApnKa_SE3v9-BQnPRN_76uslOMH1i)

The points with the shortest distance based on Euclidean distances are considered to be the nearest neighbors. For example, 1-nearest neighbor to Point A is point B. For point B, the 1-nearest neighbor is point C.

In the presence of missing coordinates, the Euclidean distance is calculated by ignoring the missing values and scaling up the weight of the non-missing coordinates.

![](https://lh3.googleusercontent.com/XoAmDPvK6VdXeR1Tvbre0GGPrVhjt0lKfsctH_U_DO4nTLMEQzFe0Cavzs50kqHdvBr483UA5HxJEOptHxBYtyRY5FTON28Yj4Q70oCeh-4Opk5KXojX5BwqUAEJRn2xiHtLoRdU)

![](https://lh5.googleusercontent.com/Ifp1O-KCM1gG0aQXXJbT3RLtTNC1eICgKhFZ89p25jzvujerKmEcn9nFXWVls16oZ-aWxBu7k4iockd3ohWyFh_jga8589F6Ra8h2lLc959pBClCGdGPoZhx0kbYnOlD4cRKK5fM)

For example, the Euclidean distances between two points (3, NA, 5) and (1, 0, 0) is:

![](https://lh5.googleusercontent.com/nPFpe1oPKYB1xUwU4GGxCrAEpi3pNBDckj0Jza5cMFGkA-tjMZAWQzEtqK1DJXJt0ZuOFcCoVymyIUzVEyBl_8bWRhFWA9k7x3AHiMgFxjXYaHzkx7qIQR24u3_p8TJkw6IMBj8i)

Now we use the sklearn to KNN imputer which uses the equalidiean distance for imputation

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df = pd.DataFrame(imputer.fit_transform(df),columns = df.columns)

In [None]:
df.isna().any()

the choice of k to impute the missing values using the kNN algorithm can be a bone of contention. Furthermore, research suggests that it is imperative to test the model using cross-validation after performing imputation with different values of k. Although the imputation of missing values is a continuously evolving field of study, kNN act as a simple and effective strategy.

# 3. Multiple Imputation (The MICE Algorithm)



Multiple Imputation by Chained Equations, also called “fully conditional specification”, is defined as such:

![](https://miro.medium.com/max/609/1*PGMV2MkOnIl7lV5hm7kDmg.png)


This process is repeated for the desired number of datasets. The method mentioned on line 8, mean matching, is used to produce imputations that behave more like the original data. This idea is explored in depth in Stef van Buuren’s online book. A reproducible example of the effects on mean matching can also be found on the miceforest Github page.
Multiple iterations are sometimes required for the imputations to converge. There are several things that affect how many iterations are required to achieve convergence such as the type of missing data, the information density in the dataset, and the model used to impute the data.

**MI involves the following three basic steps**

**1. Imputation:** The missing data are filled in with estimated values and a complete data set is created. This process of imputation is repeated m times and m datasets are created.

**2. Analysis:** Each of the m complete data sets is then analysed using a statistical method of interest (e.g. linear regression).

**3. Pooling:** The parameter estimates (e.g. coefficients and standard errors) obtained from each analysed data set are then averaged to get a single point estimate.

In [None]:
!pip install miceforest

In [None]:
import miceforest as mf
from sklearn.datasets import load_iris
import pandas as pd

# Load and format data
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris.rename(columns = {'target':'species'}, inplace = True)
iris['species'] = iris['species'].astype('category')
iris

In [None]:


# Introduce missing values
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)

In [None]:
# Create kernels. 
kernel = mf.MultipleImputedKernel(
  data=iris_amp,
  save_all_iterations=True,
  random_state=2000
)
# Run the MICE algorithm for 3 iterations on each of the datasets
kernel.mice(5,verbose=True)

What we have done is created 5 separate datasets with different imputed values. We can never be sure what the original data was, but if our different datasets all come up with similar imputed values, we can say that we are confident in our imputations. Let’s take a look at the correlations of the imputed values between datasets:

In [None]:
kernel.plot_correlations(wspace=0.4,hspace=0.5)

Each plot represents the correlation of imputed values between 2 of the datasets. Every combination of datasets is included in the graph. If the correlation between imputed values is low, it means we did not impute the values with much confidence. It looks like our models all pretty much agreed on the imputations for petal length and petal width. If we ran more iterations, we might be able to get better results for sepal length and sepal width as well.