In Data Science modeling unknown relationships between attributes has been achieved using Machine Learning models.

The same process can be applied to predict MV


Using the KNN algorithm, every time a MV is found in an instance KNN Imputation computes the k nearest neighbors  and a value from them is imputed. 

For nominal values the most common value among all neighbors is taken, for numerical values the average value is used

Impute the missing values of the provided array x, applying KNN Imputation with k=2!

In [1]:
import numpy as np
from sklearn.impute import KNNImputer

x = np.array([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]])
print("Original data: \n",x)

n=2
imputer = KNNImputer(n_neighbors=n, weights="uniform")
transformed_x= imputer.fit_transform(x)
print("\nMissing values are imputed based on values of ",n," nearest neigbors")

print("Transformed data (knn imputation): \n",transformed_x)

Original data: 
 [[ 1.  2. nan]
 [ 3.  4.  3.]
 [nan  6.  5.]
 [ 8.  8.  7.]]

Missing values are imputed based on values of  2  nearest neigbors
Transformed data (knn imputation): 
 [[1.  2.  4. ]
 [3.  4.  3. ]
 [5.5 6.  5. ]
 [8.  8.  7. ]]


KMeans Clustering is another ML algorithm, which can be used for MV Imputation. Attributes which have no MVs are used to define clusters of similar examples. Then the missing values are calculated based on existing values of the examples from the same cluster.

Apply KMeans Clustering Imputation on data frame x
1. Drop features with MV
2. Run k Means on the reducted data frame x
3. Set up an object for a simple mean imputation (remember basic imputation approaches)
4. Apply the mean imputation to the examples of each cluster seperately
5. Print the completed data set

In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer

x = pd.DataFrame([[1, 2,5], [1,0,6],[1, np.nan,6], [10, 0,20],[10, 2,21], [100, 40,220], [100, 50,230]],columns=['A1','A2','A3'])
print("Original data: \n",x)

#Feature deletion
x_clean=x.dropna(axis=1)
print("\nData after deleting features with missing values: \n",x_clean)

#Run kmeans Clustering without MV feature
n=3
kmeans = KMeans(n_clusters=n, random_state=0).fit(x_clean)
x['Cluster']=kmeans.labels_

print("\nOriginal data with Cluster-ID: \n",x)

# Set up an object for average Imputation using strategy='mean'
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

#Intitialize transformed data set (only data of Cluster 0)
transformed_x=pd.DataFrame(imp.fit_transform(x[x['Cluster']==0]),columns=['A1','A2','A3','Cluster'])
for i in range(1,n):
    append_x=pd.DataFrame(imp.fit_transform(x[x['Cluster']==i]),columns=['A1','A2','A3','Cluster'])
    transformed_x=transformed_x.append(append_x)

#Print the completed data set
print("\nTransformed data (mean imputation): \n",transformed_x)

Original data: 
     A1    A2   A3
0    1   2.0    5
1    1   0.0    6
2    1   NaN    6
3   10   0.0   20
4   10   2.0   21
5  100  40.0  220
6  100  50.0  230

Data after deleting features with missing values: 
     A1   A3
0    1    5
1    1    6
2    1    6
3   10   20
4   10   21
5  100  220
6  100  230

Original data with Cluster-ID: 
     A1    A2   A3  Cluster
0    1   2.0    5        2
1    1   0.0    6        2
2    1   NaN    6        2
3   10   0.0   20        0
4   10   2.0   21        0
5  100  40.0  220        1
6  100  50.0  230        1

Transformed data (mean imputation): 
       A1    A2     A3  Cluster
0   10.0   0.0   20.0      0.0
1   10.0   2.0   21.0      0.0
0  100.0  40.0  220.0      1.0
1  100.0  50.0  230.0      1.0
0    1.0   2.0    5.0      2.0
1    1.0   0.0    6.0      2.0
2    1.0   1.0    6.0      2.0
