<a href="https://colab.research.google.com/github/saks0106/Frequent-Lookouts/blob/main/Multivariate_Imputer_using_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

When we want to find missing values in a row using multiple features, we use KNNImputer. Same concept of Euclidean distance is used except to find 'Na' values we use  [nan_euclidean_distance:](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html). Here row where missing value needs to be imputed and all the other remaining rows, their nan_euclidean_distance is calculated, depending on no of K neighbors, that many rows with minimum nan_euclidean_distance is taken and mean is taken. This Mean will be the imputed value for missing row.

**Inference:** Better Results and More Accurate, More calculation on large dataset, training dataset needs to be deployed during production to handle missing value using KNN imputer

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('train.csv')[['Age','Pclass','Fare','Survived']]

In [4]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [5]:
df.isnull().mean() * 100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [6]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [7]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [8]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [11]:
knn = KNNImputer(n_neighbors=3,weights='distance') 
#need to expriment on n_neighbors
#weights='distance', 'uniform' - uniform is normal distance(mean) and distance is reciprocal of weights
#metric = nan_euclidean by default
#add_indicator will create a T/F column add corresponding True value if missing_value present 


X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [18]:
pd.DataFrame(X_train_trf ).sample(5)

Unnamed: 0,0,1,2
453,45.0,3.0,14.4542
177,27.477003,3.0,69.55
113,35.0,3.0,7.05
209,18.0,3.0,18.0
338,65.0,1.0,61.9792


In [12]:
lr = LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7150837988826816

In [13]:
# Comparision with Simple Imputer --> mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [14]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978