<a href="https://colab.research.google.com/github/sivasaiyadav8143/Machine-Learning-with-Python/blob/master/Missing_Values_Treatment_with_ML_Algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Missing values**, occur when no data value is stored for the variable in an observation.Missing values in the data is a very common problem in Machine Learning because Real-world data often has missing values.There are many reasons for missing values in the data, some are
1. observations that were not recorded (During a Survey, when participants refuse, or do not know the answer to or accidentally skip an item.)
2. data corruption (errors in computer data that occur during writing, reading, storage, transmission, or processing)

Handling missing data is important as many machine learning algorithms do not support data with missing values.

There are different ways to deal with missing data<br>

1. Mark Missing Values
2. Remove Rows/columns With Missing Values
3. Impute with mean, median or mode
4. **Prediction of missing values**
5. **Imputation with Deep Learning**

In this notebook, we will see point 4 & 5 how they estimate missing vales.  

In [1]:
import pandas as pd , numpy as np

import warnings
warnings.filterwarnings('ignore')

In [3]:
pima = pd.read_csv('diabetes.csv')
pima.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
pima.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Seems we have no missing values,thats a good news but lets dig a bit more with describe and see if we can find something strange.

In [5]:
pima.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


Above we can see that for the columns<br>


1.   Glucose
2.   BloodPressure
3.   SkinThickness
4.   Insulin
5.   BMI (Body mass index)

have a minimum value of zero (0). On these columns, a value of zero does not make sense and thus indicates missing value.

In [6]:
cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
pima_copy = pima.copy(deep=True)
pima_copy[cols] = pima_copy[cols].replace(0,np.nan)

In [7]:
pima_copy.isna().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [8]:
X = pima_copy.iloc[:,:-1]
y = pima_copy.iloc[:,-1]
X.shape,y.shape

((768, 8), (768,))

In [9]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,stratify=y,random_state=14)

## Prediction of missing values
We will see two popular Algorithm to impute missing values
1. KNNImputer (k-Nearest Neighbors imputation)
2. MissForest (Random Forest imputation)

scikit-learn supports KNNImputer but we will use a new library called **missingpy** which supports both **KNNImputer & MissForest**

**Installation** : pip install missingpy

In [10]:
!pip install missingpy

Collecting missingpy
[?25l  Downloading https://files.pythonhosted.org/packages/b5/be/998d04d27054b58f0974b5f09f8457778a0a72d4355e0b7ae877b6cfb850/missingpy-0.2.0-py3-none-any.whl (49kB)
[K     |██████▊                         | 10kB 15.4MB/s eta 0:00:01[K     |█████████████▍                  | 20kB 2.2MB/s eta 0:00:01[K     |████████████████████            | 30kB 2.9MB/s eta 0:00:01[K     |██████████████████████████▊     | 40kB 3.3MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 1.8MB/s 
[?25hInstalling collected packages: missingpy
Successfully installed missingpy-0.2.0


In [11]:
from missingpy import KNNImputer
imputer = KNNImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

In [12]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train_imputed)
X_test_scaled = sc.transform(X_test_imputed)

In [13]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_scaled,y_train)
print('Training Score :',lr.score(X_train_scaled,y_train))
print('Testing Score :',lr.score(X_test_scaled,y_test))

Training Score : 0.7746741154562383
Testing Score : 0.7575757575757576


In [14]:
from missingpy import MissForest
miss = MissForest(max_iter=50,max_depth=3)
X_train_imputed_ = miss.fit_transform(X_train)
X_test_imputed_ = miss.transform(X_test)

Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3


In [15]:
sc_tree = StandardScaler()
X_train_scaled_ = sc_tree.fit_transform(X_train_imputed_)
X_test_scaled_ = sc_tree.transform(X_test_imputed_)

In [16]:
lr.fit(X_train_scaled_,y_train)
print('Training Score :',lr.score(X_train_scaled_,y_train))
print('Testing Score :',lr.score(X_test_scaled_,y_test))

Training Score : 0.7802607076350093
Testing Score : 0.7662337662337663


Advantages of MissForest:
1. It can be applied to mixed data types, numerical and categorical. Using KNN-Impute on categorical data requires it to be first converted into some numerical measure. 
2. It is robust to noisy data and multicollinearity.
3. It can work with high-dimensional data.

Drawbacks:<br>
1. if the dataset is sufficiently small it may be more expensive to run MissForest.
2. it’s an algorithm, not a model object; this means it must be run every time data is imputed, which may not work in some production environments.

# Imputation with Deep Learning
We use a Deep Learning Library called **Datawig** to impute missing values.<br>
**Datawig** is a library that learns ML models using Deep Neural Networks to impute missing values in the datagram.<br>
This method works very well with categorical, continuous, and non-numerical features.<br><br>
**Installation** : pip3 install datawig



In [17]:
!pip3 install datawig



In [18]:
import datawig

# impute missing values
X_train_imputed = datawig.SimpleImputer.complete(X_train)

X_test_imputed = datawig.SimpleImputer.complete(X_test)


In [19]:
sc_tree = StandardScaler()
X_train_scaled_ = sc_tree.fit_transform(X_train_imputed)
X_test_scaled_ = sc_tree.transform(X_test_imputed)

In [20]:
lr.fit(X_train_scaled_,y_train)
print('Training Score :',lr.score(X_train_scaled_,y_train))
print('Testing Score :',lr.score(X_test_scaled_,y_test))

Training Score : 0.7783985102420856
Testing Score : 0.7619047619047619


 Imputing values in specific columns only

In [None]:
# #Initialize a SimpleImputer model
# imputer = datawig.SimpleImputer(
#     input_columns=['Pregnancies','DiabetesPedigreeFunction','Age','Outcome'], # column(s) containing information about the column we want to impute
#     output_column= 'Glucose', # the column we'd like to impute values for
#     # output_path = 'imputer_model' # stores model data and metrics
#     )

# #Fit an imputer model on the train data
# imputer.fit(train_df=X_train, num_epochs=50)

# #Impute missing values and return original dataframe with predictions
# imputed = imputer.predict(X_test)

# References
1. [missingpy Documentation](https://github.com/epsilon-machine/missingpy)
2. https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3
3. [datawig Documentation](https://github.com/awslabs/datawig)