# Handling Missing Data Problem

## Type of Missing Data

* MCAR
* MAR
* MNAR

## Deletion

 * Listwise Deletion: removing the entire record of the data that contains one or more missing data. 

    * Disadvantages — Statistical power relies on a high sample size. In smaller sets of data, listwise deletion can reduce sample size. Unless you are sure that the record is definitely not MNAR, this technique may introduce bias into the dataset.
    
* Pairwise Deletion: It is the method that uses the correlation between pairs of variables to maximize data available on an analysis by analysis basis.
    * Disadvantages — It’s difficult to interpret parts of your model due to the fact that there are different numbers of observations contributing to different parts of your model.
- Dropping Variables: It is the method to drop a variable if 60% of the data is missing. It’s difficult to know how your dropped variable may affect other variables inside the dataset.

## Imputation 


## 1- Mean, Mode Median

The disadvantage of Mean, Median, Mode Imputations 
- It reduces the variance of the imputed variables. 
- It also shrinks the standard error, which invalidates most hypothesis tests and the calculation of confidence interval. 
- It disregards the correlations between variables. It can over-represent and under-represent certain data.

In [None]:
df.Column_Name.fillna(df.Column_Name.mean(), inplace=True)
df.Column_Name.fillna(df.Column_Name.median(), inplace=True)
df.Column_Name.fillna(df.Column_Name.mode(), inplace=True)

## 2- Logistic Regression

Disadvantages of Logistic Regression:
- prone to overconfidence or overfitting due to the fact of overstating the accuracy of its predictions.
- tend to underperform when there are multiple or nonlinear decision boundaries.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression
imp=Imputer(missing_values="NaN", strategy="mean", axis=0)
logmodel = LogisticRegression()
steps=[('imputation',imp),('logistic_regression',logmodel)]
pipeline=Pipeline(steps)
X_train, X_test, Y_train, Y_test=train_test_split(X, y, test_size=0.3, random_state=42)
pipeline.fit(X_train, Y_train)
y_pred=pipeline.predict(X_test)
pipeline.score(X_test, Y_test)

## 3- Linear Regression

Disadvantages of Linear Regression:
- the standard error is deflated.
- must have a linear relationship between x and y.

In [None]:
from sklearn.linear_model import LinearModel
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
imp=Imputer(missing_values="NaN", strategy="mean", axis=0)
linmodel = LinearModel()
steps=[('imputation',imp),('linear_regression',linmodel)]
pipeline=Pipeline(steps)
X_train, X_test, Y_train, Y_test=train_test_split(X, y, test_size=0.3, random_state=42)
pipeline.fit(X_train, Y_train)
y_pred=pipeline.predict(X_test)
pipeline.score(X_test, Y_test)

## 4- KNN
 This is a model that’s widely used for missing data imputation. The reason it is widely used is due to the fact that it can handle both continuous data and categorical data.
 This model is a non-parametric method that classifies the data to its nearest heavily weighted neighbor.
 
 Disadvantages of KNN
- time-consuming on larger datasets
- on high dimensional data, accuracy can be severely degraded

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
k_range=range(1,26)
for k in k_range:
    imp=Imputer(missing_values='NaN', strategy='mean', axis=0)
    knn=KNeighborsClassifier(n_neighbors=k)
    steps=[('imputation',imp),('K-Nearest Neighbor',knn)]
    pipeline=Pipeline(steps)
    X_train, X_test, Y_train, Y_test=train_test_split(X, y, test_size=0.3, random_state=42)
    pipeline.fit(X_train, Y_train)
    y_pred=pipeline.predict(X_test)
    pipeline.score(X_test, Y_test)