# Why handling missing data is important?
The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data.
Missing data present various problems. 
* First, the absence of data reduces statistical power. 
* Second, the lost data can cause bias in the estimation of parameters. 
* Third, it may complicate the analysis of the study. 
* Fourth, many machine learning packages in python does not accept missing data.

Each of these distortions may threaten the validity of the trials and can lead to invalid conclusions which reduces the reliability on the model.

[**Reference**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/)

# Missing data mechanisms

* **Missing Completely At Random (MCAR):**
Values in a data set are Missing Completely At Random (MCAR) if the events that lead to any particular data-item being missing are independent both of observed data and of missing data.

* **Missing At Random (MAR):**
Missing At Random (MAR) is when the missing data is related to the observed data but not the missing data.

* **Missing Not At Random (MNAR):**
Missing Not At Random (MNAR) is data that is neither MAR nor MCAR. The missing values on the variable are related to that of both the observed and missing variables. 

[**Reference**](https://en.wikipedia.org/wiki/Missing_data#Missing_not_at_random)




# Handling Missing data
1. **Dropping variables**
2. **Partial Deletion**
    * 2.1 **Listwise Deletion**
3. **Data Imputation**
    * 3.1 **Single Imputation**
        * 3.1.1 Single Imputation for Numeric columns
            * 3.1.1.1 Mean Imputation
            * 3.1.1.2 Regression Imputation
        * 3.1.2 Single Imputation for Categoric columns
            * 3.1.2.1 Mode Imputation
    * 3.2 **Multiple Imputation**
        * 3.2.1 MICE Imputation

## 1. Dropping Variables
Delete the column if it consists of more than **70% of missing values** otherwise Data Imputation is the most preferred method than deleting it. Greater the information to the model, the greater is the reliability of the model's results.

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np 
import pandas as pd 
import xgboost
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import (GradientBoostingRegressor, GradientBoostingClassifier)
pd.set_option('max.columns',100)
pd.set_option('max.rows',500)

'''Load the data'''
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

def find_missing_percent(data):
    """
    Returns dataframe containing the total missing values and percentage of total
    missing values of a column.
    """
    miss_df = pd.DataFrame({'ColumnName':[],'TotalMissingVals':[],'PercentMissing':[]})
    for col in data.columns:
        sum_miss_val = data[col].isnull().sum()
        percent_miss_val = round((sum_miss_val/data.shape[0])*100,2)
        miss_df = miss_df.append(dict(zip(miss_df.columns,[col,sum_miss_val,percent_miss_val])),ignore_index=True)
    return miss_df

miss_df = find_missing_percent(train)
'''Columns with missing values'''
print(f"Number of columns with missing values: {str(miss_df[miss_df['PercentMissing']>0.0].shape[0])}")
display(miss_df[miss_df['PercentMissing']>0.0])

'''Drop the columns with more than 70% of missing values'''
drop_cols = miss_df[miss_df['PercentMissing'] >70.0].ColumnName.tolist()
train = train.drop(drop_cols,axis=1)

## 2. Partial Deletion
### 2.1 Listwise Deletion
**Listwise deletion** is a technique in which the rows that contain missing values are deleted.

**Disadvantages:**
Listwise deletion affects statistical power of the tests conducted. Statistical power relies in part on high sample size. Because listwise deletion excludes data with missing values, it reduces the sample which is being statistically analysed.

* Listwise deletion is also problematic when the data is **Missing Not At Random (MNAR)** (i.e., questions aiming to extract sensitive information). Many of the subjects in the sample may not answer due to the intrusive nature of the questions, but may answer all other items. 
* Listwise deletion will exclude these respondents from analysis. This may create a **bias** in the dataset.

**[Reference](https://en.wikipedia.org/wiki/Listwise_deletion)** 

In [None]:
def listwise_deletion(train):
    for col in train.columns:
        miss_ind = train[col][train[col].isnull()].index
        train = train.drop(miss_ind, axis = 0)
    return train

train_lwd = listwise_deletion(train)
'''Samples remaining after deletion'''
print(f"Train data shape:{train_lwd.shape}")
train_lwd.head()

## 3. Data Imputation
### 3.1 Single Imputation
Single Imputation attempts to impute the missing data by a single value as opposed to Multiple Imputation which replaces the missing data with multiple values.
* 3.1.1 Single Imputation for Numeric columns
    * 3.1.1.1 Mean Imputation
    * 3.1.1.2 Regression Imputation
* 3.1.2 Single Imputation for Categoric columns
    * 3.1.2.1 Mode Imputation
   

#### 3.1.1 Single Imputation for Numeric columns
**3.1.1.1 Mean Imputation**
* Mean Imputation is the process of imputing the missing data by the mean of the variable and can be done only to numeric columns.

**Disadvantage:** Mean Imputation is more likely to introduce bias in the model.


In [None]:
'''Segregate numeric and categoric columns'''
numeric_cols = train.select_dtypes(['float','int']).columns
categoric_cols = train.select_dtypes('object').columns

train_numeric = train[numeric_cols]
train_categoric = train[categoric_cols]

def mean_imputation(train_numeric):
    """
    Mean Imputation
    """
    for col in train_numeric.columns:
        mean = train_numeric[col].mean()
        train_numeric[col] = train_numeric[col].fillna(mean)
    return train_numeric

train_mean_imp = mean_imputation(train_numeric)
train_mean_imp.head()

**3.1.1.2 Regression Imputation**
* A Regression model is fitted where the predictors are the features without missing values and the targets are the features with missing values. The missing values are then replaced with the predictions. Regression imputation is less likely to introduce bias in the model.


In [None]:
'''Select all the numeric columns for regression imputation'''
train_numeric_regr = train[numeric_cols]
'''Numeric columns with missing values which acts as target in training'''
target_cols = ['LotFrontage','GarageYrBlt']
'''Predictors for regression imputation'''
predictors = train_numeric_regr.drop(target_cols, axis =1)

def find_missing_index(train_numeric_regr, target_cols):
    """
    Returns the index of the missing values in the columns.
    """
    miss_index_dict = {}
    for tcol in target_cols:
        index = train_numeric_regr[tcol][train_numeric_regr[tcol].isnull()].index
        miss_index_dict[tcol] = index
    return miss_index_dict

def regression_imputation(train_numeric_regr, target_cols, miss_index_dict):
    """
    Fits XGBoost Regressor and replaces the missing values with
    the prediction.
    """
    for tcol in target_cols:
        y = train_numeric_regr[tcol]
        '''Initially impute the column with mean'''
        y = y.fillna(y.mean())
        xgb = xgboost.XGBRegressor(objective="reg:squarederror", random_state=42)
        '''Fit the model where y is the target column which is to be imputed'''
        xgb.fit(predictors, y)
        predictions = pd.Series(xgb.predict(predictors),index= y.index)    
        index = miss_index_dict[tcol]
        '''Replace the missing values with the predictions'''
        train_numeric_regr[tcol].loc[index] = predictions.loc[index]
    return train_numeric_regr

miss_index_dict = find_missing_index(train_numeric_regr, target_cols)
train_numeric_regr = regression_imputation(train_numeric_regr, target_cols, miss_index_dict)
train_numeric_regr.head()

#### 3.1.2 Single Imputation for categoric columns
**3.1.2.1 Mode Imputation**
* Mode Imputation is the process of imputing the missing data by the mode of the variable and can be done only to categoric columns.

**Disadvantage:** More likely to introduce bias in the model. 

In [None]:
def mode_imputation(train_categoric):
    """
    Mode Imputation
    """
    for col in train_categoric.columns:
        mode = train_categoric[col].mode().iloc[0]
        train_categoric[col] = train_categoric[col].fillna(mode)
    return train_categoric

train_mode_imp = mode_imputation(train_categoric)
'''Concatenate the mean and mode imputed columns'''
train_imputed = pd.concat([train_mean_imp, train_mode_imp], axis = 1)
train_imputed.head()

### 3.2 Multiple Imputation
* In multiple imputation, the missing data is imputed with multiple values.

#### 3.2.1 MICE (Multiple Imputation by Chained Equation)

    
#### MICE Algorithm:
The chained equation process can be broken down into four general steps:

* **Step 1:** A simple imputation, such as imputing the mean, is performed for every missing value in the dataset. These mean imputations can be thought of as “place holders.”
* **Step 2:** The “place holder” mean imputations for one variable (“var”) are set back to missing.
* **Step 3:** The observed values from the variable “var” in Step 2 are regressed(can use any other regressors like Gradient Boosting Regressor or XGBoost Regressor for numeric data) on the other variables in the imputation model, which may or may not consist of all of the variables in the dataset. In other words, “var” is the dependent variable in a regression model and all the other variables are independent variables in the regression model. These regression models operate under the same assumptions that one would make when performing linear, logistic, or Poison regression models outside of the context of imputing missing data.
* **Step 4:** The missing values for “var” are then replaced with predictions (imputations) from the regression model. When “var” is subsequently used as an independent variable in the regression models for other variables, both the observed and these imputed values will be used.
* **Step 5:** Steps 2–4 are then repeated for each variable that has missing data. The cycling through each of the variables constitutes one iteration or “cycle.” At the end of one cycle all of the missing values have been replaced with predictions from regressions that reflect the relationships observed in the data.
* **Step 6:** Steps 2–4 are repeated for a number of cycles, with the imputations being updated at each cycle.

[**Reference** ](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/)

#### MICE Algorithm for Categorical data:
Before going through the steps 1 to 6 in MICE algorithm the following steps must be done in order to impute categorical data.
* **Step 1:** Ordinal Encode the non-null values
* **Step 2:** Use MICE imputation with Gradient Boosting Classifier to impute the ordinal encoded data
* **Step 3:** Convert back from ordinal values to categorical values.
* **Step 4:** Follow steps 1 to 6 in MICE Algorithm. Instead of using Mean imputation for initial strategy use **Mode imputation**.

[**Reference**](https://projector-video-pdf-converter.datacamp.com/17404/chapter4.pdf) 



In [None]:
def mice_imputation_numeric(train_numeric):
    """
    Impute numeric data using MICE imputation with Gradient Boosting Regressor.
    (we can use any other regressors to impute the data)
    """
    iter_imp_numeric = IterativeImputer(GradientBoostingRegressor())
    imputed_train = iter_imp_numeric.fit_transform(train_numeric)
    train_numeric_imp = pd.DataFrame(imputed_train, columns = train_numeric.columns, index= train_numeric.index)
    return train_numeric_imp

def mice_imputation_categoric(train_categoric):
    """
    Impute categoric data using MICE imputation with Gradient Boosting Classifier.
    Steps:
    1. Ordinal Encode the non-null values
    2. Use MICE imputation with Gradient Boosting Classifier to impute the ordinal encoded data
    (we can use any other classifier to impute the data)
    3. Inverse transform the ordinal encoded data.
    """
    ordinal_dict={}
    for col in train_categoric:
        '''Ordinal encode train data'''
        ordinal_dict[col] = OrdinalEncoder()
        nn_vals = np.array(train_categoric[col][train_categoric[col].notnull()]).reshape(-1,1)
        nn_vals_arr = np.array(ordinal_dict[col].fit_transform(nn_vals)).reshape(-1,)
        train_categoric[col].loc[train_categoric[col].notnull()] = nn_vals_arr

    '''Impute the data using MICE with Gradient Boosting Classifier'''
    iter_imp_categoric = IterativeImputer(GradientBoostingClassifier(), max_iter =5, initial_strategy='most_frequent')
    imputed_train = iter_imp_categoric.fit_transform(train_categoric)
    train_categoric_imp = pd.DataFrame(imputed_train, columns =train_categoric.columns,index = train_categoric.index).astype(int)
    
    '''Inverse Transform'''
    for col in train_categoric_imp.columns:
        oe = ordinal_dict[col]
        train_arr= np.array(train_categoric_imp[col]).reshape(-1,1)
        train_categoric_imp[col] = oe.inverse_transform(train_arr)
        
    return train_categoric_imp

train_numeric_imp  = mice_imputation_numeric(train_numeric)
train_categoric_imp = mice_imputation_categoric(train_categoric)

'''Concatenate Numeric and Categoric Training and Test set data '''
train_mice_imp = pd.concat([train_numeric_imp, train_categoric_imp, train['SalePrice']], axis = 1)
train_mice_imp.head()