## Different types of Missing Values and ways to handle them

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
df = pd.read_csv('../input/titanic/train.csv')
df

In [None]:
df.isnull().sum()

## 1. Missing Completely at Random (MCAR)
#### It refers to the values which have absolutely no relation with the data missing and any other values
#### For eg. Embarked

In [None]:
df[df['Embarked'].isnull()]

## 1. Missing Completely Not at Random (MCNR)
#### It refers to the values which have absolutely some relation with the data missing and any other values
#### For eg. Cabin

In [None]:
#Adding a new column in df. Writing 1 whereever the data is missing otherwise 0
df['cabin_null_val'] = np.where(df['Cabin'].isnull(),1,0)

In [None]:
df.head()

In [None]:
#Checking the percentage relation of Survived column and cabin_null_val column
df.groupby(['Survived'])['cabin_null_val'].mean()

In [None]:
#The results say, there are higher percentage of people who died(0) have cabin record as null
#Hence the relation is proved

## Missing at Random (MAR)
#### Example : Men hide their salary
#### Explanation : So in the sex column, whenever there will be men the chances are high that the salary column will have null values

# Techniques of handling the missing values
### 1. Mean/Median/Mode
### 2. Random Sample Imputation
### 3. Capturing NaN values with new Features
### 4. End of Distribution Imputation
### 5. Arbitrary Imputation

## 1. Mean Median Mode
#### When should we apply?
##### Mean Median Mode has the asumption that the data is missing completely at random (MCAR)
##### We solve this by replacing the NaN values with the most frequent occurings of the variables

In [None]:
#Reading the dataset (Only specific columns for the simplcity)
df = pd.read_csv('../input/titanic/train.csv',usecols = ['Survived', 'Age', 'Fare'])
df.head()

In [None]:
#Check the percentage of null values
df.isnull().mean()

In [None]:
#Creating median of Age column
median = df.Age.median()
median

In [None]:
#Creating a function to replace all null values with the median
def median_col(df,variable,median):
    df[variable+'_median'] = df[variable].fillna(median)

In [None]:
#Calling the function and checking the dataframe again
median_col(df,'Age',median)
df

In [None]:
#Comparing the two columns (Null values and filled values)
df[df['Age'].isnull()]

In [None]:
print(df['Age'].std())
print(df['Age_median'].std())

In [None]:
#The Standard deviation (STD) does not have much of a difference

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
df.Age.plot(kind = 'kde',ax=ax)
df.Age_median.plot(kind = 'kde',ax=ax, color = 'red')
lines,labels = ax.get_legend_handles_labels()
ax.legend(lines,labels,loc = 'best')

In [None]:
#As we can observe, due to the NaN values, the Age_median graph's density has been increased due to the median values

### Advantages and Disadvantages
#### Advantages
1. Easy to implement
2. Faster to get a complete dataset

#### Disadvantage 
1. Change or distortion in the original variance (Observe the graph)
2. Impacts corelation

## 2. Random Sample Imputation
#### When should we apply?
##### Random Sample Imputation has the asumption that the data is missing completely at random (MCAR)
##### It consists of replacing the random values from the dataset and we use this observation to replace the NaN values

In [None]:
#Reading the dataset (Only specific columns for the simplcity)
df = pd.read_csv('../input/titanic/train.csv',usecols = ['Survived', 'Age', 'Fare'])
df.head()

In [None]:
#Checking null values
df.isnull().sum()

In [None]:
def null_impute(df,variable):
    df[variable+'_random']=df[variable]
    #Fill the NA values with random sample
    n = df[variable].isnull().sum()
    random_sample = df[variable].dropna().sample(n,random_state = 0)
    #Pandas need to have same index values to merge the data
    random_sample.index = df[df[variable].isnull()].index
    #Add the values to the variable_sample column
    df.loc[df[variable].isnull(),variable+'_random'] = random_sample    

In [None]:
#Calling the function
null_impute(df,'Age')
df.head()

In [None]:
print(df['Age'].std())
print(df['Age_random'].std())

In [None]:
#The Std is almost the same and better than the first method

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
df.Age.plot(kind = 'kde',ax=ax)
df.Age_random.plot(kind = 'kde',ax=ax, color = 'red')
lines,labels = ax.get_legend_handles_labels()
ax.legend(lines,labels,loc = 'best')

In [None]:
#It can be seen that the graph is almost the same

### Advantages and Disadvantages
#### Advantages
1. Easy to implement
3. Less Change or distortion in the original variance (Observe the graph)

#### Disadvantage 
2. In every situation randomness won't work

## 3. Capturing NaN values with new Features
### When does it work?
#### It works well when the data is not completely at random

In [None]:
#Reading the dataset (Only specific columns for the simplcity)
df = pd.read_csv('../input/titanic/train.csv',usecols = ['Survived', 'Age', 'Fare'])
df.head()

In [None]:
#Creating a new column and replacing it with 1 wherever I Null value is present else 0
df['Age_null'] = np.where(df.Age.isnull(),1,0)
df.head()

### Advantages and Disadvantages
#### Advantages
1. Easy to implement
3. Capture the importance of missing values

#### Disadvantage 
2. Creating additional feature (Curse of Dimentionality)

## 4. End of Distribution imputation

If there is suspicion that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable.

In [None]:
#Reading the dataset (Only specific columns for the simplcity)
df = pd.read_csv('../input/titanic/train.csv',usecols = ['Survived', 'Age', 'Fare'])
df.head()

In [None]:
#Plotting a histogram to understand the data.
#We will be taking the end data (around 70+ from the X-axis) and relacing the nan values with it
df['Age'].plot(kind = 'hist', bins = 50)

In [None]:
#Checking the outliers before implementing. Outliers will be found
sns.boxplot(data = df['Age'])

In [None]:
#Getting the mean of Data from the 3rd Standard deviation (Extreme end)
extreme = df.Age.mean()+3*df.Age.std()

In [None]:
#Creating a function to replace the null values with extreme variable
def age_distribution_end(df,extreme,variable):
    df[variable+'_end'] = df[variable].fillna(extreme)

In [None]:
#Calling the function and checking the dataset
age_distribution_end(df,extreme,'Age')
df.head()

In [None]:
#Checking the boxplot again. Observe that the outliers are completely gone
sns.boxplot(data = df['Age_end'])

### Advantages and Disadvantages of End of Distribution Imputation
#### Advantages
1. Can bring out the importance of missing values

#### Disadvantages
1. Changes Co-variance/variance
2. May create biased data

## Arbitrary value Imputation

This technique was derived from kaggle competition It consists of replacing NAN by an arbitrary value. In this technique, all the nan values are replaced by any one value which is decided by the data scientist

In [None]:
#Reading the dataset (Only specific columns for the simplcity)
df = pd.read_csv('../input/titanic/train.csv',usecols = ['Survived', 'Age', 'Fare'])
df.head()

In [None]:
#Replacing the NaN values with 0 and 100
def impute_nan(df, variable):
    df[variable+'_zero'] = df[variable].fillna(0)
    df[variable+'_hundred'] = df[variable].fillna(100)

In [None]:
#Calling the function and checking the dataset
impute_nan(df, 'Age')
df.head()

In [None]:
df['Age_zero'].hist(bins=50)

In [None]:
df['Age_hundred'].hist(bins=50)

### Advantages and Disadvantages of Arbitrary value Imputation
#### Advantages
1. Easy to implement
2. Captures the importance of missingess if there is one

#### Disadvantages
1. Distorts the original distribution of the variable
2. If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
3. Hard to decide which value to use