<a href="https://colab.research.google.com/github/sjcorp/notebooks/blob/master/ml_fundamentals/ml_feature_engineering_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lifecycle of a Data Science Project

1. Data Collection Strategy - Company Databases, 3rd Party APIs, Surveys
2. Feature Engineering - Handling Missing Values

# Handling Missing Values

What are the different types of missing data?
- MCAR: Missing completely at Random; in such cases disregarding those cases would not bias the inferences made.
- MNAR: Missing not at Random
- Missing at Random

Two types of data could be missing:
- Continuous Data
- Categorical Data


# MNAR

In [1]:
# Random Sample Imputation

import pandas as pd
df = pd.read_csv('drive/My Drive/Colab Notebooks/datasets/titanic_train.csv')

FileNotFoundError: [Errno 2] File drive/My Drive/Colab Notebooks/datasets/titanic_train.csv does not exist: 'drive/My Drive/Colab Notebooks/datasets/titanic_train.csv'

In [None]:
df.head()

df.isnull().sum()
# Here Cabin and Age have a lot of NAN values and they are not missing at random. Its because data was collected post the accident.

# MCAR

# Cabin value missing for people that survived
import numpy as np
df['cabin_null']=np.where(df['Cabin'].isnull(),1,0)
df['cabin_null'].mean()

df.groupby(['Survived'])['cabin_null'].mean()

### Missing at Random

- Men not quoting Salary
- Women not quoting Age

# Techniques of handling missing values

1. Mean/Median/Mode replacement
2. Random Sample Imputation
3. Capturing NAN Values with a New Feature
4. End of Distribution Imputation
5. Arbitrary Imputation
6. Freqyent Categories Imputation

# Categorical Value Imputation

# Mean Median Mode Imputation

When?
- Data is missing completely at Random

How?
- Replace the NAN with most feequent occurance of variable

Advantages:
- Easy to implement
- Robust for outliers
- Fastest way to complete the dataset

Disadvantages:
- Distortion to the original variance


# Random Sample Imputation

import pandas as pd
df = pd.read_csv('drive/My Drive/Colab Notebooks/datasets/titanic_train.csv', usecols = ['Age', 'Fare', 'Survived'])

df.isnull().mean()

def impute_nan(df, variable, median):
  df[variable+'_median'] = df[variable].fillna(median)

median = df.Age.median()
median

impute_nan(df,'Age',median)
df.head()

# Check if it has changed the normal distribution

print(df['Age'].std())
print(df['Age_median'].std())

import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax = ax)
df['Age_median'].plot(kind='kde', ax=ax,color='red')
lines, lables = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

# Random Sample Imputation

When: Data is missing completely at random (MCAR)

How: Take random observation from the dataset to replace NAN values

Pros:
- Easy to implement
- Less distortion in variance

Cons:
- In every situation, randomness wont work

# Random Sample Imputation

import pandas as pd
df = pd.read_csv('drive/My Drive/Colab Notebooks/datasets/titanic_train.csv', usecols = ['Age', 'Fare', 'Survived'])

df.head()

# Code to find out NULL Values
df.isnull().sum()

df['Age'].isnull().sum()

df.isnull().mean()

df['Age'].dropna().sample()

df['Age'].dropna().sample(df['Age'].isnull().sum(),random_state = 0)

df[df['Age'].isnull()].index

def impute_nan(df, variable, median):
  df[variable+"_median"] = df[variable].fillna(median)
  df[variable+"_random"] = df[variable]
  ## It will fill up the NA value with the random sample
  random_sample = df[variable].dropna().sample(df[variable].isnull().sum(),random_state=0)
  ## pandas need to have some index in order to merge the dataset
  random_sample.index=df[df[variable].isnull()].index
  df.loc[df[variable].isnull(),variable+'_random']=random_sample

median = df.Age.median()
median

impute_nan(df,"Age",median)

df.head()

import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_median.plot(kind='kde', ax=ax,color='red') # Just to check distortionin the variance
df.Age_random.plot(kind='kde', ax=ax,color='green')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels,loc='best')

# Capturing NAN Values with a New Feature

When: When missing values are not at random (MNAR)

How: Create a new replacement feature (1 if missing, 0 if not missing)

Pros:
- Easy to implement
- Captures importance of missing values (by 1,0 as a flag)

Cons:
- Creates additional features (curse of dimensonality)

import pandas as pd

import pandas as pd
df = pd.read_csv('drive/My Drive/Colab Notebooks/datasets/titanic_train.csv', usecols = ['Age', 'Fare', 'Survived'])

df.head()

import numpy as np
df['Age_NAN']=np.where(df['Age'].isnull(),1,0)

df.head()

df.Age.median()

df['Age'].fillna(df.Age.median(),inplace=True) # Missing value has already been captured by flag. This will help in evaluating the importance of age.

df.head()

# End of distribution imputation

When: When missing values are not at random (MNAR)

How: Take far end of the distribution & replace them

Pros:

- Easy to implement
- Captures the importance of missing values

Cons:

- Distorts the original distribution of variables
- If missing data is not important, it may mask the predictive power of the original variable by distorting the distribution
- If the number of NA is big, it will mask true outliers in the distribution
- If the number of NA is small, the replaced NA may be considered an outlier and pre processed in a subsequent feature engineering

import pandas as pd
df = pd.read_csv('drive/My Drive/Colab Notebooks/datasets/titanic_train.csv', usecols = ['Age', 'Fare', 'Survived'])

df.Age.hist(bins=50)

extreme = df.Age.mean()+3*df.Age.std()

import seaborn as sns
sns.boxplot('Age',data=df)

def impute_nan(df,variable, median,extreme):
  df[variable+"_end_distribution"]=df[variable].fillna(extreme) # new variable with extreme values
  df[variable].fillna(median,inplace=True) # missing values with median

impute_nan(df,'Age',df.Age.median(),extreme)

df.head()

df['Age'].hist(bins=50)

df['Age_end_distribution'].hist(bins=50)

sns.boxplot('Age_end_distribution', data=df)