# 7 Manage Missing Values in Machine Learning

1. Deleting Rows with missing values
2. Impute missing values for continuous variable
3. Impute missing values for categorical variable
4. Other Imputation Methods
5. Using Algorithms that support missing values
6. Prediction of missing values
7. Imputation using Deep Learning Library — Datawig

In [None]:
# Load labraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os 

%matplotlib inline

# Display all the columns of the dataframe
pd.pandas.set_option('display.max_columns', None)

In [None]:
# Working with Titanic Data
df = pd.read_csv("Titanic_Train.csv")
df.head(10)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# Using seaborn heatmap to see missing values
plt.figure(figsize = (20,10))  # Set Figure size 
sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis')
plt.show()

In [None]:
# Step 1 - get the list of features with missng values
features_with_na = [features for features in df.columns if df[features].isnull().sum() >= 1]
print(features_with_na)

# Step 2 - check if there are missing values and then print the feature name and the number/percentage of missng values
if len(features_with_na) == 0:
    print('There are no missing values in the dataframe.')
else:
    for feature in features_with_na:
        print(feature, np.round(df[feature].isnull().sum(), 4), ' missing values,  %', np.round(df[feature].isnull().sum()/len(df)*100, 2))

## 1. Delete Missing Rows

In [None]:
data = df.copy()
print('DataFrame Dimenssions - ', data.shape)
data.isnull().sum()

In [None]:
# Drop missing raws and columns
data.dropna(inplace=True)
print(data.isnull().sum())
print(data.shape)

<b>Pros</b>:

A model trained with the removal of all missing values creates a robust model.

<b>Cons</b>:

Loss of a lot of information.

Works poorly if the percentage of missing values is excessive in comparison to the complete dataset.

## 2. Impute missing values with Mean/Median

Columns in the dataset which are having numeric continuous values can be replaced with the mean, median, or mode of remaining values in the column. This method can prevent the loss of data compared to the earlier method.

Replacing the above two approximations (mean, median) is a statistical approach to handle the missing values.

In [None]:
# Example of Age column
data = df.copy()
data['Age'][:20]

In [None]:
# Replacing with mean value()
data["Age"] = data["Age"].replace(np.NaN, data["Age"].mean())

# Replacing with median value()
data["Age"] = data["Age"].replace(np.NaN, data["Age"].median())

# print updated column
data['Age'][:20]

<b>Pros</b>:

Prevent data loss which results in deletion of rows or columns

Works well with a small dataset and easy to implement.

<b>Cons</b>:

Works only with numerical continuous variables.

Can cause data leakage

Does not factor the covariance between features.

## 3. Impute missing values with Mean/Median

When missing values is from categorical columns (string or numerical) then the missing values can be replaced with the most frequent category. If the number of missing values is very large then it can be replaced with a new category.

In [None]:
# Example of Age column
data = df.copy()
data.isnull().sum()

In [None]:
data['Cabin'] = data['Cabin'].fillna('Unknown')
data.isnull().sum()

<b>Pros</b>:

Prevent data loss which results in deletion of rows or columns

Works well with a small dataset and easy to implement

Negates the loss of data by adding a unique category

<b>Cons</b>:

Works only with categorical variables.

Addition of new features to the model while encoding, which may result in poor performance

## 4. Other Imputation Methods

Depending on the nature of the data or data type, some other imputation methods may be more appropriate to impute missing values.

For example, for the data variable having longitudinal behavior, it might make sense to use the last valid observation to fill the missing value. This is known as the Last observation carried forward (LOCF) method.

In [None]:
# Example of Age column
data = df.copy()
data['Age'][:20]

In [None]:
# method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
# Method to use for filling holes in reindexed Series
# pad / ffill: propagate last valid observation forward to next valid
# backfill / bfill: use NEXT valid observation to fill gap
    
data["Age"] = data["Age"].fillna(method='ffill')
data['Age'][:20]

For the time-series dataset variable, it makes sense to use the interpolation of the variable before and after a timestamp for a missing value.

In [None]:
# Example of Age column
data = df.copy()
data['Age'][:20]

In [None]:
data["Age"] = data["Age"].interpolate(method='linear', limit_direction='forward', axis=0)
data['Age'][:20]

## 5. Using Algorithms that support missing values

All the machine learning algorithms don’t support missing values but some ML algorithms are robust to missing values in the dataset. The k-NN algorithm can ignore a column from a distance measure when a value is missing. Naive Bayes can also support missing values when making a prediction. These algorithms can be used when the dataset contains null or missing values.

The sklearn implementations of naive Bayes and k-Nearest Neighbors in Python does not support the presence of the missing values.

Another algorithm that can be used here is RandomForest that works well on non-linear and the categorical data. It adapts to the data structure taking into consideration the high variance or the bias, producing better results on large datasets.

<b>Pros</b>:

No need to handle missing values in each column as ML algorithms will handle it efficiently

<b>Cons</b>:

No implementation of these ML algorithms in the scikit-learn library.

## 6. Prediction of missing values
In the earlier methods to handle missing values, we do not use correlation advantage of the variable containing the missing value and other variables. Using the other features which don’t have nulls can be used to predict missing values.

The regression or classification model can be used for the prediction of missing values depending on nature (categorical or continuous) of the feature having missing value.


```python
Here 'Age' column contains missing values so for prediction of null values the spliting of data will be,
y_train: rows from data["Age"] with non null values
y_test: rows from data["Age"] with null values
X_train: Dataset except data["Age"] features with non null values
X_test: Dataset except data["Age"] features with null values
```



In [None]:
from sklearn.linear_model import LinearRegression
import pandas as pd

data = pd.read_csv("Titanic_Train.csv")
data = data[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]
print(data.head(10))

# check missing values
print(data.isnull().sum())

# Repalce categorical values with numerical
data["Sex"] = [1 if x=="male" else 0 for x in data["Sex"]]

# select rows with missing Age values 
test_data = data[data["Age"].isnull()]
data.dropna(inplace=True)

y_train = data["Age"]
X_train = data.drop("Age", axis=1)

# drop Age column since it will be predicted
X_test = test_data.drop("Age", axis=1)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

y_pred

<b>Pros</b>:
    
Gives a better result than earlier methods

Takes into account the covariance between missing value column and other columns.

<b>Cons</b>:

Considered only as a proxy for the true values

## 7. Imputation using Deep Learning Library — Datawig
This method works very well with categorical, continuous, and non-numerical features. Datawig is a library that learns ML models using Deep Neural Networks to impute missing values in the datagram.

Datawig can take a data frame and fit an imputation model for each column with missing values, with all other columns as inputs.

Below is the code to impute missing values in the Age column

In [None]:
import pandas as pd
!pip install datawig
import datawig

data = pd.read_csv("Titanic_Train.csv")

df_train, df_test = datawig.utils.random_split(data)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['Pclass','SibSp','Parch'], # column(s) containing information about the column we want to impute
    output_column= 'Age', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)