# 2. Managing Missing Data

Having looked at the data in chapter 1, we can see that some variables of interest such as Age are sometimes missing from the ~20% of the dataset. Our exploration also hinted that this information could be a valuable predictor for whether someone survived and so we might be heavily incentivised to use that feature. 

In this chapter, we'll discuss several methods for dealing with missing data, either using simple summary statistics or by devising more complex models for predicting missing values based on other features that are available. We'll also cover some statistical models that can handle missing data for you, and what benefits that might offer.

The first thing we'll do is note whether the variable of interest is missing at all, as this can be an important marker that we can use to indicate to ML models that the input data isn't original. 

Note: This chapter is an exploration of some imputation approaches - see also:
 * [sklearn.impute](https://scikit-learn.org/stable/modules/impute.html) 
 * [impyute](https://impyute.readthedocs.io/en/master/)


Credit: The code written below was developed in collaboration with ChatGPT from OpenAI

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd

In [2]:
# Load data
data_dir = Path.cwd().parent.parent / 'data/titanic'
train = pd.read_csv(data_dir / 'train.csv')
test = pd.read_csv(data_dir / 'test.csv')

## 2.1. Data Preprocessing

Note if data is missing (this can also be done using the MissingIndicator from sklearn.impute)


### 2.1.1. Understanding what's missing

In their book *Statistical Analysis with Missing Data* (3rd Edition, 2020), Little and Rubin distinguish between the patterns of missing data and the mechanisms of missing data. This distinction highlights that it's important to understand where and when data is missing, but also the reasons why data is absent. 

Let's start off by asking the obvious, which columns have missing data in each dataset:

In [22]:
def get_columns_with_missing_data(df, name:str) -> None:
    missing_vals = df.isnull().sum()
    return pd.DataFrame(missing_vals[missing_vals > 0], columns=[name])

pd.concat([
    get_columns_with_missing_data(train, 'train'),
    get_columns_with_missing_data(test, 'test')
], axis=1)

Unnamed: 0,train,test
Age,177.0,86.0
Cabin,687.0,327.0
Embarked,2.0,
Fare,,1.0


We have missing rows for Age and Cabin in both test and train data, but we also have rare cases where the port of embarcation is missing from the training data and the fare paid by one passenger is missing form the test dataset. (This again illustrates how the titanic project has a few of these awkard properties that reflect the challenges of real-world analysis)

In [None]:
train['age_missing'] = train['Age'].isna()
test['age_missing'] = test['Age'].isna()

### 2.1.2. Do not merge the train and test data

When imputing missing data, one might think that we should combine data from the training and test datasets, so that we can maximise our ability to estimate missing values. However, if we were to merge the two datasets, we would potentially risk **data leakage** that might undermine the model we're training. 

A simple example using regression imputation shows you why: Here we simulate a simple experiment in which age is correlated with ticket fare, but the correlation coefficients and ranges of fares in test and train data are different:

In [None]:
import numpy as np

n_passengers, max_age = 200, 80

train_age = np.random.rand(n_passengers) * max_age
test_age = np.random.rand(n_passengers) * (max_age/2)              # test subjects are half as old 

train_fare = (train_age * 10) + (25*np.random.rand(n_passengers))   # y = a + bx + noise
test_fare = (test_age * 5) + (25*np.random.rand(n_passengers))      

# mask a subset of ages (data has no order)
proportion_missing = 0.3
train_age[:int(n_passengers*proportion_missing)] = np.nan   
fares_for_missing_data = train_fare[:int(n_passengers*proportion_missing)]


In [None]:
# Impute values using linear regression with only training data
from sklearn.linear_model import LinearRegression

X = train_fare[~np.isnan(train_age)].reshape(-1,1)
y = train_age[~np.isnan(train_age)].reshape(-1,1)

training_only = LinearRegression().fit(X, y)
trained_only_prediction = training_only.predict(fares_for_missing_data.reshape(-1, 1))

In [None]:
# Now impute values using linear regression on merged test and training data
X_merged = np.vstack((X, test_fare.reshape(-1, 1)))
y_merged = np.vstack((y, test_age.reshape(-1, 1)))

merged_model = LinearRegression().fit(X_merged, y_merged)
merged_prediction = merged_model.predict(fares_for_missing_data.reshape(-1,1))

When we plot the results in the cell below, we can see that the imputed ages of subjects is very strongly affected by the merging of train and test datasets. If we were then to build our survival model based on the ages imputed after merging, that model would also be influenced by the test data. In deployment, such a model may generalize poorly because it can no longer rely on the leaked information that allowed it to perform well on the test data.

In [None]:
# Plot the results
import matplotlib.pyplot as plt

plt.scatter(train_fare, train_age, marker='.', color='c', alpha=0.5, label='Train (Observed)')
plt.scatter(test_fare, test_age, marker='.', color='orange', alpha=0.5, label='Test (Observed)')

plt.scatter(
    train_fare[np.isnan(train_age)], 
    trained_only_prediction, 
    marker='x', s=5, color='b', label='Train (Imputed)')

plt.scatter(
    train_fare[np.isnan(train_age)], 
    merged_prediction, 
    marker='x', s=5, color='r', label='Merged (Imputed)')

plt.ylabel('Age (Years)')
plt.xlabel('Fare (£)')
plt.legend()
plt.show()

For some methods, it may be useful to include categorical data in the imputation. However string values are not always compatible and so we create numerical values representing each value of the category (e.g. Male=0, Female=1 etc.). We apply the factorize function to data in both the train and test set, so that we can impute missing data when making final predictions.

In [None]:
train['gender_numeric'], _ = train['Sex'].factorize()
test['gender_numeric'], _ = test['Sex'].factorize()

train['embarked_numeric'], _ = train['Embarked'].factorize()
test['embarked_numeric'], _ = test['Embarked'].factorize()

## 2.2. Methods for replacing missing data

### Deletion

This involves simply removing any rows or columns that contain missing data. This method is simple and easy to implement, but it can also reduce the sample size and potentially bias the results if the missing data is not missing completely at random.

If we were to get rid of missing values for Age data, we could still train a model that included Age as a predictor, so long as the testing data did not contain missing values. However we know that the training data does contain missing data (and even if we didn't, we might not want to limit ourselves to the assumption that all data was present). For these reasons, we will comment out the code to delete missing values.


In [None]:
# train = train.drop(train.index[train["age_missing"]])
test = test.drop(test.index[test["age_missing"]])


### Mean/median imputation
 This involves replacing the missing data with the mean or median value of the non-missing data in the same column. This can be useful if the missing data is missing at random and if the data is normally distributed. However, it can also distort the distribution of the data and potentially introduce bias if the data is not normally distributed or if the missing data is not missing at random.

In [None]:
# Replace with mean 
# train.fillna(train.mean(), inplace=True)
# test.fillna(test.mean(), inplace=True)

# Replace with median 
# train.fillna(train.mean(), inplace=True)
# test.fillna(test.mean(), inplace=True)

### Regression imputation

This involves using a regression model to predict the missing values based on the non-missing values in the same row or other relevant variables. This can be useful if the missing data is not missing at random and if there is a strong relationship between the missing and non-missing values. However, it can also introduce bias and error if the regression model is misspecified or if there is not a strong relationship between the variables.

In the case of the titanic dataset, we are somewhat limited in the predictors available for imputation. We can get an idea of the relationships between continuous variables by looking at the correlation matrix. Note that here we're considering the factorized version of categorical variables such as gender and class (and class has ordinal implications in that 1st is different from 2nd class in at least the same "direction" as 2nd is different from 3rd class)

If we try to impute values this way, we'll find that a simple regression model does a bad job and predicts impossible values, such as negative ages.

In [None]:
# train[['Age','SibSp','Fare','PassengerId','gender_numeric','Pclass','Parch']].corr()

In [None]:
# Define a function that uses the model to make predictions for missing values
# from sklearn.linear_model import LinearRegression

# # Build model from available data
# predictors = ['SibSp','Fare','gender_numeric','Pclass','Parch']

# X = train[predictors].dropna().values
# y = train["Age"].dropna().values

# model = LinearRegression().fit(X, y)

# idx = train[train["age_missing"]].index
# train.loc[idx, "Age"] = model.predict(train.loc[idx, predictors]) 


### K-nearest neighbors imputation

This involves using the values of the k-nearest neighbors to the missing data point to impute the missing value. This can be useful if the data is missing at random and if the missing data is similar to the values of its neighboring points. However, it can also introduce bias and error if the data is not missing at random or if the nearest neighbors are not representative of the missing data point.

These are just a few examples of methods for replacing missing data. There are many other methods and techniques that can be used, and the appropriate method to use depends on the specific context and characteristics of the missing data. It is important to carefully evaluate the missing data and consider the potential implications of different methods before deciding on a course of action.


In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors = 3)

# Select columns to use for impute
columns_for_impute = ['Pclass','SibSp', 'Parch', 'Fare', 'gender_numeric', 'embarked_numeric','Age']

# Fit the imputer to the data
imputer.fit(train[columns_for_impute])

# Transform the data
imputed_data = imputer.transform(train[columns_for_impute])

# Create a new dataframe with the imputed data
imputed_df = pd.DataFrame(imputed_data, columns = columns_for_impute)

In [None]:
imputed_df.head(6)

In [None]:
train[columns_for_impute].head(6)

### Neural Networks

Given the hype, it's important to remember that a neural net isn't magic - it can't find relationships if they don't exist. Also, if those relationships are very complex, we may not have enough data to find them.

In [None]:
from tensorflow import keras

# Create a sample dataset with missing values
data = np.array([[1, 2, np.nan],
                 [3, 4, 5],
                 [6, np.nan, 7],
                 [np.nan, 8, 9]])

# Define the model
model = keras.Sequential([
    keras.layers.Dense(10, input_shape=(3,)),
    keras.layers.Dense(10),
    keras.layers.Dense(3)
])

# Compile the model
model.compile(optimizer='adam', loss='mse')

# Train the model
model.fit(data, data, epochs=100)

# Use the trained model to impute missing values
imputed_data = model.predict(data)

# Print the imputed dataset
print(imputed_data)

## 2.?. Multiple Imputation

(Credit: Sklearn.Impute Documentation)

In the statistics community, it is common practice to perform multiple imputations, generating, for example, *m* separate imputations for a single feature matrix. Each of these *m* imputations is then put through the subsequent analysis pipeline (e.g. feature engineering, clustering, regression, classification). The *m* final analysis results (e.g. held-out validation errors) allow the data scientist to obtain understanding of how analytic results may differ as a consequence of the inherent uncertainty caused by the missing values. 

## Conclusions

Note that imputing missing data is not always a valid approach - particularly in research science, where the primacy of data is critical (and single data points can cost thousands of dollars to obtain). Here, it may be better to have no data and be clear that the absence exists than it is to fill in missing data based on assumptions or estimates that are ultimately based on the prejudice of the investigator.

However, i