
# Tutorial 7b: Data Imputation

Marcus Frean

*with thanks to Baligh Al-Helali (PhD, VUW, 2021)*

This covers:

* The deletion approach
    - Deleting the incomplete features
    - Deleting the incomplete instances

* pandas
    - Simple imputation using pandas
    - Interpolation imputation using pandas
    
* sklearn
    - Simple imputation using sklearn
    - KNN-based imputation using skearn
    - Iterative imputation using skearn

* Applying the learned models to incomplete test data

----

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Loading and exploring the data

In [None]:
import pandas as pd
# Or load titanic data that are alraedy split into train and test data sets according to https://www.kaggle.com/c/titanic/data
# But the test data of kaggle does not have labels
# Therefore we will load  the whole data from a data repository then split it latter
titanic_data = pd.read_csv("https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv", na_values=['?']) #yo
titanic_data.head()

**Values considered “missing”**

There are many ways to represent missing values in both the dataset file and the python pandas.

Missing values in the data might be blank entries, or '?', or something else that data collecters agreed on to represent unobserved data.
In this case it is '?' -- knowing this, we tell `pandas` what to consider as missing values via `na_values=['?']`.

At the "other end", `pandas` can represent missing values in several different ways. As can be seen above, "NaN" is the default missing value marker, however, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, some other forms can refer to missing values such as None “missing” or “not available”, “NA", or (-)inf .


In [None]:
# Let's drop some features that we will not consider here.
titanic_data.drop(['name','ticket', 'embarked', 'boat' ,'body' ,'home.dest'], axis=1, inplace=True)

Now we will split the data to train and test subsets as **ONLY** the training data will be used to learn the imputers then the learnt models are applied to the test data

In [None]:
from sklearn.model_selection import train_test_split
y=titanic_data['survived']
X=titanic_data.drop(['survived'], axis=1)
X_titanic_train, X_titanic_test, y_titanic_train, y_titanic_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
#Now if we perform classification it might not work for most classifiers
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
#classifier=SVC()
classifier.fit(X_titanic_train, y_titanic_train)


# There is a problem that some features contain string values, namely the features "sex" and "cabin", so lets encode these features

In [None]:
# We need the upgraded sklearn to accept the parameters for encoders
import sklearn
!pip install -U scikit-learn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

In [None]:
import numpy as np
# Encoding categorical features with preserving the missing values in incomplete features
from sklearn.preprocessing import OrdinalEncoder
encoder_sex = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value=np.nan)
X_titanic_train_encoded=X_titanic_train.copy()
X_titanic_train_encoded['sex'] = encoder_sex.fit_transform(X_titanic_train_encoded['sex'].values.reshape(-1, 1))

#Now lets encode the incomplete Cabin feature
encoder_cabin = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value=np.nan) #You can use the same encoder for both but we use two for the sake of clarfication
X_titanic_train_encoded['cabin'] = encoder_cabin.fit_transform(X_titanic_train_encoded['cabin'].values.reshape(-1, 1).astype(str))
#get the code of the "nan" value for the cabin categorical feature
cabin_nan_code=encoder_cabin.transform([['nan']])[0][0]
print(cabin_nan_code)
#Now, retrive the nan values to be missing in the encoded data
X_titanic_train_encoded['cabin'].replace(cabin_nan_code,np.nan,inplace=True)


## `X_titanic_train_encoded` is the encoded incomplete training data

In [None]:
#Check the types of the encoded data, no object features
X_titanic_train_encoded.info()

In [None]:
X_titanic_train_encoded.head()

In [None]:
!pip install --upgrade scikit-learn

In [None]:
# As the data has no strings/object now, let's try performing classification using the encoded data
classifier.fit(X_titanic_train_encoded, y_titanic_train)

## Note the error:ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

We need to handle the missing values before performing the classification.

Lets show the number of missing values in each feature of the encoded train data



In [None]:
print("The number of missing values ")
print(X_titanic_train_encoded.isnull().sum())

We have three incomplete features "age", "fare", and "cabin"

## The deletion approach

### Deleting the incomplete features

In [None]:
X_titanic_train_complete=X_titanic_train_encoded.copy()
X_titanic_train_complete.dropna(axis=1, inplace=True)
X_titanic_train_complete

In [None]:
#Check the number of missing values
print(X_titanic_train_complete.isnull().sum())

### Deleting the incomplete instances

In [None]:
X_titanic_train_complete=X_titanic_train_encoded.copy()
X_titanic_train_complete.dropna(axis=0, inplace=True)
#The difference is axis=0 instead of 1
X_titanic_train_complete

## Notice the reduction in the number of instances

Another important point for the instance deletion approach is that there is a need to remove the target values (from y_train) that correspond to the incomplete (deleted) data instances

In [None]:
#Check the number of missing values
print(X_titanic_train_complete.isnull().sum())

The deletion approach has several drawbacks. It reduces the availlable data, which limits the learning ability, especially when there are many missing values.

Furthermore, the approach of deleting incomplete instances is not practical for test data: we really want to know the answer!

## Imputation using `pandas`

### Simple imputation (`pandas`)

In [None]:
#Mean for numeric values
X_titanic_data_complete=X_titanic_train_encoded.copy()
X_titanic_data_complete['age']=X_titanic_data_complete['age'].fillna(X_titanic_data_complete['age'].mean())
X_titanic_data_complete['fare']=X_titanic_data_complete['fare'].fillna(X_titanic_data_complete['fare'].mean())
X_titanic_data_complete['cabin']=X_titanic_data_complete['cabin'].fillna(X_titanic_data_complete['cabin'].mean())
# Show the number of missing values
print(X_titanic_data_complete.isnull().sum())

In [None]:
X_titanic_data_complete.head()

## "interpolation" (`pandas`)

In [None]:
X_titanic_data_complete = X_titanic_train_encoded.copy()
X_titanic_data_complete = X_titanic_data_complete.interpolate()
#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete = pd.DataFrame(X_titanic_train_complete)
print(X_titanic_train_complete.isna().sum())

## Imputation using `sklearn`

### Simple imputation (`sklearn`)

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()

X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)

#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

In [None]:
X_titanic_train_encoded

## The default strategy for sklearn simple imputer is the "mean", you can change it using the strategy parameter

In [None]:
imputer = SimpleImputer(strategy="median")
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)
#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

## kNN imputer (`sklearn`)

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer()
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)
#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

In [None]:
#The default k for the KNN imputer is 5, you can change it as follows:
imputer = KNNImputer(n_neighbors=2)
# etc etc...

## Iterative Imputer (`sklearn`)

Note this is sklearn's implementation of a method originally known as "MICE" -- see lecture 2 from this week for an explanation.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)
#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

You can reset the default parameters of the iterative imputer. For example, you can set the number of iterations. Moreover, you can specify the estimator for estimating the missing values.

In [None]:
# Lets use DT as an estimator
from sklearn.tree import DecisionTreeRegressor
imputer = IterativeImputer(estimator=DecisionTreeRegressor())
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)
#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

## Applying the learned models to incomplete test data

First, apply the encoders

In [None]:
#The learnt encoder_sex should be used to encode the test data, NOTE there is NO fit here, just transform
X_titanic_test_encoded=X_titanic_test.copy()
X_titanic_test_encoded['sex'] = encoder_sex.transform(X_titanic_test_encoded['sex'].values.reshape(-1, 1))

#The learnt encoder2 should be used to encode the test data, NOTE there is NO fit here, just transform
X_titanic_test_encoded['cabin'] = encoder_cabin.transform(X_titanic_test_encoded['cabin'].values.reshape(-1, 1).astype(str))
#Now, retrive the nan values to be missing in the encoded data
X_titanic_test_encoded['cabin'].replace(cabin_nan_code,np.nan,inplace=True)


Second, use the learned imputer to estimate the missing values in the test data

In [None]:
print("The number of missing values in the test data before imputation :\n", X_titanic_test_encoded.isnull().sum())
X_titanic_test_complete = imputer.transform(X_titanic_test_encoded)
X_titanic_test_complete=pd.DataFrame(X_titanic_test_complete, columns=X_titanic_test_encoded.columns)
print("The number of missing values in the test data after imputation :\n", X_titanic_test_complete.isnull().sum())

Finally, we can perform the classification using the imputed complete data.

In [None]:
#We use f-measure because the classes are not balanced
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0)
#classifier=SVC()
classifier.fit(X_titanic_train_complete, y_titanic_train)
print("F1 score after imputation = ", f1_score(classifier.predict(X_titanic_test_complete), y_titanic_test))

----