## Imputing Missing Data

The act of replacing missing data with statistical estimates of missing values is
called imputation. The goal of any imputation technique is to produce a complete
dataset that can be used to train machine learning models.

- _The choice of imputation technique we use will
depend on whether the data is missing at random, the number of missing values, and the
machine learning model we intend to use_

- Removing observations with missing data
- Performing mean or median imputation
- Implementing mode or frequent category imputation
- Replacing missing values with an arbitrary number
- Cpturing missing values in a bespoke category
- Replacing missing values with a value at the end of the distribution
- Iplementing random sample imputation
- Adding a missing value indicator variable
- Performing multivariate imputation by chained equations
- Assembling an imputation pipeline with scikit-learn
- Assembling an imputation pipeline with Feature-engine

In [1]:
# pip install feature-engine --user --quiet

In [2]:
import pandas as pd
import numpy as np
import random

In [3]:
data_credit = pd.read_csv('crx.data', header=None)

In [4]:
# Creating data headers
headers = ['A'+str(s) for s in range(1,17)]
data_credit.columns = headers

In [5]:
data_credit.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [6]:
data_credit.replace('?', np.nan, inplace=True)

In [7]:
data_credit.dtypes

A1      object
A2      object
A3     float64
A4      object
A5      object
A6      object
A7      object
A8     float64
A9      object
A10     object
A11      int64
A12     object
A13     object
A14     object
A15      int64
A16     object
dtype: object

In [8]:
# A2 & A14 are numbers, still have a dtype of object. Let's typecast them
data_credit['A2'] = data_credit['A2'].astype('float')
data_credit['A14'] = data_credit['A14'].astype('float')

In [9]:
# Recoding (Encoding) the target variable A16 as binary
data_credit['A16'] = data_credit['A16'].map({'+':1, '-':0})

In [10]:
data_credit

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,280.0,824,1
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260.0,0,0
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,200.0,394,0
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,200.0,1,0
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280.0,750,0


In [11]:
# Introducing some missing values at random places in four variables
random.seed(9001)
values = list([random.randint(0, len(data_credit)) for x in range(100)])
for var in ['A3', 'A8', 'A9', 'A10']:
    data_credit.loc[values, var] = np.nan

In [12]:
# Saving data
data_credit.to_csv('creditApprovalUCI.csv', index=False)

## Removing observations with missing data

Complete Case Analysis (CCA), also called list-wise deletion of cases, consists
of discarding those observations where the values in any of the variables are missing. CCA
can be applied to categorical and numerical variables.

In [13]:
data_cca = pd.read_csv("creditApprovalUCI.csv")

In [14]:
data_cca.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [15]:
# Percentage of missing values
data_cca.isnull().mean().sort_values(ascending=False)*100

A3     13.333333
A8     13.333333
A9     13.333333
A10    13.333333
A14     1.884058
A1      1.739130
A2      1.739130
A6      1.304348
A7      1.304348
A4      0.869565
A5      0.869565
A11     0.000000
A12     0.000000
A13     0.000000
A15     0.000000
A16     0.000000
dtype: float64

In [16]:
# Now, we'll remove the observations with missing data in any of the variables:
data_cca.dropna(inplace=True)

In [17]:
# Comparing Size
size = ((len(data_credit) - len(data_cca))/len(data_credit))*100

In [18]:
size  # 18% smaller

18.26086956521739

## Performing mean or median imputation

*Use `mean` imputation if variables are `normally distributed` and `median
imputation` otherwise. `Mean` and `median` imputation may **distort** the
distribution of the original variables if there is a high percentage of
missing data.*


In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.imputation import MeanMedianImputer

> Imputation of Mean & Median is done by sklearn & Feature Engine. They'll calculate these statistics for the test data and then impute missing values in test, train data and also in the future data.
> We need to store the mean & median for future data.

In [20]:
data_impute = pd.read_csv('creditApprovalUCI.csv')

In [21]:
data_impute.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In mean and median imputation, the mean or median values should be
calculated using the variables in the train set; therefore, let's separate the data
into train and test sets and their respective targets:


In [22]:
x_train, x_test, y_train, y_test = train_test_split(data_impute.drop('A16', axis=1),
                                                    data_impute['A16'], test_size=0.3,
                                                    random_state=0)

In [23]:
# Percentage of missing values in the train set
x_train.isnull().mean()

A1     0.008282
A2     0.022774
A3     0.140787
A4     0.008282
A5     0.008282
A6     0.008282
A7     0.008282
A8     0.140787
A9     0.140787
A10    0.140787
A11    0.000000
A12    0.000000
A13    0.000000
A14    0.014493
A15    0.000000
dtype: float64

In [24]:
# Replacing Missing values in five numerical columns with median
for var in ['A2', 'A3', 'A8', 'A11', 'A15']:
    median = x_train[var].median()
    x_train[var] = x_train[var].fillna(median)
    x_test[var] = x_test[var].fillna(median)

> _Note how we calculate the median using the train set and then use this value to
replace the missing data in the train and test sets._
- Percentage of Null values in A2, A3, A8, A11, A15 becomes 0

**Using SimpleImputer( )**

SimpleImputer() from scikit-learn will impute all variables in the
dataset. Therefore, if we use mean or median imputation and the dataset
contains categorical variables, we will get an error. 

In [29]:
x_train, x_test, y_train, y_test = train_test_split(data_impute[['A2', 'A3', 'A8', 'A11', 'A15']], 
                                                    data_impute['A16'], test_size=0.3,
                                                    random_state=0)

In [33]:
x_train.shape, y_test.shape

((483, 5), (207,))

In [35]:
imputer = SimpleImputer(strategy='mean')   # can use median also

In [36]:
imputer.fit(x_train)

In [40]:
imputer.statistics_    # mean of each of the five columns

array([ 31.89019068,   4.84148193,   2.36901205,   2.51759834,
       966.25258799])

In [42]:
x_train = imputer.transform(x_train)  # numpy array

In [41]:
x_test = imputer.transform(x_test)

- `SimpleImputer()` returns NumPy arrays. We can transform the array
into a dataframe using `pd.DataFrame(X_train, columns = ['A2','A3', 'A8', 'A11', 'A15'])`

### Using MeanMedianImputer( )

In [54]:
x_train, x_test, y_train, y_test = train_test_split(data_impute[['A2', 'A3', 'A8', 'A11', 'A15']], 
                                                    data_impute['A16'], test_size=0.3,
                                                    random_state=0)

In [55]:
median_imputer = MeanMedianImputer(imputation_method='median', 
                                   variables=['A2', 'A3', 'A8', 'A11', 'A15'])

In [56]:
median_imputer.fit(x_train)

In [57]:
median_imputer.imputer_dict_      # List of Parameters

{'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A15': 6.0}

In [58]:
x_train = median_imputer.transform(x_train)
x_test = median_imputer.transform(x_test)

> _Feature-engine's MeanMedianImputer() returns a dataframe_

In [60]:
# All missing values are imputed
x_train[['A2','A3', 'A8','A11', 'A15']].isnull().mean()

A2     0.0
A3     0.0
A8     0.0
A11    0.0
A15    0.0
dtype: float64

In [65]:
type(x_train)

pandas.core.frame.DataFrame

## Implementing mode or frequent category imputation


In [68]:
# Use of Mode in categorical variables

_If the percentage of missing values is high, frequent category imputation
may distort the original distribution of categories._

In [71]:
# Not importing all the libraries again
from feature_engine.imputation import CategoricalImputer

In [72]:
data_impute.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [73]:
x_train, x_test, y_train, y_test = train_test_split(data_impute.drop('A16', axis=1), 
                                                    data_impute['A16'], test_size=0.3,
                                                    random_state=0)

In [78]:
x_train.isnull().mean()

A1     0.008282
A2     0.022774
A3     0.140787
A4     0.008282
A5     0.008282
A6     0.008282
A7     0.008282
A8     0.140787
A9     0.140787
A10    0.140787
A11    0.000000
A12    0.000000
A13    0.000000
A14    0.014493
A15    0.000000
dtype: float64

In [87]:
x_train['A6'].mode()

0    c
Name: A6, dtype: object

In [89]:
x_train['A6'].unique()

array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', nan,
       'aa', 'r'], dtype=object)

In [90]:
# Imputing Missing Values with most frequent values
for var in ['A4', 'A5', 'A6', 'A7']:
    value = x_train[var].mode()[0]
    x_train[var] = x_train[var].fillna(value)
    x_test[var] = x_test[var].fillna(value)

In [92]:
# No missing values in A4, A5, A6, A7
x_test.isnull().sum()

A1      8
A2      1
A3     24
A4      0
A5      0
A6      0
A7      0
A8     24
A9     24
A10    24
A11     0
A12     0
A13     0
A14     6
A15     0
dtype: int64

**Using SimpleImputer( )**

In [94]:
x_train, x_test, y_train, y_test = train_test_split(data_impute[['A4', 'A5', 'A6', 'A7']], 
                                                    data_impute['A16'], test_size=0.3,
                                                    random_state=0)

In [96]:
imputer = SimpleImputer(strategy='most_frequent')

In [97]:
imputer.fit(x_train)

In [98]:
imputer.statistics_

array(['u', 'g', 'c', 'v'], dtype=object)

In [99]:
x_train = imputer.transform(x_train)
x_test = imputer.transform(x_test)

In [102]:
x_train = pd.DataFrame(x_train)

In [105]:
x_train.isnull().mean()  # No missing values

0    0.0
1    0.0
2    0.0
3    0.0
dtype: float64

#### *Using Categorical Imputer()*

In [107]:
x_train, x_test, y_train, y_test = train_test_split(data_impute[['A4', 'A5', 'A6', 'A7']], 
                                                    data_impute['A16'], test_size=0.3,
                                                    random_state=0)

In [108]:
cat_imputer = CategoricalImputer(imputation_method='frequent')

In [109]:
cat_imputer.fit(x_train)

In [110]:
cat_imputer.imputer_dict_

{'A4': 'u', 'A5': 'g', 'A6': 'c', 'A7': 'v'}

In [111]:
x_train = cat_imputer.transform(x_train)
x_test = cat_imputer.transform(x_test)

In [112]:
# No missing values
x_test.isnull().mean()

A4    0.0
A5    0.0
A6    0.0
A7    0.0
dtype: float64

In [1]:
pip install xelatex --user --quiet

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement xelatex (from versions: none)
ERROR: No matching distribution found for xelatex
