Simple Imputer
Learning Objectives
By the end of this module, you should be able to:
Explain why Simple Imputer prevents data leakage
Use SimpleImputer to impute missing values 
Use ColumnTransformer to perform different imputation strategies on different columns
SimpleImputer Prevents Data Leakage
We have learned that missing values must be addressed as an essential component of data cleaning for analytics.  Missing values must also be addressed as part of data pre-processing required for machine learning.  
Now that we intend to use our data for machine learning, we must be very careful to prevent data leakage.  This means we do not impute values based on any calculations that involve the test set. 
Any calculations (such as mean, median, or even most frequent) must be done using only the training set to avoid data leakage.  We impute AFTER our validation split.
Sklearn's SimpleImputer allows us to perform calculations on the training data and then apply those calculations to both the training data and the test data in just a few lines of code.  

Imputation Strategies
Simple imputer can use any of 5 strategies to impute values:
 'mean', fill missing values with the mean of the column they are in.
'median', fill missing values with the median of the column they are in.
'mode', fill missing values with the mode of the column they are in.
'most_frequent', fill missing values with the most frequent value in the column they are in (equivalent to 'mode' for numeric columns).
'constant', provide a constant value to use to fill missing values.  A common choice for categorical data is 'missing'.


SimpleImputer in Python
An overview of the steps:
1. Import necessary libraries.
2. Load and examine the data.
3. Identify which columns have missing values and decide what imputation strategy to use to fill them.
(We will take a slight detour here, but then backtrack to complete the below steps)
4. Instantiate numeric and categorical column selectors.
5. Instantiate SimpleImputer objects with the imputation strategies we want to use.
6. Use ColumnTransformer to apply each different SimpleImputer object to the appropriate columns.
7. Examine the data to ensure all missing data has been filled.

To illustrate the use of the SimpleImputer we will  use the Medical Dataset for this example.  We will also use make_column_selector and make_column_transformer to apply different types of imputation strategies to different columns of our dataset.  


In [1]:
# Import Libraries
 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn import set_config
set_config(display='diagram')


# Read in the data
path = r"C:\Users\User\github_projects\Machine_Learning_with_Python\datasets\medical_data.csv"
df = pd.read_csv(path)
df.head()


Unnamed: 0,State,Lat,Lng,Area,Children,Age,Income,Marital,Gender,ReAdmis,...,Hyperlipidemia,BackPain,Anxiety,Allergic_rhinitis,Reflux_esophagitis,Asthma,Services,Initial_days,TotalCharge,Additional_charges
0,AL,34.3496,-86.72508,Suburban,1.0,53,86575.93,Divorced,Male,0,...,0.0,1.0,1.0,1.0,0,1,Blood Work,10.58577,3726.70286,17939.40342
1,FL,30.84513,-85.22907,Urban,3.0,51,46805.99,Married,Female,0,...,0.0,0.0,0.0,0.0,1,0,Intravenous,15.129562,4193.190458,17612.99812
2,SD,43.54321,-96.63772,Suburban,3.0,53,14370.14,Widowed,Female,0,...,0.0,0.0,0.0,0.0,0,0,Blood Work,4.772177,2434.234222,17505.19246
3,MN,43.89744,-93.51479,Suburban,0.0,78,39741.49,Married,Male,0,...,0.0,0.0,0.0,0.0,1,1,Blood Work,1.714879,2127.830423,12993.43735
4,VA,37.59894,-76.88958,Rural,1.0,22,1209.56,Widowed,Female,0,...,1.0,0.0,0.0,1.0,0,0,CT Scan,1.254807,2113.073274,3716.525786


In [2]:
# Examine Missing Values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   State               995 non-null    object 
 1   Lat                 1000 non-null   float64
 2   Lng                 1000 non-null   float64
 3   Area                995 non-null    object 
 4   Children            993 non-null    float64
 5   Age                 1000 non-null   int64  
 6   Income              1000 non-null   float64
 7   Marital             995 non-null    object 
 8   Gender              995 non-null    object 
 9   ReAdmis             1000 non-null   int64  
 10  VitD_levels         1000 non-null   float64
 11  Doc_visits          1000 non-null   int64  
 12  Full_meals_eaten    1000 non-null   int64  
 13  vitD_supp           1000 non-null   int64  
 14  Soft_drink          1000 non-null   int64  
 15  Initial_admin       995 non-null    object 
 16  HighBlo

In [3]:
print(df.isna().sum().sum(), 'missing values')

# There are 72 total missing values spread across 14 different columns.  
# Some columns missing data are numeric and some are categorical (object). 
# We can use mean, median, mode, most_frequent, or constant imputation strategies for numeric data, but only constant or most_frequent strategies for categorical data.


72 missing values


In [4]:
# Note: 
# If this were a real project we would investigate further to see if we should drop rows or columns missing data, or impute missing data. 
# The code below filters the dataset for just the rows that are missing at least 1 value and shows the shape.

df[df.isna().any(axis=1)].shape


# 70 rows out of 1000 are missing at least one value.  That is 0.7% of our data.  
# In a real project we would probably just drop the rows with missing values with df.dropna().  
# We could drop rows or columns before the validation split without leaking data.


(70, 32)

In [5]:
# Train/Test Split
# Imputing missing values can leak information from the testing data into the training data, so we impute values AFTER we split the data.

# First, we define our target, "Additional_charges" as y and our features (the rest of the columns) as X.  Then we perform the train test split.  
X = df.drop(columns=['Additional_charges'])
y = df['Additional_charges']


# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [6]:
# Select Columns
# We are going to separate our features into two types of columns based on the data type.  
# One will be our numeric columns that will include both integers and floats.  
# The other column with be the categorical columns that include our strings (objects).  

#instantiate the selectors to for numeric and categorical data types
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

#select the numeric columns of each type
num_columns = num_selector(X_train)
cat_columns = cat_selector(X_train)

#check our lists
print('numeric columns are', num_columns)
print('categorical columns are', cat_columns)


numeric columns are ['Lat', 'Lng', 'Children', 'Age', 'Income', 'ReAdmis', 'VitD_levels', 'Doc_visits', 'Full_meals_eaten', 'vitD_supp', 'Soft_drink', 'HighBlood', 'Stroke', 'Overweight', 'Arthritis', 'Diabetes', 'Hyperlipidemia', 'BackPain', 'Anxiety', 'Allergic_rhinitis', 'Reflux_esophagitis', 'Asthma', 'Initial_days', 'TotalCharge']
categorical columns are ['State', 'Area', 'Marital', 'Gender', 'Initial_admin', 'Complication_risk', 'Services']


In [7]:
# Method of Imputation
# Before we decide which strategy to use for imputation, we need to understand our data.  
# The code below will isolate the numeric columns that are missing data.  
# We can do this to see what imputation strategy we should use.

# isolate the numeric columns
df_num = df[num_columns]

# isolate the columns with missing data
df_num.loc[:, df_num.isna().any()]


# All of the numeric columns seem to have integer values (even though they are not integer datatypes).
# If we used a 'mean' strategy those would be filled with decimal values (floats). 
# In order to fill them with integer values we need to use a 'median' strategy.  
# In fact, all of the columns above except 'children' are actually boolean values.  0.0 represents 'no' and 1.0 represents 'yes'.  
# We especially don't want decimal values for number of children or yes/no data.


Unnamed: 0,Children,Arthritis,Diabetes,Hyperlipidemia,BackPain,Anxiety,Allergic_rhinitis
0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
1,3.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
995,3.0,0.0,1.0,1.0,0.0,0.0,0.0
996,2.0,1.0,0.0,,1.0,1.0,1.0
997,0.0,1.0,0.0,1.0,1.0,0.0,0.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Simple Imputer without ColumnTransformer
# First, let's check which columns are missing data.
X_train.isna().any()


State                  True
Lat                   False
Lng                   False
Area                   True
Children               True
Age                   False
Income                False
Marital                True
Gender                 True
ReAdmis               False
VitD_levels           False
Doc_visits            False
Full_meals_eaten      False
vitD_supp             False
Soft_drink            False
Initial_admin          True
HighBlood             False
Stroke                False
Complication_risk      True
Overweight            False
Arthritis              True
Diabetes               True
Hyperlipidemia         True
BackPain               True
Anxiety                True
Allergic_rhinitis      True
Reflux_esophagitis    False
Asthma                False
Services               True
Initial_days          False
TotalCharge           False
dtype: bool

In [10]:
# The following code shows how to apply the Simple Imputer to the columns that were previously selected as and defined as num_columns. 
# Note that the step with .fit is only applied to the training set!

#Instantiate the imputer object from the SimpleImputer class with strategy 'median'
median_imputer = SimpleImputer(strategy='median')

#Fit the imputer object on the numeric training data with .fit() 
#calculates the medians of the columns in the training set
median_imputer.fit(X_train[num_columns])

#Use the median from the training data to fill the missing values in 
#the numeric columns of both the training and testing sets with .transform()
X_train.loc[:, num_columns] = median_imputer.transform(X_train[num_columns])
X_test.loc[:, num_columns] = median_imputer.transform(X_test[num_columns])


In [11]:
# Did SimpleImputer fill the missing values in X_train?
X_train.isna().any()


State                  True
Lat                   False
Lng                   False
Area                   True
Children              False
Age                   False
Income                False
Marital                True
Gender                 True
ReAdmis               False
VitD_levels           False
Doc_visits            False
Full_meals_eaten      False
vitD_supp             False
Soft_drink            False
Initial_admin          True
HighBlood             False
Stroke                False
Complication_risk      True
Overweight            False
Arthritis             False
Diabetes              False
Hyperlipidemia        False
BackPain              False
Anxiety               False
Allergic_rhinitis     False
Reflux_esophagitis    False
Asthma                False
Services               True
Initial_days          False
TotalCharge           False
dtype: bool

In [12]:
# SimpleImputer with ColumnTransformer
# Let's recreate our original X_train with all of the missing values and see how ColumnTransformer, combined with SimpleImputer, 
# can impute both the numeric columns with medians and the categorical columns with the most frequent value.

# Create a New X_train With Missing Values
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.isna().any()


State                  True
Lat                   False
Lng                   False
Area                   True
Children               True
Age                   False
Income                False
Marital                True
Gender                 True
ReAdmis               False
VitD_levels           False
Doc_visits            False
Full_meals_eaten      False
vitD_supp             False
Soft_drink            False
Initial_admin          True
HighBlood             False
Stroke                False
Complication_risk      True
Overweight            False
Arthritis              True
Diabetes               True
Hyperlipidemia         True
BackPain               True
Anxiety                True
Allergic_rhinitis      True
Reflux_esophagitis    False
Asthma                False
Services               True
Initial_days          False
TotalCharge           False
dtype: bool

In [13]:
# Instantiate ColumnSelectors
#instantiate the selectors to for numeric and categorical data types
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')


In [15]:
# Instantiate Imputers
# We will fill missing data in numeric columns with the median of each column and missing data in categorical columns with the most frequent value.  
# These are not the only options, but it's what we will do today.
#instantiate SimpleImputers with most_frequent and median strategies
freq_imputer = SimpleImputer(strategy='most_frequent')
median_imputer = SimpleImputer(strategy='median')

In [17]:
# Build the ColumnTransformer
# As you recall from other lessons, make_column_transformer() takes tuples of the form (transformer, columns).  
# ColumnSelectors can be used instead of lists of columns.  Both are acceptable.  
# We can set remainder='passthrough' if we are not applying transformers to all columns.  
# We might do this if we imputed some columns by hand already. 
# The default for ColumnTransformer is to drop any columns that are not specified in a tuple, 
# while remainder = 'passthrough' retains the original values for any columns not specified in a tuple without any transformation.

# create tuples of (imputer, selector) for each datatype
num_tuple = (median_imputer, num_selector)
cat_tuple = (freq_imputer, cat_selector)

# instantiate ColumnTransformer
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder='passthrough')
col_transformer


In [18]:
# Impute Missing Values With ColumnTransformer
# fit ColumnTransformer on the training data
col_transformer.fit(X_train)

# transform both the training and testing data (this will output a NumPy array)
X_train_imputed = col_transformer.transform(X_train)
X_test_imputed = col_transformer.transform(X_test)

# change the result back to a dataframe
X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns)
X_train_imputed.isna().any()


# You can see that our dataframe is now free of missing values.  
# Using ColumnTransformer reduces the complexity of our code, reduces the chances for errors, and you will see later that we can use it with other tools to streamline the modeling process

State                 False
Lat                   False
Lng                   False
Area                  False
Children              False
Age                   False
Income                False
Marital               False
Gender                False
ReAdmis               False
VitD_levels           False
Doc_visits            False
Full_meals_eaten      False
vitD_supp             False
Soft_drink            False
Initial_admin         False
HighBlood             False
Stroke                False
Complication_risk     False
Overweight            False
Arthritis             False
Diabetes              False
Hyperlipidemia        False
BackPain              False
Anxiety               False
Allergic_rhinitis     False
Reflux_esophagitis    False
Asthma                False
Services              False
Initial_days          False
TotalCharge           False
dtype: bool

In [None]:
# Summary
# SimpleImputer can impute missing values in many columns at once.  
# It can also help avoid data leakage.  
# If we use SimpleImputer and ColumnTransformer together, we can easily apply different imputation strategies to different columns simultaneously.
