## Performing mean or median imputation

Mean or median imputation consists of replacing missing values with the variable mean or
median. This can only be performed in numerical variables. The mean or the median is
calculated using a train set, and these values are used to impute missing data in train and
test sets, as well as in future data we intend to score with the machine learning model.
Therefore, we need to store these mean and median values. Scikit-learn and Feature-engine
transformers learn the parameters from the train set and store these parameters for future
use. So, in this recipe, we will learn how to perform mean or median imputation using the
scikit-learn and Feature-engine libraries and pandas for comparison.

**`TIP:`** Use mean imputation if variables are normally distributed and median
imputation otherwise. Mean and median imputation may distort the
distribution of the original variables if there is a high percentage of
missing data.

In [1]:
# let's import the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.imputation import MeanMedianImputer

In [2]:
# load the dataset
data = pd.read_csv('data/creditApprovalUCI.csv')

In mean and median imputation, the mean or median values should be
calculated using the variables in the train set; therefore, let's separate the data
into train and test sets and their respective targets:

In [3]:
X_train,X_test,y_train,y_test = train_test_split(data.drop('A16',axis=1),data['A16'],test_size=0.3,random_state=0)

In [4]:
# let's check the size of the returned dataset using shape
X_train.shape,X_test.shape

((483, 15), (207, 15))

In [5]:
# let's check the percentage of missing values in a train set
X_train.isnull().mean()


A1     0.008282
A2     0.022774
A3     0.140787
A4     0.008282
A5     0.008282
A6     0.008282
A7     0.008282
A8     0.140787
A9     0.140787
A10    0.140787
A11    0.000000
A12    0.000000
A13    0.000000
A14    0.014493
A15    0.000000
dtype: float64

In [6]:
# let's replace the missing values with median in five numerical variables
for var in ['A2','A3','A8','A11','A15']:
    value = data[var].median()
    X_train[var] = X_train[var].fillna(value)
    X_test[var] = X_test[var].fillna(value)

**`TIP`**: To impute missing data with the mean, we use pandas' mean():**value =
X_train[var].mean() .**

If you run the code in step 4 after imputation, the percentage of missing values for
the A2 , A3 , A8 , A11 , and A15 variables should be 0 .

In [7]:
X_train.isnull().mean()

A1     0.008282
A2     0.000000
A3     0.000000
A4     0.008282
A5     0.008282
A6     0.008282
A7     0.008282
A8     0.000000
A9     0.140787
A10    0.140787
A11    0.000000
A12    0.000000
A13    0.000000
A14    0.014493
A15    0.000000
dtype: float64

**`TIP:`** The pandas' **fillna()** returns a new dataset with imputed values by
default. We can set the inplace argument to True to replace missing data
in the original dataframe:**X_train[var].fillna(inplace=True)**.

Now, let's impute missing values by the median using scikit-learn so that we can
store learned parameters.

In [8]:
# now let's seperate the original dataset
X_train,X_test,y_train,y_test = train_test_split(data[['A2','A3','A8','A11','A15']],data['A16'],test_size=0.3,random_state=0)

**`TIP`**:**SimpleImputer()** from scikit-learn will impute all variables in the
dataset. Therefore, if we use mean or median imputation and the dataset
contains categorical variables, we will get an error.

In [9]:
# Let's create a median imputation transformer using SimpleImputer() from scikit-learn:
imputer = SimpleImputer(strategy='median')

To perform mean imputation, we should set the strategy to mean :
**imputer = SimpleImputer(strategy = 'mean')**.

In [10]:
# Let's fit the SimpleImputer() to the train set so that it learns the median values of the variables:
imputer.fit(X_train)

In [11]:
# Let's inspect the learned median values:
imputer.statistics_

array([28.835,  2.75 ,  1.   ,  0.   ,  6.   ])

In [12]:
# Let's replace missing values with medians:
imputer.transform(X_train)
imputer.transform(X_test)

array([[4.583e+01, 1.050e+01, 5.000e+00, 7.000e+00, 0.000e+00],
       [6.408e+01, 2.000e+01, 1.750e+01, 9.000e+00, 1.000e+03],
       [3.125e+01, 3.750e+00, 6.250e-01, 9.000e+00, 0.000e+00],
       ...,
       [2.142e+01, 2.750e+00, 1.000e+00, 0.000e+00, 2.000e+00],
       [2.683e+01, 2.750e+00, 1.000e+00, 0.000e+00, 0.000e+00],
       [6.250e+01, 1.275e+01, 5.000e+00, 0.000e+00, 0.000e+00]])

Finally, let's perform median imputation using MeanMedianImputer() from
Feature-engine. First, we need to load and divide the dataset, just like we did in
step 2 and step 3. Next, we need to create an imputation transformer.

In [13]:
# Let's set up a median imputation transformer using MeanMedianImputer() from Feature-engine specifying the variables to impute:
median_imputer = MeanMedianImputer(imputation_method='median',variables=['A2', 'A3', 'A8', 'A11', 'A15'])

**`TIP`**:To perform mean imputation, change the imputation method, as follows:
**MeanMedianImputer(imputation_method='mean').**

In [14]:
# Let's fit the median imputer so that it learns the median values for each of the specified variables:
median_imputer.fit(X_train)

In [15]:
# Let's inspect the learned medians:
median_imputer.imputer_dict_

{'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A15': 6.0}

In [16]:
# Finally, let's replace the missing values with the median:
X_train = median_imputer.transform(X_train)
X_test = median_imputer.transform(X_test)

Feature-engine's MeanMedianImputer() returns a dataframe. You can check that the
imputed variables do not contain missing values using **X_train[['A2','A3', 'A8',
'A11', 'A15']].isnull().mean() .**