## Implementing mode or frequent category imputation

Mode imputation consists of replacing missing values with the mode. We normally use this
procedure in categorical variables, hence the frequent category imputation name. Frequent
categories are estimated using the train set and then used to impute values in train, test,
and future datasets. Thus, we need to learn and store these parameters, which we can do
using scikit-learn and Feature-engine's transformer

**`TIP:`**If the percentage of missing values is high, frequent category imputation
may distort the original distribution of categories.

In [51]:
# import the necessaary things
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.imputation import CategoricalImputer

In [52]:
# loading dataset 
data = pd.read_csv('data/creditApprovalUCI.csv')
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [53]:
# split the data in to train and test dataset
x_train,x_test,y_train,y_test = train_test_split(data.drop('A16',axis=1),data['A16'],test_size=0.3,random_state=0)

**`TIP:`**Remember that you can check the percentage of missing values in the
train set with **X_train.isnull().mean() .**

In [54]:
# missing data
x_train.isnull().mean()

A1     0.008282
A2     0.022774
A3     0.140787
A4     0.008282
A5     0.008282
A6     0.008282
A7     0.008282
A8     0.140787
A9     0.140787
A10    0.140787
A11    0.000000
A12    0.000000
A13    0.000000
A14    0.014493
A15    0.000000
dtype: float64

In [55]:
# Let's replace missing values with the frequent category, that is, the mode, in four categorical variables:
for var in ['A4','A5','A6','A7']:
    value = x_train[var].mode()[0]
    x_train[var] = x_train[var].fillna(value)
    x_test[var] = x_test[var].fillna(value)

Note how we calculate the mode in the train set and use that value to replace the
missing data in the train and test sets.

In [56]:
# just verifying the data
x_train.isnull().mean()

A1     0.008282
A2     0.022774
A3     0.140787
A4     0.000000
A5     0.000000
A6     0.000000
A7     0.000000
A8     0.140787
A9     0.140787
A10    0.140787
A11    0.000000
A12    0.000000
A13    0.000000
A14    0.014493
A15    0.000000
dtype: float64

**`TIP:`**The pandas' fillna() returns a new dataset with imputed values by
default. Instead of doing this, we can replace missing data in the original
dataframe by executing **X_train[var].fillna(inplace=True)**

### let's impute missing values by the most frequent category using scikit-learn.

In [57]:
# loading dataset 
data = pd.read_csv('data/creditApprovalUCI.csv')
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [58]:
x_train,x_test,y_train,y_test = train_test_split(data[['A4','A5','A6','A7']],data['A16'],test_size=0.3,random_state=0)

In [59]:
# Let's create a frequent category imputer with SimpleImputer() from scikit-learn:
imputer = SimpleImputer(strategy='most_frequent')

**`TIP:`**SimpleImputer() from scikit-learn will learn the mode for numerical
and categorical variables alike. But in practice, mode imputation is done
for categorical variables only.

In [60]:
# Let's fit the imputer to the train set so that it learns the most frequent values:
imputer.fit(x_train)

In [61]:
# Let's inspect the most frequent values learned by the imputer:
imputer.statistics_

array(['u', 'g', 'c', 'v'], dtype=object)

In [62]:
# Let's replace missing values with frequent categories:
x_train = imputer.transform(x_train)
x_test = imputer.transform(x_test)

**Note:**Note that **SimpleImputer()** will return a NumPy array and not a pandas
dataframe.

### let's impute missing values using Feature-engine.

In [63]:
# loading dataset 
data = pd.read_csv('data/creditApprovalUCI.csv')
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [64]:
x_train,x_test,y_train,y_test = train_test_split(data[['A4','A5','A6','A7']],data['A16'],test_size=0.3,random_state=0)

In [65]:
mode_imputer = CategoricalImputer(variables=['A4','A5','A6','A7'])

**`TIP:`**CategoryImputer() will select all categorical variables in the
train set by default; that is, unless we pass a list of variables to impute.

In [66]:
# Let's fit the imputation transformer to the train set so that it learns the most frequent categories:
mode_imputer.fit(x_train)

In [67]:
# Let's inspect the learned frequent categories:
mode_imputer.imputer_dict_

{'A4': 'Missing', 'A5': 'Missing', 'A6': 'Missing', 'A7': 'Missing'}

In [69]:
# Finally, let's replace the missing values with frequent categories:
x_train = mode_imputer.transform(x_train)
x_test = mode_imputer.transform(x_test)

**`TIP`:**Remember that you can check that the categorical variables do not contain
missing values by using **X_train[['A4', 'A5', 'A6',
'A7']].isnull().mean()**.