# Description

Let us take finite engine as example. Feature Engine is a good library for feature engineering like imputation, targe encoding, etc. 

Sometime, we need to add in some small code in between two feature tranformations. This can be done seamlesly in notebook. However, it requires extra work if we would like to bring the code into scikit learn pipeline and serialize it (with pickle) into production, unless you make those code into objects, which requires some extra works. 

As to mitigate this, we have introduced "Paip", a very simple pipeline module that allows us to insert small piece of code in the middle of pipeline.

In the following demo, we will impute the titanic data set with Feature Engine. We use Paip to create the pipeline and it allows to insert a small piece of code in between the imputers.


Description of titanic data can be found here. https://www.kaggle.com/c/titanic/data

# Imports 

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce
import feature_engine.missing_data_imputers as mdi

from Paip.Paip import Paip



# Retrieve Data

In [2]:
#
# We read the data from the Internet directly.
# And, we directly replace those data with "?" into np.nan.
#
url = 'https://www.openml.org/data/get_csv/16826755/phpMYEkMl'
data = pd.read_csv(url, na_values = '?')


In [3]:
data.sample(3)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1207,3,0,"Skoog, Master. Karl Thorsten",male,10.0,3,2,347088,27.9,,S,,,
340,2,1,"Becker, Miss. Marion Louise",female,4.0,2,1,230136,39.0,F4,S,11.0,,"Guntur, India / Benton Harbour, MI"
573,2,1,"Troutt, Miss. Edwina Celia 'Winnie'",female,27.0,0,0,34218,10.5,E101,S,16.0,,"Bath, England / Massachusetts"


In [4]:
x_col_list = [
    'pclass',
    'sex',
    'age',
    'sibsp',
    'parch', 
    'embarked'
]

y_col = 'survived'

X = data[x_col_list]
y = data[y_col]

In [5]:
X.sample(3)

Unnamed: 0,pclass,sex,age,sibsp,parch,embarked
676,3,male,18.0,0,0,S
1061,3,female,26.0,0,0,S
459,2,male,42.0,1,0,S


In [6]:
# Percentage of missing values
X.isnull().sum(axis = 0) / X.shape[0]

pclass      0.000000
sex         0.000000
age         0.200917
sibsp       0.000000
parch       0.000000
embarked    0.001528
dtype: float64

In [7]:
X.dtypes

pclass        int64
sex          object
age         float64
sibsp         int64
parch         int64
embarked     object
dtype: object

In [8]:
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Numerical Imputation

In [9]:
# Imputation on numerical values
num_imputer = mdi.MeanMedianImputer(imputation_method='median',
                                       variables=['age', 'sibsp', 'parch'] )

# fit the imputer
num_imputer.fit(X_train)

# transform the data
train_t= num_imputer.transform(X_train)
test_t= num_imputer.transform(X_test)

print( num_imputer.imputer_dict_ )

{'age': 28.0, 'sibsp': 0.0, 'parch': 0.0}


In [10]:
# Add to pipeline

step_list = []

step_list.append(
    {'name': 'numerical imputer',
     'obj_dict' : {'num_imputer': num_imputer},
      'run_dict':{
                'transform': ['X = num_imputer.transform(X)'],
            },
     'output_list' : ['X', ]
    }
)



In [11]:
train_t.dtypes

pclass        int64
sex          object
age         float64
sibsp         int64
parch         int64
embarked     object
dtype: object

# Categorical Imputation

In [12]:
# Imputation on categorical values

# Required
cat_variables=['pclass', 'sex', 'embarked']
train_t[cat_variables] =  train_t[cat_variables].astype('O')


cat_imputer = mdi.CategoricalVariableImputer(variables=cat_variables)

# fit the imputer
cat_imputer.fit(train_t)

# transform the data
train_t= cat_imputer.transform(train_t)
test_t= cat_imputer.transform(test_t)

print( cat_imputer.imputer_dict_ )

{'pclass': 'Missing', 'sex': 'Missing', 'embarked': 'Missing'}


In [13]:
train_t.dtypes

pclass        int64
sex          object
age         float64
sibsp         int64
parch         int64
embarked     object
dtype: object

In [14]:
# Add to pipeline

# To transform the variable to categorical.
transform_code ='''
cat_variables=['pclass', 'sex']
X[cat_variables] =  X[cat_variables].astype('O')
'''

step_list.append(
    {'name': 'categorical imputer',
     'obj_dict' : {'cat_imputer': cat_imputer},
      'run_dict':{
                'transform': [transform_code, 
                              'X = cat_imputer.transform(X)'],
            },
     'output_list' : ['X', ]
    }
)




# Test Tranform with Pipeline

In [17]:
paip1 = Paip(step_list)

In [18]:
paip1.__dict__

{'step_list': [{'name': 'numerical imputer',
   'obj_dict': {'num_imputer': MeanMedianImputer(imputation_method='median',
                      variables=['age', 'sibsp', 'parch'])},
   'run_dict': {'transform': ['X = num_imputer.transform(X)']},
   'output_list': ['X'],
   'persist_obj': False},
  {'name': 'categorical imputer',
   'obj_dict': {'cat_imputer': CategoricalVariableImputer(imputation_method='missing', return_object=False,
                               variables=['pclass', 'sex', 'embarked'])},
   'run_dict': {'transform': ["\ncat_variables=['pclass', 'sex']\nX[cat_variables] =  X[cat_variables].astype('O')\n",
     'X = cat_imputer.transform(X)']},
   'output_list': ['X'],
   'persist_obj': False}],
 'output_dict': {}}

In [19]:
paip1.run('transform',    # run those command that tag with "transform"
          {'X': X_test,}, # the variable to bring into the pipeline.
          debug = True)   # Debug output 

name:numerical imputer
    code='''X = num_imputer.transform(X)''' 
name:categorical imputer
    code='''
cat_variables=['pclass', 'sex']
X[cat_variables] =  X[cat_variables].astype('O')
''' 
name:categorical imputer
    code='''X = cat_imputer.transform(X)''' 


In [20]:
# The transformed output is obtained here.
X_test_tr = paip1.output_dict['X']


In [21]:
X_test_tr

Unnamed: 0,pclass,sex,age,sibsp,parch,embarked
1139,3,male,38.0,0,0,S
533,2,female,21.0,0,1,S
459,2,male,42.0,1,0,S
1150,3,male,28.0,0,0,S
393,2,male,25.0,0,0,S
...,...,...,...,...,...,...
914,3,male,33.0,0,0,S
580,2,female,31.0,0,0,S
1080,3,male,28.0,0,0,Q
1249,3,male,28.0,0,0,Q
