# Pipeline
* Pipeline (software) ... In software engineering, a pipeline consists of a chain of processing elements (processes, threads, coroutines, functions, etc.), arranged so that the output of each element is the input of the next; the name is by analogy to a physical pipeline.

### Objectives
* Challenges with current code
* Understanding pipeline
* Solving Problems using Pipeline
* Challenges with pipeline
* Solving hetrogenous data problem with ColumnTransformer

### Challenges with current code
* Dev have to do manually the preprocessing followed by putting the data in the estimators

In [1]:
from sklearn.datasets import load_digits #import dataset 

In [2]:
digits = load_digits()

In [3]:
digits.data.shape

(1797, 64)

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
trainX, testX, trainY, testY = train_test_split(digits.data, digits.target)

In [6]:
from sklearn.preprocessing import StandardScaler

In [7]:
ss = StandardScaler()

In [8]:
trainX_tf = ss.fit_transform(trainX)

In [9]:
from sklearn.ensemble import RandomForestClassifier

In [10]:
rf = RandomForestClassifier()

In [11]:
rf.fit(trainX_tf, trainY)

RandomForestClassifier()

In [12]:
# This is incorrect, test isn't transformed
#rf.predict(testX)

In [13]:
testX_tf = ss.transform(testX)

In [14]:
rf.predict(testX_tf)

array([9, 2, 7, 9, 9, 2, 0, 9, 8, 4, 8, 1, 9, 8, 9, 1, 8, 6, 8, 7, 3, 5,
       5, 7, 8, 9, 6, 6, 4, 6, 7, 6, 8, 7, 9, 4, 2, 7, 8, 6, 6, 2, 9, 4,
       0, 5, 3, 3, 3, 5, 5, 8, 7, 1, 0, 5, 6, 3, 5, 5, 8, 2, 0, 9, 7, 8,
       6, 3, 4, 4, 6, 7, 7, 9, 7, 9, 1, 6, 7, 3, 0, 2, 8, 7, 5, 1, 5, 0,
       0, 7, 5, 4, 0, 9, 8, 9, 5, 5, 4, 3, 2, 1, 0, 9, 1, 9, 0, 7, 4, 9,
       6, 5, 6, 3, 0, 1, 2, 5, 6, 2, 7, 6, 9, 5, 8, 0, 0, 7, 9, 1, 3, 6,
       2, 4, 3, 4, 0, 5, 6, 9, 6, 1, 5, 1, 1, 2, 7, 5, 3, 6, 7, 2, 1, 5,
       4, 4, 4, 1, 7, 9, 1, 6, 2, 6, 5, 6, 7, 6, 4, 1, 8, 5, 2, 1, 2, 3,
       6, 9, 1, 3, 4, 1, 1, 2, 4, 4, 9, 5, 3, 6, 6, 1, 4, 8, 8, 5, 8, 5,
       2, 5, 1, 7, 1, 2, 7, 7, 6, 0, 7, 9, 7, 1, 2, 7, 9, 2, 9, 9, 9, 9,
       9, 9, 2, 5, 1, 9, 9, 7, 4, 8, 8, 2, 9, 7, 7, 0, 1, 1, 4, 3, 3, 6,
       0, 0, 0, 9, 2, 6, 4, 9, 7, 6, 9, 6, 5, 7, 7, 7, 9, 0, 1, 7, 6, 9,
       8, 3, 5, 7, 2, 4, 1, 0, 5, 2, 9, 3, 3, 3, 4, 7, 1, 3, 1, 1, 3, 7,
       5, 5, 4, 5, 1, 4, 6, 7, 1, 2, 6, 9, 4, 8, 6,

### Challenges
* Different features need different preprocessing
* It's very cumbersome to do things the current way
* We have to preserve manually all the preprocessors used

### Pipeline
* It connectd preprocessors & estimators & thus removes the need to store them manually

In [15]:
from sklearn.pipeline import make_pipeline

In [16]:
digit_pipeline = make_pipeline(StandardScaler(), RandomForestClassifier())

In [17]:
digit_pipeline.fit(trainX, trainY)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestclassifier', RandomForestClassifier())])

In [18]:
digit_pipeline.predict(testX)

array([9, 2, 7, 9, 9, 2, 0, 9, 8, 4, 8, 1, 9, 8, 9, 1, 8, 6, 8, 7, 3, 5,
       5, 7, 8, 9, 6, 6, 4, 6, 7, 6, 8, 7, 9, 4, 2, 7, 8, 6, 6, 2, 9, 4,
       0, 5, 3, 3, 3, 5, 5, 8, 7, 1, 0, 5, 6, 3, 5, 5, 8, 2, 0, 9, 8, 8,
       6, 3, 4, 4, 6, 7, 7, 9, 7, 9, 1, 6, 7, 3, 0, 2, 8, 7, 5, 1, 5, 0,
       0, 7, 5, 4, 0, 9, 8, 9, 5, 5, 4, 3, 2, 1, 0, 9, 1, 9, 0, 7, 4, 9,
       6, 5, 6, 3, 0, 1, 2, 5, 6, 2, 7, 6, 9, 5, 8, 0, 0, 7, 9, 1, 3, 6,
       2, 4, 3, 4, 0, 5, 6, 9, 6, 1, 5, 1, 1, 2, 7, 5, 3, 6, 7, 2, 1, 5,
       4, 8, 4, 1, 4, 9, 1, 6, 2, 6, 5, 6, 7, 6, 4, 1, 8, 5, 2, 1, 2, 3,
       6, 9, 1, 3, 4, 1, 1, 2, 4, 4, 9, 5, 3, 6, 6, 1, 4, 8, 8, 5, 8, 5,
       2, 5, 1, 7, 1, 2, 7, 7, 6, 0, 7, 9, 7, 1, 2, 7, 9, 2, 9, 9, 9, 9,
       9, 9, 2, 5, 1, 9, 9, 7, 4, 8, 8, 2, 9, 7, 7, 4, 1, 1, 4, 3, 3, 6,
       0, 0, 0, 9, 2, 6, 4, 9, 7, 6, 9, 6, 5, 7, 7, 7, 9, 0, 1, 7, 6, 9,
       8, 3, 5, 7, 2, 6, 1, 0, 5, 2, 9, 3, 3, 3, 4, 7, 1, 3, 1, 1, 3, 7,
       5, 5, 4, 5, 1, 4, 6, 7, 1, 2, 6, 9, 4, 8, 6,

In [19]:
digit_pipeline.steps[1][1].feature_importances_

array([0.00000000e+00, 1.74761677e-03, 2.10682705e-02, 1.08093112e-02,
       9.76391126e-03, 1.93340730e-02, 7.67069646e-03, 8.78277772e-04,
       5.13579234e-05, 1.09612704e-02, 2.18566631e-02, 6.60370589e-03,
       1.45240906e-02, 2.68622565e-02, 5.37438243e-03, 5.87675187e-04,
       9.00089499e-05, 9.05955704e-03, 1.95759053e-02, 2.67725260e-02,
       2.64382655e-02, 4.68501371e-02, 6.78349626e-03, 5.54260537e-04,
       9.61485486e-05, 1.17798861e-02, 4.47404592e-02, 2.58624774e-02,
       2.94098275e-02, 2.08278264e-02, 2.97352453e-02, 4.08924413e-05,
       0.00000000e+00, 3.37402981e-02, 2.74943920e-02, 2.17318191e-02,
       4.37048189e-02, 2.30719486e-02, 2.46691249e-02, 0.00000000e+00,
       4.56080916e-05, 1.15174977e-02, 3.92086174e-02, 4.59142334e-02,
       2.13425746e-02, 2.16791479e-02, 1.99171637e-02, 1.50484159e-04,
       0.00000000e+00, 3.04613058e-03, 1.78457224e-02, 2.43841051e-02,
       1.29109101e-02, 2.14333911e-02, 2.47735890e-02, 1.82840595e-03,
      

### Connecting Pipeline with Hyper-parameter Tunning
* We could determine the best combination of hyper-parameters for preprocessor & estimator

In [20]:
from sklearn.feature_selection import SelectKBest, f_classif

In [21]:
digit_pipeline = make_pipeline(StandardScaler(), SelectKBest(k=10, score_func=f_classif), RandomForestClassifier(n_estimators=100))

In [22]:
digit_pipeline

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('selectkbest', SelectKBest()),
                ('randomforestclassifier', RandomForestClassifier())])

In [23]:
from sklearn.model_selection import GridSearchCV

In [24]:
# params = {'selectkbest__k':[20,30,40],'randomforestclassifier__n_estimators':[100,200]}

In [25]:
params = {'selectkbest__k':[50,55,60],'randomforestclassifier__n_estimators':[300,400,500]}

In [26]:
gs = GridSearchCV(digit_pipeline, param_grid = params, cv=5, n_jobs=4)

In [27]:
gs.fit(trainX, trainY)

  f = msb / msw


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('selectkbest', SelectKBest()),
                                       ('randomforestclassifier',
                                        RandomForestClassifier())]),
             n_jobs=4,
             param_grid={'randomforestclassifier__n_estimators': [300, 400,
                                                                  500],
                         'selectkbest__k': [50, 55, 60]})

In [28]:
gs.best_params_

{'randomforestclassifier__n_estimators': 400, 'selectkbest__k': 55}

In [29]:
gs.best_score_#before .9680 and now 97

0.9769902244251687

In [30]:
gs.predict(testX[:2])

array([9, 2])

In [31]:
gs.best_estimator_

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('selectkbest', SelectKBest(k=55)),
                ('randomforestclassifier',
                 RandomForestClassifier(n_estimators=400))])

### Column Tranformer for dealing with hetrogenous data
* Regular pipeline intends to do same processing for all the columns 
* This doesn't work for hetrogenous data

<hr>
<img scr='https://camo.githubusercontent.com/82ec7d9816d964e54f028d2ce5223ac529dd740e/68747470733a2f2f6769746875622e636f6d2f6564796f64612f446174612d536369656e746973742d70726f6772616d2f626c6f622f6d61737465722f41737369676e6d656e742f696d616765732f436f6c756d6e5472616e666f726d65722e706e673f7261773d74727565'>

In [32]:
import pandas as pd

In [33]:
hr_data = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/HR_comma_sep.csv.txt')

In [34]:
hr_data.rename(columns={'sales':'dept'},inplace=True)

In [35]:
hr_data.sample(5)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,dept,salary
13761,0.63,0.6,4,258,3,0,0,0,sales,medium
2460,0.86,0.72,4,167,2,0,0,0,sales,low
3370,0.64,0.83,3,188,4,0,0,0,sales,low
9932,0.17,0.89,5,261,5,0,0,0,marketing,medium
14746,0.37,0.56,2,156,3,0,1,0,sales,medium


In [36]:
feature_data = hr_data.drop(columns=['left'])

In [37]:
target_data = hr_data.left

* Different Columns needs different preprocessing

In [38]:
feature_data.dtypes

satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
promotion_last_5years      int64
dept                      object
salary                    object
dtype: object

In [39]:
cat_data = feature_data.select_dtypes(include=['object'])

In [40]:
int_data = feature_data.select_dtypes(include=['int'])

In [41]:
float_data = feature_data.select_dtypes(include=['float'])

In [42]:
float_data[:4]

Unnamed: 0,satisfaction_level,last_evaluation
0,0.38,0.53
1,0.8,0.86
2,0.11,0.88
3,0.72,0.87


* <b>Float</b> satisfaction_level & last_evaluation don't need preprocessing
* <b>Int</b> number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years need MinMaxScaler
* <b>Object</b> dept & Salary need OrdinalEncoder

In [43]:
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer # dealing with NaN values
# SimplImputer is for handling missing data in pipline

In [44]:
# cat_pipeline = make_pipeline(SimpleImputer(), OrdinalEncoder())
cat_pipeline = make_pipeline(OrdinalEncoder())

In [45]:
int_pipeline = make_pipeline(MinMaxScaler(), SelectKBest(k=3, score_func=f_classif))

In [46]:
from sklearn.compose import make_column_transformer

In [47]:
preprocessor = make_column_transformer(
    (cat_pipeline,cat_data.columns),
    (int_pipeline,int_data.columns),
    remainder='passthrough'
)

In [48]:
pipeline = make_pipeline(preprocessor, RandomForestClassifier())

In [49]:
trainX, testX, trainY, testY = train_test_split(feature_data, target_data)

In [50]:
trainX[:2]

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,dept,salary
10726,0.13,0.84,5,189,5,0,0,technical,low
5403,0.68,0.81,3,166,2,0,0,IT,medium


In [51]:
pipeline.fit(trainX, trainY)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline-1',
                                                  Pipeline(steps=[('ordinalencoder',
                                                                   OrdinalEncoder())]),
                                                  Index(['dept', 'salary'], dtype='object')),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('minmaxscaler',
                                                                   MinMaxScaler()),
                                                                  ('selectkbest',
                                                                   SelectKBest(k=3))]),
                                                  Index([], dtype='object'))])),
                ('randomforestclassifier', RandomForestClassifier())])

In [52]:
pipeline.predict(testX[:2])

array([0, 0], dtype=int64)

In [53]:
pipeline.score(testX,testY)

0.9909333333333333

In [54]:
pipeline.steps[0][1].transformers

[('pipeline-1',
  Pipeline(steps=[('ordinalencoder', OrdinalEncoder())]),
  Index(['dept', 'salary'], dtype='object')),
 ('pipeline-2', Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                  ('selectkbest', SelectKBest(k=3))]), Index([], dtype='object'))]

In [55]:
params = {'columntransformer__pipeline-2__selectkbest__k':[2,3,4,5]}

In [56]:
gs = GridSearchCV(pipeline, param_grid=params, n_jobs=4, cv=5)

In [57]:
gs.fit(trainX,trainY)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('pipeline-1',
                                                                         Pipeline(steps=[('ordinalencoder',
                                                                                          OrdinalEncoder())]),
                                                                         Index(['dept', 'salary'], dtype='object')),
                                                                        ('pipeline-2',
                                                                         Pipeline(steps=[('minmaxscaler',
                                                                                          MinMaxScaler()),
                                                                                         ('selectkbest',
         

In [58]:
gs.best_params_

{'columntransformer__pipeline-2__selectkbest__k': 4}

In [59]:
gs.best_score_

0.9911100834938985

### Disadvantages of Pipeline
* Doesn't support Online Learning


### Dealing with imbalanced data in pipeline
* Use imblearn make_pipeline rather than scikit makepipeline as RandomOverSampler is not supported in Scikit

In [60]:
target_data.value_counts() #data is not balanced 

0    11428
1     3571
Name: left, dtype: int64

In [61]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import make_pipeline



In [62]:
# This make_pipeline is not scikit pipeline but imblearn which support Oversampler as part of pipeline
pipeline = make_pipeline (preprocessor, RandomOverSampler(),RandomForestClassifier())

In [63]:
pipeline.fit(trainX,trainY)



Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline-1',
                                                  Pipeline(steps=[('ordinalencoder',
                                                                   OrdinalEncoder())]),
                                                  Index(['dept', 'salary'], dtype='object')),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('minmaxscaler',
                                                                   MinMaxScaler()),
                                                                  ('selectkbest',
                                                                   SelectKBest(k=3))]),
                                                  Index([], dtype='object'))])),
                ('randomoversampler', RandomOverSampler()),
                ('randomfores

In [64]:
pipeline.score(testX,testY)

0.9906666666666667