# Pipelines
-----------

- Pipelines are a natural way to think about a machine learning system. Indeed with some practice a data scientist can visualise data "flowing" through a series of steps. The input is typically some raw data which has to be processed in some manner. The goal is to represent the data in such a way that is can be ingested by a machine learning algorithm. Along the way some steps will extract features, while others will normalize the data and remove undesirable elements. Pipelines are simple, and yet they are a powerful way of designing sophisticated machine learning systems.

- Both **scikit-learn** and **pandas** make it possible to use pipelines. However it's quite rare to see pipelines being used in practice. Sometimes you get to see people using scikit-learn's `pipeline` module, however the `pipe` method from `pandas` is sadly underappreciated. A big reason why pipelines are not given much love is that it's easier to think of batch learning in terms of a script or a notebook. Indeed many people doing data science seem to prefer a procedural style to a declarative style. Moreover in practice pipelines can be a bit rigid if one wishes to do non-orthodox operations.

- Although pipelines may be a bit of an odd fit for batch learning, they make complete sense when they are used for online learning. Indeed the UNIX philosophy has advocated the use of pipelines for data processing for many decades. If you can visualise data as a stream of observations then using pipelines should make a lot of sense to you.

In [3]:
import pandas as pd
df = pd.read_csv('data/data.csv', header=0, sep='|')
df.fillna(df.mean(), inplace=True)
df.head()

Unnamed: 0,hits,visits,day,identifier,orders,amount,product_pages,direct_visit,organic_visit,paid_search_visit,email_visit
0,1084135,145634,2020-04-27,96,45986,3061233.89,707126,400028,260021,846,6
1,734485,111792,2020-04-30,96,53344,3271520.39,479824,255051,159261,431,0
2,2084615,182338,2020-04-08,96,11576,908171.75,1319358,675851,337172,37056,12
3,1133765,157161,2020-04-25,96,49829,3398320.87,720391,416621,237090,801,7
4,2473217,254864,2020-04-14,96,24317,2029124.65,1503301,736847,523907,75793,0


In [4]:
X=df.drop(['identifier', 'day'],axis=1)
Y=df['identifier']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=30, stratify=Y)

## scikit-learn

### without

In [5]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)

X_train_scaled = scaler.transform(X_train)

model = SVC().fit(X_train_scaled, y_train)

print("score = %3.2f" %(model.score(scaler.transform(X_test),y_test)))

score = 0.94


There is an intrinsic problem of applying a transformer and an estimator separately where the parameters for estimator (SVM) are determined using `GridSearchCV`.

When SVC.fit() is done using cross-validation the features already include info from the test-fold as StandardScaler.fit() was done on the whole training set.

### with

In [6]:
from sklearn.pipeline import Pipeline

steps = [
    ('scaler', StandardScaler()), 
    ('SVM', SVC())
]

pipeline = Pipeline(steps)

model = pipeline.fit(X_train, y_train)

print("score = %3.2f" %(model.score(X_test,y_test)))

score = 0.94


### with GridSearch

In [7]:
from sklearn.model_selection import GridSearchCV

parameters = {'SVM__C':[0.1,10,100], 'SVM__gamma':[0.1,0.01]}

grid = GridSearchCV(pipeline, param_grid=parameters, cv=5)

grid.fit(X_train, y_train)

print("score = %3.2f" %(grid.score(X_test,y_test)))
print(grid.best_params_)

score = 0.99
{'SVM__C': 100, 'SVM__gamma': 0.1}


## pandas

In [8]:
df.head()

Unnamed: 0,hits,visits,day,identifier,orders,amount,product_pages,direct_visit,organic_visit,paid_search_visit,email_visit
0,1084135,145634,2020-04-27,96,45986,3061233.89,707126,400028,260021,846,6
1,734485,111792,2020-04-30,96,53344,3271520.39,479824,255051,159261,431,0
2,2084615,182338,2020-04-08,96,11576,908171.75,1319358,675851,337172,37056,12
3,1133765,157161,2020-04-25,96,49829,3398320.87,720391,416621,237090,801,7
4,2473217,254864,2020-04-14,96,24317,2029124.65,1503301,736847,523907,75793,0


In [9]:
import numpy as np
def csnap(df, fn=lambda x: x.shape, msg=None):
    """ Custom Help function to print things in method chaining.
        Returns back the df to further use in chaining.
    """
    if msg:
        print(msg)
    display(fn(df))
    return df

In [10]:
pdPipeline = (df.pipe(csnap)
                .rename(columns={"identifier": "account"}) 
                .assign(myFilter=lambda x: np.where((x['product_pages']> 10000) & (x.email_visit > 0), 1, 0))
                .pipe(csnap)
                .query("hits >= 100000")
                .pipe(csnap)
                .sort_values("orders", ascending=False)
                .reset_index(drop=True)
                .loc[1:1000]
                .pipe(csnap)
                .filter(["account", "orders", "myFilter"], axis=1)
                .pipe(csnap, lambda x: x.sample(5))
             )

(3158, 11)

(3158, 12)

(2591, 12)

(1000, 12)

Unnamed: 0,account,orders,myFilter
143,45,105320,1
284,96,79630,0
582,36,55757,1
674,96,51006,1
342,34,73364,1
