# How to Simplify Machine Learning Workflow with scikit-learn Pipeline

**Present to you by:** Siraprapa Watakit, Data Scientist, DACoE Asia
<br>**Duration** : 1 Hour

**Objective:** <font color=turguise> **Zero** to **Hero** Machine Learning with scikit-learn Pipeline.</font>

**Prerequisite:**  
- Basic python programming. 
- Basic understanding scikit-learn Estimator,Transformer API.
- Basic knowledge in machine learning workflow

<font color=darkred>  **Bonus:** lots of examples, and give aways functions and transformers. </font>

<i>This notebook is powered by <b>RISE</b>, best viewed in Slideshow mode</i>

**Agenda**
 - Background and Reviews : machine learning workflows, sklearn  Estimator,Transformer API.
 - **Basic** : Build a model with and without sklearn.Pipeline
 - **Advanced** : Build your own Customer Transformer and Model Wrapper

**References**
 - Python Data Science Handbook
 - Kaggle competitor : Zac Stewart
 - PyData Conferences : Julie Michelman, Kevin Goetsch
 - PyCon Conferences : Kevin Markham

# Background and Reviews

<img src="./images/ml_workflow.png" >



# Data

 - <code>numpy </code> : a 1D or 2D numeric arrays, accessible via slicing notation eg. <code>X[<i>start:stop:step</i>]</code> where start:stop:step are positional integer
 - <code>pandas</code> : a Dataframe with row & column labeled, can be accessed via slicing notation as well as column names <br> e.g. <code> house_data.loc[:10,['GrLivArea','YearBuilt','SalePrice']]</code>


# Preprocessing with Scikit-Learn’s Transformer API
1. instantiate a <code>transformer</code> object
2. <code>.fit</code> the <code>transformer</code> with data
3. <code>.transform</code> on data

# Modeling with Scikit-Learn’s Estimator API
1. instantiate a <code>estimator</code> object
2. <code>.fit</code> the <code>estimator</code> with data
3. Supervised vs. Unsupervised
    - Supervised <code>.predict</code> on data
    - Unsupervised <code>.transform</code> on data


# What happen when we invoke <code>.fit, .transform, .predict</code>
<br>
<img src="./images/fit_transform_predict.png" >


### <font color=darkred>  This is important, you need to understand it,<br> if you are to create your own Custom Transformer..</font>


## Example#1: Simple preprocessing and modeling

Data : House Price(<a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">Kaggle</a>) 


In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

house_data = pd.read_csv('./data/house_data.csv')
house_data.head(3)

In [None]:
from sklearn.model_selection import train_test_split

y = house_data['SalePrice']
X = house_data.loc[:,['GrLivArea','LotArea','Street']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

X_train.head(3)

### Example #1: (1/4) Pre-process numeric features

In [None]:
from sklearn.preprocessing import StandardScaler

feat_num = ['GrLivArea','LotArea']

std_scaler = StandardScaler()
std_scaler.fit(X_train[feat_num])
X_train_num_scaled = std_scaler.transform(X_train[feat_num])
print("Scaled features:\n",X_train_num_scaled[0:3])


### Example #1: (2/4)  Pre-process categorical features


In [None]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

feat_cat = ['Street']

le = LabelEncoder()
X_train[feat_cat] = le.fit_transform(X_train[feat_cat])
ohe = OneHotEncoder(categorical_features = [0])
X_train_cat_ohe = ohe.fit_transform(X_train[feat_cat]).toarray()

print("Dummified features:\n",X_train_cat_ohe[0:3])


### This is only one of many ways to to stuffs with python!

Check this out : <a href="https://www.youtube.com/watch?v=0s_1IsROgDc">How do I create dummy variables in pandas?</a>

<font color=darkred> NOTE: sklearn library takes care of dummy variable trap; hence even if you don't drop one of the columns, it is still going to work. However we should make a habit of taking care of dummy variable trap by ourselves. Just in case you are not using sklearn for modeling</font>

### Example#1: (3/4) Build a model(s)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X_train_processed = np.concatenate((X_train_num_scaled, X_train_cat_ohe), axis=1)

print("Processed Train Data: \n",X_train_processed[0:3],"\n")
lr1 = LinearRegression()
lr1.fit(X_train_num_scaled,y_train)
print("Model #1 :  Only 2 numeric features")
print("Intercept:", lr1.intercept_)
print("Coefficient:", lr1.coef_)

lr2 = LinearRegression()
lr2.fit(X_train_processed,y_train)
print("\nModel #2 : 2 numeric and 1 categorical features")
print("Intercept:", lr2.intercept_)
print("Coefficient:", lr2.coef_)

### Example #1: (4/4) Pre-process the test data, then test and compare the model(s)

Be careful with this part..
- Use the estimator/transformer that were fitted with the <font color=darkred>**TRAIN DATA**</font>
    - Estomators: <code>lr1,lr2</code>
    - Transformer: <code>std_scaler,le,ohe</code>
- Apply estimator/transformer with the <font color=blue>**TEST DATA** </font>


In [None]:
X_test_num_scaled = std_scaler.transform(X_test[feat_num])
X_test[feat_cat] = le.transform(X_test[feat_cat])         
X_test_cat_ohe = ohe.transform(X_test[feat_cat]).toarray()
X_test_processed = np.concatenate((X_test_num_scaled, X_test_cat_ohe), axis=1)

print("Processed Test Data: \n",X_test_processed[0:3],"\n")

y_pred = lr1.predict(X_test_num_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Model #1 :  Only 2 numeric features")
print('The RMSE value is {:.4f}'.format(rmse))

y_pred = lr2.predict(X_test_processed)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("\nnModel #2 : 2 numeric and 1 categorical features")
print('The RMSE value is {:.4f}'.format(rmse))


# Do more with Pipeline

### What is Pipeline
- Pipeline is a container of steps / sequences of actions : <b><font color=darkred>list of tuples</font></b>
- What can be in the <b><font color=darkred>list of tuples</font></b>
    - <code>Transformer</code>
    - <code>Estimator</code>
    - <code>FeatureUnion</code>
    - more <code>Pipeline</code> !


# Building Blocks of scikit-learn Pipeline 
<br>
<img src="./images/block_of_legos.png" >

- Imagine each Transformer and Estimator is a piece of Lego, <code>Pipeline</code> make it possible to chain one block to another, <br><i><b>sending output of <code>block#1</code> as an input of<code>block#2</code></b></i>
- <code>FeatureUnion</code> make it possible to aggregate output from multiple <code>Pipeline</code>



### <font color=darkred> Important Note </font>
- Each block must have  <code>.fit, .transform </code> implemented
- To create your own Customer Transformer, you need to inherit <code>BaseEstimator,TransformerMixin </code>
    - <code>BaseEstimator</code> gives your transformer the <code>.fit_transform</code> method, for free.
    - <code>TransformerMixin</code> gives your transformer grid-searchable parameters. 

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator
class MyTransformer(TransformerMixin, BaseEstimator):
    '''A template for a custom transformer.'''

    def __init__(self):
        # this is where you can init internal variable
        pass

    def fit(self, X, y=None):
        # this is what happend when .fit is invoked
        return self

    def transform(self, X):
        # this is what happend when .tranform is invoked
        return X

### Example #2: Pipeline with <code> Transformer - StandandSacler</code>
<img src="./images/ex2.png" >

In [None]:
from sklearn.pipeline import Pipeline

num_pipe = Pipeline([('std_scaler',StandardScaler())])
num_pipe.fit(X_train[feat_num])

### <font color=darkred> One line transform !! </font>

In [None]:
X_test_num_scaled_new = num_pipe.transform(X_test[feat_num])

#This is the exact number shown in Example #1
print("Processed Test Data - numeric feature:\n",X_test_num_scaled_new[0:3])

### Example #3: Pipeline with <code> Transformer - Imputer and MinMaxScaler</code>
<img src="./images/ex3.png" >

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

num_pipe = Pipeline([('impute',SimpleImputer(strategy='median')),
                     ('minmax_scaler',MinMaxScaler())])
num_pipe.fit(X_train[feat_num])

### <font color=darkred> One line transform !! </font>

In [None]:
X_test_num_scaled_new = num_pipe.transform(X_test[feat_num])

In [None]:
print("Processed Test Data - numeric feature:\n",X_test_num_scaled_new[0:3])

### Example #4: Pipeline with <code> Transformer - StandardScaler , Estimator - LinearRegression</code>
<img src="./images/ex4.png" >

In [None]:
num_pipe_model = Pipeline([('std_scaler',StandardScaler()),
                           ('lr',LinearRegression())])
num_pipe_model.fit(X_train[feat_num],y_train)

print("Model #1 :  Only 2 numeric features")
print("Intercept:",num_pipe_model.named_steps['lr'].intercept_)
print("Coefficient:", num_pipe_model.named_steps['lr'].coef_)

### <font color=darkred> One line predict !! </font>
- When <code>.predict</code> of a Pipeline is invoked, <code>Pipeline</code> will invoke a chain of <code>.transform</code>, from the top; followed by <code>.predict</code> of the estimator  .

In [None]:
y_pred = num_pipe_model.predict(X_test[feat_num])

#This is the exact number shown in Example #1 - Model #1
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('The RMSE value is {:.4f}'.format(rmse))

### Example #5: (1/2)Pipeline with <code>FeatureUnion and CustomTransformer</code>
- Let's make it more dynamic, let's suppose we would like to **extract any features for any transformers**
- First, we define <code>MyFeatureExtractor</code>. 

In [None]:
#Let's dictate what's going to happend when an object is instanciated, fit, transform
class MyFeatureExtractor(TransformerMixin, BaseEstimator):
    '''This customer tranformer extracts and returns specific feature.'''

    def __init__(self,feature):
        # init feature to extract
        self.feature = feature

    def fit(self, X, y=None):
        # do nothing
        return self

    def transform(self, X):
        # return a sub-seleted feature
        return X[self.feature]

### Example #5: (2/2) Fomulate feature engineering strategy, and construct FeatureUnion Pipeline

<img src="./images/ex5.png" >

- Let's suppose we would like to extract and..
    - Standize <code>GrLivArea,LotArea,YearBuilt</code>
    - MinMaxStandize <code>OverallQual,OverallCond</code>

In [None]:
from sklearn.pipeline import FeatureUnion

COLS_1 = ['GrLivArea','LotArea','YearBuilt']
COLS_2 = ['OverallQual','OverallCond']

my_pipelines = FeatureUnion([('std',Pipeline([('cols1',MyFeatureExtractor(COLS_1)),
                                              ('std',StandardScaler() )])  ),
                             ('minmax',Pipeline([('cols2',MyFeatureExtractor(COLS_2)),
                                                 ('minmax',MinMaxScaler() )]))])

my_pipelines.fit(house_data)
output = my_pipelines.transform(house_data)
print(output[0:3])

### Example #6: Pipeline with <code>FeatureUnion + CustomTransformer + Estimator(Unsupervised)</code>

<img src="./images/ex6.png" >


In [None]:
from sklearn.decomposition import PCA
my_pipelines = Pipeline([('all_features',FeatureUnion([('std',Pipeline([('cols1',MyFeatureExtractor(COLS_1)),
                                                                        ('std',StandardScaler() )])  ),
                                                       ('minmax',Pipeline([('cols2',MyFeatureExtractor(COLS_2)),
                                                                           ('minmax',MinMaxScaler() )]))])),
                        ('pca',PCA(n_components=1))])
my_pipelines.fit(house_data)
output = my_pipelines.transform(house_data)
print(output[0:3])

### Example #7: Pipeline with <code>FeatureUnion + CustomTransformer + Estimator(Supervised)</code>

<img src="./images/ex7.png" >


In [None]:
my_pipelines = Pipeline([('all_features',FeatureUnion([('std',Pipeline([('cols1',MyFeatureExtractor(COLS_1)),
                                                                        ('std',StandardScaler() )])  ),
                                                       ('minmax',Pipeline([('cols2',MyFeatureExtractor(COLS_2)),
                                                                           ('minmax',MinMaxScaler() )]))])),
                        ('lr',LinearRegression())])

my_pipelines.fit(house_data,house_data['SalePrice'])

print("Intercept:",my_pipelines.named_steps['lr'].intercept_)
print("Coefficient:", my_pipelines.named_steps['lr'].coef_)

output = my_pipelines.predict(house_data)

In [None]:
price_compare = pd.concat([house_data['SalePrice'],pd.DataFrame(output)],axis=1)
price_compare.columns = ['SalePrice','Predicted']

print(price_compare.head())
rmse = np.sqrt(mean_squared_error(price_compare['SalePrice'], price_compare['Predicted']))
print('\nThe RMSE value is {:.4f}'.format(rmse))

### Example #8: Pipeline with <code>FeatureUnion + CustomTransformer + Estimator(Supervised) + GridSearch</code>
- Let's do GridSearch with Ridge Regression

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

my_pipelines = Pipeline([('all_features',FeatureUnion([('std',Pipeline([('cols1',MyFeatureExtractor(COLS_1)),
                                                                        ('std',StandardScaler() )])  ),
                                                       ('minmax',Pipeline([('cols2',MyFeatureExtractor(COLS_2)),
                                                                           ('minmax',MinMaxScaler() )]))])),
                        ('ridge',Ridge())])

params = {'ridge__alpha': [1,0.1,0.01,0.001,0.0001,0]}
grid = GridSearchCV(estimator=my_pipelines, param_grid=params)

grid.fit(house_data,house_data['SalePrice'].values)
print(grid.best_params_)

### Example #9: Finally, publish your model !!
- Note: Using whole dataset(house_data) for convenient

In [None]:
#Build the final model

my_pipelines = Pipeline([('all_features',FeatureUnion([('std',Pipeline([('cols1',MyFeatureExtractor(COLS_1)),
                                                                        ('std',StandardScaler() )])  ),
                                                       ('minmax',Pipeline([('cols2',MyFeatureExtractor(COLS_2)),
                                                                           ('minmax',MinMaxScaler() )]))])),
                        ('ridge',Ridge(alpha=0.01))])
my_pipelines.fit(house_data,house_data['SalePrice'])

from sklearn.externals import joblib
#Save file to disk
joblib.dump(my_pipelines, './output/final_pipelines.sav')

### Let's test it - with a completely new data

In [None]:
house_test = pd.read_csv('./data/house_test.csv')
house_test.head()

In [None]:
loaded_pipelines = joblib.load('./output/final_pipelines.sav')
predicted_prices = loaded_pipelines.predict(house_test)
pd.DataFrame(predicted_prices).head()

# Check-in

- So far, we see the the ouput of Pipeline are in <code>numpy</code> format, what if we would like to stay in <code>pandas</code> land?
- There isn't an example of <code>Pipeline[('le',LabelEncoder),('ohe',OneHotEncoder())]</code> here? <a href="https://github.com/scikit-learn/scikit-learn/issues/3956">issues#3956</a>


# Solution to this - Custom Transformer, but let's not reinvent the wheels

**Julie Michelman** - <a href="https://www.youtube.com/watch?v=BFaadIqWlAg">PyCon 2016</a>

Available Customer Transformers
- DFImputer
- DFStandardScaler
- ZeroFillTransformer
- Log1pTransformer
- DateFormatter
- DummyTransformer
- MultiEncoder

<b><i><font color=darkred>and lots more..<font></i></b>



### Example #10:  Custom Transformers - DataFrame
- First, let's take a look at mockup data

In [None]:
mockup = pd.read_csv('./data/mockup_data.csv')
print(mockup.head())

#Let's check missing values
mockup.isnull().sum()

### From what I saw above, here is what I would like to do
- For categorical variables <code>Sex, Color</code>, create dummies
- For numerric variables <code>SABal, CCBal,INVBal</code>, fill missing values with zeros and then take log1p
- <font color=darkred><b>For business rules variable, do RIDITS transformation - Bross(1958)</b></font>

In [None]:
class RiditsTransformer(TransformerMixin, BaseEstimator):
    """RIDITS tramsformation - Bross(1958)"""

    def __init__(self):
        #initialie dictionaries
        self.dict_ridits_1 = {}
        self.dict_ridits_0 = {}
        pass
        
    def fit(self, X, y=None):
        #for each feature,build rules
        Xr =pd.DataFrame(X)
        rules = Xr.columns
        for rule in rules:
            try:
                r = Xr[rule].value_counts()
                p_1 = r[1]/np.sum(r)
                p_0 = r[0]/np.sum(r)
                ridit_1 = (0.5 * p_1)
                ridit_0 = (p_1+0.5*p_0)
                self.dict_ridits_1.update({str(rule):ridit_1})
                self.dict_ridits_0.update({str(rule):ridit_0})

            except:
                pass
        
        return self

    def transform(self, X):
        # transform X with fitted dictionary
        Xr =pd.DataFrame(X)
        rules = Xr.columns
        for rule in rules:
            try:
                #replace 0,1 with ridits
                if str(rule) in self.dict_ridits_1.keys():
                    ridit_1 = self.dict_ridits_1.get(str(rule))
                    ridit_0 = self.dict_ridits_0.get(str(rule))
                    Xr[rule]=Xr[rule].map({1:ridit_1,0:ridit_0})
            except :
                pass

        return Xr
    

### Define a special transformer - RiditsTransformer
- Please visit <a href="https://analytics.knowledgehub.ageas.com/posts/2005000-sas-python-by-example-anomaly-detection-with-pridit">Knowledge Hub</a> for detailed explanation 


Or have a quick look at <a href="https://nbviewer.jupyter.org/github/swatakit/Python-Tools/blob/master/Anomaly%20Detection%20with%20PRIDIT%20-%20Python.ipynb">RIDITS- Python</a>



In [None]:
from custom_transformers import (ColumnExtractor, DFStandardScaler, DFFeatureUnion, 
                                 DFImputer,DummyTransformer,ZeroFillTransformer,Log1pTransformer)

feat_rules= ['R1','R2','R3','R4']
feat_cats = ['Sex','Color']
feat_nums = ['SABal','CCBal','INVBal']

pipeline = Pipeline([
    ('features', DFFeatureUnion([
        ('categoricals', Pipeline([
            ('extract', ColumnExtractor(feat_cats)),
            ('dummy', DummyTransformer())
        ])),
        ('numerics', Pipeline([
            ('extract', ColumnExtractor(feat_nums)),
            ('zero_fill', ZeroFillTransformer()),
            ('log', Log1pTransformer())
        ])),
        ('ridits', Pipeline([
            ('extract', ColumnExtractor(feat_rules)),
            ('ridits', RiditsTransformer())
        ]))
    ]))
])

pipeline.fit(mockup)
mockup_trans = pipeline.transform(mockup)
mockup_trans.head()

### Now, let suppose we have acquired new data
- <b><font color=#FF00FF>Take notice on the last observation</font></b>

In [None]:
mockup_new = pd.read_csv('./data/mockup_new.csv')
mockup_new

In [None]:
mockup_new_trans = pipeline.transform(mockup_new)
mockup_new_trans

### Example #11:  Custom Transformers - DataFrame + Logistic Regression(<code>.predict</code>)
- Let's add Logistic regression into it
- Again, using the whole sample(mockup) just for convenient

In [None]:
from sklearn.linear_model import LogisticRegression

pipeline_logreg = Pipeline([
    ('features', DFFeatureUnion([
        ('categoricals', Pipeline([
            ('extract', ColumnExtractor(feat_cats)),
            ('dummy', DummyTransformer())
        ])),
        ('numerics', Pipeline([
            ('extract', ColumnExtractor(feat_nums)),
            ('zero_fill', ZeroFillTransformer()),
            ('log', Log1pTransformer())
        ])),
        ('ridits', Pipeline([
            ('extract', ColumnExtractor(feat_rules)),
            ('ridits', RiditsTransformer())
        ]))
    ])),
    ('logreg',LogisticRegression(solver='liblinear'))
])

pipeline_logreg.fit(mockup,mockup['Target'])
y_pred = pipeline_logreg.predict(mockup)
y_pred

### Example #12:  Custom Transformers - DataFrame + Logistic Regression(<code>.predict_proba</code>)
- Suppose, we would like to have probability, instead of 0,1
- We need to improvise by having a model wrapper

In [None]:
class ModelTransformer(TransformerMixin, BaseEstimator):
    '''model.predict_proba'''

    def __init__(self,model):
        self.model = model

    def fit(self,*args, **kwargs):
        #expecting X, and y
        self.model.fit(*args, **kwargs)
        return self

    def transform(self,X,**transform_params):
        return self.model.predict_proba(X)

In [None]:
pipeline_logreg_proba = Pipeline([
    ('features', DFFeatureUnion([
        ('categoricals', Pipeline([
            ('extract', ColumnExtractor(feat_cats)),
            ('dummy', DummyTransformer())
        ])),
        ('numerics', Pipeline([
            ('extract', ColumnExtractor(feat_nums)),
            ('zero_fill', ZeroFillTransformer()),
            ('log', Log1pTransformer())
        ])),
        ('ridits', Pipeline([
            ('extract', ColumnExtractor(feat_rules)),
            ('ridits', RiditsTransformer())
        ]))
    ])),
    ('logreg',ModelTransformer(LogisticRegression(solver='liblinear')))
])

pipeline_logreg_proba.fit(mockup,mockup['Target'])
y_pred_proba = pipeline_logreg_proba.transform(mockup)
y_pred_proba[0:5]

# <font color=darkred>Bonus: Complete 2-Class classification reports</font>

In [None]:
from custom_functions import generate_miss_report

titanic = pd.read_csv('./data/titanic_train.csv')
titanic_missing = generate_miss_report(titanic)


In [None]:
titanic.head()

In [None]:
titanic.isnull().sum()

In [None]:
feat_cats = ['Sex','Embarked']
feat_nums = ['Age']
feat_donth = ['Pclass','SibSp']

pipeline_titanic = Pipeline([
    ('features', DFFeatureUnion([
        ('numerics', Pipeline([
            ('extract', ColumnExtractor(feat_nums)),
            ('mid_fill', DFImputer(strategy='median')),
            ('scale', DFStandardScaler())
        ])),
        ('categoricals', Pipeline([
            ('extract', ColumnExtractor(feat_cats)),
            ('dummy', DummyTransformer())
        ])),
        ('raw', Pipeline([
            ('extract', ColumnExtractor(feat_donth)),
        ]))
    ]))
])



In [None]:

y = titanic['Survived']
X = titanic.drop(columns=['Survived'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234, stratify=y)

pipeline_titanic.fit(X_train)
X_train_trans = pipeline_titanic.transform(X_train)

X_train_trans.head()


In [None]:
model = LogisticRegression(solver='liblinear', class_weight='balanced')
model.fit(X_train_trans,y_train)

In [None]:
from custom_functions import print_classification_performance2class_report

acc,pc,rc,fs,ap,roc_auc,gini,rmse = print_classification_performance2class_report(model,pipeline_titanic.transform(X_test),y_test)

In [None]:
from custom_functions import print_gainlift_charts
X_test_trans = pipeline_titanic.transform(X_test)
y_pred_proba = model.predict_proba(X_test_trans)[:,1]
df_tmp = pd.DataFrame([y_pred_proba,y_test]).transpose()
df_tmp.columns = ['logreg_proba','Survived']

outtab = print_gainlift_charts(data=df_tmp,
                               var_to_rank='logreg_proba',
                               var_to_count_nonzero='Survived')
outtab

The metrics can be put all together, for model comparison - <a href="https://nbviewer.jupyter.org/github/swatakit/The-public-and-private-life-of-Big-Data/blob/master/StockSentimentIndex.ipynb"> Example</a>

<img src="./images/ml_workflow.png" >

# Conclusion, why Pipeline?

- Short, neat, clean codes
- Data Cleasing/Feature Engineering heaven! 
- Reusable, GridSearch-able, Serializable
- Easy to create your own customer transformers


<img src="./images/thankyou.png" >