# Motivation
[This video](https://www.youtube.com/watch?v=6RSQIHAVzuo&t=756s) inspired me to learn about Pipelining, since Andreas Muller, **core developer of the sklearn library** states that **'Everybody should be using Pipelines. If you are not using Pipelines, you are probably doing it wrong'** when using sklearn.

Pipelining is an efficient, clean, and highly modifiable way of building Machine Learning Models. You can simply stick together multiple steps and **evaluate each step and its parameters** during cross-validation. Not only can you do Preprocessing and Feature Engineering in these pipelines, but you can even switch between different model types. 

Pipelining enforces us to treat each Data Cleaning and Feature Engineering step as a submodel of our final machine learning ensemble. According to [T. Scott Clendaniel](https://www.linkedin.com/in/tscottclendaniel/), understanding machine learning models this way is one key-mindset of Data Preprocessing, as mentioned in [this video](https://www.youtube.com/watch?v=vsKNxbP8R_8&t=1630s) about advanced Feature Engineering. It allows us to evaluate each Feature Engineering, Feature Selection and Model Selection step individually.


Take a look at my [Comprehensive Tutorial: Feature Engineering](https://www.kaggle.com/milankalkenings/comprehensive-tutorial-feature-engineering), if you want to practice your pipelining skills.

***
# Sections

## [1. What is a Pipeline?](#sec0)

## [2. Create Pipeline Helpers](#sec1)
* [Slicer](#sec12)
* [Heartbeat Identity Transformer](#sec12)
* [Transformer Wrapper](#sec13)
* [Estimator Wrapper](#sec14)


## [3. Create the Pipeline](#sec2)
* [Group Columns by type](#sec21)
* [Create Pipelines for Preprocessing](#sec22)
* [Combine all Preprocessing Pipelines](#sec23)
* [Add Predictors to a Pipeline](#sec24)


## [4. Pipeline validation with multiple Estimators](#sec3)
## [5. Results](#sec4)
***

<a id="sec0"></a>
***
# 1. What is a Pipeline?




**A Pipeline consists of two different sklearn objects:**
### Transformers
Transformers perform any transformation on any given data. 
The correct usage of the methods **fit()** and **transform()** ensure, that the test data is transformed using the parameters, which were calculated on train. All transformers in our Pipeline have to implement both, **fit** and **transform**.
* **fit()** calculates the parameters for the transformation and stores them. We only want to perform this on training data.
* **transform()** transforms any given data using the parameters, calculated in **fit()**
* **fit_transform()** performs the transform method after the fit method and thus should only be used on **training data**, since otherwise, **fit()** would train the parameters of the transformer on test.


### Final Estimator
The last element of your Pipeline can be an **Estimator**, and thus only has to implement the **fit** method. E.g. we can use a random forest classifier. It wouldn't transform the data, but it would solely fit some parameters on the given data. The whole pipeline will act like the chosen estimator and thus it can call the respective estimator methods like **predict()**, **predict_proba()**, and so on.




Using Pipelines, we can easily switch between different Preprocessing, Feature Engineering methods and Machine Learning Models during **Gridsearch cross-validation**. The following sketch assumes, that we treat all numerical variables and all categorical variables equivalently. As we can see, the sketched Pipeline allows us to switch between two Encoding approaches for categorical data and it even allows us to switch between two different types of models.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('../input/images/pipes.PNG')

fig = plt.figure(frameon=False)
fig.set_size_inches(11,6)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(img, aspect='auto')
fig.savefig('pipes_out.png')


### First of all, we have to check for any **missing values** and we have to perform a **train/test split**...

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid' : False})

df = pd.read_csv('../input/heart-disease-uci/heart.csv')
df.head()

Are there any missing values in the dataset ?

In [None]:
df.isna().any()

split the data into a training set and a test set:

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop(columns='target')
y = df['target']

# 80/20 split (based on the pareto principle)
# due to the small size of the dataset, the random state has a huge impact on the results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
X_train.head()

<a id="sec1"></a>
***
# 2. Create Pipeline Helpers

Custom Transformers have to inherit TransformerMixin and they need to implement the methods fit and transform for obvious reasons. Likewise, custom Estimators have to inherit BaseEstimator.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

<a id="sec11"></a>
## 2.1. Slicer
The first, and most important Helper-Transformer for Pipelines is the Slicer. We can use a Slicer to take a **subset of features** from our dataset and perform further transformations on it. (E.g., to One-Hot Encode Categorical Features with low cardinality).

Note: Instead, you can use a column transformer

In [None]:
class Slicer(TransformerMixin):
    """
    Transformer to slice certain columns of a df.
    """
    def __init__(self, col_names):
        self.col_names = col_names
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        return X[self.col_names]

<a id="sec12"></a>
## 2.2. Heartbeat Identity Transformer
It might be beneficial to have a simple tool for debuging inside a Pipeline. You can plug this Transformer into a Pipeline and activate it during cross validation if needed.
It is especially useful for more complex Pipelines, and you can use it to e.g. print the intermediate data state.

In [None]:
class Heartbeat(TransformerMixin):
    """
    Identity Transformer for Debugging. 
    Add any parameters or functionalities..
    """
    def __init__(self, heartbeat_message='no errors yet', active=False):
        self.heartbeat_message = heartbeat_message
        self.active = active
        
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        if (self.active):
            print(self.heartbeat_message)
        return X

<a id="sec13"></a>
## 2.3. Transformer Wrapper
This very simple wrapper allows us to decide whether we want a Transformer to be active or not.

In [None]:
class ActivityWrapper(TransformerMixin):
    """
    A Custom Wrapper that can activate/deactivate a Transformer.
    """ 
    def __init__(self, transformer, active=False):
        self.transformer = transformer
        self.active = active

    def fit(self, X, y=None, **kwargs):
        if(self.active):
            self.transformer.fit(X, y)
        return self
    
    def transform(self, X):
        if(self.active):
            X = self.transformer.transform(X)
        return X
    
    def set_params(self, transformer=None, active=True):
        self.transformer = transformer
        self.active = active

<a id="sec14"></a>
## 2.4. Estimator Wrapper
I found this [wrapper for estimators](https://stackoverflow.com/questions/50285973/pipeline-multiple-classifiers?noredirect=1&lq=1), which can be used to switch between different estimators during cross validation.

In [None]:
class ClassifierWrapper(BaseEstimator):
    """
    A Custom Wrapper that can switch between classifiers.
    """ 
    def __init__(self, estimator=None):
        self.estimator = estimator

    def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self

    def predict(self, X, y=None):
        return self.estimator.predict(X)

    def predict_proba(self, X):
        return self.estimator.predict_proba(X)

    def score(self, X, y):
        return self.estimator.score(X, y)

<a id="sec2"></a>
***
# 3. Create the Pipeline

<a id="sec21"></a>
# 3.1. Group Columns by type
Each List contains the names of features, which we want to **treat equally**.

In [None]:
cols_num = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
cols_cat_bin = ['sex', 'fbs', 'exang']
cols_cat = ['cp', 'restecg', 'ca', 'thal', 'slope']

<a id="sec22"></a>
# 3.2. Create Pipelines for Preprocessing 
Stack the Transformers. 

Transformer $n$ performs its transformation after Transformer $n-1$. Each Transformer is represented by a tuple. The String in the tuple is the Identifier of the Transformer and we can use it to set parameters during Gridsearch cross-validation, as you will see later on.

I will solely do some basic **preprocessing** in this example.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

pipe_num = Pipeline([
    ('selector', Slicer(cols_num)),
    ('scaler', MinMaxScaler())
])

# nothing to do
pipe_cat_bin = Pipeline([
    ('selector', Slicer(cols_cat_bin))
])

pipe_cat = Pipeline([
    ('selector', Slicer(cols_cat)),
    ('encoder_wrapped', ActivityWrapper(transformer=OneHotEncoder(handle_unknown='ignore', sparse=False)))
])

<a id="sec23"></a>
# 3.3. Combine all Pre Processing Pipelines
FeatureUnion combines several transformers and concatenates the outputs of each pipe. 

In [None]:
from sklearn.pipeline import FeatureUnion

features = FeatureUnion([
    ('num', pipe_num),
    ('obj_bin', pipe_cat_bin),
    ('cat', pipe_cat)
])

<a id="sec24"></a>
# 3.4. Add Predictors to a Pipeline
Create a new pipe using the combined Preprocessing Transformers and an Estimator (or an Estimator Wrapper). Furthermore, I added an **inactive** Heartbeat-Transformer, just to show you that you can plug these transformers in whenever you want. 

In [None]:
pipe_full = Pipeline([
    ('features', features),
    ('heartbeat', Heartbeat()),
    ('switchable', ClassifierWrapper())
])

If you want to, you can already use such a pipeline for **predictions**, e.g. to compare it with the results of the cross validated pipe.

In [None]:
from sklearn.linear_model import LogisticRegression

pipe_logreg = Pipeline([
    ('features', features),
    ('logreg', LogisticRegression())
])

pipe_logreg.fit(X_train, y_train)
y_pred = pipe_logreg.predict(X_test)
(y_pred == y_test).sum() / len(y_pred)

<a id="sec3"></a>
***
# 4. Pipeline validation with multiple Estimators
We can set the parameters of each individual Element of our Pipeline using their identifiers, followed by two underscores and the parameter name. To reach a Classifier in our  ClassifierWrapper, we have to add two more underscores.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
import xgboost

# one dict for each estimator
hyperparameters = [
    {
        'features__cat__encoder_wrapped__transformer': [OneHotEncoder(handle_unknown='ignore')],
        'features__cat__encoder_wrapped__active': [True, False],
        'switchable__estimator': [AdaBoostClassifier(base_estimator=LogisticRegression(), algorithm='SAMME.R')],
        'switchable__estimator__n_estimators': [60, 70],
        'switchable__estimator__learning_rate': [0.6, 0.7]
    },
    {
        'features__cat__encoder_wrapped__transformer': [OneHotEncoder(handle_unknown='ignore', sparse=False)],
        'features__cat__encoder_wrapped__active': [True, False],
        'switchable__estimator': [RandomForestClassifier(random_state = 5)],
        'switchable__estimator__n_estimators': [750],
        'switchable__estimator__max_features': ['log2'],
        'switchable__estimator__criterion': ['gini', 'entropy']
    }
]


val = GridSearchCV(pipe_full, hyperparameters, cv=5, scoring='f1')
# (by default,) GridSearchCV provides the best scoring pipe
val.fit(X_train, y_train);

<a id="sec4"></a>
***
# 5. Results

Let's find out which parameters are the best.

In [None]:
val.best_params_

How many data points are classified correctly? (accuracy)

In [None]:
y_pred = val.predict(X_test)
(y_pred == y_test).sum() / len(y_pred)

When inspecting the false predictions (to determine which circumstances could have led to a false prediction), I encountered a correlation with misclassification:

In [None]:
X_false_pred = X_test[y_pred != y_test]
with plt.style.context('dark_background'):
    fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, sharex=True, figsize=(8, 10))
    X_test['sex'].value_counts().plot.pie(ax=ax1, 
                                          title='Male patients in Testset', 
                                          colormap='viridis', 
                                          explode=[0.1,0], 
                                          wedgeprops={"edgecolor":"black"})
    
    X_false_pred['sex'].value_counts().plot.pie(ax=ax2, 
                                                title='Male patients in False Predictions', 
                                                colormap='viridis',
                                                explode = [0.1,0], 
                                                wedgeprops={"edgecolor":"black"})
    
plt.savefig('errors_male.png')
plt.show()

It seems like male patients are more likely to get false predictions, since they are slightly overrepresented in the set of incorrect predictions. 

I wouldn't overrate this result, since the number of false predictions (and the dataset in general) is pretty small.

# That's it. Thank you for reading this Notebook. I hope that I helped you in any way.

Take a look at my [Comprehensive Tutorial: Feature Engineering](https://www.kaggle.com/milankalkenings/comprehensive-tutorial-feature-engineering), if you want to practice your pipelining skills with more elaborate Feature Engineering techniques.


<br>
<br>
<br>

## TODO:
* find a better ensemble
* add some EDA

This Notebook is based on [This Notebook by dbaghern](https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines) but introduces new concepts and fixes. 

Feel free to **comment any suggestions for improvements**, because that's the best way of learning from each other.