# Nice Pipeline

here we present a nice example of a pipeline which we can use for training purposes. At first glance, it looks messy and hard to read.  
But if you take a moment to understand, you will notice the beauty for sure!

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer

We just need to import some transformers which are inside of the pipeline.  
**This is not a operational code, just an example on longer pipelines.**

In [0]:
#Preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

#Dimensionality reduction
from sklearn.decomposition import NMF

#Imputation
from sklearn.impute import SimpleImputer

#Modeling
from sklearn.ensemble import RandomForestClassifier

#Other
import numpy as np

## Step 1: Take a quick glance
Please take a quick look onto the pipeline which is below and come back here.

## Step 2: Slow walkthrough
Get a **high level view** like this:
- look toward the top, there is a *FeatureUnion*, which is really a wrapper for entire feature engineering
- look at the bottom, there is a *RandomForestClassifier*, which is our predictive model

Now we can go deeper inside of our FeatureUnion, which is our **feature engineering**:
- it splits into three parts, depending on which features we are attempting to process
    - on top, we have numerical features
    - in the middle, we have categorical features
    - on the bottom, we have textual features
- now zoom out again and realize that this is wrapped under FeatureUnion, which means that these features will be transformed in a parallel way and appended next to each other

Only now let's **zoom into one part of our feature engineering**, for example into "numerical features", on the top:
- inside of it, we right away need ColumnTransformer as we want to specify for which columns certain transformation will be applied by name or by type
- now we could already be applying transformers, but remember that ColumnTransformer by default drops all untransformed columns, which would mean that if we want to apply some transformations sequentially we would not be able to

Finally, **get used to the indentation** (the whitespacing). Your code editor helps with this. Get used to this by clicking just behind the last visible character on the line where you are. For example go behing the last bracket on the line of *SimpleImputer*. Now if you hit Enter, it will land where a code should continue on the next line it you still want to stay within the element, which is the *Pipeline*.

Source1: https://www.codementor.io/@bruce3557/beautiful-machine-learning-pipeline-with-scikit-learn-uiqapbxuj
Source2: http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html

In [0]:
model_pipeline = Pipeline(steps=[
    ("features", FeatureUnion([
        ("numerical_features",
         ColumnTransformer([
             ("numerical",
              Pipeline(steps=[(
                  "impute_stage",
                  SimpleImputer(missing_values=np.nan, strategy="median")
              )]),
              ["feature_1"]
             )
         ])
        ), 
        ("categorical_features",
            ColumnTransformer([
                ("country_encoding",
                 Pipeline(steps=[
                     ("ohe", OneHotEncoder(handle_unknown="ignore")),
                     ("reduction", NMF(n_components=8)),
                 ]),
                 ["country"],
                ),
            ])
        ), 
        ("text_features",
         ColumnTransformer([
             ("title_vec",
              Pipeline(steps=[
                  ("tfidf", TfidfVectorizer()),
                  ("reduction", NMF(n_components=50)),
              ]),
              "title"
             )
         ])
        )
    ])
    ),
    ("classifiers", RandomForestClassifier())
])

Now we would work with the pipeline easily:

In [0]:
#model_pipeline.fit(train_data, train_labels.values)
#predictions = model_pipeline.predict(predict_data)

# 3. How to write that?
Alright, I now have a feeling that I am comfortable with understanding these, but how do we get to write such thing? The answer is: **from the outside - inwards**. Let's walk through an example, of course you could write things differently.  

At first, lay yourself a simple structure which separates your feature engineering (inside of FeatureUnion) and your predictive model.

In [0]:
model_pipeline = Pipeline(steps=[
    ("features", FeatureUnion([#all feature engineering goes here])),
    ("classifiers", RandomForestClassifier())
])

[0;36m  File [0;32m"<command-2649658360539189>"[0;36m, line [0;32m1[0m
[0;31m    model_pipeline = Pipeline(steps=[[0m
[0m                              ^[0m
[0;31mSyntaxError[0m[0;31m:[0m expression cannot contain assignment, perhaps you meant "=="?


Secondly, depending on your features, split yourself various parts inside of your feature engineering.

In [0]:
model_pipeline = Pipeline(steps=[
    ("features", FeatureUnion([("numerical_features", #numerical transformations), 
                               ("categorical_features", #categorical transformations), 
                               ("text_features", #textual transformations)
                              ])
    ),
    ("classifiers", RandomForestClassifier())
])

[0;36m  File [0;32m"<command-2649658360539204>"[0;36m, line [0;32m5[0m
[0;31m    ])[0m
[0m    ^[0m
[0;31mSyntaxError[0m[0;31m:[0m closing parenthesis ']' does not match opening parenthesis '(' on line 4


Now you want to put inside a ColumnTransformer as the transformations will be applied only to specific columns.

In [0]:
model_pipeline = Pipeline(steps=[
    ("features", FeatureUnion([("numerical_features", ColumnTransformer([#numerical transformations])),
                               ("categorical_features", ColumnTransformer([#categorical transformations])),
                               ("text_features", ColumnTransformer([#textual transformations]))
                              ])
    ),
    ("classifiers", RandomForestClassifier())
])

[0;36m  File [0;32m"<command-2649658360539214>"[0;36m, line [0;32m1[0m
[0;31m    model_pipeline = Pipeline(steps=[[0m
[0m                             ^[0m
[0;31mSyntaxError[0m[0;31m:[0m expression cannot contain assignment, perhaps you meant "=="?


You can put Pipeline inside of it, for example, in case you have transformers which need to be sequential (such as numeric scaling and feature selection).  
And you just start to put in your individually wrote transformations from before.

# 4. Reflect
Continue with this point only once you went through the pipeline above.  

Usually we think that nicely written code costs significantly more effort than code scraped together in whichever way. Now that we went through the composite estimators properly, you know that it might be even simpler in many cases, not to mention robustness.  

You are hopefully able to tell apart two things:  
- Data preprocessing and wrangling.
- Data preparation for ML (Feature Engineering)  

Always try to separate these things in your use case (code). That is why we present these topics separatedely. It will be of tremendous help in the longer run to write code in this way.

# 5. Working Example  
[Source](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html)

In [0]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV



In [0]:
train = pd.read_csv("data_titanic/train.csv")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Use ``ColumnTransformer`` by selecting column by names

We will train our classifier with the following features:

Numeric Features:

* ``Age``: float;
* ``Fare``: float.

Categorical Features:

* ``Embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;
* ``Sex``: categories encoded as strings ``{'female', 'male'}``;
* ``Pclass``: ordinal integers ``{1, 2, 3}``.

We create the preprocessing pipelines for both numeric and categorical data.
Note that ``pclass`` could either be treated as a categorical or numeric
feature.

In [0]:
X = train.drop('Survived', axis=1)
y = train['Survived']

In [0]:
numeric_features = ["Age", "Fare"]
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), 
                                      ("scaler", StandardScaler())]
                              )

categorical_features = ["Embarked", "Sex", "Pclass"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(transformers=[("num", numeric_transformer, numeric_features),
                                               ("cat", categorical_transformer, categorical_features),
                                              ]
                                )

Append classifier to preprocessing pipeline. Now we have a full prediction pipeline.

In [0]:
clf = Pipeline(steps=[("preprocessor", preprocessor), 
                      ("classifier", LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf.fit(X_train, y_train)

print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.799


In [0]:
clf

Out[55]: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Age', 'Fare']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Embarked', 'Sex',
                                                   'Pclass'])])),
                ('classifier', LogisticRegression())])

Use ``ColumnTransformer`` by selecting column by data types

When dealing with a cleaned dataset, the preprocessing can be automatic by
using the data types of the column to decide whether to treat a column as a
numerical or categorical feature.

`sklearn.compose.make_column_selector` gives this possibility.

<div class="alert alert-info"><h4>Note</h4><p>In practice, you will have to handle yourself the column data type.
   If you want some columns to be considered as `category`, you will have to
   convert them into categorical columns. If you are using pandas, you can
   refer to their documentation regarding [Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html).</p></div>


+ First, we will transform the object columns into categorical.  
+ Then, let's only select a subset of columns to simplify our example.

In [0]:
X["Embarked"] = X["Embarked"].astype("category")
X["Sex"] = X["Sex"].astype("category")

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
subset_feature = ["Embarked", "Sex", "Pclass", "Age", "Fare"]
X_train, X_test = X_train[subset_feature], X_test[subset_feature]

In [0]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 140 to 684
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Embarked  710 non-null    category
 1   Sex       712 non-null    category
 2   Pclass    712 non-null    int64   
 3   Age       571 non-null    float64 
 4   Fare      712 non-null    float64 
dtypes: category(2), float64(2), int64(1)
memory usage: 23.9 KB


We can observe that the `embarked` and `sex` columns were tagged as `category` columns.  
Therefore, we can use this information to dispatch the categorical columns to the ``categorical_transformer`` and the remaining columns to the ``numerical_transformer``.

In [0]:
from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, selector(dtype_exclude="category")),
        ("cat", categorical_transformer, selector(dtype_include="category")),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

model score: 0.799
Out[59]: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f4707d67fa0>),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f4707d67670>)])),
                ('classifier', LogisticRegression())])

The resulting score is not exactly the same as the one from the previous
pipeline because the dtype-based selector treats the ``pclass`` column as
a numeric feature instead of a categorical feature as previously:

In [0]:
selector(dtype_exclude="category")(X_train)

Out[60]: ['Pclass', 'Age', 'Fare']

In [0]:
selector(dtype_include="category")(X_train)

Out[61]: ['Embarked', 'Sex']

Using the prediction pipeline in a grid search  

Grid search can also be performed on the different preprocessing steps defined in the ``ColumnTransformer`` object, together with the classifier's
hyperparameters as part of the ``Pipeline``.  
We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter of the logistic regression using
:class:`~sklearn.model_selection.GridSearchCV`.

In [0]:
param_grid = {"preprocessor__num__imputer__strategy": ["mean", "median"],
              "classifier__C": [0.1, 1.0, 10, 100],
             }

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search

Out[62]: GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f4707d67fa0>),
                                                                        ('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                              

Calling 'fit' triggers the cross-validated search for the best hyper-parameters combination:

In [0]:
grid_search.fit(X_train, y_train)

print("Best params:")
print(grid_search.best_params_)

Best params:
{'classifier__C': 0.1, 'preprocessor__num__imputer__strategy': 'median'}


The internal cross-validation scores obtained by those parameters is:

In [0]:
print(f"Internal CV score: {grid_search.best_score_:.3f}")

Internal CV score: 0.788


We can also introspect the top grid search results as a pandas dataframe:

In [0]:
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
cv_results[["mean_test_score",
            "std_test_score",
            "param_preprocessor__num__imputer__strategy",
            "param_classifier__C",
           ]].head(5)

Unnamed: 0,mean_test_score,std_test_score,param_preprocessor__num__imputer__strategy,param_classifier__C
1,0.788009,0.04022,median,0.1
0,0.7866,0.039922,mean,0.1
2,0.785211,0.03999,mean,1.0
3,0.785211,0.039491,median,1.0
4,0.785211,0.03999,mean,10.0


The best hyper-parameters have be used to re-fit a final model on the full training set.  
We can evaluate that final model on held out test data that was not used for hyperparameter tuning.

In [0]:
print(f"best logistic regression from grid search: {grid_search.score(X_test, y_test):.3f}")

best logistic regression from grid search: 0.799
