# Nice Pipeline

here we present a nice example of a pipeline which we can use for training purposes. At first glance, it looks messy and hard to read.  
But if you take a moment to understand, you will notice the beauty for sure!

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer

We just need to import some transformers which are inside of the pipeline.  
This is not a operational code, just an example on longer pipelines.  

In [2]:
#Preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

#Dimensionality reduction
from sklearn.decomposition import NMF

#Imputation
from sklearn.impute import SimpleImputer

#Modeling
from sklearn.ensemble import RandomForestClassifier

#Other
import numpy as np

## Step 1: Take a quick glance
Please take a quick look onto the pipeline which is below and come back here.

## Step 2: Slow walkthrough
Get a **high level view** like this:
- look toward the top, there is a *FeatureUnion*, which is really a wrapper for entire feature engineering
- look at the bottom, there is a *RandomForestClassifier*, which is our predictive model

Now we can go deeper inside of our FeatureUnion, which is our **feature engineering**:
- it splits into three parts, depending on which features we are attempting to process
    - on top, we have numerical features
    - in the middle, we have categorical features
    - on the bottom, we have textual features
- now zoom out again and realize that this is wrapped under FeatureUnion, which means that these features will be transformed in a parallel way and appended next to each other

Only now let's **zoom into one part of our feature engineering**, for example into "numerical features", on the top:
- inside of it, we right away need ColumnTransformer as we want to specify for which columns certain transformation will be applied by name or by type
- now we could already be applying transformers, but remember that ColumnTransformer by default drops all untransformed columns, which would mean that if we want to apply some transformations sequentially we would not be able to

Finally, **get used to the indentation** (the whitespacing). Your code editor helps with this. Get used to this by clicking just behind the last visible character on the line where you are. For example go behing the last bracket on the line of *SimpleImputer*. Now if you hit Enter, it will land where a code should continue on the next line it you still want to stay within the element, which is the *Pipeline*.

Source: https://www.codementor.io/@bruce3557/beautiful-machine-learning-pipeline-with-scikit-learn-uiqapbxuj

In [3]:
model_pipeline = Pipeline(steps=[
    ("features", FeatureUnion([
        ("numerical_features",
         ColumnTransformer([
             ("numerical",
              Pipeline(steps=[(
                  "impute_stage",
                  SimpleImputer(missing_values=np.nan, strategy="median")
              )]),
              ["feature_1"]
             )
         ])
        ), 
        ("categorical_features",
            ColumnTransformer([
                ("country_encoding",
                 Pipeline(steps=[
                     ("ohe", OneHotEncoder(handle_unknown="ignore")),
                     ("reduction", NMF(n_components=8)),
                 ]),
                 ["country"],
                ),
            ])
        ), 
        ("text_features",
         ColumnTransformer([
             ("title_vec",
              Pipeline(steps=[
                  ("tfidf", TfidfVectorizer()),
                  ("reduction", NMF(n_components=50)),
              ]),
              "title"
             )
         ])
        )
    ])
    ),
    ("classifiers", RandomForestClassifier())
])

Now we would work with the pipeline easily:

In [4]:
#model_pipeline.fit(train_data, train_labels.values)
#predictions = model_pipeline.predict(predict_data)

# 3. How to write that?
Alright, I now have a feeling that I am comfortable with understanding these, but how do we get to write such thing? The answer is: **from the outside - inwards**. Let's walk through an example, of course you could write things differently.  

At first, lay yourself a simple structure which separates your feature engineering (inside of FeatureUnion) and your predictive model.

In [None]:
model_pipeline = Pipeline(steps=[
    ("features", FeatureUnion([#all feature engineering goes here])),
    ("classifiers", RandomForestClassifier())
])

Secondly, depending on your features, split yourself various parts inside of your feature engineering.

In [None]:
model_pipeline = Pipeline(steps=[
    ("features", FeatureUnion([("numerical_features", #numerical transformations), 
                               ("categorical_features", #categorical transformations), 
                               ("text_features", #textual transformations)
                              ])
    ),
    ("classifiers", RandomForestClassifier())
])

Now you want to put inside a ColumnTransformer as the transformations will be applied only to specific columns.

In [None]:
model_pipeline = Pipeline(steps=[
    ("features", FeatureUnion([("numerical_features", ColumnTransformer([#numerical transformations])),
                               ("categorical_features", ColumnTransformer([#categorical transformations])),
                               ("text_features", ColumnTransformer([#textual transformations]))
                              ])
    ),
    ("classifiers", RandomForestClassifier())
])

You can put Pipeline inside of it, for example, in case you have transformers which need to be sequential (such as numeric scaling and feature selection).  
And you just start to put in your individually wrote transformations from before.

# 4. Reflect
Continue with this point only once you went through the pipeline above.  

Usually we think that nicely written code costs significantly more effort than code scraped together in whichever way. Now that we went through the composite estimators properly, you know that it might be even simpler in many cases, not to mention robustness.  

You are hopefully able to tell apart two things:  
- Data preprocessing and wrangling.
- Data preparation for ML (Feature Engineering)  

Always try to separate these things in your use case (code). That is why we present these topics separatedely. It will be of tremendous help in the longer run to write code in this way.