# Python Pipelines

Modeling pipelines
=================

Thinking about modeling as a series of transformations is really helpful.
Pipelines and functional transformations are the cleanest way to preprocess the data.
It has its roots in Category theory from mathematics.

Functional transformers are reusable and you can create many complicated things with them (think about Lego blocks).

Assumptions
-------------------

1. We will be using scikit-learn interface to pipelines.
2. We will use pandas dataframes as inputs to pipelines (useful).

There are 2 types of building blocks of machine learning pipelines: transformers and estimators

Transformers
---------

Blocks that have input and output and can be chained with other transformers.

For example

```
Data -> [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] -> Output
```

`[ Select variables ]` - transformer for selecting variables

`[ Normalize ]` - normalization step

`[ Reduce dimensions ]` - dimension reduction


-------------------

Because every transformer has the same type of data as input and output altogether they 
also form a transformer.

```
Input -> [ [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] ] -> Output

Input -> [               Data preprocessing transformation                ] -> Output
```

-------------------

An example of transformer that does nothing

```python
from sklearn.base import BaseEstimator, TransformerMixin

class LazyTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x
```

-------------------

Notice that there are 2 methods:

1. **fit** - learns the information about the data - it becomes a stateful transformer
2. **transform** - applies the transformation 

There are 2 types of transformers:
1. **stateful** - they learn something when calling fit method
2. **stateless** - they don't learn anything

**Why stateless transformers are useful?**

Transformers that don't need historical data to learn can be used in a type of learning
called `online learning`. This type of learning fits pipelines beacuse it is an algorithm
that uses the stream of observations to learn.

It doesn't keep the history so there would be no way to use stateful transformers.

More than modeling
================

Using pipelines is not limited to machine learning.
It is as easy as defining a few rules to write modular and composable classes.

Let's say you build a Data Engineering platform.

Define a set of inputs (Let's say dataframes): A,B,C,D,E

```python
class Merge(Transformation):
    input = (A,B)
    output = (C)

    def transform(A: DataFrame, B: DataFrame) -> DataFrame:
        ...
        return C
        
class ExtractUsefulFeatures(Transformation):
    input = (C)
    output = (D,E)
    
    def transform(C: DataFrame) -> (DataFrame, DataFrame):
        ...
        return (D,E)
```

Notice that every transform method accepts and outputs DataFrames. It is important to decouple IO operations


```python
def transform() -> DataFrame:
    A = load_A()
    B = load_B()
    ...
    return C
```

would be a mistake

Scikit-learn pipelines to the rescue
-------------

Fortunately scikit-learn provides a set of helpful functions to deal with pipelines.
2 of them are the most important:

1. `sklearn.pipeline.make_pipeline`

2. `sklearn.pipeline.make_union`

    Creates a union of transformers
    
    ```
    
             transformer 1
           /               \
          /                 \
    input                     output
          \                 /    
           \               /
             transformer 2
             
    ```
             
    It is useful when the dataset consists of several types of data that one must 
    deal with separately.


Alternative way to define pipelines
--------------

```python
from sklearn.pipeline import Pipeline
```

It is useful to name the steps because sometimes we want to control the steps from outside - for example when searching for parameters.

Heterogenous data
==========================

Normally datasets are not matrices of numbers.
In real life it will be a mix of:
- categorical features
- numerical features
- dates
- text data
- with missing values / without missing values

Still you must create 1 pipeline to process all these types of information.

Possible transformations:
- **categorical features**:
    - one hot encoding - converting to binary values
    - convert to numerical values - by using a hash of categorical variable
    - target averaging - replace categorical feature with an average of the target
    
- **numerical features**:
    - fill missing values
    - create bins with ranges 
    - normalize, scale
    
- **text**
    - use bag of words vectorization
    - word2vec, sentence2vec

- **dates**
    - extract years, months, days, days of week

Implimentation
==========================


Normally the data comes in various shapes and formats

We need a way merge together sklearn and pandas dataframes in order to do something like this:

```python
pipeline = make_pipeline(
     CleanData(),
     make_union(
         make_pipeline(
             Selector('text_column'), 
             CountVectorizer()
         ),
         make_pipeline(
             Selector('numerical_column_1', 'numerical_column_2'), 
             StandardScaler()
         ),
         make_pipeline(
             Selector('categorical_column'), 
             OneHotEncoder()
         ),
      ),
      model
)
```