Modeling pipelines
=================
SOURCE: Pavel Jankiewicz<br>
Thinking about modeling as a series of transformations is really helpful.
Pipelines and functional transformations are the cleanest way to preprocess the data.
It has its roots in Category theory from mathematics.

Functional transformers are reusable and you can create many complicated things with them (think about Lego blocks).

Assumptions
-------------------

1. We will be using scikit-learn interface to pipelines.
2. We will use pandas dataframes as inputs to pipelines (useful).

There are 2 types of building blocks of machine learning pipelines: transformers and estimators

Theory
--------------------

There is another name for the type of operations will be doing today.

All of the pipeline transformations are just functions.

They are defined as an operation $\cdot$ such that 

$f(a: S) \rightarrow b: S$

What should be true that a transformation $f$ changes $a$ to $b$ but the type of $a$ and $b$ is the same. 

So if your transformation:
- accepts a matrix it should return a matrix
- accepts a dataframe it should return a dataframe
- accepts a json object is should return a json object

```
"It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." —Alan Perlis
```

Transformers
---------

Blocks that have input and output and can be chained with other transformers.

For example

```
Data -> [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] -> Output
```

`[ Select variables ]` - transformer for selecting variables

`[ Normalize ]` - normalization step

`[ Reduce dimensions ]` - dimension reduction


-------------------

Because every transformer has the same type of data as input and output altogether they 
also form a transformer.

```
Input -> [ [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] ] -> Output

Input -> [               Data preprocessing transformation                ] -> Output
```

-------------------

An example of transformer that does nothing

```python
from sklearn.base import BaseEstimator, TransformerMixin

class LazyTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x
```

-------------------

Notice that there are 2 methods:

1. **fit** - learns the information about the data - it becomes a stateful transformer
2. **transform** - applies the transformation 

There are 2 types of transformers:
1. **stateful** - they learn something when calling fit method
2. **stateless** - they don't learn anything

**Why stateless transformers are useful?**

Transformers that don't need historical data to learn can be used in a type of learning
called `online learning`. This type of learning fits pipelines beacuse it is an algorithm
that uses the stream of observations to learn.

It doesn't keep the history so there would be no way to use stateful transformers.


** Some rules when writing transformers:**
- Think about your outputs first: what is the end-goal?
- Then: Work backwards from that end goal, to what you're starting with. Chances are this will save you a lot of time and make your code much smoother and easier to read and understand.


Exercise
--------------

1. Write a transformer that adds some number to the input, the number that is added should be passed in `__init__`
2. Write a transformer that normalizes the input:
   - in the fit method you must save the column means
3. Combine these 2 transformers into a pipeline:
   - hint: write a class that accepts list of transformers as argument
   
HINTS: All transformers are classes! All classes must have an `__init__` function. All transformers must inherit from the BaseEstimator and TransformerMixin parent classes. All transformers must have `fit` and `transform` functions.

In [35]:
import numpy as np
from sklearn.base import TransformerMixin

# answer - start

from sklearn.base import BaseEstimator, TransformerMixin

class AdderTransformer(BaseEstimator, TransformerMixin):
    

class MeanNormalizer(TransformerMixin): 

    
class TransformerPipeline(TransformerMixin):


# answer - end

# tests
X = np.ones((10,10))
adder = AdderTransformer(add=1)
assert np.all(adder.transform(X) == X + 1), "Adder transformer wrong"

X = np.ones((10,10))
normalizer = MeanNormalizer()
assert np.allclose(normalizer.fit_transform(X), np.zeros((10,10))), "Mean normalizer wrong"

double_adder = TransformerPipeline([AdderTransformer(add=1), 
                                    AdderTransformer(add=2)])

assert np.allclose(double_adder.transform(X), X+3), "TransformerPipeline wrong"

**Double click to see the solution**

<div class='spoiler'>

class AdderTransformer(TransformerMixin):
    
    def __init__(self, add=0):
        self.add = add
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x + self.add
    
class MeanNormalizer(TransformerMixin):
    
    def __init__(self, add=0):
        self.add = add
        
    def fit(self, x, y = None):
        self.means = x.mean(axis=0)
        return self
    
    def transform(self, x):
        return x - self.means    
    
class TransformerPipeline(TransformerMixin):
    
    def __init__(self, transformers):
        self.transformers = transformers
        
    def fit(self, x, y = None):
        x_ = x.copy()
        for transformer in self.transformers:
            x_ = transformer.fit_transform(x_)
        return self
        
    def transform(self, x):
        x_ = x.copy()
        for transformer in self.transformers:
            x_ = transformer.transform(x_)
        return x_
</div>