# Python Pipelines

Modeling pipelines
=================

Thinking about modeling as a series of transformations is really helpful.
Pipelines and functional transformations are the cleanest way to preprocess the data.
It has its roots in Category theory from mathematics.

Functional transformers are reusable and you can create many complicated things with them (think about Lego blocks).

Assumptions
-------------------

1. We will be using scikit-learn interface to pipelines.
2. We will use pandas dataframes as inputs to pipelines (useful).

There are 2 types of building blocks of machine learning pipelines: transformers and estimators

Transformers
---------

Blocks that have input and output and can be chained with other transformers.

For example

```
Data -> [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] -> Output
```

`[ Select variables ]` - transformer for selecting variables

`[ Normalize ]` - normalization step

`[ Reduce dimensions ]` - dimension reduction


-------------------

Because every transformer has the same type of data as input and output altogether they 
also form a transformer.

```
Input -> [ [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] ] -> Output

Input -> [               Data preprocessing transformation                ] -> Output
```

-------------------

An example of transformer that does nothing

```python
from sklearn.base import BaseEstimator, TransformerMixin

class LazyTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x
```

-------------------

Notice that there are 2 methods:

1. **fit** - learns the information about the data - it becomes a stateful transformer
2. **transform** - applies the transformation 

There are 2 types of transformers:
1. **stateful** - they learn something when calling fit method
2. **stateless** - they don't learn anything

**Why stateless transformers are useful?**

Transformers that don't need historical data to learn can be used in a type of learning
called `online learning`. This type of learning fits pipelines beacuse it is an algorithm
that uses the stream of observations to learn.

It doesn't keep the history so there would be no way to use stateful transformers.

More than modeling
================

Using pipelines is not limited to machine learning.
It is as easy as defining a few rules to write modular and composable classes.

Let's say you build a Data Engineering platform.

Define a set of inputs (Let's say dataframes): A,B,C,D,E

```python
class Merge(Transformation):
    input = (A,B)
    output = (C)

    def transform(A: DataFrame, B: DataFrame) -> DataFrame:
        ...
        return C
        
class ExtractUsefulFeatures(Transformation):
    input = (C)
    output = (D,E)
    
    def transform(C: DataFrame) -> (DataFrame, DataFrame):
        ...
        return (D,E)
```

Notice that every transform method accepts and outputs DataFrames. It is important to decouple IO operations


```python
def transform() -> DataFrame:
    A = load_A()
    B = load_B()
    ...
    return C
```

would be a mistake

Scikit-learn pipelines to the rescue
-------------

Fortunately scikit-learn provides a set of helpful functions to deal with pipelines.
2 of them are the most important:

1. `sklearn.pipeline.make_pipeline`

2. `sklearn.pipeline.make_union`

    Creates a union of transformers
    
    ```
    
             transformer 1
           /               \
          /                 \
    input                     output
          \                 /    
           \               /
             transformer 2
             
    ```
             
    It is useful when the dataset consists of several types of data that one must 
    deal with separately.


Alternative way to define pipelines
--------------

```python
from sklearn.pipeline import Pipeline
```

It is useful to name the steps because sometimes we want to control the steps from outside - for example when searching for parameters.

Heterogenous data
==========================

Normally datasets are not matrices of numbers.
In real life it will be a mix of:
- categorical features
- numerical features
- dates
- text data
- with missing values / without missing values

Still you must create 1 pipeline to process all these types of information.

Possible transformations:
- **categorical features**:
    - one hot encoding - converting to binary values
    - convert to numerical values - by using a hash of categorical variable
    - target averaging - replace categorical feature with an average of the target
    
- **numerical features**:
    - fill missing values
    - create bins with ranges 
    - normalize, scale
    
- **text**
    - use bag of words vectorization
    - word2vec, sentence2vec

- **dates**
    - extract years, months, days, days of week

Implimentation
==========================


Normally the data comes in various shapes and formats

We need a way merge together sklearn and pandas dataframes in order to do something like this:

```python
pipeline = make_pipeline(
     CleanData(),
     make_union(
         make_pipeline(
             Selector('text_column'), 
             CountVectorizer()
         ),
         make_pipeline(
             Selector('numerical_column_1', 'numerical_column_2'), 
             StandardScaler()
         ),
         make_pipeline(
             Selector('categorical_column'), 
             OneHotEncoder()
         ),
      ),
      model
)
```

Pipelines are set up with the fit/transform/predict functionality, so you can fit a whole pipeline to the training data and transform to the test data, without having to do it individually for each thing you do. 

# Exercise

In [None]:
import pandas as pd
import numpy as np

## Loading Data

In [None]:
sd= pd.read_csv('data/smsspamcollection/SMSSpamCollection', sep='\t', 
                names =['target','message'])
sd.sample(5)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(sd['message'], 
                                                    sd['target'], 
                                                    random_state=1)

## The usual way

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

In [None]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

## With Python pipeline

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
# Transformers - fit, transforms (multiple Transformers)
# Estimator - fit and predict (Single estimator)

In [None]:
pipe1=Pipeline([
    ('countvec', CountVectorizer()),
    ('classfier',MultinomialNB())
])

In [None]:
X_train[0]

In [None]:
pipe1.fit(X_train,y_train)

In [None]:
X_test

In [None]:
X_test.shape

In [1]:
pipe1.predict_proba(X_test)

NameError: name 'pipe1' is not defined

## Save python objects to use later

In [None]:
import joblib

In [None]:
joblib.dump(pipe1,'my_model_pipeline.pkl')

## Loading models

In [2]:
import pandas as pd
import joblib

In [3]:
mymodel=open('my_model_pipeline.pkl','rb')
pipe=joblib.load(mymodel)

In [4]:
my_msg=['I‘m going to try for 2 months ha ha only joking',
        '''Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. 
        Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's''']

my_df=pd.DataFrame({'message':my_msg})

In [5]:
my_df

Unnamed: 0,message
0,I‘m going to try for 2 months ha ha only joking
1,Free entry in 2 a wkly comp to win FA Cup fina...


In [6]:
pipe.predict_proba(my_df['message'])

array([[9.99714387e-01, 2.85613245e-04],
       [9.41423380e-22, 1.00000000e+00]])

In [7]:
pipe.classes_

array(['ham', 'spam'], dtype='<U4')

## Another Example

In [8]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# import some data within sklearn for iris classification
iris = datasets.load_iris()
X = iris.data
y = iris.target
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
 
X_train.shape, X_test.shape

((112, 4), (38, 4))

In [9]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('pca', PCA(n_components = 2)), 
                 ('std', StandardScaler()), 
                 ('Decision_tree', DecisionTreeClassifier())], verbose = True)
 
pipe.fit(X_train, y_train)

[Pipeline] ............... (step 1 of 3) Processing pca, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing std, total=   0.0s
[Pipeline] ..... (step 3 of 3) Processing Decision_tree, total=   0.0s


Pipeline(steps=[('pca', PCA(n_components=2)), ('std', StandardScaler()),
                ('Decision_tree', DecisionTreeClassifier())],
         verbose=True)

In [10]:
y_predict = pipe.predict(X_test)
y_predict

array([0, 1, 0, 1, 1, 2, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 2, 0, 1, 0, 2, 1,
       1, 1, 2, 1, 0, 2, 1, 1, 2, 0, 2, 0, 2, 1, 2, 1])

In [11]:
accuracy_score(y_true=y_test, y_pred=y_predict)

0.9210526315789473