# Tidying up Pipelines with DataClasses

## Background

Tidy code makes everyones life easier.  
The code we write will be read many times, so making things easier to mange will be apprciated later on by everyone on the team.  
Some tools that can assist in this cleanliness is the usage of Pipelines and Dataclasses.  

> MLEngineer is 10% ML 90% Engineer.   

### Pipeline
Pipeline is a *meta* object that assists in managing the processes in a ML model.  Pipelines can encapsulat seperate processes which can later on be combined together.       
Forcing to work with [Pipline objects](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) can be nuisance at the begining (especially the conversion between  `pandas DataFrame` and `np.ndarray` ), but it garenties the quility of the model down the line (no data leakage, modularity etc.). Here is [Kevin Markham 4 min. video](https://www.youtube.com/watch?v=yv4adDGcFE8) explaining the pipeline advantages.   

### Dataclass
Another usefull ML coding tool is to use an object to save datasets along the pipeline. Before Python 3.7 you may have been using [namedtubple](https://docs.python.org/3.9/library/collections.html?highlight=namedtuple#collections.namedtuple), however since Python 3.7 [dataclasses](https://docs.python.org/3/library/dataclasses.html) was introduced, which are a great candidate for storing such data objects. Using dataclasses allows for consistancy while accessing various datasets along the Pipeline.


## Pipeline

In this post we will be showing an advance pipeline that incorperates preprocess per column type,  handle the Categorical columns using the [vtreat](https://github.com/WinVector/pyvtreat) package, and then run a [catboost](https://catboost.ai/) classifier.  


Lets assume that we have a classification problem and our data has numeric and categorical column types, so we may build our pipeline as follows:  


###  Pipeline

```python
y = df.pop("label")
X = df.copy(True)

num_pipe = Pipeline([("scaler",StanderdScaler()),
                     ("variance",VarianceThreshold()),
                     ])
preprocess_pipe = ColumnTransformer(
   remainder="passthrough",
   transformers=[("num_pipe", num_pipe, X.select_dtypes("number"))]
)                     

pipe = Pipeline([("preprocess_pipe", preprocess_pipe),
               ("vtreat", BinomiaOutcomeTreatmentPlan()),
])                 

```  

In this *psedo code* our Pipeline has some preprocessing to the numeric columns follwing by the processing of the categrical columns with the vtreat package (it will passthrough all the non categorical and numeric columns).

Since `catboost` does not have a transform method we are going to introduce it leater on.  
Additionaly since we have an imbalanced data set we are going to use the `StratifiedShuffleSplit` when splitting our data.  

So now the time has come to [cut up our data](https://getyarn.io/yarn-clip/2c689f11-6d71-425c-a701-81be09ad034e#llil9DAFRQ.copy)...

## Test vs. Train vs. Valid
A common workflow when developing an ML model is the necessity to split the date into [Test/Train/Valid datasets](https://machinelearningmastery.com/difference-test-validation-datasets/).   
In a nut shell the difference between the data are:  
1. Test - put aside - don't look until final model estimation  
2. Train - dataset to train model   
3. Valid - dataset to validate model during the training phase (this can be via iteration, GridSearch or preventing overfitting )   

Each dataset will have similar attributes that we will need to save and access throughout the ML workflow.  
In order to prevent confusion lets create a `dataclass` to save the datasets in a structured manner.  




In [None]:
# basic dataclass 
import numpy as np
from dataclasses import dataclass

@dataclass
class Split:
    X: np.ndarray = None
    y: np.array = None
    idx: np.array = None
    pred_class: np.array = None
    pred_proba: np.ndarray = None
    kwargs: Dict = None
    
    def __init__(self, name:str):
        self.name = name


Now we can create the training and test datasets as follows:  
```python
train = Split(name='train')   
test = Split(name='test')  
```

Each `dataclass` will have the follwing attributes:  
1. `X` - a numpy ndarray storing all the features   
2. `y` - a numpy array storing the labeling classification  
3. `idx` - the index for storing the original indexes (usefull for referencing at the end of the pipe line )  
4. `pred_class` - a numpy array storing the predicted classification  
5. `pred_proba` - a numpy ndarray for storrying the probabilites of the classificaitons

Additionally we will store a `name` for the dataclass to easily referencing it along the pipeline.

### Splitting In Action  
There are several methods that you can split your datasets. When data is imbalenced it is important to split the data with a stratified method,  so in our case we chose to use [StratifiedShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html).  In contrast  to the simple [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20split#sklearn.model_selection.train_test_split) which returnes the datasets themselfs, the StratifiedShuffleSplit returnes only the indices for each group. 


```python
for fold_name, (train.idx, test.idx) in enumerate( StratifiedSplitValid(X, y, n_split=5, train_size=0.8) ):
    train, test = get_split_from_idx(X, y, train, test)  # a helper function to get all data itslef since the StratifiedSplitValid on returns the indices
    _train_X = pipe.fit_transform(train.X)
```  
Our helper function is nice and minimal thx for the usage of our `dataclasses`  

```python
def get_split_from_idx(X, y, split1: Split, split2: Split):
    split1.X, split2.X = X.iloc[split1.idx], X.iloc[split2.idx]
    split1.y, split2.y = y.iloc[split1.idx], y.iloc[split2.idx]
    return split1, split2  
```    



### Pipeline in action
Now we can run the first part of our Pipeline  
```python
_train_X = pipe.fit_transform(train.X)  

```

Once we have fit_transformed our data (allowing for vtreat magic to work), we can introduce `catboost` into our Pipeline.  

```python
catboost_clf = CatBoostClassifier()

train_vlaid = Split(name="train_vlaid")
vlaid = Split(name="vlaid")
for fold_name, (train_vlaid.idx, vlaid.idx) in enumerate(StratifiedSplitValid(_train_X, train.y, n_split=5, train_size=0.9) ):
    train_vlaid, vlaid = get_split_from_idx(_train_X, train.y, train_vlaid, vlaid)

    pipe.steps.append(("catboost_clf",catboost_clf))
    
    pipe.fit(train_size.X, train_vlaid.y,
            catboost_clf__eval_set=[(valid.X, valid.y)],
    )
    
```
Notice the two following things:  
1. Using `pipe.steps.append` we are able to introduce steps into the pipeline that could not be initially part of the workflow.  
2. Adding paramters into the steps within the pipeline requies the usage of double dunder for [nested paramters](https://scikit-learn.org/stable/modules/compose.html#nested-parameters).  


Finally we can get some result  

```python
test.pred_class = pipe.(test.X)  
test.pred_proba = pipe.pred_proba(test.X)[:,1]
```  

Now lets say we want to analyse our model and analyse some specific observations or generate our confusion_matrix we can run the following code:  

```python

from sklearn.metrics import confusion_matrix
conf_matrix_test = confusion_matrix(y_true=test.y, y_pred=test.pred_class )

```





## Conclusion  

This blog post outlined the advantages for using Pipelines and Dataclasses.  

I hope the example illustrated the potentail for such usage and inspires you to try them out.