# Workflow

Now we've seen the theory, it's time to see how to use it with Python, and especially how to create a pipeline.

## Model Selection

### Parametrics models

So far we've seen **parametrics models** (logistic, linear regression) where :

- $\hat{y}=f_\beta(X)$
-  Meaning that we have some parameters $\beta$ to model an arbitrary large $n$ datapoints.

Those models are very fast to compute. You can apply on them the stochastic descent gradient.

But those models can't find complex patterns unless we create complex features (x²...).

Each time we know how many parameters we're trying to optimize, those are parametrics models. (So technically Neural Networks are parametric models.)

### Non Parametrics models

**KNN**, **SVM** are non-parametrics. We don't know how many parameters we need to optimize.

During a ```KNN.fit()```, it doesn't compute trying to minimize a loss function, it's recording the distance between every point.
Those models can find complex features but it takes a lot of time to compute on large datasets, and they tend to overfit.

<div>
<img src="files/sklearn_cheatsheet.png" width="85%" source='https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html' align='center'/>
</div>

**Regression** : If you have a lot of samples (more than 100K), you can go with a **SGD Regressor** (Stochatisc Gradient Descent). If not, Lasso if you think you have few features of importance, Ridge if you're not sure. If this fails, you can go with a **SVR-rbf** (Support Vector Regression - Radial Basis Function) which is non-parametric.

**Classification** (with a labelled dataset) : If you have a lot of data, you can also use a SGD classifier. And if it doesn't work, try with a kernel approximation. Otherwise linear models such as Linear SVC  or Logistic classification. If not, Naive Bayes, KNeighbors classifier, or SVC, ensemble classifiers.

## Pipeline

When trying to create a model, you've got many choices to do :

1. **Data preparation**

- Cleaning the data.
- Create new features.
- Scaling the valies.

2. **Modelisation**

- Choosing the right model.
- Choosing the right hyperparameters.

3. **Compare results**

- Compare results between different models.

So what if a pipeline can help us to simplify all these steps?


<div>
<img src="files/pipeline.png" width="55%" source='from Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015.' align='center'/>
</div>

The train and the test must undergo the same transformations, but must be kept separated at all stages.

## Preprocessing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('data/insurance_workflow.csv')

df.head()

In [None]:
X = df.drop(columns='charges')
y = df['charges']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Data Preparation

We're going to :

- Impute missing values.
- Scale numerical features.
- Encode categorical features
- fine tune model and preprocessing

When working on a new project, it's a good habit to start the pipeline right away.

## With one numerical Series (age)

A Pipeline in sklearn takes as input a list of tasks. In the cell below the first task is an 'imputer', the second a 'standard_scaler'. Let's name them, it's going to be handy later.

Then the pipeline fits from first to last. So now it fits the ```SimpleImputer```, than the ```StandardScaler```.
We can fit a Series or an entier DataFrame. Let's start simple with only one Series.

In [None]:
# Preprocess "age"
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Build the pipeline with the different steps
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")), # replace missing values
    ('standard_scaler', StandardScaler())
                    ])

pipeline.fit(X_train[['age']])
pipeline.transform(X_train[['age']])

## Column Transformer

But all our features have different characteristics. A numeric feature will not be preprocessed the same as a categorical feature. So we can use the class ```ColumnTransformer``` to do that.

<div>
<img src="files/column_transformer.png" width="55%" source='https://bait509-ubc.github.io/BAIT509/lectures/lecture5.html' align='center'/>
</div>

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# For numeric features
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# For categorical features
from sklearn.preprocessing import OneHotEncoder


# Impute and then scale numerical values:
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy="mean")),
    ('standard_scaler', StandardScaler())
])

# Encode categorical values
cat_transformer = OneHotEncoder(handle_unknown='ignore')

# Parallelize "num_transformer" and "cat_transfomer"
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, ['age', 'bmi']),
    ('cat_transformer', cat_transformer, ['smoker', 'region'])
])

In [None]:
# Visualizing Pipelines in HTML
# If it doesn't display properly try this:
# from sklearn import set_config; set_config(display='diagram')
preprocessor

In [None]:
X_train_transformed = preprocessor.fit_transform(X_train)

In [None]:
# original X_train
X_train.head(3)

In [None]:
# Preprocessed training set
pd.DataFrame(
    X_train_transformed,
    columns=preprocessor.get_feature_names_out()
).head()

### Other columns

The column 'Children' is not here anymore? We can add a parameter to the ColumnTransformer to keep the features as they are.

In [None]:
preprocessor = ColumnTransformer(
    [('num_transformer', num_transformer, ['age', 'bmi']),
    ('cat_transformer', cat_transformer, ['smoker', 'region'])],
    remainder='passthrough')

X_train_transformed = preprocessor.fit_transform(X_train)

pd.DataFrame(
    X_train_transformed,
    columns=preprocessor.get_feature_names_out()
).head()

### Custom Functions

Sometimes we need to perform operations which don't already exist in sklearn, that's why ```FunctionTransformer``` is here.

In [None]:
from sklearn.preprocessing import FunctionTransformer

# Create a transformer that compresses data to 2 digits (for instance!)
# rounder = FunctionTransformer(np.round)

# We can use a lambda function for more customizable functions
rounder = FunctionTransformer(lambda x: np.round(x, decimals=2)) # x is an array

In [None]:
# Add it at the end of our numerical transformer
num_transformer = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('rounder', rounder)])

# Encode categorical values
cat_transformer = OneHotEncoder(drop='if_binary',
                                handle_unknown='ignore')

preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, ['bmi', 'age']),
    ('cat_transformer', cat_transformer, ['region', 'smoker'])],
    remainder='passthrough')
preprocessor

In [None]:
pd.DataFrame(preprocessor.fit_transform(X_train)).head(2)

### Stateless transformations

The ```FunctionTransformer``` class only works with **stateless transformations**.

**Stateless transformations** are transformations which don't need to store information during ```.fit(X_train)``` that would be used for the ```.transform(X_test)```.

Since a **stateless transformation** doesn't learn anything, fitting is impossible, it does nothing other than transform!
Examples of transformations which don't "learn" anything:

$( X \rightarrow \log(X)$

$(X_1, X_2) \rightarrow X_1 + 5X_2$

You can apply those functions directly on your X without having to store information. It doesn't "fit" anything, it just performs a transformation.

### Statefull transformations

But if we use a **StandardScaler** or a **MinMaxScaler**, those transformations compute some parameters.

- When we use a **MinMaxScaler**, we need to store the ```min()``` and the ```max()``` of the train set.
- When we apply a **StandardScaler** we need to store the mean and the standard deviation.

And so on...

### A Class for a custom function

We can create our own class based on the scikitlearn class to resolve this.

In [None]:
# An empty transformer

from sklearn.base import TransformerMixin, BaseEstimator # Classes to herit

class MyCustomTranformer(TransformerMixin, BaseEstimator):
    # BaseEstimator generates the get_params() and set_params() methods that all Pipelines require
    # TransformerMixin creates the fit_transform() method from fit() and transform()

    def __init__(self): # empty if no hyperparameters
        pass

    def fit(self, X, y=None):
        # Here you store what needs to be stored/learned during .fit(X_train) as instance attributes
        # Return "self" to allow chaining .fit().transform()
        pass

    def transform(self, X, y=None):
        # Return the result as a DataFrame for an integration into the ColumnTransformer
        pass

In [None]:
my_transformer = MyCustomTranformer()
my_transformer.fit(X_train)
my_transformer.transform(X_train)
my_transformer.transform(X_test)

## Feature Union

FeatureUnion applies a list of transformer objects in parallel to the input data, then concatenates the results.

This is useful to combine several feature extraction mechanisms into a single transformer.

### bmi_age_ratio feature

In [None]:
X_train.head(3)

In [None]:
from sklearn.pipeline import FeatureUnion

# Create a custom transformer that multiplies/divides two columns
# Notice that we are creating this new feature completely randomly just as an example
bmi_age_ratio_constructor = FunctionTransformer(lambda df: pd.DataFrame(df["bmi"] / df["age"]))

union = FeatureUnion([
    ('preprocess', preprocessor), # columns 0-7, it's a ColumnTransformer
    ('bmi_age_ratio', bmi_age_ratio_constructor) # new column 8, it's a FunctionTransformer
])

union

In [None]:
pd.DataFrame(union.fit_transform(X_train)).head(1)

### Summary with "make_" shortcuts.

It's a faster way to build pipeline. That's the same thing but you don't have to name it.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import make_union
from sklearn.compose import make_column_transformer

In [None]:
# instead of :

Pipeline([
    ('my_name_for_the_imputer', SimpleImputer()),
    ('my_name_for_the_scaler', StandardScaler())
])

In [None]:
# we can write :

make_pipeline(SimpleImputer(), StandardScaler())

In [None]:
# Code is more compact

num_transformer = make_pipeline(SimpleImputer(), StandardScaler())
cat_transformer = OneHotEncoder()

preproc_basic = make_column_transformer(
    (num_transformer, ['age', 'bmi']),
    (cat_transformer, ['smoker', 'region']),
    remainder='passthrough'
)

preproc_full = make_union(preproc_basic, bmi_age_ratio_constructor)

preproc_full

### Automatic features selection

In [None]:
X_train.dtypes

In [None]:
from sklearn.compose import make_column_selector

num_col = make_column_selector(dtype_include=['float64'])
cat_col = make_column_selector(dtype_include=['object','bool'])

In [None]:
from sklearn.compose import make_column_selector

# Nothing is "hard coded", it could work on a dataset with any column names
num_transformer = make_pipeline(SimpleImputer(), StandardScaler())
num_col = make_column_selector(dtype_include=['float64'])

cat_transformer = OneHotEncoder()
cat_col = make_column_selector(dtype_include=['object','bool'])

preproc_basic = make_column_transformer(
    (num_transformer, num_col),
    (cat_transformer, cat_col),
    remainder='passthrough'
)

preproc_full = make_union(preproc_basic, bmi_age_ratio_constructor)

preproc_full

### What next?

You can apply them on you train and test to :
    
- fit them
- transform them
- fit transform them

## Using pipeline

### Adding a model inside a pipeline

Model objects can be plugged into Pipelines.
Pipelines inherit the methods of the last object in the sequence
- Transformers: fit and transform
- Models: fit, score, predict, etc.

In [None]:
from sklearn.linear_model import Ridge

# Preprocessor
num_transformer = make_pipeline(SimpleImputer(), StandardScaler())
cat_transformer = OneHotEncoder()

preproc = make_column_transformer(
    (num_transformer, make_column_selector(dtype_include=['float64'])),
    (cat_transformer, make_column_selector(dtype_include=['object','bool'])),
    remainder='passthrough'
)

# Add estimator
pipeline = make_pipeline(preproc, Ridge())
pipeline

### Making predictions

In [None]:
# Train Pipeline
pipeline.fit(X_train, y_train)

# Make predictions
pipeline.predict(X_test.iloc[0:1])

# Score model
pipeline.score(X_test, y_test)

But our score isn't very reliable because it wasn't cross validated, we stayed to the initial split.

### Cross validation of a pipeline

In [None]:
from sklearn.model_selection import cross_val_score

# Cross-validate Pipeline. Scaling and all the others informations are applied separately on each fold. No data leakage.
cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2').mean()

### Grid Search with a Pipeline

Grid Searching allows you to check which combination of preprocessing/modeling hyperparameters works best.
It is possible to Grid Search the hyperparameters of any component of the Pipeline

In [None]:
pipeline.get_params()

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(
    pipeline,
    param_grid={
        # Access any component of the Pipeline
        # and any available hyperparamater you want to optimize
        'columntransformer__pipeline__simpleimputer__strategy': ['mean', 'median'],
        'ridge__alpha': [0.1, 0.5, 1, 5, 10]
    },
    cv=5,
    scoring="r2")

grid_search.fit(X_train, y_train)

grid_search.best_params_

Now that we know that ridge_alpha = 1 is the best, we can try to change our initial list ```[0.1, 0.5, 1, 5, 10]```, and replace with numbers around 1. 



In [None]:
grid_search.best_estimator_

In [None]:
pipeline_tuned = grid_search.best_estimator_

### Debug the pipeline



In [None]:
# Access the components of a Pipeline with `named_steps`
pipeline_tuned.named_steps.keys()

In [None]:
pipeline_tuned

In [None]:
pipeline_tuned.__dict__

In [None]:
# Check intermediate steps
print("Before preprocessing, X_train.shape = ")
print(X_train.shape)
print("After preprocessing, X_train_preprocessed.shape = ")
pipeline_tuned.named_steps["columntransformer"].fit_transform(X_train).shape

In [None]:
# Other example
pipeline_tuned.named_steps['columntransformer'].transformers_[2][1].fit_transform(X_train[['age']])

### Export models

You can export and load a model. Make sure you're using the same virtual environement!

You can now deploy your model on a server and try it on new data.

In [None]:
import pickle # binary format to export a python object

# Export Pipeline as pickle file
with open("pipeline.pkl", "wb") as file:
    pickle.dump(pipeline_tuned, file)

# Load Pipeline from pickle file
my_pipeline = pickle.load(open("pipeline.pkl","rb"))

my_pipeline.score(X_test, y_test)