**Day 5**: ML Workflow 👷 (***live in 1.49/1.50***)

<center><h1 style="color:maroon">Machine Learning Workflow</h1>
    <img src="https://drive.google.com/uc?id=147Lecen6y_grpY_9lWbRj28qay4CBcSN" style="width:1300px">
    <h3><span style="color: #045F5F">Data Science & Machine Learning for Planet Earth Lecture Series</span></h3><h6><i> by Cédric M. John <span style="size:6pts">(2023)</span></i></h6></center>

## Plan for today's Lecture 🗓 

* Motivation for an ML workflow
* Introduction to <code>sklearn.pipeline</code> module
* Writing custom transformers
* Grouping data transformation and models into one object
* Migrating from Notebooks to Python classes

## Intended learning outcomes 👩‍🎓

* Write clean code using pipelines
* Optimize the entire ***data-preparation-to-model-selection*** chain
* Build code deployable locally and on the cloud

# Data Pipelines
<br>

<center><img src="https://drive.google.com/uc?id=1449ihCDJfUk-s9fC3snOpBXI9WHj1oAY" style="width:900px;"><br>
 © Cédric John, 2022; Image generated with <a href="https://openai.com/blog/dall-e/">DALL-E</a><br>
<br>Prompt: Wide angle view of a large metal oil pipeline in the sunrise in the middle of a frozen arctic stepp.</center>

### Dataset

<span style="color:teal">**Todays's dataset:**</span><a href="https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india"> India air quality data, Kaggle</a><br>
<img src="https://drive.google.com/uc?id=13rbICKbhIfs-K0dKdLxxHybLDMWLHRk2" style="width:900px"/>

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv('Lecture_data/India_air_quality_light.csv')

# REMINDER: Always drop duplicates first
data = data.drop_duplicates()

data

## Reminder of the data processing needed on this dataset (Lecture 1):

1. Remove Outliers 
2. Missing Data
3. Scale Features
4. Engineer Features
5. Encode Data
6. Select Feature

* Need to train and apply these to our **training set**

* Need to apply the same transformations to our **test set**

* Need to apply the same transformations to our **new data**

* This implies we need to save (and load) <span style="color:red">**several independant transformers**</span>. 

### <span style="color:teal">Messy, work-intensive and entails large potential for bugs and errors.</span>

##  Pipelines: Chaining Data Transformations
<a href="https://scikit-learn.org/stable/modules/compose.html">Sklearn doc</a><br>

### Principle
* **chains** together multiple steps **in sequence**, e.g.:
* *impute* missing values, *then*
* *scale* numerical features, *then*
* *encode* categorical features, *etc...*

* Make your workflow much easier to read and understand.
* Enforce the implementation and order of steps in your project.
* Make your work reproducible and deployable

<img src="https://drive.google.com/uc?id=13pBbXtuoOyHn8NlZhbsdBVk-kVnib-ep" style="width:1200px">
<a href="https://www.packtpub.com/product/python-machine-learning-third-edition/9781789955750">Rashka, S., 2015, Packt Publishing</a>

## Preprocessing pipelines

All preprocessing steps can be built in a pipeline. We are going to predict whether or not the **pollution level is high** as a function of various features. This will illustrate the use of pipelines.

In [None]:
from sklearn.model_selection import train_test_split

y = data.high_level # Target
X = data.drop(columns='high_level') #Feature

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

X_train

In [None]:
y_train

## Our first simple pipeline

Let's create a simple pipeline that impute missing values from <code>so2</code> and <code>no2</code>.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())
])

In [None]:
pipe.fit(X_train[['so2', 'no2']])

In [None]:
pipe.transform(X_test[['so2', 'no2']])

In [None]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(SimpleImputer(), StandardScaler())

pipe.fit(X_train[['so2', 'no2']])

## Visualizing pipelines in html
We can turn on the diagram visualization to have a nice view of our pipelines:

In [None]:
from sklearn import set_config

set_config(display='diagram')

pipe


<h2 id="FeatureUnion">FeatureUnion</h2>



<ul>
<li>Applies transformers in paralell, independently</li>
<li>Concatenate feature matrices outputs of each transformer</li>
<li>Usefull to create and add new features</li>
</ul>


In [None]:
from sklearn.pipeline import FeatureUnion

union = FeatureUnion([
    ('pipeline', pipe), # columns 0-1
    ('not_scaled', SimpleImputer()) # new colums 2-3
])

union.fit(X_train[['so2', 'no2']])
union



In [None]:
pd.DataFrame(union.transform(X_test[['so2', 'no2']])).head()



## Tackling more complex transformations using <code>Column Transformer</code>
* Apply specific changes to specific columns in **parallel**
* A <code>Pipeline</code> object can be passed in <code>ColumnTransformer</code> and vice-versa

<code>from sklearn.compose import ColumnTransformer</code>


💻 Let's do this<br>

* *input* then *scale* numerical variables
* *encode* categorical variables



In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Impute then Scale for numerical variables: 
num_transformer = Pipeline([
    ('num_imputer', SimpleImputer()),
    ('num_scaler', StandardScaler())])

# Encode categorical variables
cat_transformer = Pipeline([
    ('cat_imputer',SimpleImputer(strategy = 'most_frequent')),
    ('cat_encoder',OneHotEncoder(handle_unknown='ignore', sparse_output=False))
     ])

In [None]:
# Paralellize "num_transformer" and "cat_transformer"
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, X_train.select_dtypes(include=np.number).columns),
    ('cat_transformer', cat_transformer, X_train.select_dtypes(exclude=np.number).columns)
])

preprocessor

In [None]:
# Tip to easily select columns in a dataframe based on type:
X_train.select_dtypes(exclude=np.number).columns

In [None]:
preprocessor.fit_transform(X_train)

## How about our feature names?

By default, `sklearn` transformers and pipelines return a numpy array. In the past, retaining the name of the columns was a bit of a juggling exercise. Luckily, starting with `sklearn version 1.2`, we can set the option of `sklearn` to return a `pandas` dataframe (<a href="https://blog.scikit-learn.org/technical/pandas-dataframe-output-for-sklearn-transformer/">See this post for explanation</a>):

In [None]:
from sklearn import set_config
set_config(transform_output = "pandas")

From now on, all of your object will return a pandas dataframe:

In [None]:
preprocessor.fit_transform(X_train)

## Makes our life much easier!
But there are limitations. This won't work if you need a sparse output. For this, simply use `transform_output = "default"`.

### Why did we get >5000 features?


* We selected **ALL** non-numeric values for the OneHotEncoder

* This includes the date, station, location columns!

* Each individual date, station and location become a feature column: this leads to an explosion of features. Let's fix this by selecting only the columns we want.

In [None]:
# Paralellize "num_transformer" and "cat_transformer"
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, X_train.select_dtypes(include=np.number).columns),
    ('cat_transformer', cat_transformer, ['state', 'type'])
])

preprocessor

In [None]:
preprocessor.fit_transform(X_train)


In [None]:
preprocessor.transform(X_test)

### Accessing individual transformers

We can easily access individual transformers in the pipeline, and see their properties:

In [None]:
preprocessor.transformers_

In [None]:
preprocessor.transformers_[0][2]

In [None]:
preprocessor.transformers_[1][1][1].get_feature_names_out()


<h3 id="Custom-transformer-(basic)">Custom transformer (basic)<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_08-Workflow.html?title=Workflow&amp;program_id=10#Custom-transformer-(basic)">¶</a></h3><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">FunctionTransformer</span>
</pre></div>
<ul>
<li><p>You will often need to perform custom transformations on your columns</p>
</li>
<li><p><code>FunctionTransformer</code> encapsulates a function into a transformer object</p>
</li>
<li><p>Can work with Pipelines (in series) or with ColumnTransformer (in paralell)</p>
</li>
</ul>


In [None]:

from sklearn.preprocessing import FunctionTransformer

# Create a transformer that compresses data to 2 digits (for instance!)
rounder = FunctionTransformer(np.round)
rounder = FunctionTransformer(lambda array: np.round(array, decimals=2))

# Add it at the end of our numerical transformer
num_transformer = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('rounder', rounder)])

preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, X_train.select_dtypes(include=np.number).columns),
    ('cat_transformer', cat_transformer, ['state', 'type'])
])



preprocessor.fit_transform(X_train)



In [None]:

preprocessor




<p>⚠️ <code>FunctionTransformer</code> only works for <strong>stateless</strong> transformations</p>
<ul>
<li>It cannot "store" information on a <code>fit</code> (for instance on a train set)</li>
<li>In order to apply it back later on a <code>transform</code> (for instance on the test set)</li>
</ul>
<p>✅ stateless transformations</p>
<p><span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;/math&gt;' id="MathJax-Element-3-Frame" role="presentation" style="font-size: 116%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-7"><span class="mjx-mrow" id="MJXc-Node-8"><span class="mjx-mi" id="MJXc-Node-9"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.249em; padding-right: 0.024em;">X</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math></span></span><script id="MathJax-Element-3" type="math/tex">X</script> --&gt; <span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/math&gt;' id="MathJax-Element-4-Frame" role="presentation" style="font-size: 116%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-10"><span class="mjx-mrow" id="MJXc-Node-11"><span class="mjx-mi" id="MJXc-Node-12"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.496em; padding-bottom: 0.311em;">l</span></span><span class="mjx-mi" id="MJXc-Node-13"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">o</span></span><span class="mjx-mi" id="MJXc-Node-14"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.496em; padding-right: 0.003em;">g</span></span><span class="mjx-mo" id="MJXc-Node-15"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.434em; padding-bottom: 0.619em;">(</span></span><span class="mjx-mi" id="MJXc-Node-16"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.249em; padding-right: 0.024em;">X</span></span><span class="mjx-mo" id="MJXc-Node-17"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.434em; padding-bottom: 0.619em;">)</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo></math></span></span><script id="MathJax-Element-4" type="math/tex">log(X)</script><br/>
<span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/math&gt;' id="MathJax-Element-5-Frame" role="presentation" style="font-size: 116%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-18"><span class="mjx-mrow" id="MJXc-Node-19"><span class="mjx-mo" id="MJXc-Node-20"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.434em; padding-bottom: 0.619em;">(</span></span><span class="mjx-msubsup" id="MJXc-Node-21"><span class="mjx-base" style="margin-right: -0.024em;"><span class="mjx-mi" id="MJXc-Node-22"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.249em; padding-right: 0.024em;">X</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-23" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.373em; padding-bottom: 0.373em;">1</span></span></span></span><span class="mjx-mo" id="MJXc-Node-24"><span class="mjx-char MJXc-TeX-main-R" style="margin-top: -0.182em; padding-bottom: 0.557em;">,</span></span><span class="mjx-msubsup MJXc-space1" id="MJXc-Node-25"><span class="mjx-base" style="margin-right: -0.024em;"><span class="mjx-mi" id="MJXc-Node-26"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.249em; padding-right: 0.024em;">X</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-27" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.373em; padding-bottom: 0.373em;">2</span></span></span></span><span class="mjx-mo" id="MJXc-Node-28"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.434em; padding-bottom: 0.619em;">)</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>X</mi><mn>1</mn></msub><mo>,</mo><msub><mi>X</mi><mn>2</mn></msub><mo stretchy="false">)</mo></math></span></span><script id="MathJax-Element-5" type="math/tex">(X_1, X_2)</script> --&gt; <span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;msub&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;msub&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;/math&gt;' id="MathJax-Element-6-Frame" role="presentation" style="font-size: 116%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-29"><span class="mjx-mrow" id="MJXc-Node-30"><span class="mjx-msubsup" id="MJXc-Node-31"><span class="mjx-base" style="margin-right: -0.024em;"><span class="mjx-mi" id="MJXc-Node-32"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.249em; padding-right: 0.024em;">X</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-33" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.373em; padding-bottom: 0.373em;">1</span></span></span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-34"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.311em; padding-bottom: 0.434em;">+</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-35"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.373em; padding-bottom: 0.373em;">5</span></span><span class="mjx-msubsup" id="MJXc-Node-36"><span class="mjx-base" style="margin-right: -0.024em;"><span class="mjx-mi" id="MJXc-Node-37"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.249em; padding-right: 0.024em;">X</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-38" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.373em; padding-bottom: 0.373em;">2</span></span></span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mn>1</mn></msub><mo>+</mo><mn>5</mn><msub><mi>X</mi><mn>2</mn></msub></math></span></span><script id="MathJax-Element-6" type="math/tex">X_1 + 5X_2</script></p>
<p>❌ Memory-dependent transformation</p>
<p><span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;/math&gt;' id="MathJax-Element-7-Frame" role="presentation" style="font-size: 116%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-39"><span class="mjx-mrow" id="MJXc-Node-40"><span class="mjx-mi" id="MJXc-Node-41"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.249em; padding-right: 0.024em;">X</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math></span></span><script id="MathJax-Element-7" type="math/tex">X</script> --&gt; <span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/math&gt;' id="MathJax-Element-8-Frame" role="presentation" style="font-size: 116%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-42"><span class="mjx-mrow" id="MJXc-Node-43"><span class="mjx-mi" id="MJXc-Node-44"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.496em; padding-bottom: 0.311em; padding-right: 0.032em;">S</span></span><span class="mjx-mi" id="MJXc-Node-45"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.311em;">t</span></span><span class="mjx-mi" id="MJXc-Node-46"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">a</span></span><span class="mjx-mi" id="MJXc-Node-47"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">n</span></span><span class="mjx-mi" id="MJXc-Node-48"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.496em; padding-bottom: 0.311em; padding-right: 0.003em;">d</span></span><span class="mjx-mi" id="MJXc-Node-49"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">a</span></span><span class="mjx-mi" id="MJXc-Node-50"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">r</span></span><span class="mjx-mi" id="MJXc-Node-51"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.496em; padding-bottom: 0.311em; padding-right: 0.003em;">d</span></span><span class="mjx-mi" id="MJXc-Node-52"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.496em; padding-bottom: 0.311em; padding-right: 0.032em;">S</span></span><span class="mjx-mi" id="MJXc-Node-53"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">c</span></span><span class="mjx-mi" id="MJXc-Node-54"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">a</span></span><span class="mjx-mi" id="MJXc-Node-55"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.496em; padding-bottom: 0.311em;">l</span></span><span class="mjx-mi" id="MJXc-Node-56"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">e</span></span><span class="mjx-mi" id="MJXc-Node-57"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.249em; padding-bottom: 0.311em;">r</span></span><span class="mjx-mo" id="MJXc-Node-58"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.434em; padding-bottom: 0.619em;">(</span></span><span class="mjx-mi" id="MJXc-Node-59"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.434em; padding-bottom: 0.249em; padding-right: 0.024em;">X</span></span><span class="mjx-mo" id="MJXc-Node-60"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.434em; padding-bottom: 0.619em;">)</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mi>t</mi><mi>a</mi><mi>n</mi><mi>d</mi><mi>a</mi><mi>r</mi><mi>d</mi><mi>S</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>e</mi><mi>r</mi><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo></math></span></span><script id="MathJax-Element-8" type="math/tex">StandardScaler(X)</script></p>
<p>☝️ For this, we will have to code our own <code>Class</code></p>



<h3 id="Custom-transformer-(advanced)">Custom transformer (advanced)<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_08-Workflow.html?title=Workflow&amp;program_id=10#Custom-transformer-(advanced)">¶</a></h3><ul>
<li>To memorize variables (e.g. column mean etc...) during the <code>.fit()</code></li>
<li>To reuse later these variables with <code>.transform()</code> on a different dataset</li>
<li>Must create a Class with both methods and storing memory as <em>instance variables</em></li>
</ul>



<p>✏️ Let's code a <code>CustomScaler(shrink_factor=3)</code> that center data around its mean, and shrink it by a <em>fixed</em> factor</p>
<ul>
<li>we will only <code>fit</code> it on the <code>X_train</code></li>
<li>we will use it to <code>transform</code> the <code>X_test</code></li>
</ul>


In [None]:

from sklearn.base import TransformerMixin, BaseEstimator

class CustomScaler(TransformerMixin, BaseEstimator): 
# TransformerMixin generates a fit_transform method from fit and transform
# BaseEstimator generates get_params and set_params methods
    
    def __init__(self, shrink_factor=3):
        self.shrink_factor = shrink_factor
    
    def fit(self, X, y=None):
        self.means = X.mean()
        return self
    
    def transform(self, X, y=None):
        X_transformed = (X - self.means) / self.shrink_factor
        # Return result as dataframe for integration into ColumnTransformer
        return X_transformed



In [None]:

# The CustomScaler can then be used like any other transformer!
custom_scaler = CustomScaler(shrink_factor=3)
custom_scaler.fit(X_train[['no2','rainfall','so2']])
custom_scaler.transform(X_test[['no2','rainfall','so2']]).head()



In [None]:
preprocessor.transformers_

In [None]:
preprocessor.transformers_[2] = ('custom_scaler', CustomScaler(shrink_factor=3), X_train.select_dtypes(include=np.number))

In [None]:
preprocessor.fit_transform(X_train)

In [None]:

preprocessor



# Including models in pipelines
<br>

<center><img src="https://drive.google.com/uc?id=145ST_iBDPTifgO5F7slXyJsOrmgCRpmr" style="width:900px;"><br>
 © Cédric John, 2022; Image generated with <a href="https://openai.com/blog/dall-e/">DALL-E</a><br>
<br>Prompt: Dramatic view of a rusted yellow oil pump from the 1960's surrounded by the streets of Havana, Cuba.</center>


* Model objects can be plugged into pipelines
* Pipelines inherit the methods of the **last** object in the sequence
* Transformers: <code>fit</code> and <code>transform</code>
* Models: <code>fit</code>, <code>score</code>, <code>predict</code>, etc...


<img src="https://drive.google.com/uc?id=13pBbXtuoOyHn8NlZhbsdBVk-kVnib-ep" style="width:1200px;">
<a href="https://www.packtpub.com/product/python-machine-learning-third-edition/9781789955750">Rashka, S., 2015, Packt Publishing</a>

In [None]:
from sklearn.linear_model import LogisticRegression

# Combine preprocessor and linear model in pipeline
final_pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(max_iter=2500))])
final_pipe



In [None]:
# Train pipeline
final_pipe_trained = final_pipe.fit(X_train,y_train)

In [None]:
# Make predictions
final_pipe_trained.predict(X_test)

In [None]:
# Score model
final_pipe_trained.score(X_test,y_test)

### Cross validate a pipeline

In [None]:

from sklearn.model_selection import cross_val_score

# Cross validate pipeline
cross_val_score(final_pipe, X_train, y_train, cv=5, scoring='accuracy').mean()


### Grid search a pipeline
* Check which combination of preprocessing/modelling **hyperparameters** work best
* It is possible to grid search hyperparameters of **any component of the pipeline**
* Sklearn Syntax: <code>step_name__transformer_name__hyperparam_name</code>
* Check available hyperparameters <code>pipe.get_params()</code>



In [None]:

#### Get all pipe components parameters (to find hyper params names)
final_pipe_trained.get_params()



In [None]:
from sklearn.model_selection import GridSearchCV

# Instanciate grid search
grid_search = GridSearchCV(
    final_pipe, 
    param_grid={
        # Access any component of the pipeline, as far back as you want
        'preprocessing__num_transformer__imputer__strategy': ['mean', 'median'],
        'classifier__C': [0.1, 0.5, 1, 5, 10]},
    cv=5,
    scoring="accuracy")

grid_search.fit(X_train, y_train)
grid_search.best_params_


In [None]:
# Getting the best estimator
tuned_pipe = grid_search.best_estimator_

In [None]:
tuned_pipe.score(X_test,y_test)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Instanciate grid search
grid_search = GridSearchCV(
    final_pipe, 
    param_grid=[{
        # Access any component of the pipeline, as far back as you want
        'classifier':[SVC()],
        'preprocessing__num_transformer__imputer__strategy': ['mean', 'median'],
        'classifier__C':[0.1]},
        {
        # Access any component of the pipeline, as far back as you want
        'classifier':[KNeighborsClassifier()],
        'preprocessing__num_transformer__imputer__strategy': ['mean', 'median'],
        'classifier__n_neighbors':[3,5]}
    ],
    cv=5,
    scoring="accuracy")

# For this demonstration only: limit ourselves to 10% of data for computational efficiency
X_train_small, _, y_train_small, _ = train_test_split(X_train, y_train, train_size=0.1)

grid_search.fit(X_train_small, y_train_small)
grid_search.best_params_



<h3 id="Test-your-pipeline-as-you-build-it">Test your pipeline as you build it<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/05-ML_08-Workflow.html?title=Workflow&amp;program_id=10#Test-your-pipeline-as-you-build-it">¶</a></h3><p>As you are building your pipeline, it is important to ensure it works identically to what you have done in your notebook thus far.</p>
<ul>
<li>Check the data preprocessing: Compare the statistics of preprocessed data out of the pipeline to the ones of the same data preprocessed outside the pipeline</li>
<li>Compare the performance of the model out of the pipeline to the one trained outside the pipeline</li>
</ul>


In [None]:

# Access component of pipeline with `name_steps`
final_pipe.named_steps["preprocessing"].fit_transform(X_train).shape



## Exporting models and Pipelines

<li>You can export your final model/pipeline as a pickle file</li>
<li>The file can be loaded back into a notebook or deployed on a server</li>
</ul>


In [None]:
# LET'S START WITH A CLEAN PIPELINE
num_transformer = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())])

# Paralellize "num_transformer" and "cat_transformer"
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, X_train.select_dtypes(include=np.number).columns),
    ('cat_transformer', cat_transformer, ['state', 'type'])
])

# Combine preprocessor and linear model in pipeline
final_pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(max_iter=2500))])

final_pipe.fit(X_train, y_train)

In [None]:
import pickle

# Export pipeline as pickle file
with open("pipeline.pkl", "wb") as file:
    pickle.dump(final_pipe, file)

In [None]:
# Load pipeline from pickle file
my_pipeline = pickle.load(open("pipeline.pkl","rb"))
new_samples = pd.read_csv('Lecture_data/new_samples.csv')
new_samples.head()

In [None]:
pred = my_pipeline.predict(new_samples)
new_samples['Predictions'] = pred

pred

# Moving from Notebooks to Packages
<br>

<center><img src="https://drive.google.com/uc?id=13sDCVeMXEc3kXqI1F_udT8SzAHaEYLqY" style="width:900px;"><br>
 © Cédric John, 2022; Image generated with <a href="https://openai.com/blog/dall-e/">DALL-E</a><br>
<br>Prompt: 35 view of a well-organised giant warehouse with endless shelves full of items, orange lighting.</center>


<p><img src="https://drive.google.com/uc?id=149_jQxru5A1q15RiF9860uatiBDRUfmz"/></p>



<p><img src="https://drive.google.com/uc?id=145bhQK2Gkcv9d7FgFG8rxyPCMOAW5dNz" width="1600"/></p>



<h3 id="Notebooks">Notebooks<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/07-Data-Engineering_01.html?title=Code+as+a+Product&amp;program_id=10#Notebooks">¶</a></h3><p>👉 Great for exploration</p>
<p>👉 Offer visible code feedback</p>
<p>👉 Plots and graphs</p>



<h3 id="Packages">Classes / Packages<a class="anchor-link" href="https://kitt.lewagon.com/karr/data-lectures.kitt/07-Data-Engineering_01.html?title=Code+as+a+Product&amp;program_id=10#Packages">¶</a></h3><p>👉 Reusable code</p>
<p>Reuse your code from one project to another, share it with your colleagues, or open source it (remember <code>pip</code>?)</p>



<p>👉 Deployable code</p>
<p>Packages are the python standard to exchange code, they will allows us to run our code online and on-demand</p>



<p>👉 Automation of tests and deployment (CI/CD)</p>
<p>We want our code to be automatically validated, deployed to production, and to run without error</p>
<p>We want to make sure our newly developed features do not break existing ones</p>


# From Notebook to Python Class


## Given our exploratory notebook:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Load and split data
data = pd.read_csv('Lecture_data/India_air_quality_light.csv')
data = data.drop_duplicates()

y = data.high_level # Target
X = data.drop(columns='high_level') #Feature

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# Preprocess data

# Impute then Scale for numerical variables: 
num_transformer = Pipeline([
    ('num_imputer', SimpleImputer()),
    ('num_scaler', StandardScaler())])

# Encode categorical variables
cat_transformer = Pipeline([
    ('cat_imputer',SimpleImputer(strategy = 'most_frequent')),
    ('cat_encoder',OneHotEncoder(handle_unknown='ignore', sparse_output=False))
     ])

preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, X_train.select_dtypes(include=np.number).columns),
    ('cat_transformer', cat_transformer, ['state', 'type'])
])

final_pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(max_iter=2500))])

# Instanciate grid search
grid_search = GridSearchCV(
    final_pipe, 
    param_grid={
        # Access any component of the pipeline, as far back as you want
        'preprocessing__num_transformer__num_imputer__strategy': ['mean', 'median'],
        'classifier__C': [0.1, 0.5, 1, 5, 10]},
    cv=5,
    scoring="accuracy")

# Train model with grid search
grid_search.fit(X_train, y_train)
tuned_pipe = grid_search.best_estimator_

# Score final model
tuned_pipe.score(X_test, y_test)

# Predict (new) samples
new_samples = pd.read_csv('Lecture_data/new_samples.csv')
pred = my_pipeline.predict(new_samples)
new_samples['Predictions'] = pred

new_samples.head()


## Let's create a class that does the following:
* Train the best Logistics Regression model and saves it
* Prints our test score
* Load the model and new data to make predictions

## There are multiple ways to do this:
* We can write one <code>PollutionModel</code> class that does it all
* We can write one <code>TrainPollutionModel</code> for training and one <code>PredictPollution</code> class
* We can write more classes as well as files containing utility functions
* ...

👉🏽 Because our code is simple, let's do one class only

## Let's try our new class!

In [None]:
from pollution_model import PollutionModel

model = PollutionModel('https://drive.google.com/uc?id=13nr-VHGi2zFRpfHhCfz-yHJLX3LQrfTX')

In [None]:
model.score_model()

In [None]:
model.predict('https://drive.google.com/uc?id=13k22u8G6FvV7f1Z8rsCrLksLY3j6SNo1')

# Suggested Resources

## 📺 Videos 
#### Short videos from my Undegraduate Machine Learning Classes:
* 📼 <a href="https://youtu.be/VSGTHhUqIk4?list=PLZzjCZ3QdgQCcRIwQdd-_cJNAUgiEBB_n">Data preparation pipelines</a>

#### Others:

* 📼 <a href="https://developers.google.com/machine-learning/guides/rules-of-ml">Rules of Machine Learning</a> by Google Developers (gives advise on designing ML systems) 

## 📚 Further Reading 
* 📖 <a href="https://hazelcast.com/glossary/data-pipeline/">Data Pipelines</a> by Hazelcast
* 📖 <a href="https://python-packaging-tutorial.readthedocs.io/en/latest/setup_py.html">Python packages and easy introduction</a> 
* 📖 <a href="https://packaging.python.org/en/latest/tutorials/packaging-projects/">Packaging Python Projects</a> 


## 💻🐍 Time to Code ! 