# **Pipeline**

Scikit-learn’s `Pipeline` class enables you to combine different preprocessors or models into a single, callable chunk of code:

![image](images/pipeline.png)

Pipelines can be composed of two different things:
1. Transformer: any object with the fit() and transform() methods. You can think of a transformer as an object that’s used for processing your data, and you will commonly have multiple transformers in your data preparation workflow. E.g., you might use one transformer to impute missing values, and another one to scale features or one-hot encode your categorical variables. MinMaxScaler(), SimpleImputer() and OneHotEncoder() are all examples of transformers.
2. Estimator: In scikit-learn lingo, an “estimator” usually means a machine learning model; i.e. an object with the fit() and predict() methods. LinearRegression() and RandomForestClassifier() are examples of estimators.
In a pipeline, you can chain together as many transformers as you like, enabling you to apply different data preprocessing steps sequentially. If you like, you can also add on an estimator (ML model) at the end in order to make predictions using the newly transformed data, but it’s not compulsory.

For example, you could build a pipeline that first imputes missing values with zeros and then one-hot encodes your variables:

![image](images/pipeline_example_1.png)

Or, if you wanted to directly include the modelling in the pipeline itself, you could build a pipeline that imputes missing values with the mean, scales the features and then makes predictions using a `RandomForestRegressor()`:

![image](images/pipeline_example_2.png)

In [1]:
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Load diabetes dataset into pandas DataFrames
X, y = load_diabetes(scaled=False, return_X_y=True, as_frame=True)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
19,41.0,1.0,24.7,83.0,187.0,108.2,60.0,3.0,4.5433,78.0
90,52.0,1.0,24.0,83.0,167.0,86.6,71.0,2.0,3.8501,94.0
158,45.0,1.0,20.3,74.33,190.0,126.2,49.0,3.88,4.3041,79.0
256,35.0,1.0,41.3,81.0,168.0,102.8,37.0,5.0,4.9488,94.0
288,68.0,2.0,24.8,101.0,221.0,151.4,60.0,4.0,3.8712,87.0


19     168.0
90      98.0
158     96.0
256    346.0
288     80.0
Name: target, dtype: float64

## **sklearn.pipeline.Pipeline**

Next, we define our `Pipeline`. For now, I’ll just define a simple preprocessing `Pipeline` that includes two steps — impute missing values with the mean, and rescale all features — and I won’t include an estimator/model. The principles, however, are the same regardless of whether or not you include an estimator.

In [2]:
from sklearn import set_config
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

# Return pandas DataFrames instead of numpy arrays
set_config(transform_output="pandas")

# Build pipeline
pipe = Pipeline(
    steps=[("impute_mean", SimpleImputer(strategy="mean")), ("rescale", MinMaxScaler())]
)

Once we’ve defined our `Pipeline`, we “fit” it to our training dataset, and use it to transform both the training and testing datasets:

In [3]:
# Fit the pipeline to the training data
pipe.fit(X_train)

# Transform data using the fitted pipeline
X_train_transformed = pipe.transform(X_train)
X_test_transformed = pipe.transform(X_test)

This will give us two preprocessed datasets (`X_train_transformed` and `X_test_transformed`), ready for any subsequent steps like modelling or feature selection.

The advantage of using a `Pipeline` to handle these preprocessing steps is twofold:
- Protect against leakage: Because the preprocessor is fitted to the training dataset X_train, no information about the test set is “leaked” when imputing missing values or creating one-hot encoded features.
- Avoid duplication: If we didn’t use a `Pipeline` to handle these preprocessing steps, we’d end up transforming the `X_test` dataset multiple times (every time we wanted to apply a preprocessing step). At this small scale, the repetition might not seem too bad. But in complex ML workflows you can easily grow to 5, 10, or even 20 preprocessing steps. Using a `Pipeline` makes this easy because we can add in as many steps as we like and still only have to transform `X_train` and `X_test` once.

# **Extra Reading**

https://medium.com/data-science/simplify-your-data-preparation-with-these-4-lesser-known-scikit-learn-classes-70270c94569f