<img src="./images/banner.png" width="800">

# Pipeline API

In the world of machine learning, data preprocessing and model training often involve multiple steps that need to be executed in a specific order. Scikit-learn's Pipeline API provides a powerful tool to streamline this process, making your code more organized, efficient, and less prone to errors.


A Pipeline in scikit-learn is a sequence of data transformation steps coupled with a final estimator.


Pipelines in scikit-learn are a way to chain multiple steps that can be cross-validated together while setting different parameters. They encapsulate an entire workflow, from data preprocessing to model fitting, into a single estimator.


Why use pipelines?
1. **Simplicity:** Pipelines simplify your code by wrapping multiple steps into a single object.
2. **Preventing Data Leakage:** By keeping all steps together, pipelines help prevent accidental data leakage during cross-validation.
3. **Convenience:** Pipelines can be used with grid searches and other model selection tools, making hyperparameter tuning more straightforward.


A typical scikit-learn pipeline consists of a series of transformers followed by a final estimator. Here's a basic representation:


```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('step1', transformer1),
    ('step2', transformer2),
    ...
    ('stepN', estimator)
])
```


Each step in the pipeline is a tuple containing a string (the step name) and an estimator object. Here's what happens when you fit the pipeline:
1. During fitting, the Pipeline calls `fit_transform()` sequentially on all transformers, passing the output of each step as input to the next step.
2. The final estimator is fit using the final transformed data.
3. For prediction, the Pipeline transforms the input data through each step and then calls `predict()` on the final estimator.


<img src="./images/pipeline.png" width="800">

💡 **Pro Tip:** Pipelines implement the same API as regular estimators, so you can use them in place of estimators in many scikit-learn functions.


There are several benefits to using pipelines:
1. **Code Organization:** Pipelines help structure your machine learning code, making it more readable and maintainable.
2. **Automated Ordering:** Pipelines ensure that operations are performed in the correct order, reducing the chance of errors.
3. **Parameter Setting:** You can set parameters for all steps in the pipeline using a single `set_params()` call.
4. **Grid Search Integration:** Pipelines can be used directly with `GridSearchCV` for efficient hyperparameter tuning across all steps.


In real-world machine learning projects, data preprocessing and model training are rarely simple, one-step processes. Pipelines provide a structured way to manage complex workflows, ensuring reproducibility and reducing the potential for errors.


By using Pipelines, you're adopting a best practice that will serve you well as your machine learning projects grow in complexity. In the following sections, we'll dive deeper into creating and using pipelines, exploring their full potential in scikit-learn.

**Table of contents**<a id='toc0_'></a>    
- [Creating and Using Simple Pipelines](#toc1_)    
  - [Constructing a Basic Pipeline](#toc1_1_)    
  - [Fitting and Predicting with Pipelines](#toc1_2_)    
  - [Accessing Steps in a Pipeline](#toc1_3_)    
  - [Setting Parameters](#toc1_4_)    
  - [Why Simple Pipelines Matter](#toc1_5_)    
- [Advanced Pipeline Techniques](#toc2_)    
  - [FeatureUnion: Combining Features](#toc2_1_)    
  - [Column Transformer: Applying Different Transformations to Different Columns](#toc2_2_)    
  - [Memory Caching](#toc2_3_)    
  - [Custom Transformers](#toc2_4_)    
- [Summary](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Creating and Using Simple Pipelines](#toc0_)

Building on our understanding of what pipelines are, let's dive into creating and using simple pipelines in scikit-learn. This section will walk you through the process of constructing basic pipelines and demonstrate their utility in streamlining machine learning workflows.


### <a id='toc1_1_'></a>[Constructing a Basic Pipeline](#toc0_)


To create a pipeline, we use the `Pipeline` class from scikit-learn. Here's a simple example:


In [10]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

pipeline = Pipeline([("scaler", StandardScaler()), ("regressor", Ridge())])

In this example, we've created a pipeline that first scales the input data using `StandardScaler`, then applies `LogisticRegression` for classification. Note that you can end up with an invalid pipeline if you don't specify the correct set of steps such as having a regressor after a classifier.

<img src="./images/invalid-vs-valid.png" width="800">

🔑 **Key Concept:** Each step in the pipeline is a tuple containing a string (the step name) and an estimator object.


### <a id='toc1_2_'></a>[Fitting and Predicting with Pipelines](#toc0_)


Once you've created a pipeline, you can use it just like any other scikit-learn estimator:


In [11]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

# Load the diabetes dataset
X, y = load_diabetes(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Assuming X_train and y_train are your training data
pipeline.fit(X_train, y_train)

# Make predictions on new data
pipeline.predict(X_test)

array([139.86277405, 179.9582406 , 135.71687469, 292.11578228,
       123.18931464,  92.63420961, 257.85540931, 182.98437571,
        88.57110276, 109.34130188,  94.44005222, 166.79559349,
        56.34471823, 206.28537014,  99.83697447, 131.14352119,
       220.13422127, 249.66324991, 196.24491272, 217.48836765,
       207.27026897,  88.62662434,  71.00067256, 188.77371926,
       155.27179344, 160.11509377, 188.68359243, 179.4462606 ,
        48.58194512, 109.46862532, 176.30864771,  87.81175183,
       132.57482949, 183.48671555, 173.64887924, 190.57967229,
       123.89852066, 119.22466026, 147.3190418 ,  59.64849436,
        72.53769889, 107.79400092, 164.47289681, 153.51402972,
       172.19580463,  62.65517635,  73.5923177 , 112.86223345,
        53.52690481, 165.85907297, 153.72729198,  63.69903686,
       106.37934882, 108.94605495, 174.12346145, 156.07183591,
        94.49282843, 209.90083637, 119.67133314,  75.31549064,
       187.08319266, 205.38638319, 140.93273277, 105.55

The `fit` method will automatically:
1. Fit the `StandardScaler` to the training data and transform it.
2. Use the transformed data to fit the `LogisticRegression` model.


Similarly, the `predict` method will:
1. Apply the fitted `StandardScaler` to the test data.
2. Use the transformed data to make predictions with the fitted `LogisticRegression` model.


### <a id='toc1_3_'></a>[Accessing Steps in a Pipeline](#toc0_)


You can access individual steps in a pipeline using the named steps:


In [12]:
scaler = pipeline.named_steps["scaler"]
regressor = pipeline.named_steps["regressor"]

This allows you to inspect or modify individual components if needed.


### <a id='toc1_4_'></a>[Setting Parameters](#toc0_)


One of the powerful features of pipelines is the ability to set parameters for all steps using a single method call:


In [13]:
pipeline.set_params(regressor__alpha=0.1, scaler__with_mean=False)

Note the double underscore syntax: `step_name__parameter_name`.


💡 **Pro Tip:** This unified parameter setting is particularly useful when doing grid search for hyperparameter tuning.


### <a id='toc1_5_'></a>[Why Simple Pipelines Matter](#toc0_)


Simple pipelines offer several advantages:

1. **Cleaner Code:** They reduce the amount of code you need to write for data preprocessing and model training.
2. **Reduced Errors:** By encapsulating the entire process, pipelines reduce the chance of applying transformations incorrectly.
3. **Easy Experimentation:** You can quickly swap out components (e.g., trying different classifiers) without changing the overall structure of your code.


❗️ **Important Note:** While simple pipelines are powerful, remember that they execute steps sequentially. For more complex workflows or when you need branching logic, you might need to explore more advanced pipeline techniques.


In the next section, we'll build on these basics to explore more advanced pipeline techniques, allowing you to tackle even more complex machine learning workflows with ease and efficiency.

## <a id='toc2_'></a>[Advanced Pipeline Techniques](#toc0_)

As you become more comfortable with basic pipelines, scikit-learn offers advanced techniques to handle more complex scenarios. This section explores these advanced concepts, enabling you to create sophisticated machine learning workflows.


### <a id='toc2_1_'></a>[FeatureUnion: Combining Features](#toc0_)


Sometimes, you need to combine features from different transformations. `FeatureUnion` allows you to apply multiple transformer pipelines in parallel and concatenate their outputs.


<img src="./images/feature-union.png" width="600">

In [14]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

In [16]:
# Create a synthetic regression dataset
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [17]:
# Create a FeatureUnion
feature_union = FeatureUnion(
    [
        ("pca", PCA(n_components=5)),
        ("select_best", SelectKBest(f_regression, k=3)),
    ]
)

In [20]:
# Create a pipeline with FeatureUnion
pipeline = Pipeline([("features", feature_union), ("regressor", Ridge())])

In [21]:
# Fit and evaluate the pipeline
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"R-squared score: {score:.4f}")

R-squared score: 0.4499


🔑 **Key Concept:** `FeatureUnion` allows you to combine multiple feature extraction methods, potentially capturing different aspects of your data.


In this example, we're using both PCA and SelectKBest to create features. PCA reduces the dimensionality to 5 components, capturing the most important patterns in the data. SelectKBest chooses the 3 best features based on their correlation with the target variable. These two sets of features are then combined and used to train a Ridge regression model.


This approach can be particularly useful when you have high-dimensional data and you're not sure which feature extraction method will work best. By combining methods, you can potentially capture more informative features than using a single method alone.


### <a id='toc2_2_'></a>[Column Transformer: Applying Different Transformations to Different Columns](#toc0_)


Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. When dealing with heterogeneous data, you often need to apply different transformations to different columns. `ColumnTransformer` is designed for this purpose:


<img src="./images/column-transformer.webp" width="800">

<img src="./images/column-transformer.png" width="800">

In [23]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge

In [24]:
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

In [25]:
# Define column indices
numeric_features = [0, 1, 2, 3, 4, 5]
categorical_features = [6, 7]

In [30]:
# Create ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        (
            "num",
            Pipeline(
                [
                    ("imputer", SimpleImputer(strategy="median")),
                    ("scaler", StandardScaler()),
                ]
            ),
            numeric_features,
        ),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

In [31]:
# Create full pipeline
pipeline = Pipeline([("preprocessor", preprocessor), ("regressor", Ridge())])

In [32]:
# Fit and evaluate the pipeline
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
pipeline.fit(X_train, y_train)

In [33]:
score = pipeline.score(X_test, y_test)
print(f"R-squared score: {score:.4f}")

R-squared score: 0.6710


This setup applies different preprocessing steps to numerical and categorical columns. For numerical features, we first impute missing values with the median and then scale the data. For categorical features, we apply one-hot encoding.


The California Housing dataset contains both numerical and categorical features. By using `ColumnTransformer`, we can appropriately preprocess each type of feature before feeding it into our Random Forest regressor.


### <a id='toc2_3_'></a>[Memory Caching](#toc0_)


For computationally expensive operations, scikit-learn provides memory caching to avoid redundant computations:


In [44]:
from tempfile import mkdtemp
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import PolynomialFeatures

In [45]:
# Load diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

In [46]:
# Set up caching
cachedir = mkdtemp()
cachedir

'/var/folders/g_/1vbng5jn1nv9ztt7z6_k8lg00000gn/T/tmp3bpx_kti'

In [47]:
# Create pipeline with caching
pipeline = Pipeline(
    [
        ("poly", PolynomialFeatures(degree=2)),
        ("scaler", StandardScaler()),
        ("regressor", Ridge()),
    ],
    memory=cachedir,
)

In [49]:
# Fit and evaluate the pipeline
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
pipeline.fit(X_train, y_train)

In [50]:
score = pipeline.score(X_test, y_test)
print(f"R-squared score: {score:.4f}")

R-squared score: 0.4558


💡 **Pro Tip:** Memory caching can significantly speed up operations like cross-validation where the same transformations are applied multiple times.


In this example, we're using the diabetes dataset and creating polynomial features, which can be computationally expensive for large datasets. By using memory caching, we ensure that if the same data passes through the `PolynomialFeatures` step multiple times (e.g., during cross-validation), the results are cached and reused rather than recomputed.


### <a id='toc2_4_'></a>[Custom Transformers](#toc0_)


You can create custom transformers by inheriting from `BaseEstimator` and `TransformerMixin`:


In [56]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np


class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, features_to_transform):
        self.features_to_transform = features_to_transform

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_transformed = X.copy()
        X_transformed[:, self.features_to_transform] = np.log1p(
            X_transformed[:, self.features_to_transform]
        )
        return X_transformed

In [57]:
# Load California housing dataset again
housing = fetch_california_housing()
X, y = housing.data, housing.target

In [58]:
# Create pipeline with custom transformer
pipeline = Pipeline(
    [
        ("log_transform", LogTransformer(features_to_transform=[0, 1, 2])),
        ("scaler", StandardScaler()),
        ("regressor", Ridge()),
    ]
)

In [59]:
# Fit and evaluate the pipeline
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"R-squared score: {score:.4f}")

R-squared score: 0.5438


This custom transformer applies a log transformation to specified features. In the California Housing dataset, this could be useful for features like median income or average house value, which often have skewed distributions.


Custom transformers allow you to incorporate domain-specific knowledge or complex preprocessing steps that aren't available in scikit-learn's built-in transformers.


🤔 **Why This Matters:** Advanced pipeline techniques allow you to create more flexible and powerful machine learning workflows. They enable you to handle complex data preprocessing scenarios, optimize performance, and create custom solutions tailored to your specific problems.


By mastering these advanced techniques, you'll be able to tackle a wide range of machine learning challenges efficiently and effectively. In the next section, we'll explore some utilities and best practices to further enhance your use of scikit-learn pipelines.

## <a id='toc3_'></a>[Summary](#toc0_)

Scikit-learn's Pipeline API is a powerful tool for streamlining machine learning workflows, enhancing code organization, and improving model development efficiency. Let's summarize the key benefits:

1. **Simplification of Workflows:** Pipelines encapsulate multiple steps of a machine learning process into a single, cohesive object. This simplification leads to cleaner, more maintainable code and reduces the chance of errors in data processing and model training.

2. **Prevention of Data Leakage:** By keeping all preprocessing steps together with the model, pipelines help prevent accidental data leakage during cross-validation, ensuring the integrity of your model evaluation.

3. **Versatility:** From basic preprocessing and model fitting to advanced techniques like feature union and column transformation, pipelines can handle a wide range of machine learning tasks.

4. **Ease of Parameter Tuning:** Pipelines integrate seamlessly with scikit-learn's model selection tools, making it straightforward to perform grid searches across all steps in your workflow.

5. **Custom Solutions:** The ability to create custom transformers allows you to incorporate domain-specific knowledge and complex preprocessing steps into your pipelines.


Mastering the Pipeline API is crucial for any data scientist or machine learning engineer. It not only improves the quality and reliability of your models but also significantly enhances your productivity and the reproducibility of your work.


As you continue to work with machine learning projects, make it a habit to structure your workflows using pipelines. This practice will pay dividends in terms of code quality, model performance, and ease of experimentation.


To further enhance your skills with scikit-learn pipelines:

1. Experiment with different combinations of preprocessors and estimators in your projects.
2. Practice creating custom transformers for domain-specific tasks.
3. Explore how pipelines can be used with different cross-validation strategies and model selection techniques.
4. Consider how pipelines can be integrated into larger machine learning systems or deployed in production environments.


By consistently applying and expanding your knowledge of pipelines, you'll be well-equipped to tackle complex machine learning challenges efficiently and effectively.