<a href="https://colab.research.google.com/github/szh141/mlproject/blob/main/sklearn_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://medium.com/@Coursesteach/supervised-learning-with-scikit-learn-part-14-pipelines-in-scikit-learn-dc408eb152d1

In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [3]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We will create a pipeline with three steps:

1. Standardize the features using StandardScaler.

2. Apply Principal Component Analysis (PCA) for dimensionality reduction.

3. Train a Logistic Regression model.


In [10]:
# Define the pipeline
pipeline = Pipeline([
 ('scaler', StandardScaler()),
 ('pca', PCA(n_components=2)),
 ('classifier', LogisticRegression())
])

# Train the pipeline
pipeline.fit(X_train, y_train)

In [11]:
# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 90.00%


One of the powerful features of Pipelines is the ability to tune hyperparameters for all steps simultaneously using GridSearchCV or RandomizedSearchCV.

In [12]:
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {
'pca__n_components': [2, 3],
'classifier__C': [0.1, 1, 10]
}
# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
# Perform grid search
grid_search.fit(X_train, y_train)
# Best parameters and score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.2f}")

Best Parameters: {'classifier__C': 1, 'pca__n_components': 3}
Best Cross-Validation Score: 0.96
