# Pipeline
A Pipeline in machine learning is a way to streamline and automate the workflow by chaining multiple processing steps together. It ensures that the data undergoes a predefined sequence of transformations and models, making the process efficient and reproducible. The Pipeline class in scikit-learn is commonly used for this purpose.

## How It Works
- Step Definition: Define each step of the pipeline. Each step can be a data transformation, preprocessing, or a model.
- Chaining Steps: Combine these steps in a sequential order.
- Fitting the Pipeline: Train the entire pipeline on the training data.
- Making Predictions: Use the pipeline to make predictions on new data.

## Simple Example

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Define pipeline steps
steps = [
    ('scaler', StandardScaler()), 
    ('classifier', LogisticRegression())
]

# Create pipeline
pipeline = Pipeline(steps)

# Fit pipeline
pipeline.fit(X_train, y_train)

# Make predictions
predictions = pipeline.predict(X_test)

print(predictions)


## Complex Example

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Load data
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Define pipeline steps
steps = [
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('classifier', SVC(kernel='linear'))
]

# Create pipeline
pipeline = Pipeline(steps)

# Fit pipeline
pipeline.fit(X_train, y_train)

# Make predictions
predictions = pipeline.predict(X_test)

print(predictions)


## Very Complex Example

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Define pipeline steps
steps = [
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_classif)),
    ('classifier', SVC())
]

# Create pipeline
pipeline = Pipeline(steps)

# Define parameter grid
param_grid = {
    'feature_selection__k': [5, 10, 15],
    'classifier': [RandomForestClassifier(), SVC()],
    'classifier__n_estimators': [10, 50, 100] if 'classifier' is RandomForestClassifier else [None],
    'classifier__C': [0.1, 1, 10] if 'classifier' is SVC else [None]
}

# Create grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5)

# Fit pipeline
grid_search.fit(X_train, y_train)

# Make predictions
predictions = grid_search.predict(X_test)

print(predictions)


## Test the examples

In [None]:
import unittest
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif

### test_simple example

In [None]:
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

steps = [
    ('scaler', StandardScaler()), 
    ('classifier', LogisticRegression())
]
pipeline = Pipeline(steps)

param_grid = {
    'classifier__C': [0.1, 1, 10]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
predictions = grid_search.predict(X_test)

self.assertEqual(len(predictions), len(y_test))

- Uses StandardScaler and LogisticRegression.
- GridSearchCV optimizes the C parameter of LogisticRegression.

### test_complex example

In [None]:
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

steps = [
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('classifier', SVC(kernel='linear'))
]
pipeline = Pipeline(steps)

param_grid = {
    'pca__n_components': [2, 3],
    'classifier__C': [0.1, 1, 10]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
predictions = grid_search.predict(X_test)

self.assertEqual(len(predictions), len(y_test))

- Uses StandardScaler, PCA, and SVC.
- GridSearchCV optimizes the number of PCA components and the C parameter of SVC.

### test_very_complex example

In [None]:
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

steps = [
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_classif)),
    ('classifier', SVC())
]
pipeline = Pipeline(steps)

param_grid = [
    {
        'feature_selection__k': [5, 10, 15],
        'classifier': [SVC()],
        'classifier__C': [0.1, 1, 10]
    },
    {
        'feature_selection__k': [5, 10, 15],
        'classifier': [RandomForestClassifier()],
        'classifier__n_estimators': [10, 50, 100]
    }
]

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
predictions = grid_search.predict(X_test)

self.assertEqual(len(predictions), len(y_test))

- Uses StandardScaler, SelectKBest for feature selection, and either SVC or RandomForestClassifier.
- GridSearchCV optimizes the number of features selected and the parameters of the classifiers.