## Pipeline and Column Transformer

- [Pipeline Doc](https://scikit-learn.org/1.6/modules/generated/sklearn.pipeline.Pipeline.html)
- [Column Transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
- [make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [2]:
steps = [
    ("scaling", StandardScaler()),
    ("classification", LogisticRegression())
]

steps

[('scaling', StandardScaler()), ('classification', LogisticRegression())]

In [6]:
# visualize pipeline
from sklearn import set_config
set_config(display="diagram")

In [7]:
pipe = Pipeline(steps)
pipe

In [8]:
# create dummy dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000)

X.shape

(1000, 20)

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

**NOTE**

No, in pipe.fit(X_train, y_train), only X_train goes through the transformations defined in the pipeline (scaling in this case). y_train is used as the target variable for training the Logistic Regression model and does not undergo any transformation. The pipeline is designed to transform the input features (X_train) before fitting the model.

In [10]:
pipe.fit(X_train, y_train)

In [11]:
# y_pred = pipe.predict(X_test)

pipe.score(X_test, y_test)

0.84

## Example

In [12]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

In [13]:
steps = [
    ("feature_scaling", StandardScaler()),
    ("PCA", PCA(n_components=3)),
    ("SVC_clf", SVC())
]

steps

[('feature_scaling', StandardScaler()),
 ('PCA', PCA(n_components=3)),
 ('SVC_clf', SVC())]

In [14]:
pipe2 = Pipeline(steps)

pipe2

In [15]:
# can use individual steps from pipeline

# pipe2['feature_scaling'].fit_transform(X_train)

array([[-1.66139766, -0.67502958, -0.7032783 , ...,  1.57750223,
        -0.22237766, -1.22773017],
       [ 0.44565621,  0.92305037,  0.81947655, ...,  0.29126085,
         0.92441577, -0.14407055],
       [-0.77277942,  0.6395547 ,  0.09579131, ...,  0.48033678,
         1.50030261,  0.74546052],
       ...,
       [-0.38988017,  1.61863315,  0.79606005, ..., -0.83009736,
        -0.48688572, -0.53570476],
       [-1.79181511,  1.14777206, -0.10679374, ...,  0.64485675,
         0.02927617, -1.59435853],
       [-1.71913477, -0.36758071, -0.94088431, ..., -0.19126264,
        -0.16049695,  1.03820392]], shape=(750, 20))

In [16]:
pipe2.fit(X_train, y_train)

In [17]:
pipe2.predict(X_test)

array([0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0])

In [18]:
pipe2.score(X_test, y_test)

0.824

## Column Transformer

In [21]:
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [24]:
numeric_pipeline = Pipeline(
    [
        ("imputation_mean", SimpleImputer(missing_values=np.nan, strategy="mean")),
        ("standard_scaling", StandardScaler())
    ]
)

numeric_pipeline

In [23]:
categorical_pipeline = Pipeline(
    [
        ("imputation_constant", SimpleImputer(fill_value="missing", strategy="constant")),
        ("scaler", StandardScaler()),
        ("onehot_encoding", OneHotEncoder(handle_unknown="ignore"))
    ]
)

categorical_pipeline

In [26]:
# combine both the pipelines

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    [
        ("categorical", categorical_pipeline, ["gender", "City"]),
        ("numerical", numeric_pipeline, ["age", "height"])
        
    ]
)

preprocessor

In [27]:
from sklearn.pipeline import make_pipeline

In [28]:
pipe = make_pipeline(preprocessor, LogisticRegression())
pipe