## Pipeline
<p>A Pipeline in Ml is a way to bundle together all the steps of ML workflow into a single object.<br>
Instead of manually doing:-<br>
- Data Preprecessing (ex. handle missing value, scalling)<br>
- Feature Transformation (ex.encoding etc)<br>
- Model Trainig<br>
You can put them all inside one pipeline and execute them in sequence.</p>

### Why use Pipeline?
- <b>Cleaner Code: </b> no need to repeat preprocessing for training and testing separately.
- <b>Avoid Data Leakage: </b> Transformation are learned only on training data and applied to test data automatically.
- <b>Easier Hyperparameter Tuning: </b>You can tune preprocessing and model parameters together using <b>GRIDSearchCV.</b>
- <b>Reproduccibility: </b>A single object contains the full ML workflow.

#### Basuc Structure of a Pipeline.
Pipeline(steps=[
('step1': transformer1),
('step2': transformer2),
....
('model',model)
]})

In [1]:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')

In [2]:
df.shape

(891, 15)

In [3]:
df.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


In [4]:
df = df[["survived","sex","age","fare","embarked"]]

In [5]:
df.shape

(891, 5)

In [6]:
df.head()

Unnamed: 0,survived,sex,age,fare,embarked
0,0,male,22.0,7.25,S
1,1,female,38.0,71.2833,C
2,1,female,26.0,7.925,S
3,1,female,35.0,53.1,S
4,0,male,35.0,8.05,S


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  891 non-null    int64  
 1   sex       891 non-null    object 
 2   age       714 non-null    float64
 3   fare      891 non-null    float64
 4   embarked  889 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 34.9+ KB


In [8]:
df.isnull().sum()

survived      0
sex           0
age         177
fare          0
embarked      2
dtype: int64

In [9]:
X = df.drop(columns=['survived'])
y = df['survived']

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

### Pipeline Creation

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder , StandardScaler
from sklearn.compose import ColumnTransformer

In [13]:
import numpy as np

In [14]:
# Numerical Feature
numerical_feature = X_train.select_dtypes(include=np.number).columns
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [15]:
# Categorical Feature
categorical_feature = X_train.select_dtypes(include='object').columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [16]:
# Combining Preprocessing for both types
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_feature),
    ('cat', categorical_transformer, categorical_feature)
])

In [17]:
# Add Model into the Pipeline
from sklearn.linear_model import LogisticRegression
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

In [18]:
clf.fit(X_train,y_train)

In [19]:
y_pred = clf.predict(X_test)

In [20]:
# Model Evaluation
from sklearn.metrics import accuracy_score, classification_report

In [21]:
print("Accuracy Score : ", accuracy_score(y_pred,y_test))
print("Classification_Report")
print(classification_report(y_pred,y_test))

Accuracy Score :  0.776536312849162
Classification_Report
              precision    recall  f1-score   support

           0       0.83      0.80      0.81       109
           1       0.70      0.74      0.72        70

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



#### Benefits or Action / Pipeline
- No mannual preprocessing for <b> X_test </b> - It's done inside the pipeline.
- If we switch <b>LogisticRegression to RandomForestClassifier</b>, everything else still work.
- We can <b>tune both preprocessing and model hyperparameters</b> in one place.

## Hyperparameter Tuning With Pipeline

In [22]:
from sklearn.model_selection import GridSearchCV

In [26]:
param_grid = {
    'classifier__C':[0.1,1.0,10],  #C parameter for logistic Regression
}

In [27]:
grid_search = GridSearchCV(clf,param_grid,cv=5,n_jobs=-1)

In [28]:
grid_search.fit(X_train,y_train)

In [30]:
print("Best Parameters :",grid_search.best_params_)
print("Best Cross-Validaion Score :",grid_search.best_score_)

Best Parameters : {'classifier__C': 0.1}
Best Cross-Validaion Score : 0.783669851275485


In [31]:
from sklearn.metrics import accuracy_score, classification_report

In [32]:
y_pred = grid_search.predict(X_test)

In [33]:
print("Test Accuracy :",accuracy_score(y_test,y_pred))
print("Classification Report\n",classification_report(y_test,y_pred))

Test Accuracy : 0.776536312849162
Classification Report
               precision    recall  f1-score   support

           0       0.80      0.83      0.81       105
           1       0.74      0.70      0.72        74

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



### Key Points to Remember
- <b>Always put preprocessing inside the pipeline</b> to avoid leakage.
- <b>ColumnTransformer</b> is the best way to handle multiple column types.
- The <b>last step in a pipeline must be an estimator</b> (classifier/regressor).
- Pipelines can be saved using <b>joblib or pickle</b>.

## Save A Pipeline using Joblib

In [34]:
import joblib

In [35]:
#Save Pipeline
joblib.dump(grid_search.best_estimator_,'titanic_pipeline.pkl')

['titanic_pipeline.pkl']

In [36]:
#Load Pipeline
loaded_pipeline = joblib.load('titanic_pipeline.pkl')

In [37]:
predictions = loaded_pipeline.predict(X_test)

In [40]:
predictions[:10]

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1])

##### Why Joblib?
- Optimized for scikit-learn objects.
- handles large NumPy arrays efficiently.

## Using Pickle

In [41]:
import pickle

In [43]:
#save 
with open('titanic_pipeline2.pkl','wb') as f:
    pickle.dump(grid_search.best_estimator_,f)

In [44]:
#load 
with open('titanic_pipeline2.pkl','rb') as f:
    loaded_pipeline = pickle.load(f)

In [45]:
predictions = loaded_pipeline.predict(X_test)

In [46]:
predictions[:10]

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1])

### Note :- Joblib is usually faster for ML models, but pickle works too.