<a href="https://colab.research.google.com/github/shahahmad-dev/machine_learnig/blob/main/20_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ðŸ”§ **What is a Pipeline in Machine Learning?**

A **Pipeline** is a tool that allows you to connect multiple steps of a machine-learning workflow into **one single flow**.

It means:

> **You put all steps together so they run automatically in the correct order.**

Example steps:
Data Cleaning â†’ Scaling â†’ Model Training â†’ Prediction

With a pipeline, all these steps run together without you repeating code.

---

## ðŸ‘‰ Simple Example

Normally you write:

* Scale the data
* Train the model

With Pipeline:

```python
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
```

Now the pipeline will **automatically scale the data and train the model**.

---

## ðŸŽ¯ Why do we use Pipelines?

âœ” Keeps code clean
âœ” Prevents data leakage
âœ” Works perfectly with cross-validation
âœ” Makes your workflow easy and repeatable
âœ” Ensures every step runs in the correct order



In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score

# load dataset
data = sns.load_dataset('titanic')

# FEATURES (keep categorical + numeric)
X = data[['age', 'fare', 'pclass', 'sex', 'embarked']]
y = data['survived']

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# columns
numeric_cols = ['age', 'fare']
categorical_cols = ['pclass', 'sex', 'embarked']

# numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])

# categorical pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

# full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# fit model
pipeline.fit(X_train, y_train)

# predictions
y_pred = pipeline.predict(X_test)

# accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.7821229050279329


# **Hyperparameter Tuning with Pipelines**

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score

# load dataset
data = sns.load_dataset('titanic')

# FEATURES (keep categorical + numeric)
X = data[['age', 'fare', 'pclass', 'sex', 'embarked']]
y = data['survived']

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




