## Pipeline:
Pipelines chains together multiple steps so that the output of each step is used as input to the next step.<br>
Pipelines makes it easy to apply the same preprocessing to train and test.

পাইপলাইন (Pipeline) হল একটি স্বয়ংক্রিয় ও সংগঠিত ওয়ার্কফ্লো, যেখানে ডেটা প্রসেসিং, মডেল ট্রেনিং এবং প্রেডিকশন ধাপে ধাপে স্বয়ংক্রিয়ভাবে পরিচালিত হয়। পাইপলাইন ব্যবহার করলে কোড ম্যানেজমেন্ট সহজ হয় এবং পুনরায় ব্যবহারযোগ্য (reusable) মডেল তৈরি করা যায়।<br>
### Pipeline তৈরির কারণ
1. Code Reusability – বারবার কোড লিখতে হয় না
2. Automation – প্রতিবার ম্যানুয়ালি কাজ করতে হয় না
3. Efficiency – সময় ও প্রসেসিং পাওয়ার বাঁচায়
4. Consistency – সব সময় নির্ভুল ফলাফল দেয় <br>

ডেটা সংগ্রহ  → প্রিপ্রসেসিং(Remove missing values using simpleimputer) → ফিচার ইঞ্জিনিয়ারিং(OneHotencoding) -> Feature seletion best 5 to 10 → মডেল ট্রেনিং → মূল্যায়ন → ডিপ্লয়মেন্ট

In [2]:
import numpy as np
import pandas as pd

### Import all from sklearn
1. train_test_split -> fro train and test dataset
2. ClumnTransformer - use it to transform columns
3. SimpleImputer - > use it for missing values fill.
4. OneHotEncoder - use it for new columns or featurs create
5. MinMaxscaler -- convert columns into scaling 0 to 1
6. Pipeline, make_piple -> use it dor doing all this in a line 
7. SelectkBest -- best or top 5to 10 fetures selection
8. DecsiionTree- use this algorithm for model 

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.tree import DecisionTreeClassifier

In [4]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# drop unnecessary columns from dataset
df.drop(columns=['PassengerId','Name','Ticket','Cabin'],inplace=True)

In [8]:
# train test split 
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['Survived']), df['Survived'], test_size=0.2, random_state=42)

In [9]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S
382,3,male,32.0,0,0,7.925,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.275,S


In [11]:
y_train.sample(5)

694    0
622    1
859    0
700    1
595    0
Name: Survived, dtype: int64

In [14]:
# imputation transformer
trf1 = ColumnTransformer([
    ('impute_age',SimpleImputer(),[2]),
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6])],remainder='passthrough' # means no others columns are drop they remain as all as 
 )

In [15]:
# one hot encoding
trf2 = ColumnTransformer([
    ('ohe_sex_embarked',OneHotEncoder(sparse_output=False,handle_unknown='ignore'),[1,6])
],remainder='passthrough')

In [16]:
# Scaling
trf3 = ColumnTransformer([
    ('scale',MinMaxScaler(),slice(0,10))
])

In [17]:
# Feature selection
trf4 = SelectKBest(score_func=chi2,k=10) # k means total 10 fetures select

In [18]:
# train the model
trf5 = DecisionTreeClassifier()

In [19]:
# create pipeline uysing tuple dataset
pipe = Pipeline([ 
    ('trf1',trf1),
    ('trf2',trf2),
    ('trf3',trf3),
    ('trf4',trf4),
    ('trf5',trf5)
])

### pipeline make_pipline

Pipeline requires naming of steps, make_pipeline does not.

(Same applies to ColumnTransformer vs make_column_transformer)

In [20]:
# Alternate Syntax
pipe = make_pipeline(trf1,trf2,trf3,trf4,trf5)

In [21]:
# train
pipe.fit(X_train,y_train)

In [22]:
# Predict
y_pred = pipe.predict(X_test)

In [23]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6256983240223464

In [24]:
# cross validation using cross_val_score
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy').mean()

np.float64(0.6391214419383433)

In [25]:
# gridsearchcv
params = {
    'trf5__max_depth':[1,2,3,4,5,None]
}

In [26]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

ValueError: Invalid parameter 'trf5' for estimator Pipeline(steps=[('columntransformer-1',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('impute_age', SimpleImputer(),
                                                  [2]),
                                                 ('impute_embarked',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  [6])])),
                ('columntransformer-2',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe_sex_embarked',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False),
                                                  [1, 6])])),
                ('columntransformer-3',
                 ColumnTransformer(transformers=[('scale', MinMaxScaler(),
                                                  slice(0, 10, None))])),
                ('selectkbest',
                 SelectKBest(score_func=<function chi2 at 0x74bbf009fba0>)),
                ('decisiontreeclassifier', DecisionTreeClassifier())]). Valid parameters are: ['memory', 'steps', 'transform_input', 'verbose'].