Here, we will use the same titanic dataset and predict based on some inputs whether a person would've survived in the titanic or not. But this time we will use pipeline method.

In [77]:
import numpy as np
import pandas as pd

In [78]:
df = pd.read_csv("titanic.csv")
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
92,93,0,1,"Chaffee, Mr. Herbert Fuller",male,46.0,1,0,W.E.P. 5734,61.175,E31,S
231,232,0,3,"Larsson, Mr. Bengt Edvin",male,29.0,0,0,347067,7.775,,S
184,185,1,3,"Kink-Heilmann, Miss. Luise Gretchen",female,4.0,0,2,315153,22.025,,S
705,706,0,2,"Morley, Mr. Henry Samuel (""Mr Henry Marshall"")",male,39.0,0,0,250655,26.0,,S
181,182,0,2,"Pernot, Mr. Rene",male,,0,0,SC/PARIS 2131,15.05,,C


In [79]:
# dropping these 4 columns as they are not required -
df.drop(columns=['PassengerId','Name','Ticket','Cabin'],inplace=True)
# The inplace=True argument in a pandas DataFrame method allows you to modify the DataFrame directly without creating a new copy of it. When you set inplace=True, the operation will be performed on the original DataFrame, and no new DataFrame is returned.

In [80]:
df.sample(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
845,0,3,male,42.0,0,0,7.55,S
445,1,1,male,4.0,0,2,81.8583,S
382,0,3,male,32.0,0,0,7.925,S
57,0,3,male,28.5,0,0,7.2292,C


In [81]:
from sklearn.model_selection import train_test_split

In [82]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Survived']),df['Survived'],test_size=0.2,random_state=0)
X_train.shape

(712, 7)

In [83]:
X_train.sample(5)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
383,1,female,35.0,1,0,52.0,S
721,3,male,17.0,1,0,7.0542,S
528,3,male,39.0,0,0,7.925,S
858,3,female,24.0,0,3,19.2583,C
164,3,male,1.0,4,1,39.6875,S


Now, we will see the steps for the transformations to be done with the training data

Imputing missing values ->

In [84]:
from sklearn.compose import ColumnTransformer

In [85]:
from sklearn.impute import SimpleImputer

In [86]:
transformation1 = ColumnTransformer([
    ('impute_age',SimpleImputer(),[2]),
    ('impute_embarked',SimpleImputer(strategy="most_frequent"), [6])
], remainder="passthrough"
)

# **Explanation -**

### **Transformations in the List**

The list inside ColumnTransformer is made up of tuples. Each tuple contains:

A name for the transformation: This is simply a label and can be anything descriptive.
A transformation (like SimpleImputer): This specifies the actual transformation to apply.
A list of column indices: These are the columns on which the transformation will be applied.

### **Explanation of Each Transformation**
('impute_age', SimpleImputer(), [2]):

impute_age: A name for this specific transformation.

SimpleImputer(): This is an imputer that will fill missing values with a default strategy (mean by default). Here, it will apply the imputer on the third column ([2] indicates column index 2, as indexing starts from 0).
Purpose: This will replace any missing values in the age column with the mean age.
('impute_embarked', SimpleImputer(strategy="most_frequent"), [6]):

impute_embarked: A name for this specific transformation.

SimpleImputer(strategy="most_frequent"): An imputer with strategy="most_frequent" will fill in missing values with the most frequently occurring value in that column.
Column [6]: This targets the seventh column in the dataset.
Purpose: This will replace any missing values in the Embarked column with the most frequent value.

remainder="passthrough"
The remainder="passthrough" argument tells ColumnTransformer to leave the columns not specified in the transformation list unchanged, meaning these columns will be retained in their original form and passed through to the output without any modifications. If we don't specify this parameter then other columns would've been dropped which we don't want.

One Hot Encoding -

In [87]:
from sklearn.preprocessing import OneHotEncoder

In [88]:
transformation2 = ColumnTransformer([
    ('ohe_sex_embarked',OneHotEncoder(sparse_output=False,handle_unknown='ignore'),[1,6])
], remainder="passthrough"
)

# **Explanation -**

The list inside ColumnTransformer consists of one tuple specifying a single transformation:

('ohe_sex_embarked', OneHotEncoder(sparse=False, handle_unknown='ignore'), [1, 6]):

ohe_sex_embarked: This is a name for the transformation, which helps to identify this specific transformation within the ColumnTransformer.

OneHotEncoder(handle_unknown='ignore'):

OneHotEncoder is used to convert categorical features into one-hot-encoded columns, meaning each unique category will be represented by a separate binary column.

handle_unknown='ignore': If there are categories in the test data that were not present during training, this option will prevent errors by ignoring these unknown categories.

Column [1, 6]: This targets the second column (index 1, which is "Sex") and the seventh column (index 6, which is "Embarked") in the dataset.

remainder="passthrough"
The remainder="passthrough" argument specifies that any columns not included in the transformation list should be left unchanged and passed through to the output.

Feature Scaling -

In [89]:
from sklearn.preprocessing import MinMaxScaler

In [90]:
transformation3 = ColumnTransformer([
    ('scale',MinMaxScaler(),slice(0,10))
])

# **Explanation -**

We are doing feature scaling here. In this we are using MinMaxScaler because we are also going to do feature selection after this step and MinMaxScaler is a better option than StandardScaler when you have to do feature selection going ahead.

slice(0,10) -> this will make sure that the MinMaxScaler transformation is applied to columns 0-9 i.e. 10 columns in the df.

But how did we get 10 columns in the first place here when we actually had 7 columns at the start in the training data?
-> This is because of the OHE step done before the scaling step. In the OHE step we converted the 'Sex' column to 2 different columns and also the 'Embarked' column to 3 different columns.

So, totally we now have 10 columns and hence the slice(0,10) operation is required here.

Feature selection -

In [91]:
from sklearn.feature_selection import SelectKBest,chi2

In [92]:
transformation4 = SelectKBest(score_func=chi2,k=8)

# **Explanation -**

This code snippet defines a feature selection method called transformer4 using SelectKBest from scikit-learn. SelectKBest selects the top 𝑘.
k features based on a scoring function, in this case, the chi-squared (χ²) statistic.

Explanation of Each Part

SelectKBest: This is a feature selection method that scores each feature individually based on a statistical test and selects the best features according to the scores.

score_func=chi2:

chi2 is the scoring function used here. It calculates the chi-squared (χ²) statistic between each feature and the target variable.
Chi-squared is commonly used for categorical data and measures how much the observed frequency of a feature differs from the expected frequency. Higher values indicate stronger associations.
This scoring function is typically applied to non-negative features (e.g., count data or binary features).
k=8: This specifies that the top 8 features, based on the chi-squared scores, will be selected.

Creating the DT object -

In [93]:
from sklearn.tree import DecisionTreeClassifier

In [94]:
transformation5 = DecisionTreeClassifier()

Now, the task is to connect all these transformation blocks that we have created in a Pipeline -

In [95]:
from sklearn.pipeline import Pipeline,make_pipeline

In [96]:
pipe = Pipeline([
    ("trf1",transformation1),
    ("trf2",transformation2),
    ("trf3",transformation3),
    ("trf4",transformation4),
    ("trf5",transformation5)
]
)
# Alternate syntax -
# pipe = make_pipeline(transformation1,transformation2,transformation3,transformation4,transformation5)

We can use the alternate syntax also as shown above to define the pipeline using the make_pipeline class which doesn't require naming each transformation.

In [97]:
pipe.fit(X_train, y_train)

As shown it gives a diagramtic representation of what we are doing in the pipeline.

# **Important note:**

There are usually 2 types of pipelines that we can create -
1. Pipeline which has pre-processing steps as well as model training using some algo in the pipeline. This is what we are doing in our example as well since we are not only doing the pre-processing (Imputation, OHE and scaling) but also the algo training for model creation (Decision tree step) step.
2. Pipeline which has only pre-processing steps and no model creation.

In the first case we need to use pipe.fit to fit and then pipe.predict whereas in the second case we do pipe.fit and then pipe.transform or pipe.fit_transform. There is no pipe_predict in this case.


In [98]:
pipe.named_steps

{'trf1': ColumnTransformer(remainder='passthrough',
                   transformers=[('impute_age', SimpleImputer(), [2]),
                                 ('impute_embarked',
                                  SimpleImputer(strategy='most_frequent'),
                                  [6])]),
 'trf2': ColumnTransformer(remainder='passthrough',
                   transformers=[('ohe_sex_embarked',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse_output=False),
                                  [1, 6])]),
 'trf3': ColumnTransformer(transformers=[('scale', MinMaxScaler(), slice(0, 10, None))]),
 'trf4': SelectKBest(k=8, score_func=<function chi2 at 0x7fd9d6c477f0>),
 'trf5': DecisionTreeClassifier()}

Now, suppose we want to know about a particular pre-processing step in the pipeline. We can see it like this -

In [101]:
pipe.named_steps['trf1'].transformers_

[('impute_age', SimpleImputer(), [2]),
 ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6]),
 ('remainder',
  FunctionTransformer(accept_sparse=True, check_inverse=False,
                      feature_names_out='one-to-one'),
  [0, 1, 3, 4, 5])]

In [102]:
pipe.named_steps['trf2'].transformers_

[('ohe_sex_embarked',
  OneHotEncoder(handle_unknown='ignore', sparse_output=False),
  [1, 6]),
 ('remainder',
  FunctionTransformer(accept_sparse=True, check_inverse=False,
                      feature_names_out='one-to-one'),
  [0, 2, 3, 4, 5])]

In [105]:
pipe.named_steps['trf3'].transformers_

[('scale', MinMaxScaler(), slice(0, 10, None)),
 ('remainder',
  'drop',
  [10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30,
   31,
   32,
   33,
   34,
   35,
   36,
   37,
   38,
   39,
   40,
   41,
   42,
   43,
   44,
   45,
   46,
   47,
   48,
   49,
   50,
   51,
   52,
   53,
   54,
   55,
   56,
   57,
   58,
   59,
   60,
   61,
   62,
   63,
   64,
   65,
   66,
   67,
   68,
   69,
   70,
   71,
   72,
   73,
   74,
   75,
   76,
   77,
   78,
   79,
   80,
   81,
   82,
   83,
   84,
   85,
   86,
   87,
   88,
   89,
   90,
   91,
   92,
   93,
   94,
   95,
   96,
   97,
   98,
   99,
   100,
   101,
   102,
   103,
   104,
   105,
   106,
   107,
   108,
   109,
   110,
   111,
   112,
   113,
   114,
   115,
   116,
   117,
   118,
   119,
   120,
   121,
   122,
   123,
   124,
   125,
   126,
   127,
   128,
   129,
   130,
   131,
   132,
   133,
   134,
   135,
   136,
 

Now, let's say we want to know what exact value was imputed by the SimpleImputer class for 'Age' during our first step of the pipeline. Let's see this stepwise -

In [109]:
trf1List = pipe.named_steps['trf1'].transformers_
trf1List

[('impute_age', SimpleImputer(), [2]),
 ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6]),
 ('remainder',
  FunctionTransformer(accept_sparse=True, check_inverse=False,
                      feature_names_out='one-to-one'),
  [0, 1, 3, 4, 5])]

From here, we want to 'impute_age'

In [110]:
trf1List[0]

('impute_age', SimpleImputer(), [2])

From here, we want to see the mean value for age that SimpleImputer has calculated for imputing missing values -

In [111]:
trf1List[0][1].statistics_

array([29.74518389])

Likewise, let's say we want to know the most frequent value calculated for imputing missing values in Embarked column, we could do this -

In [112]:
pipe.named_steps['trf1'].transformers_[1][1].statistics_

array(['S'], dtype=object)

In [114]:
y_pred = pipe.predict(X_test)
y_pred

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0])

This array shows the prediction done by our model for all the 179 passengers in X_test df.

In [117]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred) # calculating acc score of predicted vs actual values

0.6759776536312849

Exporting the pipeline -

In [118]:
import pickle
pickle.dump(pipe,open('pipe.pkl','wb'))