# PIPELINES IN SCIKIT LEARN

**The core idea (intuition):**

In real ML workflows, you almost always do this:

- Preprocess data

- scaling

- encoding

- imputing missing values

- Train a model

**A pipeline guarantees:**

- Steps happen in the correct order

- The same preprocessing is applied to train and test data

- No data leakage

- Cleaner code

-> **PIPELINES MAKES IT EASY TO APPLY THE SAME PREPROCESSING TO TRAIN AND TEST**

scaler = StandardScaler() <br>
X_train_scaled = scaler.fit_transform(X_train)<br>
X_test_scaled = scaler.fit_transform(X_test)  # ‚ùå WRONG (data leakage)<br>
<br>
model.fit(X_train_scaled, y_train)

# Where does data leakage happen?

Now look at this:

scaler.fit_transform(X_test)


This means:

You are recalculating mean & std using test data

The model indirectly gets information about the test set

üëâ This breaks the rule:

Test data must be completely unseen during training

Even though you‚Äôre not fitting the model, you‚Äôre still learning from test data.

That is data leakage.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [2]:
df = sns.load_dataset('titanic')
df = df.iloc[:, :8]
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked
0,0,3,male,22.0,1,0,7.2500,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.9250,S
3,1,1,female,35.0,1,0,53.1000,S
4,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S
887,1,1,female,19.0,0,0,30.0000,S
888,0,3,female,,1,2,23.4500,S
889,1,1,male,26.0,0,0,30.0000,C


In [3]:
X_train, X_test, Y_train, Y_test = train_test_split(df.drop(columns = ['survived']), df['survived'], test_size = 0.2, random_state = 42)

In [4]:
X_train.head(2)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S


In [5]:
Y_train.head(2)

331    0
733    0
Name: survived, dtype: int64

In [6]:
df.isnull().sum()

survived      0
pclass        0
sex           0
age         177
sibsp         0
parch         0
fare          0
embarked      2
dtype: int64

In [7]:
si_age = SimpleImputer(strategy='mean')
si_embarked = SimpleImputer(strategy = 'most_frequent')

X_train_age = si_age.fit_transform(X_train[['age']])
X_train_embarked = si_embarked.fit_transform(X_train[['embarked']])

X_test_age = si_age.transform(X_test[['age']])
X_test_embarked = si_embarked.transform(X_test[['embarked']])

In [8]:
X_train_age

array([[45.5       ],
       [23.        ],
       [32.        ],
       [26.        ],
       [ 6.        ],
       [24.        ],
       [45.        ],
       [29.        ],
       [29.49884615],
       [29.49884615],
       [42.        ],
       [36.        ],
       [33.        ],
       [17.        ],
       [29.        ],
       [50.        ],
       [35.        ],
       [38.        ],
       [34.        ],
       [17.        ],
       [11.        ],
       [61.        ],
       [30.        ],
       [ 7.        ],
       [63.        ],
       [20.        ],
       [29.49884615],
       [29.        ],
       [36.        ],
       [29.49884615],
       [50.        ],
       [27.        ],
       [30.        ],
       [33.        ],
       [29.49884615],
       [29.49884615],
       [ 2.        ],
       [25.        ],
       [51.        ],
       [25.        ],
       [29.49884615],
       [29.49884615],
       [24.        ],
       [18.        ],
       [29.49884615],
       [25

In [9]:
# one hot encoding
oe_sex = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
oe_embarked = OneHotEncoder(sparse_output=False, handle_unknown='ignore')


X_train_sex = oe_sex.fit_transform(X_train[['sex']])
X_train_embarked = oe_embarked.fit_transform(X_train_embarked)

X_test_sex = oe_sex.transform(X_test[['sex']])
X_test_embarked = oe_embarked.transform(X_test_embarked)

In [10]:
X_train_sex

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       ...,
       [0., 1.],
       [1., 0.],
       [0., 1.]], shape=(712, 2))

In [11]:
X_train_embarked

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]], shape=(712, 3))

In [12]:
X_train

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
331,1,male,45.5,0,0,28.5000,S
733,2,male,23.0,0,0,13.0000,S
382,3,male,32.0,0,0,7.9250,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.2750,S
...,...,...,...,...,...,...,...
106,3,female,21.0,0,0,7.6500,S
270,1,male,,0,0,31.0000,S
860,3,male,41.0,2,0,14.1083,S
435,1,female,14.0,1,2,120.0000,S


In [13]:
X_train_rem = X_train.drop(columns= ['sex', 'age', 'embarked'])
X_test_rem = X_test.drop(columns=['sex', 'age', 'embarked'])

In [15]:
X_train_transformed = np.concatenate((X_train_rem, X_train_sex, X_train_age, X_train_embarked), axis = 1)
X_test_transformed = np.concatenate((X_test_rem, X_test_sex, X_test_age, X_test_embarked), axis = 1)

In [16]:
X_train_transformed.shape

(712, 10)

In [18]:
clf = DecisionTreeClassifier()
clf.fit(X_train_transformed, Y_train)

0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [19]:
y_pred = clf.predict(X_test_transformed)

In [20]:
from sklearn.metrics import accuracy_score

In [22]:
accuracy_score(Y_test, y_pred)

0.7821229050279329