# Scikit Learn Pipelines

We’ve reached an important step. Make sure to review this notebook regularly. We’ll pull together everything we’ve learned so far.

We’ll start by importing what we need and explain them as we go.

In [50]:
# Importing Pandas
import pandas as pd

# Preprocessing tools
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures

# Pipeline tools
from sklearn.pipeline import Pipeline  # Chains transformations in sequence
from sklearn.compose import ColumnTransformer  # Applies transformations to specific columns in parallel
from sklearn.preprocessing import FunctionTransformer  # Makes custom functions work in pipelines

# Classification model
from sklearn.neighbors import KNeighborsClassifier

# Grid search for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

Now's a good time to check which version of scikit-learn we’re using.

In [2]:
import sklearn
sklearn.__version__

'1.5.1'

## 1. The Titanic disaster

We'll be using the Titanic dataset. This dataset includes information about passengers on the Titanic, such as their age, gender, class, and whether they survived

In [5]:
path = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/titanic.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


| Column Name    | Description |
|----------------|-------------|
| **PassengerId** | Unique ID for each passenger |
| **Survived**    | Survival (0 = No, 1 = Yes) |
| **Pclass**      | Passenger class (1st, 2nd, 3rd) |
| **Name**        | Name of the passenger |
| **Sex**         | Gender of the passenger |
| **Age**         | Age of the passenger |
| **SibSp**       | Number of siblings/spouses aboard |
| **Parch**       | Number of parents/children aboard |
| **Ticket**      | Ticket number |
| **Fare**        | Ticket fare |
| **Cabin**       | Cabin number |
| **Embarked**    | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |

Our goal is to predict if a passenger survived the Titanic disaster or not.

## 2. Custom transformers: creating new features

Often, you’ll want to create new features from your existing data to improve your model's performance. 
For example, you might combine or transform existing columns into new ones. 
To do this in a pipeline, you can use `FunctionTransformer`, which lets you wrap any Python function and apply it as a transformer. 
This makes it easy to generate new features during the data preprocessing step.

We'll illustrate this with two examples. We'll create two new features: Title (e.g., Dr., Mr., Rev.) and Family Size.

**Title** can give insight into a passenger’s social status or profession, which could influence their likelihood of survival.
**Family Size** combines the number of siblings, spouses, parents, and children a passenger had on board. 
This might help because passengers with larger families could have different survival chances compared to those traveling alone."

These are just examples of new features I came up with. Can you think of other features that might improve our predictions?

**Title Feature**: Let's write a function to extract the title (Mr., Mrs., Miss, Dr., etc.) from the passenger's name.

In [7]:
# Let's take a look at the names
df.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

To extract the title from the passenger's name, we can use the fact that titles like "Mr.", "Mrs.", and "Dr." typically appear between the first and last names. We’ll use string manipulation methods to locate the title. One way is to split the name by commas and then extract the part that contains the title.

In [9]:
# Here’s an example:
df.Name[0].split(',')[1].split('.')[0].strip()

'Mr'

In [13]:
# check that our idea works
df['Name'].apply(lambda name:name.split(',')[1].split('.')[0].strip())

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: Name, Length: 891, dtype: object

Some titles appear very infrequently, and we’ll need to handle them appropriately. We’ll discuss how to deal with these later.

In [45]:
# check that our idea works
df['Name'].apply(lambda name:name.split(',')[1].split('.')[0].strip() ).value_counts(normalize=True)

Name
Mr              0.580247
Miss            0.204265
Mrs             0.140292
Master          0.044893
Dr              0.007856
Rev             0.006734
Mlle            0.002245
Major           0.002245
Col             0.002245
the Countess    0.001122
Capt            0.001122
Ms              0.001122
Sir             0.001122
Lady            0.001122
Mme             0.001122
Don             0.001122
Jonkheer        0.001122
Name: proportion, dtype: float64

In [14]:
def get_title(dataframe):
    # Step 1: Create a copy of the dataframe to avoid modifying the original
    df = dataframe.copy()

    # Step 2: Extract the title from the 'Name' column
    df['Title'] = df.Name.apply(lambda x: x.split(",")[1].split(".")[0].strip())
    return df

In [15]:
# Check if the function works
get_title(df)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,Rev
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Miss
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Miss
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Mr


**Family Size Feature:** Let’s write a function to calculate the size of a passenger’s family.

To get the family size, we add the number of siblings/spouses (SibSp), parents/children (Parch), and include the passenger (add 1).

In [23]:
# family size
family_size = df.SibSp+df.Parch+1
family_size

0      2
1      2
2      1
3      2
4      1
      ..
886    1
887    1
888    4
889    1
890    1
Length: 891, dtype: int64

In [16]:
def get_family_size(dataframe):
    # Make a copy of the dataframe to avoid modifying the original
    df = dataframe.copy()

    # Calculate family size by adding SibSp (siblings/spouses) and Parch (parents/children), then add 1 for the passenger
    df['Family_size'] = df.SibSp + df.Parch + 1
    return df

In [17]:
# check that it works
get_family_size(df)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family_size
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,4
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1


Now that we have the two functions, `get_title` and `get_family_size`, we can easily apply them to our data using `FunctionTransformer`. This allows us to wrap the functions and use them as part of our data processing steps.

In [18]:
# Wrapping functions for use in preprocessing
family_size_processor = FunctionTransformer(get_family_size)
title_processor = FunctionTransformer(get_title)

And here’s the cool part: now our family size and title processors work just like any other scikit-learn transformer. They come with the `.fit`, `.transform`, and `.fit_transform` methods.

In [19]:
family_size_processor.fit_transform(df)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family_size
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,4
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1


In [20]:
title_processor.fit_transform(df)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,Rev
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Miss
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Miss
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Mr


Let’s set these functions aside for now. We’ll come back to them later, but for now, let’s switch gears.

 ## 3. Column transformers and pipelines

Often, we need to apply different transformations to different types of columns. That’s where `ColumnTransformer` comes in. 
It allows us to handle each group of columns separately.
We’ll create one transformer for the numerical features, another for the categorical features, and a third for the ordinal features. This way, we can customize how each type of data is processed

Each feature transformer will actually be a `pipeline`, where we chain together the processing steps for that feature group. 
This allows us to apply multiple transformations in sequence—one after the other. 
For example, we can handle missing values, scale the data, and apply any custom transformations, all within a single pipeline for each set of features.

In scikit-learn, a **pipeline** is a sequence of steps where each step has a name and an operation (like a transformation or model).
The data moves through the steps in order. 
Each step has:

- Name: A label for the step.
- Operation: The transformation or model to apply.
  
The pipeline runs everything in sequence, so you don’t have to apply each step separately.

Let’s create a pipeline for the numerical features. First, we’ll handle missing values by imputing them, then we’ll scale the data.

In [22]:
# Pipeline for numerical features: impute missing values and scale
numeric_features = ['Age', 'Fare', 'Family_size', 'Pclass']
numeric_processor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Handle missing values using the median
    ('scaler', StandardScaler())  # Scale the data
])
numeric_processor

Next, we’ll create one more pipeline for the categorical features. We’ll handle missing values and then apply the appropriate transformations.

In [46]:
# Pipeline for categorical features: impute missing values and apply one-hot encoding
categorical_features = ['Embarked', 'Sex', 'Title']
categorical_processor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing values with the most frequent category
    ('encoder', OneHotEncoder(min_frequency=0.006, handle_unknown='ignore'))  # One-hot encode categorical data
])

- **min_frequency=0.006**: This handles the very infrequent titles we saw earlier by grouping them as 'other.' Only categories that appear in at least 0.6% of the data get their own column.

- **handle_unknown='ignore'**: Any unseen categories during prediction are ignored to avoid errors.

In [47]:
categorical_processor

A **ColumnTransformer** allows you to apply different transformations to specific columns in a dataset. You define a list of transformations, where each transformation has:

- A **name**: A label for that transformation.
- A **transformer**: The operation you want to apply (e.g., scaling, encoding).
- The **columns**: The specific columns you want to transform.

For example, if you want to combine our two processors:

In [49]:
feature_processor = ColumnTransformer(
    transformers=[
        ('num', numeric_processor, numeric_features),
        ('cat', categorical_processor, categorical_features)
    ],
         remainder='drop') # drop 'Name', 'SibSp', 'Parch' columns

feature_processor

## 4. Building the classification pipeline

Now, let’s bring everything together. We'll build a pipeline that:

1. Creates the family size feature.
2. Extracts the title feature.
3. Processes numerical and categorical features separately.
4. Adds polynomial features.
5. Finally, includes a kNN model for making predictions

In [68]:
pipe = Pipeline(steps=[('get family_size', family_size_processor),
                           ('get title', title_processor),
                           ('preprocessor', feature_processor),
                           ('poly_features', PolynomialFeatures(degree=2)), # add polynomial combinations of the features
                           ('clf', KNeighborsClassifier())   
                          ])
pipe

## 5. Hyperparameter tuning with grid search

In [69]:
# Feature matrix and target vector
feature_cols = ['Name','Age','Fare','Sex','Embarked','Pclass','SibSp','Parch']
X = df[feature_cols] 
y = df.Survived

In [70]:
# train/test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)

We want to tune the hyperparameters of the kNN model, specifically `weights` and `n_neighbors`.
However, since the kNN model is just one step in a larger pipeline, we need to specify the names of these hyperparameters in a way that the pipeline can recognize them. 

The `__` (double underscore) notation in scikit-learn is used to access the hyperparameters of a specific step within a pipeline. 
When you want to tune a hyperparameter of a model that is part of a pipeline, you need to specify the step name followed by the hyperparameter name, separated by a double underscore.

So, to access the `n_neighbors` hyperparameter of a kNN model within a pipeline, you would write it as `clf__n_neighbors`. Similarly, to access the weights hyperparameter, you would write `clf__weights`.

In [71]:
param_grid = { 
    'clf__n_neighbors': list(range(1,25)),
    'clf__weights' : ['uniform','distance']
}

In [72]:
# instantiate and fit the grid
grid = GridSearchCV(pipe, param_grid, cv=10, scoring='accuracy', n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)

Fitting 10 folds for each of 48 candidates, totalling 480 fits


In [73]:
# best hyper-parameters
grid.best_params_

{'clf__n_neighbors': 12, 'clf__weights': 'uniform'}

In [74]:
###### best accuracy
grid.best_score_

0.82037539574853

In [75]:
# best predictor
best_clf = grid.best_estimator_

## 6. Evaluating the model

In [76]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [77]:
y_test_pred = best_clf.predict(X_test)

In [78]:
confusion_matrix(y_test,y_test_pred)

array([[130,  13],
       [ 25,  55]], dtype=int64)

In [79]:
accuracy_score(y_test,y_test_pred)

0.8295964125560538

## A fake passenger

In [47]:
titanic.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [89]:
# Would Your Professor Have Survived the Titanic Disaster? 0 = No, 1 = Yes
Javier = pd.DataFrame({
                    'Name':['Perez-Alvaro, Dr. Javier'],
                    'Age': [38],
                    'Fare': [100],
                    'Sex': ['male'],
                    'Embarked': ['C'],
                    'Pclass':[2],
                    'SibSp': [1],
                    'Parch': [0],
                   })
Javier

Unnamed: 0,Name,Age,Fare,Sex,Embarked,Pclass,SibSp,Parch
0,"Perez-Alvaro, Dr. Javier",38,100,male,C,2,1,0


In [91]:
best_clf.predict(Javier) # Oops...

array([0], dtype=int64)