# Titanic 4: Tuning the whole pipeline with Cross Validation

In this notebook we will see how Grid Search Cross Validation can be used to not only tune the parameters of the model but also the parameters of all the transformers in a pipeline, thus helping us find the best preprocessing strategy for our data.

## 1. Pipeline creation

As shown in the previous notebooks, here we clean the data, split it and create a pipeline:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

# reading
url = "https://drive.google.com/file/d/1g3uhw_y3tboRm2eYDPfUzXXsw8IOYDCy/view?usp=sharing"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = df = pd.read_csv(path)

# X and y creation
X = data.drop(columns=["PassengerId", "Name", "Ticket"])
y = X.pop("Survived")

# feature selection: only numericals
X_num = X.select_dtypes(include="number").copy()

# data splitting
X_num_train, X_num_test, y_train, y_test = train_test_split(X_num, y, test_size=0.2, random_state=123)

# initialize transformers &amp; model
imputer = SimpleImputer()
dtree = DecisionTreeClassifier()
 
# Create a pipeline
pipe = make_pipeline(imputer,
                     dtree)

## 2. Cross Validation with the whole pipeline:

We can see the steps in the pipeline (note that they have been given names: `simpleimputer` and `decisiontreeclassifier`. we will use these names when defining the parameter grid for the cross validation)

In [2]:
pipe

Pipeline(steps=[('simpleimputer', SimpleImputer()),
                ('decisiontreeclassifier', DecisionTreeClassifier())])

To define the parameter grid for cross validation, you need to create a dictionary, where:

- The keys are the name of the pipeline step, followed by two underscores and the name of the parameter you want to tune.
- The values are lists (or "ranges") with all the values you want to try for each parameter.

In [3]:
param_grid = {
    "simpleimputer__strategy":["mean", "median"],
    "decisiontreeclassifier__max_depth": range(2, 14),
    "decisiontreeclassifier__min_samples_leaf": range(3, 10),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
}

When defining the cross validation, we want to pass our pipeline (`pipe`), our parameter grid (`param_grid`) and the number of folds (an arbitrary number, usually 5 or 10). You can also define the parameter `verbose` if you want to recieve a bit more info about the CV task.

In [4]:
from sklearn.model_selection import GridSearchCV

search = GridSearchCV(pipe,
                      param_grid,
                      cv=10,
                      verbose=1)

Fit your "search" to the training data (`X` and `y`), as we used to do with our model alone or with our pipeline:

In [None]:
search.fit(X_num_train, y_train)

Fitting 10 folds for each of 336 candidates, totalling 3360 fits


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('simpleimputer', SimpleImputer()),
                                       ('decisiontreeclassifier',
                                        DecisionTreeClassifier())]),
             param_grid={'decisiontreeclassifier__criterion': ['gini',
                                                               'entropy'],
                         'decisiontreeclassifier__max_depth': range(2, 14),
                         'decisiontreeclassifier__min_samples_leaf': range(3, 10),
                         'simpleimputer__strategy': ['mean', 'median']},
             verbose=1)

Explore the best parameters and the best score achieved with your cross validation:

In [None]:
search.best_params_

{'decisiontreeclassifier__criterion': 'gini',
 'decisiontreeclassifier__max_depth': 8,
 'decisiontreeclassifier__min_samples_leaf': 6,
 'simpleimputer__strategy': 'mean'}

In [None]:
# cross validation average accuracy
search.best_score_

0.7078834115805948

In [5]:
X_num_train

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
329,1,16.0,0,1,57.9792
749,3,31.0,0,0,7.7500
203,3,45.5,0,0,7.2250
421,3,21.0,0,0,7.7333
97,1,23.0,0,1,63.3583
...,...,...,...,...,...
98,2,34.0,0,1,23.0000
322,2,30.0,0,0,12.3500
382,3,32.0,0,0,7.9250
365,3,30.0,0,0,7.2500


In [None]:
# training accuracy
y_train_pred = search.predict(X_num_train)

accuracy_score(y_train, y_train_pred)

0.7710674157303371

In [6]:
X_num_test

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
172,3,1.0,1,1,11.1333
524,3,,0,0,7.2292
452,1,30.0,0,0,27.7500
170,1,61.0,0,0,33.5000
620,3,27.0,1,0,14.4542
...,...,...,...,...,...
388,3,,0,0,7.7292
338,3,45.0,0,0,8.0500
827,2,1.0,0,2,37.0042
773,3,,0,0,7.2250


In [None]:
# testing accuracy
y_test_pred = search.predict(X_num_test)

accuracy_score(y_test, y_test_pred)

0.7486033519553073

## **Exercise 1:**

Add a scaler to the pipeline, and use GridSearchCV to tune the parameters of the scaler, as well as the parameters of the imputer and the decision tree.

In [None]:
# Solution:
from sklearn.preprocessing import StandardScaler

# initialize transformers &amp; model
imputer = SimpleImputer()
scaler = StandardScaler()
dtree = DecisionTreeClassifier()

# create the pipeline
pipe = make_pipeline(imputer,
                     scaler,
                     dtree)

# create parameter grid
param_grid = {
    "simpleimputer__strategy":["mean", "median"],
    "standardscaler__with_mean":[True, False],
    "standardscaler__with_std":[True, False],
    "decisiontreeclassifier__max_depth": range(2, 14),
    "decisiontreeclassifier__min_samples_leaf": range(3, 10),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
}

# define cross validation
search = GridSearchCV(pipe,
                      param_grid,
                      cv=10,
                      verbose=1)

# fit
search.fit(X_num_train, y_train)

# cross validation average accuracy
search.best_score_
# best parameters
search.best_params_

Fitting 10 folds for each of 1344 candidates, totalling 13440 fits


0.7107003129890455

In [None]:
# best parameters
search.best_params_

{'decisiontreeclassifier__criterion': 'gini',
 'decisiontreeclassifier__max_depth': 8,
 'decisiontreeclassifier__min_samples_leaf': 6,
 'simpleimputer__strategy': 'mean',
 'standardscaler__with_std': True}

## **Your challenge**

In a new notebook, apply everything you have learned here to the Housing project, following the Learning platform.