<a href="https://colab.research.google.com/github/sivasaiyadav8143/Machine-Learning-with-Python/blob/master/DataLeaks_with_GridSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## GirdSearch causes DataLeaks when for some operations like z-score, Imuputer etc.This notebook demonstrates who actuallly GirdSearch lead to DataLeaks and how to fix it.

In [61]:
#import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [3]:
# split data into train and test
X_train,X_test,y_train,y_test = train_test_split(load_breast_cancer().data,load_breast_cancer().target,random_state=14)

## Without Pipeline & Grid Search

In [62]:
#initialize StandardScaler and transform the data
std = StandardScaler()
X_train_scale = std.fit_transform(X_train)
X_test_scale = std.transform(X_test)

In [64]:
#initialize suport vector classifier
svc = SVC()
svc.fit(X_train_scale,y_train)
print('Training score',svc.score(X_train_scale,y_train))
print('Testing score',svc.score(X_test_scale,y_test))

Training score 0.9882629107981221
Testing score 0.986013986013986


## Pipeline

In [65]:
from sklearn.pipeline import make_pipeline

#initialize make_pipe  and other objetcs
make_pipe = make_pipeline(StandardScaler(),SVC())
make_pipe.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svc',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [66]:
print('Training score',make_pipe.score(X_train,y_train))
print('Testing score',make_pipe.score(X_test,y_test))

Training score 0.9882629107981221
Testing score 0.986013986013986


## Grid Search

In [67]:
from sklearn.model_selection import GridSearchCV

grid_params = {'C': [0.001, 0.01, 0.1, 1.0, 10, 100], 'gamma':['scale','auto',0.001,0.1, 0.01, 10, 100,]}

grid = GridSearchCV(svc,grid_params,cv = 5)
grid.fit(X_train_scale,y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.1, 1.0, 10, 100],
                         'gamma': ['scale', 'auto', 0.001, 0.1, 0.01, 10, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [68]:
grid.best_params_

{'C': 10, 'gamma': 0.01}

In [69]:
grid.best_score_

0.9765526675786594

In [70]:
grid.score(X_test_scale,y_test)

0.986013986013986

Grid search alone cause data leaks during the cross validation when we perform some operations that calculates on whole training data like z-score / StandardScalar.In z-score, mean and Standard Deviation are calculated on whole training data.When Grid search splits the data using cross validation into train & test blocks,the data in the train blocks will influence by the data in the data test block because the mean & S.D is calculated on train set which is combination of cross validation train and test data.<br>
To aviod data leakes, we use pipeline with Grid Search.

## Grid Search with Pipeline

In [71]:
make_pipe = make_pipeline(StandardScaler(),SVC())

grid_params = {'svc__C': [0.001, 0.01, 0.1, 1.0, 10, 100], 'svc__gamma':['scale','auto',0.001,0.1, 0.01, 10, 100,]}

grid_ = GridSearchCV(make_pipe,grid_params,cv = 5)
grid_.fit(X_train,y_train)


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('standardscaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('svc',
                                        SVC(C=1.0, break_ties=False,
                                            cache_size=200, class_weight=None,
                                            coef0=0.0,
                                            decision_function_shape='ovr',
                                            degree=3, gamma='scale',
                                            kernel='rbf', max_iter=-1,
                                            probability=False,
                                            random_state=None, shrinking=True,
                                            t

In [72]:
grid_.best_params_

{'svc__C': 1.0, 'svc__gamma': 'scale'}

In [73]:
grid_.best_score_

0.976497948016416

In [74]:
grid_.score(X_test,y_test)

0.986013986013986