## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [1]:
# IMPORT PACKAGES

In [1]:
import pandas as pd
import numpy as np

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [1]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [4]:
df = pd.read_csv(r"C:\Users\bevli\Downloads\pima-indians-diabetes.csv", sep=';')
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [15]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score


In [6]:
X = df.drop(columns = 'class')
y = df['class']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [10]:
pca = PCA(n_components =2)
selection = SelectKBest(k=7)
feature_union = FeatureUnion([("pca", pca), ("select_best", selection)])
rfr = RandomForestClassifier(n_jobs=1)

pipeline = Pipeline(steps = [('scaling', StandardScaler()),
                             ('features',feature_union),
                            ('classifier',RandomForestClassifier())])

pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print ("accuracy is", accuracy)

accuracy is 0.7619047619047619


In [12]:
from sklearn import set_config
set_config(display="diagram")
pipeline    # click on the diagram below to see the details of each step

In [16]:
# set up our parameters grid
param_grid = {"features__pca__n_components": [1, 2, 3],
                  "features__select_best__k": [1, 2, 3, 4, 5 ,6],
                  "classifier__n_estimators":[50, 100, 200]}

# create a Grid Search object
grid_search = GridSearchCV(pipeline, param_grid, verbose=10, refit=True)    
grid_search.fit(X_train,y_train)

best_model = grid_search.best_estimator_
best_parameters = grid_search.best_params_

Fitting 5 folds for each of 54 candidates, totalling 270 fits
[CV 1/5; 1/54] START classifier__n_estimators=50, features__pca__n_components=1, features__select_best__k=1
[CV 1/5; 1/54] END classifier__n_estimators=50, features__pca__n_components=1, features__select_best__k=1;, score=0.657 total time=   0.0s
[CV 2/5; 1/54] START classifier__n_estimators=50, features__pca__n_components=1, features__select_best__k=1
[CV 2/5; 1/54] END classifier__n_estimators=50, features__pca__n_components=1, features__select_best__k=1;, score=0.796 total time=   0.0s
[CV 3/5; 1/54] START classifier__n_estimators=50, features__pca__n_components=1, features__select_best__k=1
[CV 3/5; 1/54] END classifier__n_estimators=50, features__pca__n_components=1, features__select_best__k=1;, score=0.766 total time=   0.0s
[CV 4/5; 1/54] START classifier__n_estimators=50, features__pca__n_components=1, features__select_best__k=1
[CV 4/5; 1/54] END classifier__n_estimators=50, features__pca__n_components=1, features__

[CV 2/5; 8/54] END classifier__n_estimators=50, features__pca__n_components=2, features__select_best__k=2;, score=0.722 total time=   0.0s
[CV 3/5; 8/54] START classifier__n_estimators=50, features__pca__n_components=2, features__select_best__k=2
[CV 3/5; 8/54] END classifier__n_estimators=50, features__pca__n_components=2, features__select_best__k=2;, score=0.701 total time=   0.0s
[CV 4/5; 8/54] START classifier__n_estimators=50, features__pca__n_components=2, features__select_best__k=2
[CV 4/5; 8/54] END classifier__n_estimators=50, features__pca__n_components=2, features__select_best__k=2;, score=0.748 total time=   0.0s
[CV 5/5; 8/54] START classifier__n_estimators=50, features__pca__n_components=2, features__select_best__k=2
[CV 5/5; 8/54] END classifier__n_estimators=50, features__pca__n_components=2, features__select_best__k=2;, score=0.720 total time=   0.0s
[CV 1/5; 9/54] START classifier__n_estimators=50, features__pca__n_components=2, features__select_best__k=3
[CV 1/5; 9/5

[CV 5/5; 14/54] END classifier__n_estimators=50, features__pca__n_components=3, features__select_best__k=2;, score=0.785 total time=   0.0s
[CV 1/5; 15/54] START classifier__n_estimators=50, features__pca__n_components=3, features__select_best__k=3
[CV 1/5; 15/54] END classifier__n_estimators=50, features__pca__n_components=3, features__select_best__k=3;, score=0.704 total time=   0.0s
[CV 2/5; 15/54] START classifier__n_estimators=50, features__pca__n_components=3, features__select_best__k=3
[CV 2/5; 15/54] END classifier__n_estimators=50, features__pca__n_components=3, features__select_best__k=3;, score=0.759 total time=   0.0s
[CV 3/5; 15/54] START classifier__n_estimators=50, features__pca__n_components=3, features__select_best__k=3
[CV 3/5; 15/54] END classifier__n_estimators=50, features__pca__n_components=3, features__select_best__k=3;, score=0.729 total time=   0.0s
[CV 4/5; 15/54] START classifier__n_estimators=50, features__pca__n_components=3, features__select_best__k=3
[CV 

[CV 3/5; 21/54] END classifier__n_estimators=100, features__pca__n_components=1, features__select_best__k=3;, score=0.748 total time=   0.0s
[CV 4/5; 21/54] START classifier__n_estimators=100, features__pca__n_components=1, features__select_best__k=3
[CV 4/5; 21/54] END classifier__n_estimators=100, features__pca__n_components=1, features__select_best__k=3;, score=0.710 total time=   0.0s
[CV 5/5; 21/54] START classifier__n_estimators=100, features__pca__n_components=1, features__select_best__k=3
[CV 5/5; 21/54] END classifier__n_estimators=100, features__pca__n_components=1, features__select_best__k=3;, score=0.701 total time=   0.1s
[CV 1/5; 22/54] START classifier__n_estimators=100, features__pca__n_components=1, features__select_best__k=4
[CV 1/5; 22/54] END classifier__n_estimators=100, features__pca__n_components=1, features__select_best__k=4;, score=0.704 total time=   0.1s
[CV 2/5; 22/54] START classifier__n_estimators=100, features__pca__n_components=1, features__select_best__

[CV 2/5; 28/54] END classifier__n_estimators=100, features__pca__n_components=2, features__select_best__k=4;, score=0.750 total time=   0.1s
[CV 3/5; 28/54] START classifier__n_estimators=100, features__pca__n_components=2, features__select_best__k=4
[CV 3/5; 28/54] END classifier__n_estimators=100, features__pca__n_components=2, features__select_best__k=4;, score=0.692 total time=   0.1s
[CV 4/5; 28/54] START classifier__n_estimators=100, features__pca__n_components=2, features__select_best__k=4
[CV 4/5; 28/54] END classifier__n_estimators=100, features__pca__n_components=2, features__select_best__k=4;, score=0.710 total time=   0.1s
[CV 5/5; 28/54] START classifier__n_estimators=100, features__pca__n_components=2, features__select_best__k=4
[CV 5/5; 28/54] END classifier__n_estimators=100, features__pca__n_components=2, features__select_best__k=4;, score=0.766 total time=   0.1s
[CV 1/5; 29/54] START classifier__n_estimators=100, features__pca__n_components=2, features__select_best__

[CV 1/5; 35/54] END classifier__n_estimators=100, features__pca__n_components=3, features__select_best__k=5;, score=0.713 total time=   0.0s
[CV 2/5; 35/54] START classifier__n_estimators=100, features__pca__n_components=3, features__select_best__k=5
[CV 2/5; 35/54] END classifier__n_estimators=100, features__pca__n_components=3, features__select_best__k=5;, score=0.731 total time=   0.1s
[CV 3/5; 35/54] START classifier__n_estimators=100, features__pca__n_components=3, features__select_best__k=5
[CV 3/5; 35/54] END classifier__n_estimators=100, features__pca__n_components=3, features__select_best__k=5;, score=0.720 total time=   0.1s
[CV 4/5; 35/54] START classifier__n_estimators=100, features__pca__n_components=3, features__select_best__k=5
[CV 4/5; 35/54] END classifier__n_estimators=100, features__pca__n_components=3, features__select_best__k=5;, score=0.720 total time=   0.0s
[CV 5/5; 35/54] START classifier__n_estimators=100, features__pca__n_components=3, features__select_best__

[CV 4/5; 41/54] END classifier__n_estimators=200, features__pca__n_components=1, features__select_best__k=5;, score=0.701 total time=   0.2s
[CV 5/5; 41/54] START classifier__n_estimators=200, features__pca__n_components=1, features__select_best__k=5
[CV 5/5; 41/54] END classifier__n_estimators=200, features__pca__n_components=1, features__select_best__k=5;, score=0.776 total time=   0.2s
[CV 1/5; 42/54] START classifier__n_estimators=200, features__pca__n_components=1, features__select_best__k=6
[CV 1/5; 42/54] END classifier__n_estimators=200, features__pca__n_components=1, features__select_best__k=6;, score=0.713 total time=   0.2s
[CV 2/5; 42/54] START classifier__n_estimators=200, features__pca__n_components=1, features__select_best__k=6
[CV 2/5; 42/54] END classifier__n_estimators=200, features__pca__n_components=1, features__select_best__k=6;, score=0.750 total time=   0.2s
[CV 3/5; 42/54] START classifier__n_estimators=200, features__pca__n_components=1, features__select_best__

[CV 2/5; 48/54] END classifier__n_estimators=200, features__pca__n_components=2, features__select_best__k=6;, score=0.769 total time=   0.2s
[CV 3/5; 48/54] START classifier__n_estimators=200, features__pca__n_components=2, features__select_best__k=6
[CV 3/5; 48/54] END classifier__n_estimators=200, features__pca__n_components=2, features__select_best__k=6;, score=0.664 total time=   0.2s
[CV 4/5; 48/54] START classifier__n_estimators=200, features__pca__n_components=2, features__select_best__k=6
[CV 4/5; 48/54] END classifier__n_estimators=200, features__pca__n_components=2, features__select_best__k=6;, score=0.701 total time=   0.2s
[CV 5/5; 48/54] START classifier__n_estimators=200, features__pca__n_components=2, features__select_best__k=6
[CV 5/5; 48/54] END classifier__n_estimators=200, features__pca__n_components=2, features__select_best__k=6;, score=0.785 total time=   0.2s
[CV 1/5; 49/54] START classifier__n_estimators=200, features__pca__n_components=3, features__select_best__

[CV 5/5; 54/54] END classifier__n_estimators=200, features__pca__n_components=3, features__select_best__k=6;, score=0.776 total time=   0.2s


In [18]:
best_acc = grid_search.score(X_test, y_test)
best_acc

0.8008658008658008

In [21]:
print (f'Accuracy of {best_acc} achieved using {best_parameters}')

Accuracy of 0.8008658008658008 achieved using {'classifier__n_estimators': 100, 'features__pca__n_components': 3, 'features__select_best__k': 2}


In [22]:
y_pred2 = grid_search.predict(X_test)

In [28]:
accuracy2 = accuracy_score(y_test, y_pred2)
accuracy2

0.8008658008658008