## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [22]:
# IMPORT PACKAGES
import pandas as pd
import numpy as np
from sklearn.pipeline import FeatureUnion,Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [2]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
df = pd.read_csv(url, names=col_names)

In [3]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [18]:
X = df.drop(columns=['class'])
y = df['class']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [14]:
scaler = StandardScaler()
pca = PCA(n_components=3)
skb = SelectKBest(k = 10)
comb_f = FeatureUnion([("pca",pca),("sel_f",skb)])
ran = RandomForestClassifier(n_estimators=50,max_depth=6,min_samples_leaf=10)

In [15]:
pipeline = Pipeline([("scaler",scaler),
                     ("fea_sel",comb_f),
                     ("estimator",ran)])

In [11]:
n_params = np.arange(2,9)

In [12]:
params = {"fea_sel__pca__n_components":n_params,
          "fea_sel__sel_f__k":n_params,
          "estimator__n_estimators":[30,50,80],
          "estimator__max_depth":[5,8,10]}

In [13]:
g_model = GridSearchCV(pipeline,param_grid=params,cv = 5, n_jobs=-1)

In [19]:
g_model.fit(X_train,Y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('fea_sel',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA(n_components=3)),
                                                                       ('sel_f',
                                                                        SelectKBest())])),
                                       ('estimator',
                                        RandomForestClassifier(max_depth=6,
                                                               min_samples_leaf=10,
                                                               n_estimators=50))]),
             n_jobs=-1,
             param_grid={'estimator__max_depth': [5, 8, 10],
                         'estimator__n_estimators': [30, 50, 80],
                         'fea_sel__pca__n_components': array([2, 3, 

In [20]:
g_model.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('fea_sel',
                 FeatureUnion(transformer_list=[('pca', PCA(n_components=7)),
                                                ('sel_f', SelectKBest(k=2))])),
                ('estimator',
                 RandomForestClassifier(max_depth=8, min_samples_leaf=10,
                                        n_estimators=30))])

In [21]:
y_pred = g_model.predict(X_test)

In [24]:
accuracy_score(Y_test,y_pred)

0.7662337662337663

In [26]:
print(classification_report(Y_test,y_pred))

              precision    recall  f1-score   support

           0       0.81      0.84      0.83       102
           1       0.67      0.62      0.64        52

    accuracy                           0.77       154
   macro avg       0.74      0.73      0.73       154
weighted avg       0.76      0.77      0.76       154

