## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [69]:
# import necessary libraries and packages
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
import pandas as pd

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [70]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [71]:
# Load dataset and skip the first row (header)
df = pd.read_csv('pima_indians_diabetes.csv', sep=';', names=col_names, skiprows=1)

In [72]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [73]:
# Check for missing values in the target variable
print(df['class'].isnull().sum())

0


In [74]:
# Drop rows with missing target values
df = df.dropna(subset=['class'])

### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [75]:
# define your X and y
X = df.drop('class', axis=1)
y = df['class']

In [76]:
# PCA transformer
pca = PCA()

In [77]:
# SelectKBest transformer
selection = SelectKBest()

In [78]:
# FeatureUnion transformer
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

In [79]:
# Random Forest Classifier
rfc = RandomForestClassifier()

In [80]:
# Pipeline integrating the FeatureUnion and the classifier
pipeline = Pipeline([("features", combined_features), ("rfc", rfc)])

In [81]:
# parameter grid for grid search
param_grid = {
    'features__pca__n_components': [1, 2, 3],
    'features__univ_select__k': [1, 2],
    'rfc__n_estimators': [10, 20, 30],
    'rfc__max_depth': [None, 5, 10],
}

In [82]:
# grid search
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)

In [83]:
# fit the model
grid_search.fit(X, y)