# Section 8 - Pipeline, Grid Search, Random Forests
This section will get you to practice:
1. classification algorithms you recently learned in lectures, such as, decision trees, random forests. 
2. parameter optimization via pipeline and grid search.

## 0 Data
### Load
The copy of UCI ML Breast Cancer Wisconsin (Diagnostic) dataset is available from sklearn.datasets.

- A summary of information is provided here:
https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset
- Dataset can also be downloaded from: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic. 

**Task:**

- Run the cell below to load the breast cancer data 
- Features `X` consisting of 30 features 
- target `y`: 0 (benign/harmless/good) and 1 (malignant/harmful/bad). 

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer dataset as panda data frame
# X stores sample features and y stores labels [0 or 1]
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
display(X)

In [None]:
print('num of 1s (malignant):  ', np.count_nonzero(y), 'of', len(y))

### 0.2 Split
**Discuss:**
- What does stratified shuffle split do? What inputs does it take and what outputs does it give to the user?

    **Ans:** 

**Task:**
- Split the data into training and test sets using [`sklearn.model_selection.StratifiedShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html). 
    - Set n_split=1, test_size=1/5, random_state=0.
- Verify that the stratified split was performed correctly by printing relevant sizes of arrays.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
# use random state to ensure reproducibility (each time we execute this data we get same set
# of training and test data
sss = None      # TODO
for i, (train_idx, test_idx) in enumerate(sss.split(X, y)):
    X_train, y_train = None, None      # TODO
    X_test , y_test  = None, None      # TODO

# Verify correctness of split.
print('num of 1 (malignant)')
print("Training set:", (y_train==1).sum(), "out of", y_train.shape[0])
print("Test set    :", (y_test==1).sum(), "out of", y_test.shape[0])

## 1 Decision Tree GRidSearchCV
This step-by-step example is shown before implementing for all others later. The goal is to understand full how gridsearch works.

### Step 1: Prepare Decision Tree Classifier and Grid Search
**Task:**
- Create a sklearn Decision Tree classifier `dt_clf`
- define `dt_grid`, a dictionary of parameters with the possible values they are allowed to take in the grid search
    - criterion: gini or entropy
    - max_depth: 2, 3, 4
    - min_samples_split: 5, 10, 15?
    - **Discuss:** what do each of these parameters mean?

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_clf = None       # TODO

# parameters and their possible values
dt_grid = {
    'criterion': None,          # TODO
    'max_depth': None,          # TODO
    'min_samples_split': None   # TODO
}

### Step 2: Find best parameters using GridSearchCV
Perform grid search using cross-validation in sklearn.

**Task:**
- Define `grid_search`, a GridSearchCV object
    - input classifier clf and param_grid
    - set cv=5, for 5 fold cross validation. Note: these folds are different
- fit grid_search to the training data

In [None]:
from sklearn.model_selection import GridSearchCV

dt_search = None            # TODO
dt_search.fit(None, None)   # TODO

**Task:**
- extract best parameters (best_params_) and print it
- extract best model (best_estimator_) and score its accuracy on test data

In [None]:
best_params = None      # TODO
print("Best Parameters:", best_params)

best_estimator = None   # TODO
accuracy = best_estimator.score(X_test, y_test)
print("Test Set Accuracy:", accuracy)

### Step 3: Visualize best Decision Tree Classifier
**Task:**
- Visualize your tree result using sklearn.tree plot_tree() function.

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

plt.figure(figsize=(90,70))
plot_tree(None, feature_names=None, class_names=None, filled=True)  # TODO
plt.show()

## 2 More Classifiers with GridSearchCV
### 2.1 Train models
We have covered the following classifiers in lecture:
- kNN
- LDA, QDA, GNB
- decision trees, random forests

The next exercise is to code up a pipeline that will compare all models at the same time.

**Task:**
1. Write the function `best_model` which find the best model, given a pipeline/classifier and parameter grid.
2. Use [`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to find the best classifiers, while allowing a range of parameters to be considered for each classifier. 
    - "best" means the highest accuracy on cross validation.
    - You should fit the models to the training data.
    - Parameters of each classifier are set with the ‘__’ convention. Look at lecture notes or sklearn documentation for examples.
    - Where necessary, seek optimal hyperparameters using `best_model` you wrote above. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline


# Return best classifier from grid search. CV = cross-validation
def best_model(pipe, grid, X_train, y_train):
    '''
    pipe: pipeline object or sklearn classifier object
    grid: dictionary of parameters to explore. if using pipeline, 
            ensure the double underscore __ convention is used
    X_train, y_train: the training data
    '''
    search = None               # TODO
                                # TODO
    return None                 # TODO

#### knn
- use standard scaler in pipeline
- explore n_neighbors: [4, 16, 32]


In [None]:
# k-NN with pipeline and standard scaler
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

pipe_knn = Pipeline([
    None
    ])
knn_grid = {
    None
    }
knn_model = best_model(pipe_knn, knn_grid, X_train, y_train)

#### GNB, LDA, QDA
- No pipeline needed
    - **Discuss:** why Why doesn't it make that much sense to have a pipeline?
- default parameters for all

In [None]:
# GNB, LDA, QDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

gnb_model = None
lda_model = None
qda_model = None

#### Decision Tree
- use pca in pipeline
    - n_components: [10, 20, 30]
- explore 
    - criterion: ['gini', 'entropy']
    - max_depth: [2, 3, 4,5]
    - min_samples_split: [5, 10, 15]

In [None]:
# Decision Tree with pipeline
from sklearn.tree import DecisionTreeClassifier

pipe_dt = Pipeline([
    None
    ])
dt_grid = {
    None
    }
dt_model = best_model(pipe_dt, dt_grid, X_train, y_train)


#### Random Forest
- use pca in pipeline
    - n_components: [10, 20, 30]
- explore
    - n_estimators = [10, 50, 200]

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn import decomposition

pipe_rf = Pipeline([
    None
    ])
rf_grid = {
    None
    }
rf_model = best_model(pipe_rf, rf_grid, X_train, y_train)

### 2.2 Compare models
**Task:**
- Which model has the best accuracy on test data? Print it.
- What were the hyperparameters used in the best model's training? Can you recognize which are the hyperparameters?

In [None]:
from sklearn.metrics import accuracy_score as acc
models = [knn_model, gnb_model, lda_model, qda_model, dt_model, rf_model]

# Best model.
model_accuracies = [acc(y_test, mod.predict(X_test)) for mod in models]
print('models     :', '[knn_model, gnb_model, lda_model, qda_model, dt_model, rf_model]')
print('accuracies :', np.round(model_accuracies, 4))

best_model_idx = np.argmax(model_accuracies)
print('\n*best model:', best_model_idx+1)
print(models[best_model_idx])

### 2.3 Remove standard scaler/PCA and rerun the cells above
- Standard scaler rescales the data in each feature/dimension to have variance 1. 
- PCA reduces the dimensionality of the data.

These are preprocessing steps in the data, though not necessarily the "right" thing to do. Try removing the standard scaler/PCA and see what effect it has on the prediction accuracy.
- you can do this by simply commenting out relevant lines of code

**Disuss:** What is the best model now? 