# Scikit-learn - Unit 05 - Cross Validation Search (GridSearchCV) and Hyperparameter Optimization Binary Clf- Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and use GridSearchCV for Hyperparameter Optimization




---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 05 - Cross Validation Search (GridSearchCV ) and Hyperparameter Optimization

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Hyperparameter Optimization with 1 algorithm

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Binary Classification

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the last section, we saw how to conduct a hyperparameter tuning using one algorithm to solve a Regression problem.
* There is a tiny difference for using GridSearch CV when your ML task is classification, we will cover that now.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are going to consider a similar workflow to the one we studied earlier:
* Split the data
* Define the pipeline and hyperparameter
* Fit the pipeline
* Evaluate the pipeline

Let's load the breast cancer data from sklearn. It shows records for a breast mass sample and a diagnosis informing whether it is as malignant or benign cancer, where 0 is malignant and 1 is benign.

from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df_clf = pd.DataFrame(data.data,columns=data.feature_names)
df_clf['diagnostic'] = pd.Series(data.target)
df_clf = df_clf.sample(frac=0.5, random_state=101)


print(df_clf.shape)
df_clf.head()

We split the data into train and test set

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['diagnostic'],axis=1),
                                    df_clf['diagnostic'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

And create a pipeline with 3 steps, feature scaling, feature selection and modelling using RandomForestClassifier

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

def pipeline_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestClassifier(random_state=101)) ),
      ( "model", RandomForestClassifier(random_state=101)),

    ])

  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We define our hyperparameter list based on the algorithm documentation. One method could be to consider the default parameter value and a set of values that are around the default value
* In this case, there are 2 possible combinations of hyperparameter

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.model_selection import GridSearchCV

param_grid = {"model__n_estimators":[50,20],
              }

param_grid

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> When we move to Classification, there will be a different GridSearchCV scoring argument. 



* We consider that in our classification projects, the potential performance metrics are: accuracy, recall, precision, f1 score.
  * When the metric is either recall, precision or f1 score, we need to inform which class we want to tune for and use `make_scorer()` as an "auxiliary" function to help with defining the metric and the class to tune. The documentation for make_scorer is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)
  * When your performance metric is recall, you need to import [recall_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), if is precision, [precision_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) and if it is F1 score, you need to import [f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html); so you can parse the metric to the `make_scorer()` function.
  * When your performance metric is accuracy, you simply write "accuracy" for scoring: `scoring='accuracy'`





<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In this exercise, we have 0 and 1 as diagnostics for breast cancer. 
* We assume that when defining the ML business case, it was agreed that the performance metric is recall on malignant (0), since the client needs to detect a malignant case. 
* The client doesn't want to miss a malignant case, even if that comes with a cost where you misidentify someone that is benign, and you say it is malignant. For this client, this is not as bad a misidentifying someone as benign when malignant. Therefore the model is tuned on recall for malignant (0) 


from sklearn.metrics import make_scorer, recall_score
from sklearn.metrics import f1_score # in case your metric is f1 score, you would need this import
from sklearn.metrics import precision_score # in case your metric is precision, you would need this import

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png">  The arguments `estimator, param_grid, cv, n_jobs, verbose` are similar to the previous example.

* The focus is now on `scoring` when creating the object to conduct a grid search. You will need `make_scorer()` to parse your tune on recall for class 0 for this binary classifier. 
  * Parse at `make_scorer()`, recall_score as your metric and pos_label to identify which class you want to tune recall. In this case, it is 0.
* Next, you fit the grid-search with the train set (features and target) as usual


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Since `cv=2`, we will fit 2 models for each hyperparameter combination using k-fold cross-validation. Therefore 4 models (2 times 2) are trained in the end. 
* The same dynamic repeats: computes the performance for each cross validated model and gets the average performance for a given hyperparemeter combination, then iterate for each hyperparamter combination 


grid = GridSearchCV(estimator=pipeline_clf(),
                    param_grid=param_grid,
                    cv=2,
                    n_jobs=-2,
                    verbose=3,
                    # in the workplace we typically set verbose to 1, 
                    # to reduce the amount of messages when fitting the models
                    # for teaching purpose, we set to 3 to see the score for each cross validated model
                    scoring=make_scorer(recall_score, pos_label=0)
                    )


grid.fit(X_train,y_train)

Next, we check the results for all 4 different models  with `.cv_results_` and use the same code from the previous section
* Note that `'model__n_estimators': 50` gave a average recall score on class 0 of 0.86 and is superior than the other combination

(pd.DataFrame(grid.cv_results_)
.sort_values(by='mean_test_score',ascending=False)
.filter(['params','mean_test_score'])
.values
 )

Let's check the best parameters with `.best_params_`

grid.best_params_

And finally grab the pipeline that has the best estimator, the one which gave the highest score. 

pipeline = grid.best_estimator_
pipeline

As usual in our workflow, we will evaluate the pipeline using our custom function for classification problems

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

We parse the parameters as usual, considering that class 0 is malignant and class 1 is benign, therefore label_map receives a ordered list that matches the class value and its meaning:  `['malignant', 'benign']`
* Note the recall on malignant on the train set is 100% and on the test set is 90%. In a project, you will have set the threshold you would accept. 
* In case the threshold you agreed with the client is 90%, this pipeline is the solution for your case. In case your threshold is 98%, you would still have to look for other algorithms or hyperparameters combinations to improve your pipeline performance

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= ['malignant', 'benign'] 
                )

---