# Scikit-learn - Unit 05C - Cross Validation Search (GridSearchCV) and Hyperparameter Optimization Multiple Clf- Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and use GridSearchCV for Hyperparameter Optimization




---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 05C - Cross Validation Search (GridSearchCV ) and Hyperparameter Optimization

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Hyperparameter Optimization with 1 algorithm

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Multiclass Classification

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the last section, we saw how to conduct a hyperparameter tuning using one algorithm to solve a Binary Classification problem.
* There is a tiny difference for using GridSearch CV when your ML task is multi-class classification, we will cover that now.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are going to consider a similar workflow we studied earlier:
* Split the data
* Define the pipeline and hyperparameters
* Fit the pipeline
* Evaluate the pipeline

We load the iris dataset for this exercise. It contains records of 3 classes of iris plants, with its petal and sepal measurements

df_clf = sns.load_dataset('iris')

print(df_clf.shape)
df_clf.head()

As usual, we split the data into train and test set

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

And create a pipeline using 3 steps: feature scaling, feature selection and modelling with RandomForestClassifier

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

def pipeline_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestClassifier(random_state=101)) ),
      ( "model", RandomForestClassifier(random_state=101)),

    ])

  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We define our hyperparameter list based on the algorithm documentation.
* In this case, there will be 2 hyperparameters combinations
* We are interested to provide you a feeling experience in hyperparmeter optimization, therefore we will reduce the amount of hyperparameter combinations, so the learning process can be faster in this moment. However, we encourage you to try by yourself additional combinations in your free time.



# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.model_selection import GridSearchCV

param_grid = {"model__n_estimators":[10,20],
              }
param_grid

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Let's assume for this project, the client is interested in the Virginica species, and needs the predictions for this class to be precise (for some client's reason, if the model says a given flower is Virginica, it has to be Virginica, it can't be something else). 
* In this case, your scoring parameter is `precision_score` to the class Virginica
  * In a Multiclass classification, when your performance metric is accuracy, you just parse scoring='accuracy', like in a binary classifier.
  * In our case, we need make_scorer() to parse we want to fine-tune the model using precision on the Virginica species. We parse to make_scorer the metric we want - precision_score. The next argument is `labels`, where you parse the class you want to tune a list. Note, in this dataset, the species is not encoded as numbers but as categories. If it were numbers, you would parse the number related to the class you want to tune. The last argument is average, and it should parse None, since you compute the precision from one class only (in this case Virginica) and you don't need to average.
* Finally, you fit the grid search to the training data.


from sklearn.metrics import make_scorer, precision_score
grid = GridSearchCV(estimator=pipeline_clf(),
                    param_grid=param_grid,
                    cv=2,
                    n_jobs=-2,
                    verbose=3, # in the workplace we typically set verbose to 1, 
                    # to reduce the amount of messages when fitting the models
                    # for teaching purpose, we set to 3 to see the score for each cross validated model
                    scoring=make_scorer(precision_score,
                                        labels=['virginica'],
                                        average=None)
                    )


grid.fit(X_train,y_train)

Next, we check the results for all 4 different models with `.cv_results_` and use the same code from the previous section
* Note this combination `''model__n_estimators': 10` gave a average precision score on virginica of 0.91. In this case, both options look to give same performance, and the grid search picked the model with n_estimator as 10

(pd.DataFrame(grid.cv_results_)
.sort_values(by='mean_test_score',ascending=False)
.filter(['params','mean_test_score'])
.values
 )

We grab programmatically the best hyperparameter combination for a quick check

grid.best_params_

And finally grab the best pipeline, considering the best cross-validated model for the best hyperparameter combination

pipeline = grid.best_estimator_
pipeline

We finally evaluate the pipeline
* Note the precision on Virginica, on the train set is 98% and on the test set is 100%. It is a very good sign that the precision is maxed for the test set since it shows the pipeline can generalize on unseen data
* Again, the client will accept the pipeline based on the performance criteria you both set in the ML business case.

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)
    

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique()
                )

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Congratulations! You now know how to get a given algorithm and do a hyperparameter optimization for Regression and Classification!
  * The **next level** is to define a set of algorithms and a set of hyperparameters for each algorithm, and do a hyperparameter optimization for Regression and Classification tasks!

---