# Scikit-learn - Unit 05A - Cross Validation Search (GridSearchCV) and Hyperparameter Optimization Regression - Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and use GridSearchCV for Hyperparameter Optimization




---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 05A - Cross Validation Search (GridSearchCV ) and Hyperparameter Optimization

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Good job! You fitted multiple pipelines considering different algorithms separately for regression and classification tasks. However, how do you know which was better for a given ML task? 
* Imagine for the classification task on the iris dataset, you fitted three individual pipelines, evaluated each separately, and concluded a given algorithm was better. That is fair enough, but you want a more effective way to assess multiple algorithms.
* However, you also fitted the models with the default hyperparameters, and may wonder: what if for a given algorithm, I could fit multiple models using different hyperparameters and find a model with even better performance




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%208-%20Challenge.png"> Let's learn how to use GridSearchCV and do Hyperparameter Optimization using multiple algorithms. 
* This is the heart of conventional ML. We will split this topic into 2 parts: 
  * in the next 3 notebooks, we will show how to conduct hyperparameter optimization using 1 algorithm (for Regression, Binary Classification and Multiclass Classification) 
  * Then we will cover how to do hyperparameter optimization using multiple algorithms at once.


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Hyperparameter Optimization with 1 algorithm

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In this section, we will select a given algorithm and fine-tune it, by defining a set of hyperparameters. 
* For each possible hyperparameter combination, a set of models will be fitted - based on the cross-validation parameter. For example, if the developer sets cross-validation as 5, it will fit 5 models for a given hyperparameter combination. 
* These 5 models are scored against a performance metric (ie.: if it is regression it could be the R2 score), and average performance is computed. This average is the cross-validated performance for a given configuration of hyperparameters. 
* This process is repeated then for every combination of hyperparameters. 



<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">  Let's quickly recap how we use the data for fitting supervised models
* The training subset of the data is used to fit or train the model.
* A subset of the training set is known as the validation set. This is used during fitting to compare one model against another, and when choosing or tuning hyperparameters. 
* The final subset of data used to test the model is known as the test set. This assesses the final models' performance and it must be data that is new to the model to give an unbiased result. The test set closely replicates what the deployed model will see in real-time usage. 


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
">   When we do Hyperparameter Optimization, a part of the train set is automatically subset as a validation set and the model is fitted using cross-validation. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> After all, how can I do that in Sckit-learn? 
* We use a function called **GridSearchCV**, where it fits multiple models looping through a hyperparameter list over the model. 
* Note: CV here means cross-validation. It uses cross-validation to  compare different algorithm and hyperparameter combinations. So, in the end, we can select the best parameters from the listed hyperparameters that achieves a better performance.
* In the end, it helps to automate the process to find the best combination of hyperparameters for a given algorithm in a given dataset. Its documentation is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We will split this section showing uses cased for GridSearchCV on Regression, Binary Classification, Multiclass classification task
* When using GridSearchCV, the difference among these ML tasks relies on the scoring parameters, which tells which performance metric should be used to select the best model.

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Regression

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We are going to consider a similar workflow we studied earlier:
* Split the data
* Define the pipeline 
* Fit the pipeline
* Evaluate the pipeline

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">The differences now are:
* we decide on a list of hyperparameters to optimize our model while fitting it
* we need a performance metric to decide which model is the best in cross-validating the models.


We will use the Boston dataset from sklearn. It has house price records and house characteristics, like the average number of rooms per dwelling and Boston's per capita crime rate.

from sklearn.datasets import load_boston
data = load_boston()
df_reg = pd.DataFrame(data.data,columns=data.feature_names)
df_reg['price'] = pd.Series(data.target)

df_reg = df_reg.sample(frac=0.5, random_state=101)

print(df_reg.shape)
df_reg.head()

We split the data into train and test sets.

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_reg.drop(['price'],axis=1),
                                    df_reg['price'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Similarly to previous examples, we set the pipeline with 3 steps, feature scaling, feature selection and modelling. 
* For the learning purpose, we just set the algorithm used as RandomForestRegressor. However, we encourage you to try by yourself additional algorithms in your free time.

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
#from sklearn.ensemble import AdaBoostRegressor
#from sklearn.linear_model import LinearRegression
#from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
#from sklearn.tree import DecisionTreeRegressor

def pipeline_adaboost_reg():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestRegressor(random_state=101)) ),
      ( "model", RandomForestRegressor(random_state=101)),

    ])

  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We should parse the **algorithm's hyperparameters**, in a dictionary, with the support from its documentation, which is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
* Since we are fitting a pipeline, that contains a set of steps, you should state in your hyperparameter list to which step your hyperparameter belongs. In our case, we named the modelling step as `"model"`, so we add the suffix `"model__" `before the hyperparameter name.
* For learning purposes, we picked 1 hyperparameter: n_estimators. For n_estimator, we parse in a list with 10 and 20 (the default value is 50, but for learning purpose we set 10 and 20 for a faster computation)
* It will take time and practical experience to make sense of which hyperparameters are more useful for each algorithm and what are the typical ranges to consider when listing hyperparameter's values

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html  # documentation is here
param_grid = {"model__n_estimators":[10,20],
              }

param_grid

We import GridSearchCV. Its documentation is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). We parse
*  `estimator` as the pipeline, `param_grid` as the dictionary we stated above. 
* `cv` sets the number of cross-validation you have to try for each selected set of hyperparameters. It uses k-fold cross-validation, where we subdivide the whole dataset into multiple randomly chosen data sets known as k-fold cross-validation where k refers to the number of data sets.  
* `n_jobs`, according to the documentation, is the number of jobs to run in parallel. -1 means using all processors, where -2 uses all but one.
* `scoring`, is the evaluation metric that you want to use.  That will depend on the ML task you are considering. In this case, it is regression, so we set the R2 score to be the metric. Other options would include: `'neg_mean_absolute_error'`, `'neg_mean_squared_error'`
* `verbose`, according to the documentation, controls the verbosity: the higher, the more messages. For learning purposes we set as 3, so you can understand the process.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> We create an object (we simply called the grid) that contains a GridSearchCV using the parameters above. Next, we fit this object to the train set (features and target).

* That will fit multiple RandomForestRegressor models. Considering cv=2 and the hyperparameters we listed above, it will train 2 models. It trains 2 models since we have 2 possible combinations of hyperparameters - 
* For each model, we use a 2-fold cross-validation, since cv=2. Therefore each model will be fit twice.
* In total, this operation will fit 4 models, 2 models where each model is fitted 2 times, due to 2-fold cross-validation.
* Note, the 2 scores for `model__n_estimators=10` (the first 2 lines) are 0.618 and 0.695. The average is 0.656. We will highlight to you this average when computing the cross-validation results for this hyperparameter combination in upcoming cells



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Remember, the validation set is automatically defined using GridSearchCV. You parse the training set and it will subset the validation set as a part of the training set.

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=pipeline_adaboost_reg(),
                    param_grid=param_grid,
                    cv=2,
                    n_jobs=-2,
                    verbose=3,  # for learning, we set 3 to print the score from every cross validation
                    scoring='r2')


grid.fit(X_train,y_train)

The results of all models and their respective cross-validations are stored in the attribute `.cv_results`
* When you access this attribute, you will notice it is a dictionary and when displayed as it is, is not very informative

grid.cv_results_

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> One way to make it more informative is to parse it to a DataFrame, sort the values by 'mean_test_score', filter 'parameters' and 'mean_test_score' columns and convert to an array, using .values
* In the end, you print a simplified ordered list showing the results for optimizing a model with multiple hyperparameters combinations.
  * For example, we see the best hyperparameter configuration is `n_estimators 10`, which gave an R2 score of 0.65
  * Note the first hyperparameter combination: `model__n_estimators=10`. Previously we commented on its average performance: 0.656. That is a result of the average of the 2 cross-validated models for this hyperparameter combination. In this particular example considering this set of hyperparameter combinations, its performance was the best compared to the other.

(pd.DataFrame(grid.cv_results_)
.sort_values(by='mean_test_score',ascending=False)
.filter(['params','mean_test_score'])
.values
 )

At the same time, we can get the best parameters with `.best_params_`

grid.best_params_

This is interesting, but we want to have the pipeline which gave the
highest score, so we can use it in real life. To grab it, use the attribute `.best_estimator_`
* This is the most important aspect of the grid search, where we grab the pipeline, we will evaluate and potentially use
* Note, when fitting the pipelines, you saw the score for each cross-validated model for each hyperparameter combination. For the best hyperparameter combination, it takes the best cross-validated model, in this case, it takes the last.
  * `[CV 1/2] END ............model__n_estimators=10;, score=0.618 total time=   0.2s`
  * `[CV 2/2] END ............model__n_estimators=10;, score=0.695 total time=   0.2s`

pipeline = grid.best_estimator_
pipeline

You can now evaluate the pipeline that you fit using hyperparameter optimization using the techniques we covered already. We will import the custom function for regression evaluation

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
import numpy as np

def regression_performance(X_train, y_train, X_test, y_test,pipeline):
	print("Model Evaluation \n")
	print("* Train Set")
	regression_evaluation(X_train,y_train,pipeline)
	print("* Test Set")
	regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X,y,pipeline):
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train, X_test, y_test,pipeline, alpha_scatter=0.5):
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)


  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")
  plt.show()


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Next, we parse the train set, test set and our pipeline to the function
* Note the performance on the train test is nice (0.93). The test set is not so nice (0.68); however, these values are distant, which may indicate overfitting.
* We may have to consider additional values for the hyperparameters or even consider other hyperparameters. Or maybe we need more data so the algorithm can find the patterns and generalize on unseen data.
* Or maybe this algorithm is not the best for this dataset. But don't worry, soon we will discover an approach to train multiple algorithms at once.

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            pipeline, alpha_scatter=0.5)

---