# Hyperparameter search
This notebook is dedicated to hyperparameter search for the different classifiers that we chose to use for leaves classification base on features extracted from the images.

The goal is to find the best hyperparameters for each classifier using cross validation to compare the performances between the classifiers with the default hyperparameters and the classifiers with the best hyperparameters.

### Importing our own functions

In [38]:
import importlib

import src.Data as Data
importlib.reload(Data)
Data = Data.Data

import src.Metrics as Metrics
importlib.reload(Metrics)
Metrics = Metrics.Metrics

### Importing libraries
`numpy` and `pandas` are used to manipulate the data

`scikit-learn` is used to train the classification models and compute the metrics

`matplotlib` and `seaborn` are used to plot the results

In [39]:
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.model_selection import LearningCurveDisplay, learning_curve, cross_validate, train_test_split, cross_val_predict
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import numpy as np
import pandas as pd

### Loading the data
The data is loaded from the `data` folder.

Samples are split into a training set and a test set with a custom ratio. Stratified sampling is used to ensure that the proportion of samples in each class is the same in both sets.

The number of samples in the least represented class is computed to choose the number of folds for cross-validation.

In [40]:
data: Data = Data(test_size=0.2, include_images=False)

least_populated_class_count = np.unique(data.y_train, return_counts=True)[1].min()
print("Least populated class count:", least_populated_class_count)
print("This is the maximum valid number of folds for cross validation.")

Least populated class count: 8
This is the maximum valid number of folds for cross validation.


### Choosing the models
Here you can choose which models you want include in the hyperparameter search.

The parameter `n_jobs` is used to specify the number of cores to use for parallel processing. If `-1` is given, all cores are used.

In [41]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier

classifiers = [
    # DecisionTreeClassifier(), 
    # RandomForestClassifier(n_jobs=-1), 
    # BaggingClassifier(n_jobs=-1), 
    # LogisticRegression(n_jobs=-1), 
    # SVC(), 
    # GaussianNB(), 
    # SGDClassifier(n_jobs=-1), 
    # KNeighborsClassifier(n_jobs=-1), 
    GradientBoostingClassifier(), 
    # MLPClassifier(), 
    AdaBoostClassifier()
]

### Getting the list of hyperparameters
To simplify the hyperparameter search, we use the `get_params()` method of the classifier to get the list of hyperparameters that can be tuned.


In [42]:
for classifier in classifiers:
    print("Classifier:", classifier.__class__.__name__)
    print("Parameters:")
    for key in classifier.get_params():
        print("\t", key)
    print("")
    

Classifier: GradientBoostingClassifier
Parameters:
	 ccp_alpha
	 criterion
	 init
	 learning_rate
	 loss
	 max_depth
	 max_features
	 max_leaf_nodes
	 min_impurity_decrease
	 min_samples_leaf
	 min_samples_split
	 min_weight_fraction_leaf
	 n_estimators
	 n_iter_no_change
	 random_state
	 subsample
	 tol
	 validation_fraction
	 verbose
	 warm_start

Classifier: AdaBoostClassifier
Parameters:
	 algorithm
	 base_estimator
	 estimator
	 learning_rate
	 n_estimators
	 random_state



### Chosing the hyperparameters to tune
We then need to choose from the list above which hyperparameters we want to tune. We can also choose the range of values to test for each hyperparameter.

The `param_grid` variable is a dictionary where the keys are the names of the hyperparameters and the values are the list of values to test for each hyperparameter.

In [43]:
param_grids = []

# DecisionTreeClassifier
param_grid = {
    "criterion": ["gini", "entropy"],
    "splitter": ["best", "random"],
    "max_depth": [None, 5, 10, 20, 50, 100],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5, 10],
    "max_features": ["sqrt", "log2"]
}
if "DecisionTreeClassifier" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# RandomForestClassifier
param_grid = {
    "n_estimators": [10, 50, 100, 200, 500],
    "criterion": ["gini", "entropy"],
    "min_samples_split": [2, 5, 10],
    "max_features": ["sqrt", "log2"]
}
if "RandomForestClassifier" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# BaggingClassifier
param_grid = {
    "n_estimators": [10, 20, 50, 100],
    "max_samples": [0.1, 0.5, 1.0],
    "max_features": [0.1, 0.5, 1.0],
    "bootstrap": [True, False],
    "bootstrap_features": [True, False]
}
if "BaggingClassifier" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# LogisticRegression
param_grid = {
    "penalty": ["l1", "l2", "elasticnet"],
    "C": [0.1, 0.5, 2, 5, 10, 20, 50, 100, 200, 500, 1000],
    "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
    "max_iter": [100, 200, 500]
}
if "LogisticRegression" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# SVC
param_grid = {
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "C": [0.1, 0.5, 2, 5, 10, 20, 50, 100, 200, 500, 1000],
    "gamma": ["scale", "auto"]
}
if "SVC" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# GaussianNB
param_grid = {
    "var_smoothing": [1e-07, 1e-06, 1e-05, 1e-04, 1e-03, 0.005, 0.01, 0.02, 0.05, 0.075, 0.1]
}
if "GaussianNB" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# SGDClassifier
param_grid = {
    "loss": ["hinge", "log", "modified_huber", "squared_hinge", "perceptron"],
    "penalty": ["l1", "l2", "elasticnet"],
    "alpha": [0.00001, 0.0001, 0.001, 0.01],
    "max_iter": [1000, 2000, 5000, 10000],
}
if "SGDClassifier" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# KNeighborsClassifier
param_grid = {
    "n_neighbors": [1, 2, 5, 10],
    "weights": ["uniform", "distance"],
    "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
    "leaf_size": [1, 2, 5, 10, 20, 30, 50],
    "p": [1, 2]
}
if "KNeighborsClassifier" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# GradientBoostingClassifier
param_grid = {
    "learning_rate": [0.005, 0.01, 0.025, 0.05, 0.1, 0.5],
    "n_estimators": [100, 500], 
    "criterion": ["friedman_mse", "squared_error"],
    "max_depth": [1, 2, 3, 5, 10],
    "min_samples_split": [2, 5, 10, 15, 20],
    "max_features": ["sqrt", "log2"]
}
if "GradientBoostingClassifier" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# MLPClassifier
param_grid = {
    "hidden_layer_sizes":  [(20,20,), (100,), (200,), (500,)],
    "activation": ["logistic", "tanh", "relu"],
    "solver": ["lbfgs", "sgd"],
    "alpha": [0.00001, 0.0001, 0.001, 0.01] ,
    "learning_rate": ["constant", "adaptive"],
    "max_iter": [200, 500]
}
if "MLPClassifier" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)

# AdaBoostClassifier
param_grid = {
    "n_estimators": [50, 200, 500],
    "learning_rate": [0.001, 0.01, 0.1, 0.5],
    "algorithm": ["SAMME", "SAMME.R"]
}
if "AdaBoostClassifier" in [classifier.__class__.__name__ for classifier in classifiers]:
    param_grids.append(param_grid)


### Fitting the models with all combinations of hyperparameters
We use the `GridSearchCV` class to fit the models with all combinations of hyperparameters and find the best hyperparameters for each model. 

This class uses cross-validation to evaluate the performance through an exhaustive search over the hyperparameter values space.

In [44]:
from sklearn.model_selection import GridSearchCV

best_params = []
best_scores = []

for classifier, param_grid in zip(classifiers, param_grids):
    print("Classifier:", classifier.__class__.__name__)
    print("Parameters:")
    for key in param_grid:
        print(f"\t{key:12}: {param_grid[key]}")
    print("")
    
    grid_search = GridSearchCV(classifier, param_grid, cv=2, verbose=1, n_jobs=-1)
    grid_search.fit(data.x_train, data.y_train)
    best_params.append(grid_search.best_params_)
    best_scores.append(grid_search.best_score_)
    print("Best parameters:", best_params[-1])
    print(f"Best score: {best_scores[-1]:.3f}")
    print("")

Classifier: GradientBoostingClassifier
Parameters:
	learning_rate: [0.005, 0.01, 0.025, 0.05]
	n_estimators: [500]
	criterion   : ['friedman_mse']
	max_depth   : [1, 2, 3]
	min_samples_split: [10, 15, 20]
	max_features: ['log2']

Fitting 2 folds for each of 36 candidates, totalling 72 fits


### Printing the results

In [37]:
for classifier, best_param, best_scores in zip(classifiers, best_params, best_scores):
    print("Classifier:", classifier.__class__.__name__)
    print("Best parameters:")
    for key in best_param:
        print(f"\t{key:12}: {best_param[key]}")
    print(f"Best score: {best_scores:.3f}")
    print("")

Classifier: GradientBoostingClassifier
Best parameters:
	criterion   : friedman_mse
	learning_rate: 0.05
	max_depth   : 3
	max_features: log2
	min_samples_split: 10
	n_estimators: 100
Best score: 0.782

Classifier: AdaBoostClassifier
Best parameters:
	algorithm   : SAMME.R
	learning_rate: 0.01
	n_estimators: 500
Best score: 0.535

