# --- SOLUTION ---

HINT: These solutions only give an impression on how the problems can be tackled. They are neither the best possible solutions nor are they always complete. You are invited to find approaches that outperform those given in the solutions and present them to your fellow students.

# Exercise 5: Hyperparameter Tuning

This exercise is about hyperparameter tuning. To get familiar with hyperparameter tuning in scikit-learn, refer to the respective [part in the documentation](https://scikit-learn.org/stable/modules/grid_search.html).

We again use the data set of the Data Mining Cup 2006. Remember: the task is to predict the attribute `gms_greater_avg` as precisely as possible. This time, we use the F1-measure of the class `1` as main performance metric.

## Task 1: Warm-up

In [1]:
import numpy as np
import pandas as pd

RANDOM_STATE = 42  # use this random state to make your experiments consistent
np.random.seed(RANDOM_STATE)

In [2]:
# Use the pandas library to import the training data similarly to exercise 2.

# --- SOLUTION ---
df = pd.read_csv('dmc2006/dmc2006_train.txt', sep='\t', encoding='cp1252').drop(columns=['auct_id', 'gms', 'listing_title', 'listing_subtitle', 'listing_start_date', 'listing_end_date'])
X, y = df.drop(columns='gms_greater_avg'), df['gms_greater_avg']

# converting columns to have reasonable format
X = pd.get_dummies(X, columns=['item_leaf_category_name'])
boolean_columns = [k for k, v in X.dtypes.items() if v == np.object]
X[boolean_columns] = X[boolean_columns].apply(lambda row: [1 if x == 'Y' else 0 for x in row])

In [3]:
# Create a 50:50 train-test-split and assign the results to the variables X_train, X_test and y_train, y_test

# --- SOLUTION ---
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=RANDOM_STATE, stratify=y)

In [4]:
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

estimators = {
    'Naive Bayes': GaussianNB().fit(X_train, y_train),
    'K-NN': KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train),
    'SVC': SVC(random_state=RANDOM_STATE).fit(X_train, y_train)
}

# Implement the `evaluate_estimators` function so that it returns precision, recall, and F1-measure
# of the class 1 on the test set for the classifiers given in `estimators`.

# --- SOLUTION ---
from sklearn.metrics import precision_recall_fscore_support

def evaluate_estimators(estimators, X, y_true):
    for e_name, e in estimators.items():
        p, r, f1, _ = precision_recall_fscore_support(y_true, e.predict(X), average='binary')
        print(f'{e_name}: P={p:.2f} R={r:.2f} F1={f1:.2f}')
        
        
evaluate_estimators(estimators, X_test, y_test)

Naive Bayes: P=0.63 R=0.29 F1=0.39
K-NN: P=0.60 R=0.59 F1=0.60
SVC: P=0.63 R=0.29 F1=0.40


## Task 2: Grid Search

In [6]:
%%time

tune_params = {
    'K-NN': {
        'n_neighbors': [1, 3, 5, 10]
    },
    'SVC': {
        'C': [.001, .01, .1, 1, 10, 100],
        'gamma': ['scale', 'auto'],
        'tol': [1e-2, 1e-3, 1e-4],
        'class_weight': ['balanced', None],
    }
}

# Run a grid search with the parameters given in `tune_params` with F1-measure as optimization objective.
# For the best estimator, print the parameters and evaluate it with the `evaluate_estimators` function.
# HINT: Take a look at https://scikit-learn.org/stable/modules/grid_search.html for infos about grid search.

# --- SOLUTION ---

from sklearn.model_selection import GridSearchCV

def grid_search_estimator(e_name, e, param_grid, X, y):
    gscv = GridSearchCV(e, param_grid, scoring='f1', cv=10).fit(X, y)
    print(f'{e_name} best parameters: {gscv.best_params_}')
    return gscv.best_estimator_

grid_estimators = {e_name: grid_search_estimator(e_name, e, tune_params[e_name], X_train, y_train)
                   for e_name, e in estimators.items()
                   if e_name in tune_params}

evaluate_estimators(grid_estimators, X_test, y_test)

K-NN best parameters: {'n_neighbors': 3}
SVC best parameters: {'C': 1, 'class_weight': 'balanced', 'gamma': 'auto', 'tol': 0.01}
K-NN: P=0.60 R=0.58 F1=0.59
SVC: P=0.53 R=0.69 F1=0.60
CPU times: user 3min 49s, sys: 1.73 s, total: 3min 51s
Wall time: 3min 47s


## Task 3: Successive Halving

In [8]:
%%time

# Now run a successive halving grid search with the parameters given in `tune_params` with F1-measure as objective.
# Use a `min_resources` of 200 and a `factor` of 2.
# Again, print parameters of the best estimator and evaluate it with the `evaluate_estimators` function.
# HINT: Examples for halving grid search: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV
# HINT: To use successive halving, you need a scikit-learn version of 0.24.1 or higher
#       -> run a cell with `!pip install scikit-learn==0.24.1` and restart the notebook.

# --- SOLUTION ---

from sklearn.experimental import enable_halving_search_cv  # enable it first, as it is still experimental feature
from sklearn.model_selection import HalvingGridSearchCV

# hint: only works with sklearn >= 0.24.1
def halving_search_estimator(e_name, e, param_grid, X, y):
    hgscv = HalvingGridSearchCV(e, param_grid, scoring='f1', factor=2, min_resources=200, cv=10, random_state=RANDOM_STATE).fit(X, y)
    print(f'{e_name} best parameters: {hgscv.best_params_}')
    return hgscv.best_estimator_

halving_estimators = {e_name: halving_search_estimator(e_name, e, tune_params[e_name], X_train, y_train)
                   for e_name, e in estimators.items()
                   if e_name in tune_params}

evaluate_estimators(halving_estimators, X_test, y_test)

K-NN best parameters: {'n_neighbors': 5}
SVC best parameters: {'C': 100, 'class_weight': 'balanced', 'gamma': 'scale', 'tol': 0.01}
K-NN: P=0.59 R=0.55 F1=0.57
SVC: P=0.61 R=0.35 F1=0.44
CPU times: user 1min 30s, sys: 1.37 s, total: 1min 32s
Wall time: 1min 25s


Conclusion: As expected, the heuristic does not deliver optimal results. However, we get a decently tuned SVM in less than half of the time.

## Task 4: Bayesian Optimization

In [6]:
%%time

bayes_tune_params = {
    'K-NN': {
        'n_neighbors': (1, 10)
    },
    'SVC': {
        'C': (1e-3, 1e+3, 'log-uniform'),
        'gamma': ['scale', 'auto'],
        'tol': (1e-4, 1e-2, 'log-uniform'),
        'class_weight': ['balanced', None],
    }
}

# Now run a bayesian search with the parameters given in `bayes_tune_params` with F1-measure as objective.
# Use a `n_iter` of 15.
# Again, print parameters of the best estimator and evaluate it with the `evaluate_estimators` function.
# HINT: Use scikit-optimize for bayesian search (https://scikit-optimize.github.io/stable/auto_examples/bayesian-optimization.html)
# HINT: Currently, BayesSearchCV does not work with scikit-learn version of 0.24.1. Use version 0.23.2 instead.
#       -> run a cell with `!pip install scikit-learn==0.23.2` and restart the notebook.

# --- SOLUTION ---
from skopt import BayesSearchCV

# hint: install scikit-optimize first
# hint: use sklearn==0.23.2 and restart notebook after installation
def bayes_search_estimator(e_name, e, param_grid, X, y):
    bscv = BayesSearchCV(e, search_spaces=param_grid, scoring='f1', n_iter=15, cv=10, random_state=RANDOM_STATE).fit(X, y)
    print(f'{e_name} best parameters: {bscv.best_params_}')
    return bscv.best_estimator_

bayes_estimators = {e_name: bayes_search_estimator(e_name, e, bayes_tune_params[e_name], X_train, y_train)
                   for e_name, e in estimators.items()
                   if e_name in bayes_tune_params}

evaluate_estimators(bayes_estimators, X_test, y_test)

K-NN best parameters: OrderedDict([('n_neighbors', 1)])
SVC best parameters: OrderedDict([('C', 65.08969218717925), ('class_weight', 'balanced'), ('gamma', 'auto'), ('tol', 0.01)])
K-NN: P=0.60 R=0.59 F1=0.60
SVC: P=0.58 R=0.43 F1=0.49
CPU times: user 2min 49s, sys: 5.43 s, total: 2min 55s
Wall time: 2min 28s


Conclusion: As expected, the bayesian search does not deliver optimal results. But we get even better results than with successive halving and we can easily control the runtime of the optimization with the `n_iter` parameter.