Run the cell below if you are using Google Colab to mount your Google Drive in your Colab instance. Adjust the path to the files in your Google Drive as needed if it differs.

If you do not use Google Colab, running the cell will simply do nothing, so do not worry about it.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd 'drive/My Drive/Colab Notebooks/04_Classification'
except ImportError as e:
    pass

# More Classifiers, Evaluation Methods & Hyperparameter Optimization

In this exercise, we will use the **Iris dataset**, which you can find in **data/iris.csv**. 

The dataset describes three types of Iris flowers:
- Setosa
- Virginica
- Versicolour

There are four (non-class) attributes
- Sepal width and length
- Petal width and length


<div style="text-align: center;">
    <img src="imgs/iris_dataset_meme.png" style="width: 60%;">
</div>

In [None]:
import pandas as pd
from sklearn import preprocessing

# load the data
iris = pd.read_csv("data/iris.csv")
iris.head()

In [None]:
# Separate the training features and target variable
iris_data = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']]

# Encode the target variable
label_encoder = preprocessing.LabelEncoder()
iris_target = label_encoder.fit_transform(iris['Name'])

iris_data.head()

# Naive Bayes (NB)

## Bayes Theorem

Fundamental theorem in probability that describes how to update our belief about an event based on new evidence.
- It computes the **conditional probability $P(C|A)$** that tells us the probability of a class C given some attribute A


<div style="text-align: center;">
    <img src="imgs/bayes_theorem.png" style="width: 60%;">
</div>

- **P(C|A)** is the **posterior / conditional probability**: the probability of class $C$ _after_ attribute $A$ is seen  
- **P(A|C)** is the **likelihood / class-conditional probability**: the probability of observing attribute $A$ given class $C$ 
- **P(C)**: is the **prior probability of class C**: the initial probability of class $C$ _before_ attributes are seen
- **P(A)** is the **marginal probability**: the total probability of attribute $A$ across all possible classes


<div style="text-align: center;">
    <img src="imgs/baes_theorem.jpeg" style="width: 60%;">
</div>

## Naive Bayes Classifier

The Naive Bayes Classifier is a simple classification algorithm based on Bayes' Theorem.

Let's classify whether an email is **Spam** or **Not Spam**. We have a dataset of **5 emails** with the words **"Offer"** and **"Free"** and their corresponding class labels.

| Email | Word: "Offer" | Word: "Free" | Class (Spam/Not Spam) |
|--------|------------|------------|--------------------|
| 1      | Yes        | Yes        | Spam              |
| 2      | Yes        | No         | Spam              |
| 3      | No         | Yes        | Spam              |
| 4      | Yes        | Yes        | Not Spam          |
| 5      | No         | No         | Not Spam          |

📝 **Feature Representation:**  
- **Yes (1)** means the word is present in the email.  
- **No (0)** means the word is absent.


### How does Naive Bayes Work?

#### 1. Compute the prior probabilites $P(C_j)$
- For each class $C_j$, count the records in the training set that are labeled with class $C_j$ and divide the count by the overall number of records
    
---------------------------------------------------------------------------------------------------------   

---------------------------------------------------------------------------------------------------------   

    
#### 2. Estimate the class-conditional probability $P(A|C)$
   
   - ⚠️ Naive Bayes **assumes** that all **features** are **conditionally independent** (**Naive Bayes assumption**)
   
   - **Important**: this independence assumption is almost never correct!
   
   - ✅ Thanks to the _independence assumption_, we can re-write the joint probabiity $P(A|C)$ as the product of the invididual probabilities $P(A_i|C_j)$ (which we can estimate directly from the training data for all $A_i$ and $C_j$):
   
   $P(A_1, A_2, ..., A_n|C_j) = P(A_1|C_j) \times P(A_2|C_j) \times ... \times P(A_n|C_j) = \prod_{i=1}^n P(A_i|C_j)$
   
   - **In practice**: Estimate  $P(A_i|C_j)$ by counting how often an attribute value co-occurs with class $C_j$, and divide by the overall number of examples belonging to class $C_j$
   
---------------------------------------------------------------------------------------------------------   
---------------------------------------------------------------------------------------------------------   


#### 3. Apply Bayes' Theorem

- The probability of a sample $A$ belonging to class $C_j$ is: $P(C_j|A) = \frac{P(A|C_j)P(C_j)}{P(A)}$
- Since **$P(A)$** is the **same for all classes**, we can compare probabilities using $P(C_j|A) \propto P(C_j) \prod_{i=1}^n P(A_i|C_j)$
    
---------------------------------------------------------------------------------------------------------   

Suppose we receive a **new email**:  📧 **"Offer Free"** (contains both words "Offer" and "Free")


---------------------------------------------------------------------------------------------------------   


#### 4. Classification Decision
- Assign A to the class that **maximizes** the posterior probability, i.e., the class with the highest probability: $\hat{C} = arg max_{C_j} P(C_j) \prod_{i=1}^n P(A_i|C_j)$
    
---------------------------------------------------------------------------------------------------------   

---------------------------------------------------------------------------------------------------------   


### ⚠️  Zero-Frequency Problem

This problem occurs in Naive Bayes classification when a feature value **never appears** in the training set for a particular class. This leads to a **zero class-conditional probability**, which causes problems when computing the final probability using Bayes' theorem.

#### Why Is This a Problem?
If any feature $A_i$ has $P(A_i|C_j)=0$, then the entire product becomes **zero**, making it impossible to classify the instance correctly.

#### Solution: Laplace Smoothing

Add a small constant $\alpha$ (i.e., usually 1) to each probability estimate
    
- Original: $P(A_i|C_j) = \frac{N_{ic}}{N_c}$
- Laplace: $P(A_i|C_j) = \frac{N_{ic} + 1}{N_c + c}$

where $c$ = number of attribute values of $A$

✅ Probabilities will never be zero!

✅ Stabilizes probability estimates

## Strengths
- Works very well, even is the independence assumption is violated
- **Robust** to **isolated noise points**, as they will be averaged out
- **Robust** to **irellevant attributes**, as $P(A_i|C_j)$ is distributed uniformly for $A_i$
- **Computationally cheap**: probabilities can be estimated doing one pass over the training data
- **Memory efficient**: storing the probabilities does not require a lot of memory

## Naive Bayes in Scikit-learn

[Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) is implemented in different variations in scikit-learn.
They differ mainly by the assumptions they make regarding the distribution of $P(x_i|y)$


- [```GaussianNB``` class](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) implements the Naive Bayes classifier for continious (numeric) features. Likelihood of the features is assumed to be Gaussian
- [```MultinomialNB``` class](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) implements the Naive Bayes classifier for discrete (categorical) features (multinomially distributed data)
- [```BernoulliNB``` class](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) assumes multivariate Bernoulli distributions
- [```CategoricalNB``` class](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html) assumes that each feature has its own categorical distribution

In [None]:
from sklearn.naive_bayes import GaussianNB
naive_bayes = GaussianNB()
naive_bayes.fit(iris_data, iris_target)

# Support Vector Machines (SVM)

## What is SVM?

It's a machine learning algorithm used for classification and regression. It classifies data by finding an optimal line or hyperplane that maximizes the distance between each class in an N-dimensional space.

## How does SVM work?

Find a linear hyperplance (decision boundary) that **maximizes** the margin to the closest points (support vectors).

<div style="text-align: center;">
    <img src="imgs/svm_1.png" style="width: 60%;">
</div>


 ⚠️ If the **decision boundary is not linear**, then transform the data into a higher dimensional space using a **Kernel function**.

<div style="text-align: center;">
    <img src="imgs/svm_2.png" style="width: 60%;">
</div>


## Strenghts

- Works well in **high dimensional spaces** (i.e., many features)
- **Memory efficient**: it uses a subset of training points in the decision function (i.e., suppprt vectors)
- **Versatile**: different Kernel functions can be specified for the decision function
- Can handle **non-linear data** using the kernel trick.



## Limitations
- **Computationaly expensive** on large datasets, especially when using complex kernels
- **Difficult to choose the right kernel**: the choice of kernel (e.g., linear, polynomial, RBF) is crucial and requires hyperparameter optimization
- **Hard to interpret**: the decision boundary is abstract and hard to interpret, especially in high-dimensional spaces

## SVM in Python

[Support Vector Machines](https://scikit-learn.org/stable/modules/svm.html) are also implemented in different variations.

We will be using the [```SVC``` class](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) which implements support vector classification.
An alternative implementation with different parameters is the [```NuSVC``` class](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html).

In [None]:
from sklearn.svm import SVC

svm = SVC(gamma='auto')
svm.fit(iris_data, iris_target)

# Artificial Neural Networks (ANN)

## Perceptron: The Simplest Neural Unit

<div style="text-align: center;">
    <img src="imgs/ann_1.png" style="width: 60%;">
</div>

## Multi-layer ANNs
<div style="text-align: center;">
    <img src="imgs/ann_2.jpg" style="width: 60%;">
</div>

### Training
1. Initialize the weights ($w_0$, $w_1$, ..., $w_n$), either randomly or using pretrained weights
2. Adjust the weights such that the output of the ANN is as consistent as possible with the class labels of the training examples:
    - Using an **objective function**, e.g. $E = \sum_i [Y_i - f(w_i, X_i)]^2$
    - Find the weights $w_i$ that minimize $E$ using **backpropagation**
    - Adjustment factor: **learning rate**
    
    <div style="text-align: center;">
        <img src="imgs/backprop.png" style="width: 60%;">
    </div>
    

### Differences compared to the perceptron

<div style="text-align: center;">
    <img src="imgs/ann_3.png" style="width: 60%;">
</div>

# Evaluation Methods

**Goal**: Obtain a reliable estimate of the model's gneralization performance

<div style="text-align: center;">
    <img src="imgs/evaluation_meme.jpg" style="width: 40%;">
</div>

## ⚠️ NEVER EVER TEST A MODEL ON DATA THAT WAS USED FOR TRAINING!!⚠️ 

**General approach**: split the labeled records into a training set and a test set

## Holdout Method

This methout reserves a certain amount of the labeled data for testing, and uses the remainder for training.
- Applied when **lots of sample data** is available
- Typical train / test splits: 75% / 25% or 80% / 20%

<div style="text-align: center;">
    <img src="imgs/holdout_method.png" style="width: 60%;">
</div>

⚠️ Random samples might not be representative for imbalanced datasets, as few or no records o the minority class will be in the training or test sets

- **Stratified Sampling**: Sample each class independently, so that records of the minority clss are present in each sample
- **Random Subsampling**: Repeat the process with different subsamples, i.e., in each iteration, a certain proportion is randomly selected for training and the performance of the different iterations is averaged

## Leave One Out Method

It iterates over all examples as follows:
- Train a model on all examples but the current one
- Evaluate on the current example

<div style="text-align: center;">
    <img src="imgs/leave_one_out_method.jpg" style="width: 60%;">
</div>

✅ Produces very accurate estimates

❌ Computationally infeasible 

## Cross-Validation

**K-fold cross-validation**:
- Splits the data into **k equally sized subsets** (usually $k=10$ and stratified sampling is used)
- Each subset in turn is used for testing, and the remainder for training
- The error estimates are averaged over all subsets to yield the overall error estimate

<div style="text-align: center;">
    <img src="imgs/cross_validation.png" style="width: 60%;">
</div>

### Cross-Validation in Python

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()

accuracy_iris = cross_val_score(dt, iris_data, iris_target, cv=10, scoring='accuracy')

for i, acc in enumerate(accuracy_iris):
    print("Fold {}: Accuracy = {}%".format(i, acc * 100.0))

print("Average Accuracy = {}%".format(accuracy_iris.mean() * 100.0))

### Stratified Sampling in Cross Validation

You can control how the folds are created by changing the ```cv``` parameter.
Stratified sampling is implemented in the [```StatifiedKFold``` class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html).

In [None]:
from sklearn.model_selection import StratifiedKFold

cross_val = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

accuracy_iris = cross_val_score(dt, iris_data, iris_target, cv=cross_val, scoring='accuracy')
accuracy_iris.mean()

### Obtaining predictions by cross-validation

If you want to analyse the predictions made during cross validation (for error analysis, you don't apply cross validation when actually applying the model!), you can use the [```cross_val_predict()``` function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html).

**Note**: As the folds of a cross validation are non-overlapping, you get exactly one prediction for every example in your dataset.

In [None]:
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(dt, iris_data, iris_target, cv=10)

display(predicted)

### Manual Cross Validation 
If you want to implement cross validation yourself, you can iterate over the folds manually:

In [None]:
# sometimes you have to use the raw array and not the pandas dataframe (access it with .values)
data = iris_data.values 
target = iris['Name']

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

for train_indices, test_indices in cv.split(data, target):
    train_data = data[train_indices]
    train_target = target[train_indices]
    
    dt.fit(train_data, train_target)

    test_data = data[test_indices]
    test_target = target[test_indices]
    
    test_prediction = dt.predict(test_data)

## Pipelines

A [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) in scikit-learn allows you to specify a sequence of transforms and a final estimator that can be executed and cross-validated.
This way you don't have to worry about applying the preprocessing steps (transforms) properly to each training and test split.

You create a pipeline by defining the steps that should be executed as a list.
Each element of the list is a tuple that consists of a name and the transform or estimator.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

transform = StandardScaler()
estimator = KNeighborsClassifier()

pipeline = Pipeline([ ('normalisation', transform), ('classification', estimator) ])

accuracy_iris = cross_val_score(pipeline, iris_data, iris_target, cv=10, scoring='accuracy')

print("Average Accuracy = {}%".format(accuracy_iris.mean() * 100.0))

# Intermezzo: Hyperparameter Optimization

<div style="text-align: center;">
    <img src="imgs/hpo_meme.jpg" style="width: 50%;">
</div>


## Hyperparameter Selection

**Hyperparameter**: a parameter which influences the learning process and whose value is **set before the learning begins** (e.g., learning rate, number of hidden layers for ANNs, pruning thresholds for decision trees, $K$ for K-NN)

**Parameter**: the values learned by an estimator during training / from the training data (e.g., weights in ANN, splits in a tree)

### 🛠  The complete learning procedure is thus:
- Hyperparameter Tuning ➡️ pick best hyperparameters
- Training ➡️ find best parameters
- Testing model performance on *unseen* test data

**Goal of Hyperparameter Optimization**: find the combination of hyperparameter values that result in learning the model with the lowest generalization error

## Search Strategies

### 1. Brute Force Search
- Try out all hyperparameter combinations 
- Computationally impossible; “blind” evaluation of parameters

### 2. Grid Search
- Manually restrict search space to certain parameter combinations
- Quality of solution strongly dependent on grid definition
- It may miss the best parameters


### 3. Random Search
- Test all combinations of random parameter values


### 4. Bayesian Optimization
- Treat hyperparameter tuning as a learning problem:
    - Given a set of hyperparameters $p$, predict the evaluation score $s$ of the model
    - The prediction model is called a **surrogate model** or **oracle**
- Why? Because training and evaluating the actual model is costly

<div style="text-align: center;">
    <img src="imgs/bayesian_optimization.png" style="width: 60%;">
</div>

### Grid Search in Python

- We perform the hyper-parameter tuning using [Grid Search](http://scikit-learn.org/stable/modules/grid_search.html).
- It is implemented in the [```GridSearchCV``` class](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) in scikit-learn.
- This class behaves exactly like an estimator. If its ```fit()``` function is called, all hyper-parameter combinations are evaluated.

Parameters:
- ```estimator```: an estimator (e.g. a decision tree)
- ```parameter_grid```: the parameters that should be evaluated as a dictionary
    - the key is the name of the hyper-parameter
    - the value is a list of possible values
    - example: ```{'param_a':[1,2,3], 'param_b':[7,8,9] }```
- ```scoring```: the metric that should be used to evaluate the parameter settings (can be 'accuracy' or other scores)
- ```cv```: specifies how to perform cross validation (default: 3-fold cross validation)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

# create an estimator
knn_estimator = KNeighborsClassifier()

# specify the parameter grid
parameters = {
    'n_neighbors': range(2, 9)
}

# specify the cross validation
stratified_10_fold_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# create the grid search instance
grid_search_estimator = GridSearchCV(
    knn_estimator, 
    parameters, 
    scoring='accuracy', 
    cv=stratified_10_fold_cv, 
    return_train_score=False
)

# run the grid search
grid_search_estimator.fit(iris_data,iris_target)

# print the results of all hyper-parameter combinations
results = pd.DataFrame(grid_search_estimator.cv_results_)
display(results)
    
# print the best parameter setting
print("best score is {} with params {}".format(
    grid_search_estimator.best_score_, grid_search_estimator.best_params_)
     )

## Model Selection

**Goal**: Select the model $m_{best}$ from all learned models $M$ that is expected to generalize best to unseen records

## ⚠️ Separate data for model selection from the data for model evaluation!

Otherwise: 
- overfitting to test set
- overly optimistic generalization error estimate

## Model Selection using a Validation Set

1. Split training set $D_{train}$ into validation set $D_{val}$ and training set $D_{tr}$
2. Learn models $m_i$ on $D_{tr}$ using different hyperparameter value combinations $p_i$
3. Select best parameter values$p_{best}$ by testing each model $m_i$ on the validation set $D_{val}$
4. Learn the final model $m_{best}$ on complete $D_{train}$ using the parameter values $p_{best}$
5. Evaluate $m_{best}$ on test set in order to get a unbiased estimate of its generalization performance

<div style="text-align: center;">
    <img src="imgs/model_selection_val_set.png" style="width: 40%;">
</div>

## Model Selection using a Cross-Validation 

✅ Make sure that all examples are used for validation once

✅ Use as much labeled data as possible for training

<div style="text-align: center;">
    <img src="imgs/model_selection_crossval.png" style="width: 70%;">
</div>

# Back to Evaluation Methods

## Nested Cross-Validation

<div style="text-align: center;">
    <img src="imgs/model_selection_nested_crossval.png" style="width: 70%;">
</div>

### Nested Cross-Validation in Python

In [None]:
from sklearn.model_selection import cross_val_score

# use only 5 folds here, as we only have 50 examples per class in the iris dataset!
nested_cv_score = cross_val_score(grid_search_estimator, iris_data, iris_target, cv=5, scoring='accuracy')

display(nested_cv_score.mean())

grid_search_estimator.fit(iris_data,iris_target)
display(grid_search_estimator.best_params_)

### Grid Search using Pipelines

Often, we need preprocessing steps before we perform a grid search, or even want to optimise the hyper-parameters of our preprocessing steps.
In these cases, we set up a pipeline and run the grid search on all steps.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# create the pipeline
transform = StandardScaler()
estimator = KNeighborsClassifier()
pipeline = Pipeline(steps=[ ('normalisation', transform), ('classification', estimator) ])


# specify the parameter grid
parameters = {
    'normalisation__with_mean': [ True, False],
    'normalisation__with_std': [ True, False],
    'classification__n_neighbors': range(2, 9)
}

# create the grid search instance
grid_search_estimator = GridSearchCV(pipeline, parameters, scoring='accuracy', cv=10)

accuracy_best = cross_val_score(grid_search_estimator, iris_data, iris_target, cv=5, scoring='accuracy', n_jobs=-1)
print("Accuracy = {}%".format(accuracy_best.mean() * 100.0))

grid_search_estimator.fit(iris_data, iris_target)
display(grid_search_estimator.best_params_)

# Comparing Classifiers

## 1. Confidence Intervals

<div style="text-align: center;">
    <img src="imgs/confidence_intervals_1.png" style="width: 70%;">
</div>

<div style="text-align: center;">
    <img src="imgs/confidence_intervals_2.png" style="width: 70%;">
</div>

<span style="color:red">Caution: only for sample size > 30.</span>

With p% probability, $error_D$ is in $[error_s - y, error_s + y]$, with $y = z_N \cdot \sqrt{\frac{error_s (1 -error_s)}{n}}$

<div style="text-align: center;">
    <img src="imgs/z_table.png" style="width: 50%;">
</div>


### Computing Confidence Intervals

You are using a machine learning solution from company A. Recently, you were contacted by the Junior Vice President of company B and he offered you to switch to his solution. As a migration is very costly, you only want to switch if you can be at least 90% sure that the new solution is better. For such purposes, you have a dedicated test set with 420 examples where your current solution makes 105 errors. 

What is the highest number of errors that you accept for the new solution in order to switch?

## 2. Statistical Tests: Sign Test vs. Wilcoxon Signed-Rank Test

Let's consider two classifiers, $ A $  and $  B$ , evaluated on 10 test instances. We record their accuracy (or any performance metric):

| Instance | Model A Score | Model B Score | Difference ($d$) | Sign |
|----------|--------------|--------------|----------------|------|
| 1        | 0.90         | 0.80         | **+0.10**      | +    |
| 2        | 0.88         | 0.75         | **+0.13**      | +    |
| 3        | 0.85         | 0.85         | **0.00**       | Tie  |
| 4        | 0.92         | 0.88         | **+0.04**      | +    |
| 5        | 0.80         | 0.78         | **+0.02**      | +    |
| 6        | 0.89         | 0.82         | **+0.07**      | +    |
| 7        | 0.91         | 0.84         | **+0.07**      | +    |
| 8        | 0.87         | 0.86         | **+0.01**      | +    |
| 9        | 0.76         | 0.79         | **−0.03**      | −    |
| 10       | 0.93         | 0.85         | **+0.08**      | +    |
| 11       | 0.95         | 0.88         | **+0.07**      | +    |
| 12       | 0.89         | 0.82         | **+0.07**      | +    |

- **+** means Model A outperformed Model B.  
- **−** means Model B outperformed Model A.  
- **Ties are removed.**  

### Sign Test

The **Sign Test** only considers the number of wins/losses. The null hypothesis ($H_0$) assumes that Model A and Model B are equally good, meaning that each instance is equally likely to favor either model ($p = 0.5$).

**Step 1: Count the wins**

Ignoring the tie (Instance 3):
- **Model A wins:** $n_A = 10$ 
- **Model B wins:** $n_B = 1$  
- **Ties:** $n_t = 1$  
- **Total non-tied instances:** $n' = 11$ 

**Step 2: Find the critical value**
<div style="text-align: center;">
    <img src="imgs/sign_test_table.png" style="width: 60%;">
</div>

Since Model A performs better than Model B in 11 cases, we **reject \( H_0 \)** → **Model A is significantly better than Model B.** ✅  

### Wilcoxon Signed-Rank Test

The **Wilcoxon signed-rank test** considers both **signs** and **magnitude** of the differences.

**Step 1: Rank results by _absolute differences_**
- Ties are ignored
- Equal ranks are averaged

| Instance | Model A Score | Model B Score | Difference ($d$) | Absolute $d$ | Rank |
|----------|--------------|--------------|----------------|----------------|------|
| 1        | 0.90         | 0.80         | **+0.10**      | 0.10           | **10**  |
| 2        | 0.88         | 0.75         | **+0.13**      | 0.13           | **11**  |
| 3        | 0.85         | 0.85         | **0.00**       | 0.00           | **Tie**  |
| 4        | 0.92         | 0.88         | **+0.04**      | 0.04           | **4**  |
| 5        | 0.80         | 0.78         | **+0.02**      | 0.02           | **2**  |
| 6        | 0.89         | 0.82         | **+0.07**      | 0.07           | **7**  |
| 7        | 0.91         | 0.84         | **+0.07**      | 0.07           | **7**  |
| 8        | 0.87         | 0.86         | **+0.01**      | 0.01           | **1**  |
| 9        | 0.76         | 0.79         | **−0.03**      | 0.03           | **3**  |
| 10       | 0.93         | 0.85         | **+0.08**      | 0.08           | **9**  |
| 11       | 0.95         | 0.88         | **+0.07**      | 0.07           | **7**  |
| 12       | 0.89         | 0.82         | **+0.07**      | 0.07           | **7**  |

**Step 2: Sum ranks by sign**
- **Sum of positive ranks**: $W_+ = 10 + 11 + 4 + 2 + 7 + 7 + 1 + 9 + 7 + 7 = 65$
 
- **Sum of negative ranks**: $W_- = 3$


<div style="text-align: center;">
    <img src="imgs/wilcoxon_signed_test_table.png" style="width: 60%;">
</div>

**Step 3: Compute the test statistic \( W \)**

Wilcoxon’s statistic is the smaller sum of ranks:  
$W = \min(W_+, W_-) = \min(65, 3) = 3$

**Step 4: Find the critical value**

Since $W = 3 < 13$, **we reject $H_0$** → **Model A is significantly better than Model B.** ✅  

# QUIZ TIME

## Question 1

You train a Naïve Bayes classifier for sentiment analysis on movie reviews. The model predicts **positive sentiment** for the review:

_"The movie was absolutely amazing, the plot was thrilling, but the acting was mediocre."_

The words "amazing" and "thrilling" are associated with **positive** sentiment, while "mediocre" is linked to **negative** sentiment. Why might Naïve Bayes still classify this as **positive**?

## Question 2

Table below contains information about different biological species. Using the training data, create a Naive
Bayes classification model and classify the following examples:
- Dolphin <yes, no, yes, no>
- Duck <no, yes, sometimes, yes>

| Gives birth | Can fly | Lives in Water | Has Legs | Class       |
| ----------- | ------- | -------------- | -------- | ----------- |
| yes         | no      | no             | yes      | mammals     |
| no          | no      | no             | no       | non-mammals |
| no          | no      | yes            | no       | non-mammals |
| yes         | no      | yes            | no       | mammals     |
| no          | no      | sometimes      | yes      | non-mammals |
| no          | no      | no             | yes      | non-mammals |
| yes         | yes     | no             | yes      | mammals     |
| no          | yes     | no             | yes      | non-mammals |
| yes         | no      | no             | yes      | mammals     |
| yes         | no      | yes            | no       | non-mammals |
| no          | no      | sometimes      | yes      | non-mammals |
| no          | no      | sometimes      | yes      | non-mammals |
| yes         | no      | no             | yes      | mammals     |
| no          | no      | yes            | no       | non-mammals |
| no          | no      | sometimes      | yes      | non-mammals |
| no          | no      | no             | yes      | non-mammals |
| no          | no      | no             | yes      | mammals     |
| no          | yes     | no             | yes      | non-mammals |
| yes         | no      | yes            | no       | mammals     |
| no          | yes     | no             | yes      | non-mammals |

Steps:
1. Compute the prior probability of each class
2. Compute the class conditional probability of evidence (for each attribute)
3. Classify the _Dolphin_ and _Duck_

## Question 3

Suppose you are using the **holdout method** to evaluate a machine learning model. You split your dataset into **70% training** and **30% testing**. You then perform **feature selection** on the **entire dataset** before training the model.

Why is this a problem?
How would this affect the model’s performance on new, unseen data?

## Question 4

A researcher performs **10-fold cross-validation** for **hyperparameter tuning** and then trains the final model on the entire dataset. They **evaluate this final model using another round of 10-fold cross-validation on the same dataset**.

- Why is this evaluation flawed?
- How should the researcher evaluate the final model properly?

## Question 5

Cross-validation is often preferred over the holdout method since it provides a more reliable estimate of generalization performance.

In what scenarios might the **holdout method** be preferable to **cross-validation**?

## Question / Task 6

This exercise is about hyperparameter tuning. To get familiar with hyperparameter tuning in scikit-learn, refer to the respective [part in the documentation](https://scikit-learn.org/stable/modules/grid_search.html).

We will use the data set of the Data Mining Cup 2006, which you can find in **data/dmc2006**. The task is to predict the attribute `gms_greater_avg` as precisely as possible. We will use the F1-measure of the class `1` as main performance metric.

1. Data preparation.
    - Import the data and create a 50:50 train-test split.
    - Implement the `evaluate_estimators` function so that it returns precision, recall, and F1-measure of the class 1 on the test set for the classifiers given in `estimators`. Use the following `estimators`: {Naive Bayes, K-NN, SVC}

2. Grid Search
    - Run a grid search with the parameters given in `tune_params` with F1-measure as optimization objective. 
    
    ```python
    tune_params = {
    
        'K-NN': {
            'n_neighbors': [1, 3, 5, 10]
        },
        
        'SVC': {
            'C': [.001, .01, .1, 1, 10, 100],
            'gamma': ['scale', 'auto'],
            'tol': [1e-2, 1e-3, 1e-4],
            'class_weight': ['balanced', None],
        }
    }
    ```
    
    - For the best estimator, print the parameters and evaluate it with the `evaluate_estimators` function.
    
    **HINT**: Take a look at https://scikit-learn.org/stable/modules/grid_search.html for infos about grid search.
    
3. Bayesian Optimization
    - Now run a bayesian search with the parameters given in `bayes_tune_params` with F1-measure as objective. Use a `n_iter` of 15.
    - Again, print parameters of the best estimator and evaluate it with the `evaluate_estimators` function.

    **HINT**: Use scikit-optimize for bayesian search (https://scikit-optimize.github.io/stable/auto_examples/bayesian-optimization.html)

    **HINT**: Currently, BayesSearchCV does not work with scikit-learn version of 0.24.1. Use version 0.23.2 instead      -> run a cell with `!pip install scikit-learn==0.23.2` and restart the notebook.
