## 5. Multi-class Classification

After finishing the MALIS course, you are hired by a company that is building a machine learning model to automatically identify if passengers respect the carry-on luggage rules from low cost arilines. The rules are the following:
- Passenger has one valid carry-on. Passenger can board (A)
- Passenger has two carry-on items. Boarding denied (B)
- Passenger has a big carry-on that does not fit under the seat. Passenger to pay a fee (C)
- Passenger has a heavy carry-on. Bag needs to be checked-in and pay a fee (D)

Your first task in the company is to develop a classifier that given a photo of a passenger and their carry-on identifies will classify the photo into one of the categories described above. For that purpose, they provide you with a dataset of 2000 previously acquired photos, with 1000 from A, 500 from B, 450 from C and 50 from D.

You realize that you only know how to train binary classifiers. As you do not want to be fired on the first day, you agree to develop this multi-class classisifer.

`(a)` [2 points] Propose a strategy to solve multi-class classification problem using only binary classifiers. Be specific about how your strategy performs training and how it predicts the label of a new photo.

One-vs-All (OvA) Strategy: 
   - Train one binary classifier for each class (A, B, C, D).
   - Each classifier $ C_i $ is trained to distinguish class $ i $ from the rest.
   - During prediction, run all classifiers on a new photo. The classifier with the highest confidence score determines the predicted class.
   - Training involves labeling instances of class $ i $ as positive and all other instances as negative for each classifier.


**Strategy: One-vs-All (OvA) for Multi-Class Classification using Binary Classifiers**

**Training:**
1. Train four binary classifiers, one for each class (A, B, C, D):
   - $ C_A $: Class A vs. Not A (Positive: A, Negative: B, C, D)
   - $ C_B $: Class B vs. Not B (Positive: B, Negative: A, C, D)
   - $ C_C $: Class C vs. Not C (Positive: C, Negative: A, B, D)
   - $ C_D $: Class D vs. Not D (Positive: D, Negative: A, B, C)

**Prediction:**
1. Given a new photo $ x $, obtain scores from all classifiers:
   - $ s_A(x) $: Score from $ C_A $
   - $ s_B(x) $: Score from $ C_B $
   - $ s_C(x) $: Score from $ C_C $
   - $ s_D(x) $: Score from $ C_D $
2. The predicted class is the one with the highest score:
   $
   \text{Predicted class} = \arg\max \{ s_A(x), s_B(x), s_C(x), s_D(x) \}
   $

This method efficiently transforms the multi-class classification problem into multiple binary classifications and predicts the label based on the highest confidence score.

`(b)` [1 point] Which labels ($y$) do you assign to each of the categories (A,B,C,D) for training?

**Labels for Training:**

1. **Classifier $ C_A $ (A vs. Not A)**
   - Positive ($ y = 1 $): A
   - Negative ($ y = 0 $): B, C, D

2. **Classifier $ C_B $ (B vs. Not B)**
   - Positive ($ y = 1 $): B
   - Negative ($ y = 0 $): A, C, D

3. **Classifier $ C_C $ (C vs. Not C)**
   - Positive ($ y = 1 $): C
   - Negative ($ y = 0 $): A, B, D

4. **Classifier $ C_D $ (D vs. Not D)**
   - Positive ($ y = 1 $): D
   - Negative ($ y = 0 $): A, B, C

`(c)` How do you split the dataset into training, validation and testing?

Split the dataset:
   - Training set: 70% of the dataset (1400 photos).
   - Validation set: 15% of the dataset (300 photos).
   - Test set: 15% of the dataset (300 photos).
   - Nnsure that each set contains a representative proportion of each class
   - Ensure each split maintains the class distribution (stratified split).


**Dataset Split:**

- **Training Set (70%):**
  - Class A: 700 photos
  - Class B: 350 photos
  - Class C: 315 photos
  - Class D: 35 photos
  - **Total: 1400 photos**

- **Validation Set (15%):**
  - Class A: 150 photos
  - Class B: 75 photos
  - Class C: 68 photos
  - Class D: 8 photos
  - **Total: 301 photos**

- **Testing Set (15%):**
  - Class A: 150 photos
  - Class B: 75 photos
  - Class C: 68 photos
  - Class D: 8 photos
  - **Total: 301 photos**

`(d)` [1 point] You use the overall classification accuracy to assess the generalization error of your classifier (correctly detected photos / all photos). Why this measure may be problematic? Can you propose another measure?

Issues with overall accuracy:
   - It may be problematic because of class imbalance (e.g., only 50 samples for D).
   - An alternative measure: Use the F1-score, which considers both precision and recall, and is more informative for imbalanced datasets. Compute the F1-score for each class and consider the macro-average F1-score.

Using overall classification accuracy to assess the generalization error of the classifier can be problematic in cases where there is a class imbalance in the dataset. In your dataset, classes have the following distribution:
- Class A: 1000 photos
- Class B: 500 photos
- Class C: 450 photos
- Class D: 50 photos

This imbalance means that the classifier could achieve high overall accuracy by primarily focusing on correctly predicting the majority class (Class A), while performing poorly on the minority classes (especially Class D). As a result, the classifier's performance on the minority classes might be misrepresented.

**Proposed Measure:**

**F1 Score (Macro-Averaged):**
The F1 score considers both precision and recall, and macro-averaging ensures that performance across all classes is taken into account equally, regardless of the class distribution.

1. **Precision for each class**: $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
2. **Recall for each class**: $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
3. **F1 Score for each class**: $ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $
4. **Macro-Averaged F1 Score**: Average the F1 scores across all classes.

This measure provides a better understanding of the classifier's performance across all classes, especially in imbalanced datasets.

**Other Measures:**
- **Confusion Matrix**: Provides a detailed breakdown of the classifier’s performance, showing the number of true positives, false positives, false negatives, and true negatives for each class.
- **Precision-Recall Curve**: Useful for evaluating classifiers on imbalanced datasets.
- **Balanced Accuracy**: Average of recall obtained on each class, which accounts for imbalances.

These measures help provide a more nuanced evaluation of the classifier's performance across different classes.