# Final Project Summary: Multi-Modal Muppet Character Classification

We processed video frames and aligned audio tracks to create a multi-modal dataset. We compared the performance of three classifiers (MLP, SVM, Random Forest) across three experimental setups: **Visual-Only**, **Audio-Only**, and **Combined**.

**Key Findings:**
1.  **Visual Dominance:** Visual features vastly outperform audio features in isolation.
2.  **Modality Imbalance:** Combining features naively led to the model ignoring audio data entirely (99.9% visual importance).
3.  **Best Performance:** The MLP classifier on Combined features achieved the highest F1-Score ($\approx 0.50$), but significant confusion remains between visually similar characters.

![](img/f1-combined.png)

## 1. Modality Performance Analysis (ROC Curves)

We extracted the ROC curves from our three experimental notebooks to visualize the trade-off between True Positive Rate and False Positive Rate.

### Visual vs. Audio Baseline
| **Visual ROC** (from NB 1) | **Audio ROC** (from NB 2) |
|:---:|:---:|
| ![Visual ROC](img/video-roc.png) | ![Audio ROC](img/audio-roc.png) |
| *Visual features* | *Audio features* |

**Insight:** The visual model is the primary driver of performance. The audio model suffers from class imbalance, predicting the majority class ("OtherPigs") frequently, which collapses the AUC for distinct characters like Rowlf.

![Combined](img/combined-roc.png)
*Combined Features*

## 2. The Modality Imbalance Problem

![Feature Importance](img/feature-importance.png)

### Critical Observation
* **99.9% Visual Importance:** The model effectively ignored the audio data.
* **The Cause:** We concatenated a massive visual vector (~8,200 dimensions from HOG) with a tiny audio vector (41 dimensions).
* **The Result:** The tree classifiers statistically favored visual splits, treating audio as noise. This confirms that **dimensionality reduction** (PCA) is probably required before fusion to balance the modalities.

## 3. Critical Reflection and Future Improvements

Based on our error analysis (Confusion Matrices) and feature importance ranking, we think the following things might improve the performance.

### 1. Fix the "Chef vs. Pig" Confusion (Visual)
* **Problem:** Our visual model (MLP) confused the *Swedish Chef* with *Miss Piggy* ~2,000 times.
* **Root Cause:** We converted frames to **Grayscale** for HOG/LBP extraction. Since both characters share similar round shapes and "bright" skin tones, they became indistinguishable in grayscale.
* **Proposed Fix:** Add an **RGB Color Histogram** feature. The "Pink" of the pig vs. the "White/Orange" of the chef would likely resolve this confusion immediately.

### 2. Solve Modality Imbalance (Fusion)
* **Problem:** The 99.9% vs 0.1% importance split caused the Audio modality to be wasted.
* **Proposed Fix:** Apply **PCA** to the Visual features *before* concatenation. Reducing the visual vector from 8,200 to ~50 components (matching the Audio size) would force the classifier to weigh both modalities equally.

### 3. Address Audio Noise (Temporal)
* **Problem:** Audio classification on single video frames (1/25th of a second) is unstable.
* **Proposed Fix:** Instead of classifying individual frames, we should aggregate features over a **1-second window** (e.g., rolling mean of 25 frames). This would smooth out momentary silence or noise and capture the temporal context of the voice (phonemes/words).