# Analysis of Emotion Prediction Models

This notebook loads the evaluation results from our model training script (`train_evaluate_models.py`), visualizes the performance metrics, and draws conclusions about which model is best suited for predicting valence and arousal from music features.

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the evaluation data
EVALUATION_FILE = '../results/model_evaluations.json'
with open(EVALUATION_FILE, 'r') as f:
    data = json.load(f)

# Convert the nested dictionary to a more usable DataFrame format
records = []
for model, metrics in data.items():
    for dimension, scores in metrics.items():
        if dimension in ['valence', 'arousal']:
            records.append({
                'model': model,
                'dimension': dimension,
                'r2_score': scores['r2_score'],
                'mse': scores['mse'],
                'mae': scores['mae']
            })

df = pd.DataFrame(records)
print("Model Evaluation Metrics:")
print(df)

## 1. R² Score Comparison

The R² score (coefficient of determination) represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A score closer to 1 indicates a better fit.

In [None]:
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# Valence R² Scores
sns.barplot(data=df[df['dimension'] == 'valence'], x='model', y='r2_score', ax=ax[0], palette='viridis')
ax[0].set_title('R² Scores for Valence Prediction')
ax[0].set_xlabel('Model')
ax[0].set_ylabel('R² Score')
ax[0].tick_params(axis='x', rotation=45)

# Arousal R² Scores
sns.barplot(data=df[df['dimension'] == 'arousal'], x='model', y='r2_score', ax=ax[1], palette='plasma')
ax[1].set_title('R² Scores for Arousal Prediction')
ax[1].set_xlabel('Model')
ax[1].set_ylabel('')

plt.suptitle('Model Performance: R² Score Comparison', fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

## 2. Error Metrics Comparison (MSE & MAE)

Mean Squared Error (MSE) and Mean Absolute Error (MAE) are two common metrics for measuring the average errors between predicted and actual values. Lower values are better.

- **MAE**: Represents the average absolute difference between the predicted and actual values. It's less sensitive to outliers.
- **MSE**: Represents the average of the squared differences. It penalizes larger errors more heavily.

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(18, 12))

# Valence Error Metrics
sns.barplot(data=df[df['dimension'] == 'valence'], x='model', y='mse', ax=ax[0, 0], palette='coolwarm')
ax[0, 0].set_title('MSE for Valence Prediction')
ax[0, 0].set_xlabel('')
ax[0, 0].set_ylabel('MSE')
ax[0, 0].tick_params(axis='x', rotation=45)

sns.barplot(data=df[df['dimension'] == 'valence'], x='model', y='mae', ax=ax[0, 1], palette='coolwarm')
ax[0, 1].set_title('MAE for Valence Prediction')
ax[0, 1].set_xlabel('')
ax[0, 1].set_ylabel('MAE')
ax[0, 1].tick_params(axis='x', rotation=45)

# Arousal Error Metrics
sns.barplot(data=df[df['dimension'] == 'arousal'], x='model', y='mse', ax=ax[1, 0], palette='RdYlGn')
ax[1, 0].set_title('MSE for Arousal Prediction')
ax[1, 0].set_xlabel('Model')
ax[1, 0].set_ylabel('MSE')
ax[1, 0].tick_params(axis='x', rotation=45)

sns.barplot(data=df[df['dimension'] == 'arousal'], x='model', y='mae', ax=ax[1, 1], palette='RdYlGn')
ax[1, 1].set_title('MAE for Arousal Prediction')
ax[1, 1].set_xlabel('Model')
ax[1, 1].set_ylabel('MAE')
ax[1, 1].tick_params(axis='x', rotation=45)

plt.suptitle('Model Performance: Error Metrics Comparison', fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

## 3. Conclusion

Based on the evaluation metrics:

- **Best Overall Performance (R² Score):** The **XGBoost** model consistently provides the highest R² scores for both valence and arousal, indicating it explains the most variance in the data. This makes it the most accurate model overall.

- **Error Rates:** XGBoost also shows the lowest MSE and MAE, reinforcing its position as the top performer.

- **Model Trade-offs:**
  - **Ridge** and **SVR** offer a reasonable balance. Their performance is not far behind XGBoost, and they are typically faster to train. They represent a good baseline and could be suitable for applications where training time is a major constraint.
  - **MLP (Multi-layer Perceptron)** performed the poorest in this configuration. Its R² scores are significantly lower, and its error rates are higher. This could be due to the relatively small dataset size or the need for more extensive hyperparameter tuning.

**Final Recommendation:** For the highest predictive accuracy in this task, **XGBoost is the recommended model**. If computational resources or training time are a concern, **Ridge** provides a viable and efficient alternative.