## **CS:345 Multilingual Emotion Detection in Social Media**

### **Team Members**
1. Abeeb Abdullahi Samuel
2. Maereg Habtezgi 

### **Introduction**

Social media platforms like Twitter generate vast volumes of short texts that convey a wide spectrum of emotions. This project aims to build an automated system for emotion detection in these posts by fine-tuning transformer-based language models—such as BERT-base-multilingual-cased and XLM-RoBERTa-base—on two Twitter-sourced datasets. We will evaluate which model–dataset combination best captures the subtle nuances of emotional expression, addresses challenges like data imbalance and linguistic variability, and can serve as the foundation for future emotion-classification tasks.

- **BERT (Bidirectional Encoder Representations from Transformers)**
- **XLM-RoBERTa (Cross-lingual Language Model)**

The datasets used for training and evaluation are:

- [`TweetEval`](https://huggingface.co/datasets/cardiffnlp/tweet_eval): A benchmark suite of tasks built on English tweets.
- [`SuperTweetEval`](https://huggingface.co/datasets/cardiffnlp/super_tweeteval): An extended multilingual version supporting more complex evaluation, including multi-label emotion classification.


> **Note:** We have excluded the SEMEVAL-11 dataset from our experiments and are using only the two current Twitter-sourced datasets for the following reasons:
> 1. **Richer annotations:** The two chosen datasets provide more granular emotion categories, clearer label definitions, and additional metadata columns compared to SEMEVAL-11.  
> 2. **Availability:** SEMEVAL-11 is no longer hosted on the Hugging Face platform and cannot be accessed for fine-tuning.




Our aim is to adapt these models to accurately classify a range of emotions datasest found in tweets, evaluating their strengths and challenges during fine-tuning. And see which model perfroms best with which result.

---

### **Project Structure**

This project is divided into **three Jupyter notebooks** for clarity and modular experimentation:

#### 1. `BERT_Finetune.ipynb`

This notebook focuses on:

- Fine-tuning the BERT-based model (including `bert-base-uncased`, `twitter-roberta-base` and its variants) on both TweetEval and SuperTweetEval .
- Preprocessing steps like emoji normalization and tokenization tailored for social media content.
- Handling class imbalance and improving performance through parameter tuning.
- Tracking training/validation metrics and analyzing results across epochs.

#### 2. `XLM_Finetune.ipynb`

This notebook covers:

- Fine-tuning the multilingual `XLM-RoBERTa` (and its variants) model on both TweetEval and SuperTweetEval.
- Addressing challenges in cross-lingual emotion classification.
- Analyzing model performance on non-English tweets.
- Comparative metrics and evaluation of zero-shot or low-resource performance.

#### 3. `Conclusion.ipynb`

This notebook provides:

- A summary of findings from both BERT and XLM fine-tuning experiments.
- A comparative analysis of accuracy and F1 performance across both datasets.
- Key challenges, trade-offs, and recommendations for future work (e.g., better handling of multi-label classification, cross-lingual transfer learning, or low-resource language support).


---
## **Conclusion: Comparative Performance of BERT and XLM-RoBERTa for Emotion Detection**

This project investigated the effectiveness of two prominent transformer-based language models, BERT and XLM-RoBERTa, for multilingual emotion detection in social media. We fine-tuned various configurations of these models on two distinct Twitter-sourced datasets: TweetEval and SuperTweetEval. Our experiments revealed significant differences in performance depending on the model-dataset combination and the application of specific training strategies.

### **BERT Performance Comparison**

We fine-tuned both `bert-base-uncased` and `vinai/bertweet-base` on the TweetEval and SuperTweetEval datasets.

* **TweetEval:** `bert-base-uncased` achieved a test accuracy of 0.7966 and a weighted F1 score of 0.7917. However, by switching to the domain-specific backbone `vinai/bertweet-base` and employing advanced training techniques (revamped preprocessing, stronger regularization, ultra-low learning rate with cosine scheduler, gradient accumulation, and early stopping), we significantly improved the results to a **test accuracy of 0.8269 and a weighted F1 score of 0.8264**. This highlights the benefit of domain-aligned pretraining for this specific task.

* **SuperTweetEval:** Fine-tuning `vinai/bertweet-base` on SuperTweetEval initially yielded a low test accuracy of 0.0525 and a macro-F1 score of 0.5422. Through extensive exploratory data analysis, we identified significant class imbalance. Implementing a class-weighted loss function led to a substantial improvement in **test accuracy to 0.2492** and a slight increase in **macro-F1 score to 0.5574**. While the weighted loss improved the model's ability to classify less frequent emotions, the overall performance on SuperTweetEval remained considerably lower than on TweetEval.

**Key Takeaway for BERT:** For emotion classification on Twitter data similar to TweetEval, `vinai/bertweet-base` fine-tuned with appropriate regularization and optimization strategies demonstrates superior performance. SuperTweetEval presents a more challenging task for BERT, primarily due to class imbalance, requiring more sophisticated techniques beyond simple loss weighting.

### **XLM-RoBERTa Performance Comparison**

We fine-tuned `xlm-roberta-base` on both TweetEval and SuperTweetEval.

* **TweetEval:** Initial fine-tuning for 5 epochs resulted in a **test accuracy of 0.8086 and a test F1 score of 0.7822**. Further hyperparameter tuning and extending the training to 8 epochs led to signs of overfitting, with the test accuracy decreasing to 0.7987 and the test F1 score to 0.7683. Therefore, the model trained for 5 epochs exhibited better generalization.

* **SuperTweetEval:** Initial fine-tuning of `xlm-roberta-base` on SuperTweetEval achieved a **test accuracy of 0.8272 and a weighted F1 score of 0.8306**. Recognizing a class imbalance between 'Anger' and 'Anticipation', we implemented a class-weighted loss function. Despite observing signs of overfitting (training loss decreased sharply while validation loss increased), this strategy led to improved generalization on the test set, resulting in a **test accuracy of 0.8450 and a weighted F1 score of 0.8449**. This demonstrates the effectiveness of addressing class imbalance for XLM-RoBERTa on this dataset.

**Key Takeaway for XLM-RoBERTa:** XLM-RoBERTa demonstrated strong performance on both datasets. On TweetEval, careful monitoring of validation metrics to prevent overfitting was crucial. On SuperTweetEval, employing a class-weighted loss function effectively mitigated the impact of class imbalance and improved overall performance.

### **Overall Best Results**

Comparing the best performing models for both BERT and XLM-RoBERTa across the two datasets:

| Model               | Dataset        | Test Accuracy | F1 Score (Weighted/Macro) |
| ------------------- | -------------- | ------------- | ------------------------- |
| `vinai/bertweet-base` | TweetEval      | **0.8269** | **0.8264** |
| `xlm-roberta-base`  | SuperTweetEval | **0.8450** | **0.8449** (weighted)     |

**Overall Conclusion:** The best performance was achieved by **`xlm-roberta-base` on the SuperTweetEval dataset, with a test accuracy of 0.8450 and a weighted F1 score of 0.8449**. This suggests that for the specific emotion classification task and data characteristics of SuperTweetEval, the multilingual capabilities and the successful mitigation of class imbalance through weighted loss allowed XLM-RoBERTa to outperform the best BERT configuration.

However, `vinai/bertweet-base` also demonstrated strong capabilities on the TweetEval dataset, achieving comparable results. The choice of the optimal model depends on the specific dataset and the importance of handling class imbalance. For future work, further investigation into techniques like multi-label calibration, focal loss, data augmentation, and hierarchical modeling could potentially enhance the performance on the more challenging SuperTweetEval dataset for both model architectures.