# Model Details and Performance Overview

## Introduction

I have gathered detailed information about different models and their benchmarks for you. This notebook provides an in-depth overview of their training data, performance, limitations, and use cases.

### **1. Model and Dataset Overview**

#### **Model: distilbert-base-multilingual-cased-sentiments-student**

**Description:**
The `distilbert-base-multilingual-cased-sentiments-student` model is a distilled version of DistilBERT, optimized for sentiment analysis in multiple languages. It has been fine-tuned using zero-shot classification techniques to handle sentiment analysis across 12 languages.

**Key Details:**
- **Model Type:** DistilBERT for Multilingual Sentiment Analysis
- **Languages Supported:** 12 languages
- **Model Size:** 135M parameters
- **Frameworks:** Transformers, PyTorch
- **License:** Apache 2.0

**Training Details:**
- **Teacher Model:** MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
- **Student Model:** distilbert-base-multilingual-cased
- **Training Batch Size:** 16
- **Epochs:** 1
- **Training Time:** ~33 minutes
- **Result:** Agreement of student and teacher predictions: 88.29%
- **Training Loss:** 0.647

**Framework Versions:**
- **Transformers:** 4.28.1
- **PyTorch:** 2.0.0+cu118
- **Datasets:** 2.11.0
- **Tokenizers:** 0.13.3

**Downloads Last Month:** 2,321,855

**Hugging Face Model Page:** [distilbert-base-multilingual-cased-sentiments-student](https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student)

---

#### **Dataset: multilingual-sentiments**

**Creator:** tyqiangz

**Description:**
The `multilingual-sentiments` dataset is designed for sentiment analysis across multiple languages. It includes text samples labeled with sentiments in three classes: positive, neutral, and negative.

**Languages Covered:**
- German
- English
- Spanish
- Malay
- Indonesian
- 9 additional languages

**Size:**
- **Number of Rows:** 591,442
- **File Size (Parquet):** 79.5 MB
- **Total Size of Downloaded Files:** 141 MB

**License:** Apache 2.0

**Tasks:**
- Text Classification
- Sentiment Analysis

**Libraries Used:**
- Datasets

**Available Splits:**
- **Train:** 270,000 rows
- **Test:** 296,000 rows

**GitHub Repository:** [multilingual-sentiment-datasets](https://github.com/tyqiangz/multilingual-sentiment-datasets)

**Downloads Last Month:** 1,158

**Note:** This dataset is ideal for fine-tuning models like `distilbert-base-multilingual-cased-sentiments-student`, providing a diverse multilingual corpus for enhancing sentiment analysis capabilities.

---

### **2. Model and Dataset Overview**

#### **Model: roberta-base-go_emotions**

**Description:**
The `roberta-base-go_emotions` model is a RoBERTa-based transformer model fine-tuned on the GoEmotions dataset for multi-label emotion classification. It can predict multiple emotional labels for a given input text, with each text potentially being associated with several emotions simultaneously.

**Key Details:**
- **Model Type:** RoBERTa for Multi-Label Emotion Classification
- **Languages Supported:** English
- **Model Size:** 125M parameters
- **Frameworks:** Transformers, PyTorch, Safetensors
- **License:** MIT
- **Inference Endpoints:** Available

**Training Details:**
- **Dataset:** GoEmotions
- **Batch Size:** 16
- **Epochs:** 3
- **Learning Rate:** 2e-5
- **Weight Decay:** 0.01
- **Training Method:** `AutoModelForSequenceClassification.from_pretrained` with `problem_type="multi_label_classification"`
- **Metrics:**
  - **Accuracy:** 0.474
  - **Precision:** 0.575
  - **Recall:** 0.396
  - **F1 Score:** 0.450

**Metrics by Label:**
- **Admiration:** Accuracy 0.946, Precision 0.725, Recall 0.675, F1 0.699
- **Amusement:** Accuracy 0.982, Precision 0.790, Recall 0.871, F1 0.829
- **Anger:** Accuracy 0.970, Precision 0.652, Recall 0.379, F1 0.479
- ... (additional labels)

**ONNX Version:**
- Available at [roberta-base-go_emotions-onnx](https://huggingface.co/SamLowe/roberta-base-go_emotions-onnx)
- Includes INT8 quantized version for faster inference and reduced file size.

**Evaluation:**
- **Accuracy:** 0.474
- **Precision:** 0.575
- **Recall:** 0.396
- **F1 Score:** 0.450

**Optimized Metrics:**
- **Overall Precision:** 0.572
- **Overall Recall:** 0.677
- **Overall F1 Score:** 0.611

**Downloads Last Month:** 2,844,128

**Hugging Face Model Page:** [roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions)

---

#### **Dataset: go_emotion**

**Creator:** google-research-datasets

##### Dataset Summary
The **GoEmotions** dataset contains 58,000 meticulously curated Reddit comments labeled across 27 distinct emotion categories, plus a neutral class. It includes both the raw data and a simplified version with predefined train, validation, and test splits.

##### Supported Tasks and Leaderboards
- **Tasks**: Multi-class classification, Multi-label classification
- **Languages**: English

##### Dataset Structure
- **Data Instances**: Each instance is a Reddit comment with a unique ID and one or more emotion annotations.
- **Data Fields**:
  - **Simplified Data**:
    - `text`: The Reddit comment
    - `labels`: Emotion annotations
    - `comment_id`: Unique identifier for the comment
  - **Raw Data** (additional fields):
    - `author`: Reddit username of the comment’s author
    - `subreddit`: Subreddit where the comment was posted
    - `link_id`: Link ID of the comment
    - `parent_id`: Parent ID of the comment
    - `created_utc`: Timestamp of the comment
    - `rater_id`: Unique ID of the annotator
    - `example_very_unclear`: Whether the annotator marked the example as very unclear or difficult to label

##### Data Splits
- **Simplified Data**:
  - **Train**: 43,410 examples
  - **Validation**: 5,426 examples
  - **Test**: 5,427 examples

##### Dataset Creation
- **Curation Rationale**: The dataset is designed to support the development of systems that can understand and categorize emotional content in text, with applications ranging from empathetic chatbots to detecting harmful online behavior.
- **Source Data**: Collected from Reddit comments using various automated methods.
- **Annotators**: English-speaking crowdworkers from India.

##### Personal and Sensitive Information
- The dataset includes Reddit usernames, which could potentially be linked to real-world identities in some cases.

##### Additional Information
- **Dataset Curators**: Researchers from Amazon Alexa, Google Research, and Stanford.
- **Licensing**: Apache License 2.0
- **Citation**:
  ```bibtex
  @inproceedings{demszky2020goemotions,
    author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
    booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)},
    title = {{GoEmotions: A Dataset of Fine-Grained Emotions}},
    year = {2020}
  }
  ```

- **Size**: 
  - **Downloaded Dataset Files**: 28.3 MB
  - **Auto-Converted Parquet Files**: 28.3 MB
  - **Number of Rows**: 265,488

- **Homepage**: [GitHub Repository](https://github.com/google-research/google-research/tree/master/goemotions)
- **Paper**: [Arxiv Paper](https://arxiv.org/abs/2005.00547)

---

## 3. Model Overview: GPT-3.5 Turbo

**Model:** gpt-3.5-turbo-0125

**Description:**
GPT-3.5 Turbo is an advanced model that offers improved accuracy and fixes previous text encoding issues for non-English languages. It supports a large context window and detailed responses.

**Key Details:**
- **Context Window:** 16,385 tokens
- **Max Output Tokens:** 4,096 tokens
- **Training Data:** Up to September 2021

![Table 1: List of datasets used for labeling in this report](https://cdn.prod.website-files.com/6423879a8f63c1bb18d74bfa/649a74912e87afa5d7e719a8_ZrysACtP6pbLupaeQJvs5HdCkPb8OIZXS9PgSvHkcdB6HNbwS0FJCieAtjb6et7iwmLghv8LuTiAJnAwt4mD9Aj9YvXX3tfkRL1WLBsxTbpR4cmolIe8vDze8aY6THaqwF4nX1hUtQYWygykhtHYR4Y.png)
![Table 2:  Comparing the new OpenAI model versions to the current versions on label quality (% agreement with ground truth labels) across a variety of NLP tasks. Cells with decrease in quality (% change < 0) are in red. Cells with increase in quality (% change > 0) are in green](https://cdn.prod.website-files.com/6423879a8f63c1bb18d74bfa/649a74f2c5db078d30ad48ab_Pem2nkb-VV7tTPamPFmdSR839pFduFHrAfVNntkGCuzZb84u24t8jDUTG4Uy5RY-oPeuR2DQecuv2Z8aY6Xo1E7kzLjFwxoneckwAKJsL7hjYn40nFEUVLr1pKJz0A-F6TkJz4osg41rIzLcFGCW8zM.png)

For more information, visit the [OpenAI GPT-3.5 Turbo Documentation](https://platform.openai.com/docs/models/gpt-3-5-turbo).


---

© 2024 Erfan Shafiee Moghaddam