  # Description

  
  🌟 Objective: Classification with various techniques on this dataset.

  Dataset link: https://huggingface.co/datasets/dair-ai/emotion?library=datasets
 
 📋 Tasks:
   1. **Baseline Implementation**: 
      - Train a tf-idf + logistic regression classifier with 5%, 10%, 25%, 50%, and 100% of the data.
      - Plot the learning curve (f1/precision/recall) based on the volume of data.
   2. **BERT Implementation**: 
      - Train a BERT classifier with 5%, 10%, 25%, 50%, and 100% of the data.
      - Add its learning curve (f1/precision/recall) to the previous plot.
   3. **BERT Model with Limited Data**: 
      - Train a setfit model using only 32 labeled examples and assess its performance.
      - Add a horizontal line on the previous plot.
   4. **Zero Shot Technique**: 
      - Apply a large language model in a zero-shot learning setup with an LLM such as chatGPT.
      - If you can't apply it on all the data, a small sample should suffice.
      - Add a horizontal line on the previous plot.
   5. **Generate New Data from Scratch**: 
      - Use the LLM to generate a few samples for each class (10, 50, 100).
      - Recreate the learning curve and add the performances to the previous plot.
   6. **Bonus Question**: 
      - Examine some differences in what the models have learned.
 
 🔗 Helpful Link:
 This link might be useful for interacting with the LLM: [Native JSON Output from GPT-4](https://yonom.substack.com/p/native-json-output-from-gpt-4). It explains how to ask the model to provide information in a JSON format, which will be easier to organize.

# Housekeeping

Let's start by loading all the necessary libraries and the dataset.

In [None]:
import plotly.graph_objects as go
from src.utils import Utils

In [None]:
from src.data import Data
Data().ds

In [None]:
data = Data().ds
# Print a few samples from the training set
print("Training set samples:\n")
for i, sample in enumerate(data['train'].select(range(5))):
    print(f"Sample {i+1}:")
    print(f"Text: {sample['text']}")
    print(f"Label: {sample['label']}")
    print()

# Print the unique labels
unique_labels = set(data['train']['label'] + data['validation']['label'] + data['test']['label'])
print("Unique labels:", unique_labels)

# Visualisation

In [None]:
Data().generate_word_clouds()

In [None]:
Data().plot_label_distribution()

# Task 1: Baseline Implementation

In [None]:
from src.baseline import Baseline

baseline = Baseline(
    data = Data(),
    seed = 100,
    test_size = 0.20,
    num_percentiles = 20
)

baseline.results.head()

In [None]:

fig = go.Figure()
fig = Utils.plot_metrics(baseline.results)
fig.update_layout(
    title='Learning Curve: TF-IDF + Logistic Regression',
    xaxis_title='Percentage of Training Data',
    yaxis_title='Score',
    legend_title='Metric',
    template='plotly_white'
)

fig.show()

# BERT Implementation

In [None]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification

In [None]:
from src.bert import BERT

bert = BERT(Data())
bert.results.head()    


In [None]:

fig = go.Figure()
fig = Utils.plot_metrics(bert.results)
fig.update_layout(
    title='Learning Curve: BERT',
    xaxis_title='Percentage of Training Data',
    yaxis_title='Score',
    legend_title='Metric',
    template='plotly_white'
)

fig.show()