# Artificial Intelligence
# 464
# Homework #6

## Before You Begin...
00. We're using a Jupyter Notebook environment (tutorial available here: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html),
01. Read the entire notebook before beginning your work, and
02.  Check the submission deadline on Gradescope.


## General Directions for this Assignment
00. Output format should be exactly as requested (it is your responsibility to make sure notebook looks as expected on Gradescope), and
01. Functions should do only one thing.


## Before You Submit...
00. Re-read the general instructions provided above, and
01. Hit "Kernel"->"Restart & Run All". The first cell that is run should show [1], the second should show [2], and so on...
02. Submit your notebook (as .ipynb, not PDF) using Gradescope, and
03.  Do not submit any other files.

## Language Modeling

This homework will require you to load and train models.  If you choose small models and datasets, you should be able to run this locally on your computer. However, larger models/datasets may require GPU access. You can access one GPU for free on [Google Colab](https://colab.research.google.com/).

We will use HuggingFace libraries in this quiz. We discussed majority of what you will need during the discussion demo. Additional documentation can be found [here](https://huggingface.co/docs).

In [1]:
# Imports
import torch
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import pipeline
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import os
os.environ["WANDB_DISABLED"] = "true"

  from .autonotebook import tqdm as notebook_tqdm


## Problem 0: Data
From the [HuggingFace Datasets](https://huggingface.co/datasets), choose a dataset that satisfies the following criteria:
- Data must have train and test splits (Optional development set)
- Task must be text classification
- Task must have at least 3 labels


In [2]:
ds_name = "dair-ai/emotion"
# Load the data here
ds = load_dataset(ds_name, "split")

**Describe the data.**
What is the utility of the task? What are the inputs? What are the labels? Are the any potential difficulties you expect from the task? How do you evaluate the performance of this task?

In this dataset, the inputs are text from Twitter messages that are labeled with one of "six basic emotions: anger, fear, joy, love, sadness, and surprise." I think that it would be easy for confusion to arise due to the subjective nature of emotion and the lack of ability to determine intent behind a phrase. Tools like sarcasm can give a very different meaning to a phrase than its literal interpretation, likely causing some difficulty in classification. To evaluate the performance, we would need to determine the accuracy of classification, including the frequency of various misclassifications (e.g. how often anger is interpreted as sadness).

**Research current methods using this dataset.**
What is the current state of the art method? Describe the method, including the type of model used, training protocol (if any), and the performance. Cite your sources.

The current state-of the art is the BERT transformer model, which seems to outperform most other models with its enhanced ability to detect subtle emotional cues (Shah et al).  Many researchers also look at this technique to see how it can be integrated into other methods such as CNNs to enhance the effectiveness of both (Abas et al; Bhardwaj and Abulaish). Training these models requires large, labeled text datasets that classify strings of text with various standard emotions, something that can be difficult to find. This model is made up of transformer encoder layers with several nodes, and for every input in a series, each node calculates part to form the vector representation. The outputs of each node are combined, and the layer is then normalized (Abas et al).

Citations:
Bhardwaj and Abulaish: https://www.sciencedirect.com/science/article/pii/S2666827025000763
Khemani et al: https://pmc.ncbi.nlm.nih.gov/articles/PMC12148580/
Shah et al: https://www.science-gate.com/IJAAS/2025/V12I7/1021833ijaas202507006.html
Abas et el: https://www.sciencedirect.com/org/science/article/pii/S1546221821001314

(Optional) If necessary, perform any data preprocessing here. For example, depending on the dataset you choose, you may need to clean the text or split the training set into a train and validation set.

In [3]:
# TODO: Dataset preprocessing (Optional)

## Problem 1: encoder-only models or decoder-only models
## Option A: encoder-only models
Choose an encoder-only model (e.g. BERT). Load the model and add a classification layer.

Describe the model you choose. What are the unique properties of this model? What are the pros and cons? Cite your sources.

The bert-base-uncased model is a transformer model trained on a large quantitiy of english text. The uncased attribute means that it does not distinguish between different cases. One of its primary features is masked language modeling (MLM), where 15% of the input text is randomly masked so that the model can predict the missing text. This helps the model learn to understand and leverage the context surrounding a word. The second feature is next sentence prediction (NSP), which concatenates two of the masked sentences. The model is then made to predict whether or not the two sentences actually followed each other or not. The model is ideal for fine-tuning, MLM, and NSP, but for full text generation there are beter choices.

Source: https://huggingface.co/google-bert/bert-base-uncased

Finetune the model on your dataset. Report the performance on the test set.

In [4]:
# source: https://huggingface.co/docs/transformers/en/training
# select model
model_name = "distilbert/distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# tokenize data
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])
tokenized_ds = ds.map(tokenize_dataset, batched=True)

In [6]:
# define model evaluation
def compute_metrics(eval_preds):
    metric = evaluate.load("accuracy", "f1")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [7]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [8]:
# define training args
training_args = TrainingArguments(
    report_to="none",
    output_dir="spam-detect",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [9]:
# take small chunk of dataset for training and testing
train = tokenized_ds["train"].shuffle(seed=42).select(range(1000))
test = tokenized_ds["test"].shuffle(seed=42).select(range(1000))

# set up model trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=test,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# train model
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,1.5023,1.146636,0.609
2,1.0982,1.007736,0.666




TrainOutput(global_step=250, training_loss=1.3002310791015625, metrics={'train_runtime': 36.7976, 'train_samples_per_second': 54.351, 'train_steps_per_second': 6.794, 'total_flos': 21840050384352.0, 'train_loss': 1.3002310791015625, 'epoch': 2.0})

In [10]:
# get model evaluation metrics
predictions = trainer.predict(tokenized_ds["test"])



In [11]:
mapping = ["sadness", "joy", "love", "anger", "fear", "surprise"]

def map_label(label):
  return mapping[label]

def parse(response):
  response = response.lower()
  for emot in mapping:
    if emot in response:
      return emot
    
  return None

In [12]:
# evaluate
print(predictions.metrics)
predicted_labels = np.argmax(predictions.predictions, axis=-1)
true_labels = predictions.label_ids

# Print a few example predictions
# for i in range(10):
#     print(f"text: {tokenized_ds['test'][i]['text']}, response: {map_label(predicted_labels[i])}, actual: {map_label(true_labels[i])}\n")

{'test_loss': 1.0157246589660645, 'test_accuracy': 0.664, 'test_runtime': 5.8649, 'test_samples_per_second': 341.013, 'test_steps_per_second': 42.627}


## Option B: decoder-only models
Choose an decoder-only model (e.g. GPT2). Describe the model you choose. What are the unique properties of this model? What are the pros and cons? Cite your sources.

The SmolLM2 model is a language model trained on large datasets with supervised fine-tuning. Though the model is not as targeted for classification tasks as the previous model we looked at, its ability to take in any context and base its responses accordingly allow it to be used in a much wider range of scenarios. This does reduce the assuracy, but as we saw in class, we were able to make it act like an "evil devil" and "spam detector," which are very different but can both be achieved through prompting.

Source: https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct

Load the model and use prompting for your task. You will likely need to write a helper function to parse the answer.

(Ex. “The answer is 1” -> 1). Report the performance on the test set.

In [13]:
from accelerate import Accelerator

device = Accelerator().device

In [14]:
chatbot = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", device=device)

Device set to use mps


In [15]:
prompt = """This is a text classification task.
text: {text}
Is this text anger, fear, joy, love, sadness, or surprise:"""

def create_chat(text):
  return [
    {"role": "system", "content": """You are an emotion classifier agent that looks at a string of text 
      and outputs the emotion it expresses. The emotion must be one of the following 
      6 options: anger, fear, joy, love, sadness, and surprise."""},
    {"role": "user", "content": f"{prompt.format(text=text)}"},
  ]

In [16]:
success = 0
failure = 0
total = 100
# test a series of example text
for i in range(total):
  text = ds["test"][i]["text"]
  actual = ds["test"][i]["label"]

  # generate response
  chat = create_chat(text)
  response = chatbot(chat, max_new_tokens=64)
  # print(f"text: {text}, response: {parse(response[0]["generated_text"][-1]["content"])}, actual: {map_label(actual)}")

  # validate
  if map_label(actual) == parse(response[0]["generated_text"][-1]["content"]):
    success += 1
  else:
    failure += 1

print(f'success rate: {success / total}')
print(f'error rate: {failure / total}')

success rate: 0.5
error rate: 0.5


## Problem 2: Error Analysis

Conduct an error analysis on your models. What are your models good at? What do they get wrong? Provide examples of both correct and incorrect predictions. Suggest methods to improve the performance.

In [None]:
# See last cell in each section for full error reporting
# Encoder: 0.66 accuracy
# Decoder: 0.5 accuracy

The following two examples are from the first BERT encoder model classification, which achieved a passable accuracy of 0.664.

text: i was feeling a little vain when i did this one, response: sadness, actual: sadness

text: i cant walk into a shop anywhere where i do not feel uncomfortable, response: sadness, actual: fear

This model did perform slightly better than the other, but I found that many of the text fragments that we more subtle were often misclassified. The second example is a good indicator of that, as sadness is a fairly reasonable classification (the model was rarely far off base), and it is difficult for even some humans to classify all of these phrases with a relatively subjective emotional category. To improve this model, we could increase the size of the training set to provide the model with further context before generating predictions.

The following are two examples from the GPT classification, which was much less accurate than the BERT model.

text: i was feeling as heartbroken as im sure katniss was, response: sadness, actual: sadness

text: i feel a little mellow today, response: sadness, actual: joy

The GPT model had a very hard time understanding subtext, and many of the text fragments were a little vague. It often guessed sadness in place of other emotions as well. This particular model is not as advanced as more recent models, so it could be helpful to try other models. In addition to this, more rigorous prompting to increase the level of detail of the specifications could also be useful. Interestingly, in some cases the model would also output something like "this seems like a combination of two emotions, but I can only specify one," indicating its confusion/uncertainty.

## OPTIONAL. BONUS. Problem 3: Improvements

Implement your suggestions for improving the performance. Describe your method and report the results on the test set.

In [18]:
# TODO

No other directions for this quiz, other than what's here and in the "General Directions" section. You have a lot of freedom with this quiz. Don't get carried away. It is expected the results may vary, being better or worse. Graders are not going to run your notebooks. The notebook will be read as a report on how different models were explored. Since you'll be using libraries, the emphasis will be on your ability to communicate your findings.

## Before You Submit...

00. Re-read the general instructions provided above, and
01. Hit "Kernel"->"Restart & Run All". The first cell that is run should show [1], the second should show [2], and so on...
02. Submit your notebook (as .ipynb, not PDF) using Gradescope, and
03.  Do not submit any other files.