# Emotion classification using transformers and hugging face

- Model: <a href="https://huggingface.co/distilbert/distilbert-base-uncased">distilbert/distilbert-base-uncased</a>
- Dataset: <a href="https://huggingface.co/datasets/dair-ai/emotion">dair-ai/emotion</a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Install Dependencies

In [10]:
!pip install datasets torch accelerate>=0.21.0 transformers==4.28 evaluate wandb

## Import libraries

In [11]:
import evaluate
import torch
import numpy as np
import wandb
from datasets import (
    load_dataset,
    load_metric,
)
from transformers import (
    DataCollatorWithPadding,
    DistilBertConfig,
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline,
)

In [12]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## Prepare data

### Load Dataset

In [6]:
dataset_card = 'dair-ai/emotion'
dataset = load_dataset(dataset_card, trust_remote_code=True)
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

### Obtain the labels

In [8]:
labels = dataset["train"].features["label"].names
print(f"Labels: {labels}")
print(f"Total labels: {len(labels)}")

Labels: ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
Total labels: 6


### Load Model and Tokenizer

In [9]:
model_card = 'distilbert/distilbert-base-uncased'

model = DistilBertForSequenceClassification.from_pretrained(model_card, num_labels=len(labels))
tokenizer = DistilBertTokenizer.from_pretrained(model_card)

# Load the pre-trained model configuration
# config = DistilBertConfig.from_pretrained(model_card, num_labels=len(labels))

# Initialize the model with the new configuration
# model = DistilBertModel(config)

# Load the pre-trained model for sequence classification
# model = DistilBertForSequenceClassification.from_pretrained(model_card, num_labels=len(labels))



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

### Tokenize the data

In [13]:
def tokenize_fn(batch):
    return tokenizer(batch['text'], truncation=True)

In [14]:
# Tokenize all dataset
dataset_tokenized = dataset.map(tokenize_fn, batched=True)
dataset_tokenized

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

In [15]:
dataset_tokenized["train"][250]

{'text': 'i just feel terrified like im on the edge of a precipice staring ahead',
 'label': 4,
 'input_ids': [101,
  1045,
  2074,
  2514,
  10215,
  2066,
  10047,
  2006,
  1996,
  3341,
  1997,
  1037,
  3653,
  6895,
  24330,
  2063,
  4582,
  3805,
  102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Data Collator: Dynamic padding

> A DataCollator is a utility in the Hugging Face `transformers` library that helps with preparing batches of data during training or evaluation of NLP models. It is used to collate data into batches and handle tasks such as padding and tensor conversion, making it easier to feed the data into the model.

In [17]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizer(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

## Training

### Set Up Training Hyperparameters

In [18]:
basepath = "/content/drive/MyDrive/ColabNotebooks/Emotions_clasification"

In [19]:
training_args = TrainingArguments(
    output_dir=f'{basepath}/results_emotions_classification',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

### Set up metrics

To monitor the evaluation metrics during training

In [23]:
accuracy_metric = evaluate.load("accuracy")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_predictions):
    # logits are the predictions of the model
    logits, labels = eval_predictions
    predictions = np.argmax(logits, axis=-1)

    # Calculate each metric
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    precision = precision_metric.compute(predictions=predictions, references=labels, average='weighted')
    recall = recall_metric.compute(predictions=predictions, references=labels, average='weighted')
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')

    metrics = {
        "accuracy": accuracy["accuracy"],
        "precision": precision["precision"],
        "recall": recall["recall"],
        "f1": f1["f1"],
    }
    return metrics

### Set up Trainer

In [24]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

### Execute the training

In [25]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1052,0.235377,0.9245,0.924075,0.9245,0.923985
2,0.0826,0.257351,0.9245,0.92591,0.9245,0.924813
3,0.063,0.271477,0.9235,0.923608,0.9235,0.923485


TrainOutput(global_step=3000, training_loss=0.08510498046875, metrics={'train_runtime': 304.9634, 'train_samples_per_second': 157.396, 'train_steps_per_second': 9.837, 'total_flos': 585319974143040.0, 'train_loss': 0.08510498046875, 'epoch': 3.0})

# Save the fine tuned model

In [26]:
my_fine_tuned_model = f"{basepath}/fine-tuned-emotion-classification-model"
model.save_pretrained(my_fine_tuned_model)
tokenizer.save_pretrained(my_fine_tuned_model)

('/content/drive/MyDrive/ColabNotebooks/Emotions_clasification/fine-tuned-emotion-classification-model/tokenizer_config.json',
 '/content/drive/MyDrive/ColabNotebooks/Emotions_clasification/fine-tuned-emotion-classification-model/special_tokens_map.json',
 '/content/drive/MyDrive/ColabNotebooks/Emotions_clasification/fine-tuned-emotion-classification-model/vocab.txt',
 '/content/drive/MyDrive/ColabNotebooks/Emotions_clasification/fine-tuned-emotion-classification-model/added_tokens.json')

# Make predictions

## Using pipeline

In [27]:
model = DistilBertForSequenceClassification.from_pretrained(my_fine_tuned_model, num_labels=len(labels))
tokenizer = DistilBertTokenizer.from_pretrained(my_fine_tuned_model)

In [None]:
labels

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

In [30]:
texts = [
  "She felt an overwhelming sense of despair wash over her as she watched the rain pour down, mirroring her tears.",
  "He jumped up and down with excitement when he heard he had gotten the job of his dreams.",
  "She looked into his eyes and felt a deep, unconditional affection that warmed her heart.",
  "His face turned red with frustration as he shouted at the unfairness of the situation.",
  "She froze in terror, her heart pounding, as the shadowy figure approached her.",
  "Her eyes widened in astonishment when she opened the door to find all her friends shouting 'Happy Birthday!'",
  "OMG I can not believe it.",
]

In [32]:
generator_pipeline = pipeline(
  "sentiment-analysis",
  model=model,
  tokenizer=tokenizer,
)

for text in texts:
  print(f"\n{text}...")
  generated_text = generator_pipeline(text)
  print(generated_text)


She felt an overwhelming sense of despair wash over her as she watched the rain pour down, mirroring her tears....
[{'label': 'LABEL_0', 'score': 0.9996743202209473}]

He jumped up and down with excitement when he heard he had gotten the job of his dreams....
[{'label': 'LABEL_1', 'score': 0.970298707485199}]

She looked into his eyes and felt a deep, unconditional affection that warmed her heart....
[{'label': 'LABEL_2', 'score': 0.993769109249115}]

His face turned red with frustration as he shouted at the unfairness of the situation....
[{'label': 'LABEL_3', 'score': 0.9996403455734253}]

She froze in terror, her heart pounding, as the shadowy figure approached her....
[{'label': 'LABEL_4', 'score': 0.9993605017662048}]

Her eyes widened in astonishment when she opened the door to find all her friends shouting 'Happy Birthday!'...
[{'label': 'LABEL_5', 'score': 0.7559982538223267}]

OMG I can not believe it....
[{'label': 'LABEL_4', 'score': 0.6206813454627991}]


## Using Tokenizer

In [37]:
for text in texts:
  print("-"*20)
  print(text)
  inputs_tokenized = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
  with torch.no_grad():
    outputs = model(**inputs_tokenized)

  # Get the predicted class label
  predictions = torch.argmax(outputs.logits, dim=-1)
  print(f"predictions: {predictions}")

  # Get the predicted label
  predicted_label = labels[predictions.item()]
  print(f"Predicted label: {predicted_label}")

--------------------
She felt an overwhelming sense of despair wash over her as she watched the rain pour down, mirroring her tears.
predictions: tensor([0])
Predicted label: sadness
--------------------
He jumped up and down with excitement when he heard he had gotten the job of his dreams.
predictions: tensor([1])
Predicted label: joy
--------------------
She looked into his eyes and felt a deep, unconditional affection that warmed her heart.
predictions: tensor([2])
Predicted label: love
--------------------
His face turned red with frustration as he shouted at the unfairness of the situation.
predictions: tensor([3])
Predicted label: anger
--------------------
She froze in terror, her heart pounding, as the shadowy figure approached her.
predictions: tensor([4])
Predicted label: fear
--------------------
Her eyes widened in astonishment when she opened the door to find all her friends shouting 'Happy Birthday!'
predictions: tensor([5])
Predicted label: surprise
--------------------

**Disable gradient calculation**

- Why torch.no_grad()? During inference (i.e., when making predictions on new data), we don't need to calculate gradients. Disabling gradient calculation reduces memory usage and speeds up computation. It also ensures that no gradients are accidentally computed and stored, which is important for efficiency.
- The outputs include logits, which are the raw, unnormalized scores for each class.

**Predictions**
- torch.argmax: This function is used to find the index of the maximum value in a tensor along a specified dimension. In this case, it finds the index of the highest logit value for each input example.
- dim=-1: Specifies the dimension along which to compute the argmax. For classification, dim=-1 typically refers to the last dimension, which corresponds to the class scores.