# **Transformers + HuggingFace Tutorial**


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


https://colab.research.google.com/drive/17WqJ3hV-qobEiC689DpH8lyB1auNa4EO?usp=sharing

Hello! In this notebook, we will talk a bit more on transformers and how to use transformers using the [HuggingFace](https://huggingface.co) library.

In today's tutorial we will reiterate on why transformers are so powerful by talking about attention and self-attention. Then we will fine-tune a endoer-decoder model (BERT) on a machine trasnlation task.

### **Set-up Libraries**

In [None]:
!pip install datasets
!pip install evaluate
!pip install rouge_score

Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.22.2-py3-none-an

In [None]:
!pip install torch==2.0.1
!pip install transformers

Collecting torch==2.0.1
  Downloading torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl (619.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch==2.0.1)
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99 (from torch==2.0.1)
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu11==11.7.101 (from torch==2.0.1)
  Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl (11.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 

[31mERROR: Operation cancelled by user[0m[31m
[0m

# Hands on activity


3.*Bert for Sentiment Analysis*

dataset: [link](https://huggingface.co/datasets/financial_phrasebank)

Evaluation: [weighted avg](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [None]:
%%capture
!pip install transformers[sentencepiece] datasets accelerate
!pip install -U sentence-transformers

In [None]:
from datasets import get_dataset_config_names, get_dataset_split_names, load_dataset
from datasets import load_metric
import logging

#### Load data and tokenize

In [None]:
from datasets import load_dataset
dataset = load_dataset("financial_phrasebank", 'sentences_75agree')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["sentence"], truncation=True)

In [None]:
dataset["train"][0]

{'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
 'label': 1}

In [None]:
tokenized_data = dataset.map(preprocess_function, batched=True)
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from torch.utils.data import random_split
train_ds, test_ds = random_split(tokenized_data["train"], [0.8, 0.2])

#### Load model and set metrics

In [None]:
import numpy as np
import evaluate
# accuracy = evaluate.load("accuracy")
from sklearn.metrics import classification_report

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    # Generate classification report as a dictionary
    report = classification_report(labels, predictions, output_dict=True)
    return {
        'accuracy': report['accuracy'],  # Overall accuracy
        'weighed avg precision': report['weighted avg']['precision'],  # Weighted average precision
        'weighed avg recall': report['weighted avg']['recall'],  # Weighted average recall
        'weighed avg f1-score': report['weighted avg']['f1-score']  # Weighted average F1-score
    }

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import DataCollatorWithPadding

id2label = {0: 'negative', 1: 'neutral', 2: 'positive'}
label2id = {'negative': 0, 'neutral': 1, 'positive': 2}

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=3, id2label=id2label, label2id=label2id)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
!pip install transformers[torch]



In [None]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NLP/Week 9 Lab",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Weighed avg precision,Weighed avg recall,Weighed avg f1-score
1,No log,0.252485,0.914493,0.920549,0.914493,0.915984
2,No log,0.226723,0.931884,0.931476,0.931884,0.93151
3,0.255600,0.274489,0.934783,0.935901,0.934783,0.934976
4,0.255600,0.307214,0.933333,0.934858,0.933333,0.933795
5,0.255600,0.342429,0.928986,0.932344,0.928986,0.929894
6,0.030600,0.312246,0.934783,0.93562,0.934783,0.935072
7,0.030600,0.354223,0.930435,0.93174,0.930435,0.930844
8,0.030600,0.352257,0.934783,0.935896,0.934783,0.935137
9,0.002800,0.344126,0.934783,0.935773,0.934783,0.935107
10,0.002800,0.339924,0.937681,0.93842,0.937681,0.937939


TrainOutput(global_step=1730, training_loss=0.08367216410492197, metrics={'train_runtime': 255.7075, 'train_samples_per_second': 108.053, 'train_steps_per_second': 6.766, 'total_flos': 429000992506206.0, 'train_loss': 0.08367216410492197, 'epoch': 10.0})

# Assignments

Please compare the performance of Bert (try various pretrained models if you can), LSTM, BiLSTM, BiLSTM+attention, and RRN on dataset: [link](https://huggingface.co/datasets/financial_phrasebank)
.

Please ensure a fair comparison (e.g. using the same parameters), and explain results.

Setup:

```python
batch_size = 16
num_classes = 3
learning_rate = 2e-5
epoch = 10
```

## DistilRoberta-financial-sentiment

This model has been finetuned on the same domain.

In [None]:
model2 = AutoModelForSequenceClassification.from_pretrained(
    "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis", num_labels=3, id2label=id2label, label2id=label2id)

trainer2 = Trainer(
    model=model2,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer2.train()

Epoch,Training Loss,Validation Loss,Accuracy,Weighed avg precision,Weighed avg recall,Weighed avg f1-score
1,No log,0.798008,0.649275,0.634091,0.649275,0.551428
2,No log,0.68162,0.730435,0.706443,0.730435,0.688151
3,0.711800,0.59241,0.765217,0.753066,0.765217,0.747462
4,0.711800,0.685455,0.757971,0.794599,0.757971,0.759917
5,0.711800,0.645472,0.75942,0.779163,0.75942,0.764205
6,0.402000,0.673274,0.789855,0.791641,0.789855,0.788379
7,0.402000,0.705437,0.813043,0.814454,0.813043,0.803599
8,0.402000,0.777111,0.785507,0.790323,0.785507,0.78592
9,0.232700,0.802432,0.786957,0.789581,0.786957,0.786531
10,0.232700,0.780981,0.813043,0.809399,0.813043,0.810587


TrainOutput(global_step=1730, training_loss=0.41052688201727894, metrics={'train_runtime': 1125.7192, 'train_samples_per_second': 24.544, 'train_steps_per_second': 1.537, 'total_flos': 422973359649612.0, 'train_loss': 0.41052688201727894, 'epoch': 10.0})

In [None]:
trainer2.evaluate()

{'eval_loss': 0.5924104452133179,
 'eval_accuracy': 0.7652173913043478,
 'eval_weighed avg precision': 0.7530658848297955,
 'eval_weighed avg recall': 0.7652173913043478,
 'eval_weighed avg f1-score': 0.7474620856740826,
 'eval_runtime': 1.4517,
 'eval_samples_per_second': 475.3,
 'eval_steps_per_second': 30.309,
 'epoch': 10.0}

## twitter-roberta-base-sentiment-latest

In [None]:
model3 = AutoModelForSequenceClassification.from_pretrained(
    "cardiffnlp/twitter-roberta-base-sentiment-latest", num_labels=3, id2label=id2label, label2id=label2id)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
trainer3 = Trainer(
    model=model3,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer3.train()

Epoch,Training Loss,Validation Loss,Accuracy,Weighed avg precision,Weighed avg recall,Weighed avg f1-score
1,No log,0.804256,0.656522,0.650781,0.656522,0.56277
2,No log,0.634874,0.749275,0.752497,0.749275,0.71513
3,0.644900,0.618263,0.776812,0.770231,0.776812,0.758754
4,0.644900,0.718506,0.797101,0.798807,0.797101,0.783604
5,0.644900,0.730224,0.769565,0.79017,0.769565,0.772397
6,0.268900,0.851616,0.818841,0.815519,0.818841,0.811165
7,0.268900,0.922278,0.824638,0.820744,0.824638,0.81983
8,0.268900,1.049746,0.807246,0.811679,0.807246,0.801913
9,0.103100,1.048226,0.828986,0.82554,0.828986,0.825368
10,0.103100,1.074158,0.827536,0.823965,0.827536,0.823418


TrainOutput(global_step=1730, training_loss=0.2997160073649677, metrics={'train_runtime': 677.4664, 'train_samples_per_second': 40.784, 'train_steps_per_second': 2.554, 'total_flos': 840116003871564.0, 'train_loss': 0.2997160073649677, 'epoch': 10.0})

In [None]:
trainer3.evaluate()

{'eval_loss': 0.6182626485824585,
 'eval_accuracy': 0.7768115942028986,
 'eval_weighed avg precision': 0.7702306510629912,
 'eval_weighed avg recall': 0.7768115942028986,
 'eval_weighed avg f1-score': 0.7587542528079992,
 'eval_runtime': 2.5459,
 'eval_samples_per_second': 271.028,
 'eval_steps_per_second': 17.283,
 'epoch': 10.0}

## facebook/opt-350m

Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters

In [None]:
model4 = AutoModelForSequenceClassification.from_pretrained(
    "facebook/opt-350m", num_labels=3, id2label=id2label, label2id=label2id)

In [None]:
trainer4= Trainer(
    model=model4,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer4.train()

Epoch,Training Loss,Validation Loss,Accuracy,Weighed avg precision,Weighed avg recall,Weighed avg f1-score
1,No log,0.749482,0.675362,0.735982,0.675362,0.590905
2,No log,0.542449,0.810145,0.809576,0.810145,0.798598
3,0.513600,0.581002,0.827536,0.82441,0.827536,0.820819
4,0.513600,1.255887,0.824638,0.821312,0.824638,0.816943
5,0.513600,1.467546,0.824638,0.821174,0.824638,0.819004
6,0.075100,1.618979,0.810145,0.818056,0.810145,0.812886
7,0.075100,1.83852,0.836232,0.833342,0.836232,0.832022
8,0.075100,1.826995,0.842029,0.839581,0.842029,0.838325
9,0.006000,1.901775,0.83913,0.836422,0.83913,0.834957
10,0.006000,1.828011,0.83913,0.836572,0.83913,0.83584


TrainOutput(global_step=1730, training_loss=0.17186858186072998, metrics={'train_runtime': 1841.7183, 'train_samples_per_second': 15.002, 'train_steps_per_second': 0.939, 'total_flos': 2975603466835968.0, 'train_loss': 0.17186858186072998, 'epoch': 10.0})

In [None]:
metrics = trainer4.evaluate(eval_dataset = test_ds)

In [None]:
metrics

{'eval_loss': 0.5424486994743347,
 'eval_accuracy': 0.8101449275362319,
 'eval_weighed avg precision': 0.8095761442849324,
 'eval_weighed avg recall': 0.8101449275362319,
 'eval_weighed avg f1-score': 0.7985980831613062,
 'eval_runtime': 8.1818,
 'eval_samples_per_second': 84.334,
 'eval_steps_per_second': 5.378,
 'epoch': 10.0}

## nickmuchi/sec-bert-finetuned-finance-classification

This model is a fine-tuned version of nlpaueb/sec-bert-base on the sentence_50Agree financial-phrasebank + Kaggle Dataset, a dataset consisting of 4840 Financial News categorised by sentiment (negative, neutral, positive). The Kaggle dataset includes Covid-19 sentiment data and can be found here: sentiment-classification-selflabel-dataset.

In [None]:
model5 = AutoModelForSequenceClassification.from_pretrained(
    "nickmuchi/sec-bert-finetuned-finance-classification", num_labels=3, id2label=id2label, label2id=label2id)

config.json:   0%|          | 0.00/885 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [None]:
trainer5= Trainer(
    model=model5,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer5.train()

Epoch,Training Loss,Validation Loss,Accuracy,Weighed avg precision,Weighed avg recall,Weighed avg f1-score
1,No log,0.785887,0.671014,0.63869,0.671014,0.57924
2,No log,0.608127,0.76087,0.7423,0.76087,0.737066
3,0.629000,0.550647,0.778261,0.769884,0.778261,0.772632
4,0.629000,0.692046,0.762319,0.780124,0.762319,0.768181
5,0.629000,0.903146,0.75942,0.797579,0.75942,0.767818
6,0.234600,0.800874,0.818841,0.814344,0.818841,0.810314
7,0.234600,0.869108,0.810145,0.812734,0.810145,0.810133
8,0.234600,0.920795,0.823188,0.824697,0.823188,0.823303
9,0.075400,0.967045,0.815942,0.817128,0.815942,0.81588
10,0.075400,0.967225,0.817391,0.81789,0.817391,0.816995


TrainOutput(global_step=1730, training_loss=0.2760947745659448, metrics={'train_runtime': 815.9835, 'train_samples_per_second': 33.861, 'train_steps_per_second': 2.12, 'total_flos': 840116003871564.0, 'train_loss': 0.2760947745659448, 'epoch': 10.0})

In [None]:
trainer5.evaluate()

{'eval_loss': 0.5506474375724792,
 'eval_accuracy': 0.7782608695652173,
 'eval_weighed avg precision': 0.7698843573733138,
 'eval_weighed avg recall': 0.7782608695652173,
 'eval_weighed avg f1-score': 0.7726321327938983,
 'eval_runtime': 2.4971,
 'eval_samples_per_second': 276.321,
 'eval_steps_per_second': 17.62,
 'epoch': 10.0}

## LSTM, BiLSTM, BiLSTM+attention, and RRN

Hyperparameters set up:

```python
embedding_dim = 15
n_hidden = 10 # number of hidden units in one cell
num_classes = 3
cut_off = 150
lr = 2e-5
```

### Preprocess data for the architecture

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
import matplotlib.pyplot as plt

In [None]:
vocab_size = tokenizer.vocab_size
vocab_size

30522

In [None]:
# Parameters Set Up
embedding_dim = 15
n_hidden = 10 # number of hidden units in one cell
num_classes = 3
lr = 2e-5

In [None]:
train_inputs = []
train_targets = []

for i in range(len(train_ds)):
    input = torch.tensor(train_ds[i]['input_ids'])
    label = train_ds[i]['label']
    train_inputs.append(input)
    train_targets.append(label)

In [None]:
test_inputs = []
test_targets = []

for i in range(len(test_ds)):
    input = torch.tensor(test_ds[i]['input_ids'])
    label = test_ds[i]['label']
    test_inputs.append(input)
    test_targets.append(label)

In [None]:
from torch.nn.utils.rnn import pad_sequence
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_input_batch = pad_sequence(train_inputs, batch_first=True).to(device)
train_target_batch = Variable(torch.LongTensor(train_targets)).to(device)

test_input_batch = pad_sequence(test_inputs, batch_first=True).to(device)
y_true = np.array(test_targets).reshape(-1,1)

print(train_input_batch.size(), train_target_batch.size())
print(test_input_batch.size(), y_true.shape)

torch.Size([2763, 150]) torch.Size([2763])
torch.Size([690, 93]) (690, 1)


### LSTM

In [None]:
class LSTM(nn.Module):
    def __init__(self):
        super(LSTM, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, n_hidden, bidirectional= False)
        self.FC = nn.Linear(n_hidden, num_classes)

    def forward(self, X):
        input = self.embedding(X) # input : [batch_size, len_seq, embedding_dim]
        output, (h,c) = self.lstm(input)
        output = self.FC(output[: ,-1])
        return output

lstm_m = LSTM()
lstm_m = lstm_m.to(device)

criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(lstm_m.parameters(), lr= lr)

#### Train

In [None]:
# Training
from tqdm import tqdm

lstm_losses = []

for epoch in tqdm(range(500)):
    optimizer.zero_grad()
    output = lstm_m(train_input_batch)
    loss = criterion(output, train_target_batch)
    lstm_losses.append(loss)
    if (epoch + 1) % 100 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    loss.backward()
    optimizer.step()

 20%|█▉        | 99/500 [01:41<05:54,  1.13it/s]

Epoch: 0100 cost = 0.906653


 40%|███▉      | 199/500 [03:14<04:38,  1.08it/s]

Epoch: 0200 cost = 0.906610


 60%|█████▉    | 299/500 [04:47<02:58,  1.13it/s]

Epoch: 0300 cost = 0.906575


 80%|███████▉  | 399/500 [06:21<01:34,  1.07it/s]

Epoch: 0400 cost = 0.906526


100%|█████████▉| 499/500 [07:53<00:00,  1.13it/s]

Epoch: 0500 cost = 0.906451


100%|██████████| 500/500 [07:54<00:00,  1.05it/s]


#### Eval

In [None]:
from sklearn.metrics import classification_report

# Make predictions using the LSTM model
predict = lstm_m(test_input_batch)
predict = predict.data.max(1, keepdim=True)[1].detach().cpu().numpy()

# Generate and print the classification report
print(classification_report(y_true, predict))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00        85
           1       0.65      1.00      0.79       447
           2       0.00      0.00      0.00       158

    accuracy                           0.65       690
   macro avg       0.22      0.33      0.26       690
weighted avg       0.42      0.65      0.51       690



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### BiLSTM

In [None]:
class BiLSTM(nn.Module):
    def __init__(self):
        super(BiLSTM, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, n_hidden, bidirectional=True)
        self.FC = nn.Linear(n_hidden*2, num_classes)

    def forward(self, X):
        input = self.embedding(X) # input : [batch_size, len_seq, embedding_dim]
        output, (h,c) = self.lstm(input)
        output = self.FC(output[: ,-1])
        return output # model : [batch_size, num_classes], attention : [batch_size, n_step]

bi_model = BiLSTM()
bi_model = bi_model.to(device)

criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(bi_model.parameters(), lr=lr)

#### Train

In [None]:
# Training
from tqdm import tqdm

bi_losses = []
for epoch in tqdm(range(500)):
    optimizer.zero_grad()
    output = bi_model(train_input_batch)
    loss = criterion(output, train_target_batch)
    bi_losses.append(loss)
    if (epoch + 1) % 100 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    loss.backward()
    optimizer.step()

 20%|█▉        | 99/500 [03:10<11:57,  1.79s/it]

Epoch: 0100 cost = 1.052118


 40%|███▉      | 199/500 [06:14<09:06,  1.82s/it]

Epoch: 0200 cost = 1.036578


 60%|█████▉    | 299/500 [09:17<06:11,  1.85s/it]

Epoch: 0300 cost = 1.021880


 80%|███████▉  | 399/500 [12:24<02:57,  1.76s/it]

Epoch: 0400 cost = 1.008076


100%|█████████▉| 499/500 [15:23<00:01,  1.81s/it]

Epoch: 0500 cost = 0.995207


100%|██████████| 500/500 [15:25<00:00,  1.85s/it]


#### Eval

In [None]:
from sklearn.metrics import classification_report

predict = bi_model(test_input_batch)
predict = predict.data.max(1, keepdim=True)[1].detach().cpu().numpy()

print(classification_report(y_true, predict))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        85
           1       0.65      1.00      0.79       447
           2       0.00      0.00      0.00       158

    accuracy                           0.65       690
   macro avg       0.22      0.33      0.26       690
weighted avg       0.42      0.65      0.51       690



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### BiLSTM+Attention

In [None]:
class BiLSTM_Attention(nn.Module):
    def __init__(self):
        super(BiLSTM_Attention, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, n_hidden, bidirectional=True)
        self.out = nn.Linear(n_hidden * 2, num_classes)

    # lstm_output : [batch_size, n_step, n_hidden * num_directions(=2)], F matrix
    def attention_net(self, lstm_output, final_state):
        hidden = final_state.view(-1, n_hidden * 2, 1)   # hidden : [batch_size, n_hidden * num_directions(=2), 1(=n_layer)]
        # Attention Weight Calculation
        attn_weights = torch.bmm(lstm_output, hidden).squeeze(2) # attn_weights : [batch_size, n_step]
        # normalization
        soft_attn_weights = F.softmax(attn_weights, 1)
        # Context Computation (weighted sum)
        # [batch_size, n_hidden * num_directions(=2), n_step] * [batch_size, n_step, 1] = [batch_size, n_hidden * num_directions(=2), 1]
        context = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2)
        return context, soft_attn_weights.data.detach().cpu().numpy() # context : [batch_size, n_hidden * num_directions(=2)]

    def forward(self, X):
        input = self.embedding(X) # input : [batch_size, len_seq, embedding_dim]
        input = input.permute(1, 0, 2) # input : [len_seq, batch_size, embedding_dim]

        hidden_state = Variable(torch.zeros(1*2, len(X), n_hidden)).to(device) # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
        cell_state = Variable(torch.zeros(1*2, len(X), n_hidden)).to(device) # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]

        # final_hidden_state, final_cell_state : [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
        output, (final_hidden_state, final_cell_state) = self.lstm(input, (hidden_state, cell_state))
        output = output.permute(1, 0, 2) # output : [batch_size, len_seq, n_hidden]
        attn_output, attention = self.attention_net(output, final_hidden_state)
        return self.out(attn_output), attention # model : [batch_size, num_classes], attention : [batch_size, n_step]

lstm_att_model = BiLSTM_Attention()
lstm_att_model = lstm_att_model.to(device)

criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(lstm_att_model.parameters(), lr= lr)

#### Train

In [None]:
# Training
from tqdm import tqdm

eps = 0.00003
max_penalty = 3
count = 0
lstm_att_losses = []

for epoch in tqdm(range(500)):
    optimizer.zero_grad()
    output, attention = lstm_att_model(train_input_batch)
    loss = criterion(output, train_target_batch)

    if (epoch + 1) % 100 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    if len(lstm_att_losses) >0:
        if abs(loss - lstm_att_losses[-1]) < eps or loss < 0.05:
            if count >= max_penalty:
                break
            else:
                count +=1

    lstm_att_losses.append(loss)
    loss.backward()
    optimizer.step()

 20%|█▉        | 99/500 [03:30<12:27,  1.86s/it]

Epoch: 0100 cost = 1.006153


 40%|███▉      | 199/500 [06:53<09:20,  1.86s/it]

Epoch: 0200 cost = 0.996486


 60%|█████▉    | 299/500 [10:36<07:24,  2.21s/it]

Epoch: 0300 cost = 0.987401


 80%|███████▉  | 399/500 [13:49<03:08,  1.87s/it]

Epoch: 0400 cost = 0.978854


100%|█████████▉| 499/500 [17:01<00:01,  1.82s/it]

Epoch: 0500 cost = 0.970814


100%|██████████| 500/500 [17:03<00:00,  2.05s/it]


#### Eval

In [None]:
from sklearn.metrics import classification_report as cr

predict, _ = lstm_att_model(test_input_batch)
predict = predict.data.max(1, keepdim=True)[1].detach().cpu().numpy()

print(cr(y_true, predict))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        88
           1       0.62      1.00      0.76       426
           2       0.00      0.00      0.00       176

    accuracy                           0.62       690
   macro avg       0.21      0.33      0.25       690
weighted avg       0.38      0.62      0.47       690



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### RNN

In [None]:
class RNN(nn.Module):
    def __init__(self):
        super(RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, n_hidden, bidirectional=False)
        self.FC = nn.Linear(n_hidden, num_classes)

    def forward(self, X):
        input = self.embedding(X)
        output, h = self.rnn(input)
        output = self.FC(output[: ,-1])
        return output # model : [batch_size, num_classes], attention : [batch_size, n_step]

rnn_model = RNN()
rnn_model = rnn_model.to(device)

criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(rnn_model.parameters(), lr=lr)

#### Train

In [None]:
# Training
from tqdm import tqdm

rnn_losses = []

for epoch in tqdm(range(500)):
    optimizer.zero_grad()
    output = rnn_model(train_input_batch)
    loss = criterion(output, train_target_batch)
    rnn_losses.append(loss.item())
    if (epoch + 1) % 100 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    loss.backward()
    optimizer.step()

 20%|█▉        | 99/500 [01:11<02:59,  2.23it/s]

Epoch: 0100 cost = 1.186968


 40%|███▉      | 199/500 [02:09<02:15,  2.23it/s]

Epoch: 0200 cost = 1.162854


 60%|█████▉    | 299/500 [02:58<01:29,  2.25it/s]

Epoch: 0300 cost = 1.140224


 80%|███████▉  | 399/500 [03:48<00:48,  2.09it/s]

Epoch: 0400 cost = 1.119050


100%|█████████▉| 499/500 [04:38<00:00,  2.21it/s]

Epoch: 0500 cost = 1.099304


100%|██████████| 500/500 [04:39<00:00,  1.79it/s]


#### Eval

In [None]:
from sklearn.metrics import classification_report

predict = rnn_model(test_input_batch)
predict = predict.data.max(1, keepdim=True)[1].detach().cpu().numpy()

print(classification_report(y_true, predict))

              precision    recall  f1-score   support

           0       0.13      1.00      0.23        88
           1       0.50      0.00      0.01       426
           2       0.00      0.00      0.00       176

    accuracy                           0.13       690
   macro avg       0.21      0.33      0.08       690
weighted avg       0.33      0.13      0.03       690



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Comparison between models

In general, pre-trained models seem to perform better than LSTM, BiLSTM with/without attention and RNN.

- Pretrained models: Among all pretrained models, **facebook/opt-350m** is the model with the highest accuracy and weighed average - about 80% accuracy on all accuracy and weighed averages. This maybe because of its higher number of parameters, allowing them to capture more intricate patterns and nuances in the data. This increased model capacity can lead to better performance on various tasks.
- Among other architecture, RNN perform the worst, which maybe attributable to its architecture nature. RNN suffers from vanishing or exploding gradient problem, hence making long-term dependencies hard to learn. That's the reason why it only achieves an accuracy of 13% - relatively low.
- Other architectures, such as LSTM, BiLSTM or BiLSTM with attention have similar performances. This maybe due to their ability to capture long-term dependencies better than simple RNNs. LSTMs and BiLSTMs are designed to mitigate the vanishing gradient problem by introducing gating mechanisms, such as the forget gate and input gate, which help the model to retain relevant information over longer sequences. Therefore, they can achieve an accuracy of approximately 65%, and 40-50% on all weighed average categories. However, they currently do not perform better than the pretrained models, which maybe due to the architecture here are quite simple - each contain only 1-2 layers.