# Finetuning Bert (Uncased)

We found the following list of resources very helpful to learning about how to finetune large language models for Sequence Classification! Please see:

1. https://huggingface.co/docs/transformers/en/tasks/sequence_classification
2. https://huggingface.co/docs/transformers/v4.38.1/en/main_classes/trainer#transformers.TrainingArguments.max_steps
3. https://huggingface.co/docs/transformers/en/training
4. https://youtu.be/eC6Hd1hFvos?feature=shared

## Environment Configuration

In [1]:
!pip install torch
!pip install transformers[torch]
!pip install trl
!pip install datasets
!pip install numpy
!pip install evaluate

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.2
Collecting trl
  Downloading trl-0.7.11-py3-none-any.whl (155 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from trl)
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tyro>=0.5.11 (from trl)
  Downloading tyro-0.7.3-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Collecting docstring-parser>=0.14.1 (from tyro>=0.5.11->trl)
  Downloadi

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
from pprint import pprint
import torch
import numpy as np
import evaluate

## Dataset

In [5]:
training_dataset = load_dataset(
    "csv",
    data_files="/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/dataset/justice/justice_train.csv",
    split="train"
    )

training_dataset = training_dataset.rename_column("label", "labels")
pprint(training_dataset)

Dataset({
    features: ['labels', 'scenario'],
    num_rows: 21791
})


In [6]:
validation_dataset = load_dataset(
    "csv",
    data_files="/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/dataset/justice/justice_test.csv",
)
validation_dataset = validation_dataset.rename_column("label", "labels")
pprint(validation_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'scenario'],
        num_rows: 2704
    })
})


## Base Model Configuration

In [7]:
hf_id = 'google-bert/bert-base-cased'
dv = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'The programming environment will be using {dv} to run the models.')

The programming environment will be using cpu to run the models.


In [8]:
id2label = {
    0: 'Unreasonable',
    1: 'Reasonable'
}

label2id = dict((v,k) for k,v in id2label.items())

In [9]:
model = AutoModelForSequenceClassification.from_pretrained(
    hf_id,
    num_labels=2,
    id2label = id2label,
    label2id = label2id
    )
tokenizer = AutoTokenizer.from_pretrained(hf_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

## Finetuning Configuration

In [10]:
def tokenize_dataset(examples):
  return tokenizer(examples['scenario'],
                   padding='max_length',
                   truncation=True
                   )
training_dataset = training_dataset.map(tokenize_dataset, batched=True)
validation_dataset = validation_dataset.map(tokenize_dataset, batched=True)

Map:   0%|          | 0/21791 [00:00<?, ? examples/s]

Map:   0%|          | 0/2704 [00:00<?, ? examples/s]

In [24]:
training_args = TrainingArguments(
    output_dir="/",
    learning_rate = 1e-5,
    num_train_epochs=2,
    max_steps = 5,
    auto_find_batch_size=True,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True
    )

In [20]:
metric = evaluate.load("accuracy")

In [21]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=compute_metrics,
)

In [23]:
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=5, training_loss=0.6305279731750488, metrics={'train_runtime': 304.2765, 'train_samples_per_second': 0.131, 'train_steps_per_second': 0.016, 'total_flos': 10524442214400.0, 'train_loss': 0.6305279731750488, 'epoch': 0.0})

## Empirical Testing