# Finetuning Bert (Uncased)

We found the following list of resources very helpful to learning about how to finetune large language models for Sequence Classification! Please see:

1. https://huggingface.co/docs/transformers/en/tasks/sequence_classification
2. https://huggingface.co/docs/transformers/v4.38.1/en/main_classes/trainer#transformers.TrainingArguments.max_steps
3. https://huggingface.co/docs/transformers/en/training
4. https://youtu.be/eC6Hd1hFvos?feature=shared

## Environment Configuration

In [1]:
!pip install torch
!pip install transformers[torch]
!pip install datasets
!pip install numpy
!pip install evaluate

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.2
Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, m

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
from pprint import pprint
import torch
import numpy as np
import evaluate

## Dataset

In [4]:
training_dataset = load_dataset(
    "csv",
    data_files="/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/dataset/justice/justice_train.csv",
    split="train"
    )

training_dataset = training_dataset.rename_column("label", "labels")
pprint(training_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['labels', 'scenario'],
    num_rows: 21791
})


In [3]:
validation_dataset = load_dataset(
    "csv",
    data_files="/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/dataset/justice/justice_test_hard.csv",
    split="train"
)
validation_dataset = validation_dataset.rename_column("label", "labels")
pprint(validation_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['labels', 'scenario'],
    num_rows: 2052
})


In [6]:
hard_dataset = load_dataset(
    "csv",
    data_files="/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/dataset/justice/justice_test_hard.csv",
)
hard_dataset = hard_dataset.rename_column("label", "labels")
pprint(hard_dataset)

DatasetDict({
    train: Dataset({
        features: ['labels', 'scenario'],
        num_rows: 2052
    })
})


## Base Model Configuration

In [7]:
hf_id = 'google-bert/bert-base-cased'
dv = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'The programming environment will be using {dv} to run the models.')

The programming environment will be using cpu to run the models.


In [8]:
id2label = {
    0: 'Unreasonable',
    1: 'Reasonable'
}
label2id = dict((v,k) for k,v in id2label.items())

In [9]:
model = AutoModelForSequenceClassification.from_pretrained(
    hf_id,
    num_labels=2,
    id2label = id2label,
    label2id = label2id
    )
model.to(dv)
tokenizer = AutoTokenizer.from_pretrained(hf_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

## Finetuning Configuration

In [11]:
def tokenize_dataset(examples):
  return tokenizer(examples['scenario'],
                   padding='max_length',
                   truncation=True
                   )
training_dataset = training_dataset.map(tokenize_dataset, batched=True)
validation_dataset = validation_dataset.map(tokenize_dataset, batched=True)
hard_dataset = hard_dataset.map(tokenize_dataset, batched=True)

Map:   0%|          | 0/21791 [00:00<?, ? examples/s]

Map:   0%|          | 0/2052 [00:00<?, ? examples/s]

In [12]:
training_args = TrainingArguments(
    output_dir="/",
    learning_rate = 3e-5,
    num_train_epochs=3,
    auto_find_batch_size=True,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    logging_steps=1,
    label_names = ["labels"]
    )

In [13]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [14]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [16]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


In [None]:
# We will save this to test later
trainer.save_model('/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/models/justice-bert')
tokenizer.save_pretrained('/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/models/justice-bert')

In [None]:
pprint(trainer.evaluate(training_dataset))
pprint(trainer.evaluate(validation_dataset))
pprint(trainer.evaluate(hard_dataset))

## Empirical Testing

In [None]:
tune_id = '/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/models/justice-bert'

bert_justice = pipeline(
      'text-classification',
      model = tune_id,
      tokenizer = AutoTokenizer.from_pretrained(tune_id),
      torch_dtype = "auto",
      device = dv,
      return_all_scores = True,
)



In [None]:
for entry in validation_dataset['train']['scenario']:
  label = bert_justice(entry)
  if bert_justice(entry):
    print(f"""
    Sentence: {entry}
    Evaluation: {label}
    """)