# Finetuning Bert (Uncased)

We found the following list of resources very helpful to learning about how to finetune large language models for Sequence Classification! Please see:

1. https://huggingface.co/docs/transformers/en/tasks/sequence_classification
2. https://huggingface.co/docs/transformers/v4.38.1/en/main_classes/trainer#transformers.TrainingArguments.max_steps
3. https://huggingface.co/docs/transformers/en/training
4. https://youtu.be/eC6Hd1hFvos?feature=shared

## Environment Configuration

In [1]:
!pip install torch
!pip install transformers[torch]
!pip install trl
!pip install datasets
!pip install numpy
!pip install evaluate



In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
from pprint import pprint
import torch
import numpy as np
import evaluate

## Dataset

In [25]:
training_dataset = load_dataset(
    "csv",
    data_files="/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/dataset/justice/justice_train.csv",
    split="train"
    )
pprint(training_dataset)

Dataset({
    features: ['label', 'scenario'],
    num_rows: 21791
})


In [28]:
validation_dataset = load_dataset(
    "csv",
    data_files="/content/drive/MyDrive/Graduate/Courses/Winter 2024/EECS 6322/Course Project/dataset/justice/justice_test.csv",
)
pprint(validation_dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'scenario'],
        num_rows: 2704
    })
})


## Base Model Configuration

In [29]:
hf_id = 'google-bert/bert-base-cased'
dv = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'The programming environment will be using {dv} to run the models.')

The programming environment will be using cpu to run the models.


In [30]:
id2label = {
    0: 'Unreasonable',
    1: 'Reasonable'
}

label2id = dict((v,k) for k,v in id2label.items())

In [31]:
model = AutoModelForSequenceClassification.from_pretrained(
    hf_id,
    num_labels=2,
    id2label = id2label,
    label2id = label2id
    )
tokenizer = AutoTokenizer.from_pretrained(hf_id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Finetuning Configuration

In [33]:
def tokenize_dataset(examples):
  return tokenizer(examples['scenario'],
                   padding='max_length',
                   truncation=True
                   )
training_dataset = training_dataset.map(tokenize_dataset, batched=True)
validation_dataset = validation_dataset.map(tokenize_dataset, batched=True)

Map:   0%|          | 0/21791 [00:00<?, ? examples/s]

Map:   0%|          | 0/2704 [00:00<?, ? examples/s]

In [34]:
training_args = TrainingArguments(
    output_dir="/",
    learning_rate = 1e-5,
    num_train_epochs=4,
    max_steps = 100,
    auto_find_batch_size=True,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True
    )

In [35]:
metric = evaluate.load("accuracy")

In [36]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [37]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
