# Fine Tuning LLaMA-2
This notebook fine tunes LLaMA 2 on our Supreme Court case dataset and evaluates it's performance on a test set of cases.

COMPUTE REQUIREMENTS: A100 GPU with 40GB RAM

## Results From This Notebook
Accuracy On Test Set: 0.6280701754385964

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [None]:
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import os
import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [None]:
drive.mount('/content/drive/')

#change this to the directory you have the files stored in
%cd /content/drive/My Drive/CPSC-477-Project/

df = pd.read_csv('2024-05-07-oyez-scrape.csv')
df.head()

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
/content/drive/.shortcut-targets-by-id/1ygGGGOVkhqy-8CG8UUS16DYOOWKzLRFx/CPSC-477-Project


Unnamed: 0,Case Key,Case Name,First Party Label,First Party,Second Party Label,Second Party,Winning Party,Justices,Facts,Question,Conclusion
0,1971/70-18,Roe v. Wade,Appellant,Jane Roe,Appellee,Henry Wade,Jane Roe,"William O. Douglas, Potter Stewart, Thurgood M...","In 1970, Jane Roe (a fictional name used in co...",Does the Constitution recognize a woman's righ...,Inherent in the Due Process Clause of the Four...
1,1971/70-5014,Stanley v. Illinois,Petitioner,"Peter Stanley, Sr.",Respondent,Illinois,Stanley,"William O. Douglas, Potter Stewart, Thurgood M...",Joan Stanley had three children with Peter Sta...,Does the Illinois statutory scheme that assume...,"Yes. Justice Byron R. White, writing for a 5-..."
2,1971/70-29,Giglio v. United States,Petitioner,John Giglio,Respondent,United States,Giglio,"William O. Douglas, Potter Stewart, Thurgood M...",John Giglio was convicted of passing forged mo...,Is the prosecution’s failure to disclose a pro...,"Yes. Chief Justice Warren E. Burger, writing ..."
3,1971/70-4,Reed v. Reed,Appellant,Sally Reed,Appellee,Cecil Reed,Sally Reed,"William O. Douglas, Potter Stewart, Thurgood M...","The Idaho Probate Code specified that ""males m...",Did the Idaho Probate Code violate the Equal P...,"In a unanimous decision, the Court held that t..."
4,1971/70-73,Miller v. California,Appellant,Marvin Miller,Appellee,California,Marvin Miller,"Warren E. Burger, William O. Douglas, William ...","Miller, after conducting a mass mailing campai...",Is the sale and distribution of obscene materi...,"In a 5-to-4 decision, the Court held that obsc..."


## Data
To fine-tune our model, we need instruction data. We will generate this by creating prompts and responses out of the case data, saving the cases from the last 5 years as test cases (this includes cases LLaMA may have knowledge about, but there are so few cases after the knowledge cutoff that we will include them)

In [None]:
def get_case_info(case_num):
  facts = list(df["Facts"])[case_num]
  question = list(df["Question"])[case_num]
  party_1 = list(df["First Party"])[case_num]
  party_2 = list(df["Second Party"])[case_num]
  winning_party = list(df["Winning Party"])[case_num]
  key = list(df["Case Key"])[case_num]

  return {"facts" : facts, "question": question,
          "party_1": party_1, "party_2": party_2,
          "winning_party": winning_party, "key": key,
          }

def get_case_prompt(case_info, include_answer= False):
  prompt = f"""
  [INST]
  The United States Supreme Court is hearing a legal case centered around the legal question of {case_info["question"]}.
  Given these case facts: {case_info["facts"]}
  Where the parties in question are {case_info["party_1"]} and {case_info["party_2"]}

  Responding with only one party, which party would the United States Supreme Court rule in favor of?
  [/INST]
  """
  if include_answer:
    prompt += f"""
    The United States Supreme Court would rule in favor of {case_info["winning_party"]}
    """
  return prompt

In [None]:
#generate train and test data
train_data = []
test_data = []
test_winning_parties = []
for case_num in tqdm(range(len(df))):
  case_info = get_case_info(case_num)
  if int(case_info["key"][:4]) > 2017:
    prompt = get_case_prompt(case_info)
    #For the test data, we need to save the winning party externally since it is not included in the prompt
    test_data.append({"prompt": prompt, "winning_party": case_info["winning_party"]})
  else:
    prompt = get_case_prompt(case_info, include_answer = True)
    train_data.append(prompt)

100%|██████████| 2509/2509 [00:02<00:00, 851.10it/s]


## Fine Tuning the Model
Note: The majority of the code for the next 2 cells was taken from https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html

In [None]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"

# Fine-tuned model name
new_model = "llama-2-7b-scotus"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [None]:
# Load dataset
dataset = Dataset.from_dict({"text": train_data}, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Your GPU supports bfloat16: accelerate training with bf16=True


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]



Map:   0%|          | 0/2224 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.6946
50,1.14
75,1.2238
100,1.0573
125,1.196
150,1.0962
175,1.1793
200,1.0704
225,1.1673
250,1.0687




Step,Training Loss
25,1.6946
50,1.14
75,1.2238
100,1.0573
125,1.196
150,1.0962
175,1.1793
200,1.0704
225,1.1673
250,1.0687




In [None]:
from transformers import GenerationConfig
model.eval()

generation_config = GenerationConfig(
    max_new_tokens = 30, #short generation since we only need one sentence responses
    decoder_start_token_id=1,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
)

def get_model_response(input):
  torch.cuda.empty_cache()
  tokenized_input = tokenizer(input, add_special_tokens=True, return_tensors="pt")
  tokenized_input.to(device)
  input_len = len(input)
  outputs = model.generate(**tokenized_input, generation_config=generation_config)[0]
  tokenized_input.to('cpu')
  model_response = tokenizer.decode(outputs)[input_len:]
  return model_response




In [None]:
import logging
from tqdm import tqdm

# Set logging level to suppress warnings
logging.getLogger("transformers").setLevel(logging.ERROR)

num_correct = 0
case_count = 0
for test in tqdm(test_data):
  case_count += 1
  prediction = get_model_response(test["prompt"])
  #The model now always begins with the following words in its response:
  response_seed = "the united states supreme court would"
  # This is exactly how we trained it, but for some cases a party is "United States"
  # Since we simply check for the presence of the winning party in the response string,
  # We should remove that section of the string,
  # otherwise we would get false positives if the model did not predict United States as winner
  prediction = prediction.lower()[len(response_seed):]
  correct = test["winning_party"].lower() in prediction
  if correct:
    num_correct += 1
  if case_count % 50 == 0:
    print(num_correct/case_count)

print(" ")
accuracy = num_correct/case_count
print(f"Accuracy: {accuracy}")

 18%|█▊        | 50/285 [01:59<09:17,  2.37s/it]

0.6


 35%|███▌      | 100/285 [04:00<07:29,  2.43s/it]

0.56


 53%|█████▎    | 150/285 [05:59<05:21,  2.38s/it]

0.5866666666666667


 70%|███████   | 200/285 [07:59<03:25,  2.42s/it]

0.635


 88%|████████▊ | 250/285 [10:00<01:23,  2.38s/it]

0.628


100%|██████████| 285/285 [11:26<00:00,  2.41s/it]

 
Accuracy: 0.6280701754385964





## Sources

https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html

https://huggingface.co/docs/datasets/v1.1.1/loading_datasets.html