<h1 style="text-align:center; font-size:3em; font-weight:bold;">
Alpaca GPT LLM Fine Tuning
</h1>

The script is a large language model (LLM) experiment designed to fine-tune LLMs —specifically, Google’s gemma-7b—using the Alpaca GPT-4 dataset. It leverages parameter-efficient fine-tuning (PEFT) with LoRA (Low-Rank Adaptation) and 8-bit quantization for efficient training on consumer GPUs. The script also features implementing checkpointing to allow training to resume after interruptions as fine-tuning can be time consuming.

It is possible to modify the script to test different LLMs. Below are some proposed ideas:
1. Experiment with Different Base Models for performance and resource requirements comparison
2. Hyperparameter Tuning of LLM characteristics
3. Evaluation and Validation of model performance during and after training
4. Custom Dataset Integration
5. Further Advanced Training Techniques
6. Robust Checkpoint Management enhancement

### changes implemented so far:
    # changed epoch from 3 to 1 to shorten timing to train to test stability
    # implemented checkpoints for training to allow batch by batch training instead of single training
    # allow model to resume training from checkpoints
    # saved the last 3 checkpoints temporarily as backup in case crash or pause training

In [None]:
# Script for alpaca gpt4

## install required libraries first
# pip install transformers datasets accelerate peft trl bitsandbytes numpy
# pip install -U numpy==1.23.5 (might skip this command since it was forked from AWS)

## run this command to check gpu usage in terminal
# nvidia-smi

## hugging face access token: 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

### Please note that this script is designed to run on a machine with a GPU. Hide hugging face token before sharing the code publicly.

# import required libraries and environment setup required for LLM fine tuning
import torch
import peft
import bitsandbytes as bnb
from huggingface_hub import login
from huggingface_hub import whoami
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import TrainingArguments, Trainer
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

hface_access_token = "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"



In [None]:
# check if details and setup is correct
login(hface_access_token)
print(whoami())
print(torch.cuda.is_available())
print(peft.__version__)
print(bnb.__version__)



In [None]:
# load the dataset alpaca gpt4 from hugging face
ds = load_dataset("vicgalle/alpaca-gpt4")

def format_example(example):
    prompt = example["instruction"]
    if example["input"]:
        prompt += f"\n\nInput:\n{example['input']}"
    prompt += "\n\nResponse:"
    return {
        "prompt": prompt,
        "response": example["output"]
    }

dataset = ds["train"].map(format_example)



In [None]:
# Display the first 6 rows of the training split
for i in range(6):
    print(f"--- Row {i + 1} ---")
    print("Instruction:", ds["train"][i]["instruction"])
    print("Input      :", ds["train"][i]["input"])
    print("Output     :", ds["train"][i]["output"])
    print()

model_name = "google/gemma-7b"



In [None]:
# Define 8-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_skip_modules=None,
    llm_int8_enable_fp32_cpu_offload=True
)



In [None]:
# Load tokenizer and model with quantization config
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)



In [None]:
# Add PEFT LoRA support for fine-tuning 8-bit models
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()



In [None]:
# Tokenize the Dataset for Training
def tokenize(batch):
    full_texts = [prompt + response for prompt, response in zip(batch["prompt"], batch["response"])]
    tokenized = tokenizer(
        full_texts,
        truncation=True,
        padding="max_length",
        max_length=1024
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])



In [None]:
# define training arguments and trainer
training_args = TrainingArguments(
    output_dir="./gemma7b-finetuned",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,  # changed from 3 to 1 for shorter initial training
    logging_steps=10,   # log documentation process in intervals of 10
    save_steps=500,
    save_total_limit=3,
    fp16=True,
    report_to="none",
    resume_from_checkpoint=True  # enable training continuation from last checkpoint
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

trainer.train()



In [None]:
# start training with checkpoint resume support, do not run if training was not done
#trainer.train(resume_from_checkpoint=True)