<a href="https://colab.research.google.com/github/yashgt/AI101/blob/main/TRAINING_LLAMA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Role Teller

In [None]:
!pip install -q --upgrade bitsandbytes==0.48.2 trl==0.25.1
!wget -q https://raw.githubusercontent.com/ed-donner/llm_engineering/main/week7/util.py -O util.py

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os
import re
import math
from tqdm import tqdm

from google.colab import userdata
from huggingface_hub import login
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, set_seed, BitsAndBytesConfig
from datasets import load_dataset, Dataset, DatasetDict
import wandb
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datetime import datetime
import matplotlib.pyplot as plt

In [3]:
# Constants

BASE_MODEL = "meta-llama/Llama-3.2-3B"
PROJECT_NAME = "roleteller"
HF_USER = "yash-ganthe" # your HF name here!

#DATA_USER = "yash-ganthe"
#DATASET_NAME = f"{DATA_USER}/training_data"

RUN_NAME =  f"{datetime.now():%Y-%m-%d_%H.%M.%S}"
PROJECT_RUN_NAME = f"{PROJECT_NAME}-{RUN_NAME}"
HUB_MODEL_NAME = f"{HF_USER}/{PROJECT_RUN_NAME}"

# Hyper-parameters - overall

EPOCHS = 3
BATCH_SIZE = 32
MAX_SEQUENCE_LENGTH = 128
GRADIENT_ACCUMULATION_STEPS = 1

# Hyper-parameters - QLoRA

QUANT_4_BIT = True
LORA_R = 32
LORA_ALPHA = LORA_R * 2
ATTENTION_LAYERS = ["q_proj", "v_proj", "k_proj", "o_proj"]
MLP_LAYERS = ["gate_proj", "up_proj", "down_proj"]
TARGET_MODULES = ATTENTION_LAYERS
LORA_DROPOUT = 0.1

# Hyper-parameters - training

LEARNING_RATE = 1e-4
WARMUP_RATIO = 0.01
LR_SCHEDULER_TYPE = 'cosine'
WEIGHT_DECAY = 0.001
OPTIMIZER = "paged_adamw_32bit"

capability = torch.cuda.get_device_capability()
use_bf16 = capability[0] >= 8

# Tracking

VAL_SIZE = 500
LOG_STEPS = 5
SAVE_STEPS = 100
# LOG_TO_WANDB = True

In [23]:
# A100 GPU supports this; T4 does not natively

use_bf16

False

# More on Optimizers

https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#optimizers

The most common is Adam or AdamW (Adam with Weight Decay).  
Adam achieves good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory footprint of the order of the number of model parameters.


### Log in to HuggingFace and Weights & Biases

If you don't already have a HuggingFace account, visit https://huggingface.co to sign up and create a token.

Then select the Secrets for this Notebook by clicking on the key icon in the left, and add a new secret called `HF_TOKEN` with the value as your token.

Repeat this for weightsandbiases at https://wandb.ai and add a secret called `WANDB_API_KEY`

In [5]:
# Log in to HuggingFace

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [6]:
training_data = [{"name":"Yuvraj Chhabra", "role": "Project Manager"}, {"name" : "Yash Ganthe", "role":"Architect"}, {"name" : "Vaibhav Gandhi", "role": "Architect"}, {"name":"Rahul Hegde", "role": "Solution Architect"}, {"name" : "Shyamal Dhole", "role":"Dev Lead"}, {"name" : "Vineet Vyas", "role": "Specialist"}, {"name":"Michael John", "role": "DBA"}, {"name" : "Neha Magarde", "role":"Dev Lead"}, {"name" : "Sonali Patil", "role": "QA Lead"}, {"name" : "Rohini Golatkar", "role": "QA Lead"}  ]

In [16]:
train_datalist = []
for item in training_data:
  train_datalist.append({'prompt': "What is the role of "+ item['name'] + "?", 'completion': item['role']})
train_dataset = Dataset.from_list(train_datalist)

## Now load the Tokenizer and Model

The model is "quantized" - we are reducing the precision to 4 bits.

In [8]:
# pick the right quantization

if QUANT_4_BIT:
  quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16 if use_bf16 else torch.float16,
    bnb_4bit_quant_type="nf4"
  )
else:
  quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16 if use_bf16 else torch.float16,
  )

In [9]:
# Load the Tokenizer and the Model

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)
base_model.generation_config.pad_token_id = tokenizer.pad_token_id

print(f"Memory footprint: {base_model.get_memory_footprint() / 1e6:.1f} MB")

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Memory footprint: 2197.6 MB


# AND NOW

## We set up the configuration for Training

We need to create 2 objects:

A LoraConfig object with our hyperparameters for LoRA

An SFTConfig with our overall Training parameters

In [10]:
# LoRA Parameters

lora_parameters = LoraConfig(
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    r=LORA_R,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=TARGET_MODULES,
)

In [24]:
# Training parameters

train_parameters = SFTConfig(
    output_dir=PROJECT_RUN_NAME,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    optim=OPTIMIZER,
    save_steps=SAVE_STEPS,
    save_total_limit=10,
    logging_steps=LOG_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=0.001,
    #fp16=not use_bf16,
    fp16=use_bf16,
    bf16=use_bf16,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=WARMUP_RATIO,
    group_by_length=True,
    lr_scheduler_type=LR_SCHEDULER_TYPE,
    # report_to="wandb" if LOG_TO_WANDB else None,
    run_name=RUN_NAME,
    max_length=MAX_SEQUENCE_LENGTH,
    save_strategy="steps",
    hub_strategy="every_save",
    push_to_hub=True,
    hub_model_id=HUB_MODEL_NAME,
    hub_private_repo=True,
    eval_strategy= "no",
    eval_steps=SAVE_STEPS
)

# AND NOW - create the trainer

In [25]:
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    #eval_dataset=val,
    peft_config=lora_parameters,
    args=train_parameters
)



Adding EOS to train dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

## In the next cell, we kick off fine-tuning!

This will run for some time, uploading to the hub every SAVE_STEPS steps.

After some time, Google might stop your colab. For people on free plans, it can happen whenever Google is low on resources. For anyone on paid plans, they can give you up to 24 hours, but there's no guarantee.

If your server is stopped, you can follow my colab here to resume from your last save:

https://colab.research.google.com/drive/1qGTDVIas_Vwoby4UVi2vwsU0tHXy8OMO#scrollTo=R_O04fKxMMT-

I've saved this colab with my final run in the output so you can see the example. The trick is that I needed to set `is_trainable=True` when loading the fine_tuned model.

### Anyway, with that in mind, let's kick this off!

In [None]:
# Fine-tune!
fine_tuning.train()

# Push our fine-tuned model to Hugging Face
fine_tuning.model.push_to_hub(PROJECT_RUN_NAME, private=True)
print(f"Saved to the hub: {PROJECT_RUN_NAME}")

In [None]:
if LOG_TO_WANDB:
  wandb.finish()