# High Qaulity LLMs

### Steps:
- Language modeling (pretrain one or more massive datasets)
- Fine-tuning 1 supervised fine-tuning (base model follow instructions)
- Fine-tuning 2 preference tuning (aligns output to our peferences)

Adapters help performance in the transformer block for parameter-efficient fine-tuning (PEFT), we don't need to fine-tune all model weights. different adapters can specalize in different tasks and you can download them if needed

https://adapterhub.ml/

Transformer block:
- self-attention
- adapter
- feedforward neural network
- adapter

alternative to adapters: low-rank adaptation (LoRA) updates a small set of parameters effective technique for PEFT

Weight matrices are the crux of LLMs instead we can take a 10x10 matrix and break it up to 10x1 and 1x10 to make a total of 20 parameters then combined with the full frozen weights.

Also as we've seen lower the percision from say 32->16

Quantized LLMs (techniques like QLoRA) require less VRAM you can use this along side destribution-aware blocks to prevent values close to one another being the same quantized value

## Use TinyLlama for a pretrained LLM and use UltraChat for conversations between a user and an LLM

In [12]:
!pip uninstall transformers -y

Found existing installation: transformers 4.48.0
Uninstalling transformers-4.48.0:
  Successfully uninstalled transformers-4.48.0


In [13]:
!pip install transformers

Collecting transformers
  Using cached transformers-4.48.0-py3-none-any.whl.metadata (44 kB)
Using cached transformers-4.48.0-py3-none-any.whl (9.7 MB)
Installing collected packages: transformers
Successfully installed transformers-4.48.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
from transformers import AutoTokenizer
from datasets import load_dataset

template_tokenizer = AutoTokenizer.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
)

def format_prompt(example):
    """Formatting prompt to using the <|user|> template Tiny is using"""
    chat = example["messages"]
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)
    return {"text": prompt}

dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft")
        .shuffle(seed=42)
        .select(range(3_000))
)
dataset = dataset.map(format_prompt)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [2]:
dataset["text"][2120]

'<|user|>\nWrite a 2000-word fantasy story in a third-person omniscient point of view about a young boy named Max who is transported to a magical world but eventually returns to the normal world. The story must include at least three magical creatures, a description of Max\'s journey through the magical world, and a plot twist in the resolution of the story. The writing style must be descriptive and imaginative, with rich sensory details and vivid imagery.</s>\n<|assistant|>\nMax was just an average boy without any particular magic or talent. He lived in a highly technological world, surrounded by gadgets and machines that could do almost everything for him. However, Max always enjoyed reading fantastic stories about brave knights, wise wizards, and mythical creatures.\n\nOne day, when Max was walking home from school, he found a strange book on the sidewalk. The cover was adorned with exotic symbols and colorful illustrations, and the title read "The Chronicles of the Forgotten Realm.

## Quantization
bitsandbytes doesn't work well on MAC :( ill need a work around

In [3]:
!pip install bitsandbytes


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=False,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype="float16",
#     bnb_4bit_use_double_quant=True, # nested quantization
# )

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    # leave this out for regular SFT
    # quantization_config=bnb_config,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"

In [None]:
!pip install peft

## LoRA Configuration

In [5]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

peft_config = LoraConfig(
    lora_alpha=32, # rule of thumb choose 1/2 value of r, change added to the original weights
    lora_dropout=0.1,
    r=64, # rank of compressed matrices
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["k_proj", "gate_proj", "v_proj", "up_proj", "q_proj", "o_proj", "down_proj"]
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


## Training Config

In [6]:
from transformers import TrainingArguments

output_dir = "./results"
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=False, # usually True, False because I am using MAC
    gradient_checkpointing=True
)

## Training
Even on a GPU this will take an hour or so, can't continue will have to run it on cloud or a GPU when I get $

In [None]:
!pip install trl

In [10]:
from trl import SFTTrainer

for param in model.parameters():
    param.requires_grad = True

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    #dataset_text_field="text",
    tokenizer=template_tokenizer,
    args=training_arguments,
    # max_seq_length=512,
    # leave this out for regular SFT (which we do since mac)
    # peft_config=peft_config,
)

trainer.train()

  trainer = SFTTrainer(


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

KeyboardInterrupt: 

In [12]:
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")

## Merge Weights

In [13]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)

merged_model = model.merge_and_unload()



KeyError: 'base_model.model.model.lm_head'

In [14]:
from transformers import pipeline

prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
pipe(prompt)[0]["generated_text"]

NameError: name 'merged_model' is not defined

A downside to public benchmarks is that models can be overfitted to these benchmarks to generate the best responses. 

Human evaluation is the gold standard, example: https://lmarena.ai/ comparing LLM outputs

Preference tuning Reward system (accept/reject scores):
- Collect preference data
- Train a reward model
- Use reward model to fine-tune the LLM (operating as the preference evaluator)

Llama2 trains two reward models
- Helpfulness
- Safety

common method with trained reward model is Proximal Policy Optimization (PPO)
although this has a disadvantage to train two models (reward and LLM)

Direct Preference Optimization (DPO) doesn't use a reward model and lets the LLM itself as the reward model by comparing the output of a frozen model with the trainable model.

## DPO generally is more stable than PPO Templating alignment data

In [16]:
from datasets import load_dataset

def format_prompt(example):
    """Format prompt to TinyLlama"""
    system = "<|system|>\n" + example["system"] + "</s>\n"
    prompt = "<|user|>\n" + example["input"] + "</s>\n<|assistant|>\n"
    chosen = example["chosen"] + "</s>\n"
    rejected = example["rejected"] + "</s>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

dpo_dataset = load_dataset(
    "argilla/distilabel-intel-orca-dpo-pairs", split="train"
)

dpo_dataset = dpo_dataset.filter(
    lambda r:
        r["status"] != "tie" and
        r["chosen_score"] >= 8 and
        not r["in_gsm8k_train"]
)
dpo_dataset = dpo_dataset.map(
    format_prompt,
    remove_columns=dpo_dataset.column_names
)
dpo_dataset

Downloading readme:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/79.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12859 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12859 [00:00<?, ? examples/s]

Map:   0%|          | 0/5922 [00:00<?, ? examples/s]

Dataset({
    features: ['chosen', 'rejected', 'prompt'],
    num_rows: 5922
})

## Qunatization
4-bit quantization nor BitsAndBytes work on mac

In [18]:
from peft import AutoPeftModelForCausalLM
from transformers import BitsAndBytesConfig, AutoTokenizer

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
    #quantization_config=bnb_config,
)
merged_model = model.merge_and_unload()

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"

In [19]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["k_proj", "gate_proj", "v_proj", "up_proj", "q_proj", "o_proj", "down_proj"]
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

## Training Config

In [20]:
# run for 200 steps instead of 1 epoch as it may take hours to train
from trl import DPOConfig

output_dir = "./results"

training_arguments = DPOConfig(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=1e-5,
    lr_scheduler_type="cosine",
    max_steps=200,
    logging_steps=10,
    fp16=False, # True for non-mac or cuda
    gradient_checkpointing=True,
    warmup_ratio=0.1
)

## Training

In [24]:
from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model,
    args=training_arguments,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    #beta=0.1,
    #max_prompt_length=512,
   # max_length=512,
)
dpo_trainer.train()

  dpo_trainer = DPOTrainer(


Extracting prompt from train dataset:   0%|          | 0/5922 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/5922 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/5922 [00:00<?, ? examples/s]

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


RuntimeError: MPS backend out of memory (MPS allocated: 15.72 GB, other allocations: 2.24 GB, max allowed: 18.13 GB). Tried to allocate 225.12 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [25]:
dpo_trainer.model.save_pretrained("TinyLlama-1.1B-dpo-qlora")

## SFT+DPO

In [27]:
from peft import PeftModel

# merge LoRA and base model
model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)
sft_model = model.merge_and_unload()

# merge DPO LoRA and SFT
dpo_model = PeftModel.from_pretrained(
    sft_model,
    "TinyLlama-1.1B-dpo-qlora",
    device_map="auto",
)
dpo_model = dpo_model.merge_and_unload()



KeyError: 'base_model.model.model.lm_head'

Odds Ratio Preference Optimization (ORPO) combines SFT and DPO into a single training process. Removes the need to perform two separate trainings, while allowing use of QLoRA.