<a href="https://colab.research.google.com/github/r3lativo/fine-tuning-for-sa/blob/main/llama3_8b_instruct_ft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Llama3 for text generation
## Project Overview
In this notebook, we will take in consideration one of the latest models from Meta and fine-tune it for text generation using a dataset containing English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.

The main difference here is that we will combine the power of Quantization and LoRA (so, QLoRA), working efficiently only on a very small part of the total trainable parameters.

In [None]:
# If on Google Colab, install these packages
!pip install -qU bitsandbytes datasets accelerate transformers peft huggingface_hub trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m115.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Import necessay libraries
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
import transformers
from transformers import (
    TrainingArguments,
    AutoTokenizer,
    AutoConfig,
    AutoModelForCausalLM,
    BitsAndBytesConfig)
from google.colab import userdata
from datasets import load_dataset
from trl import SFTTrainer
from peft import AutoPeftModelForCausalLM


In [None]:
# Access the model via HF
hugginface_token = userdata.get('HF_TOKEN')
!huggingface-cli login --token $hugginface_token

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# Main settings
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset("vicgalle/alpaca-gpt4")

## Preprocessing

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})


In [None]:
print(dataset['train'][0])

{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for sta

In [None]:
dataset_subset = dataset["train"].select(range(1_000))

We will also embed the text into a specific kind of prompt to use the particular properties of the Instruct type of model.

In [None]:
def generate_prompt(example, return_response=True) -> str:
  """Helper function to create the right prompt for the model"""
  full_prompt = f"Generate a simple instruction that could result in the provided context."
  full_prompt += f"[INST]CONTEXT: {example['output']}[/INST]"

  if return_response:
    full_prompt += f"INSTRUCTION: "
    full_prompt += f"{example['instruction']}"
  return [full_prompt]

In [None]:
generate_prompt(dataset_subset[0])[x0]

'Generate a simple instruction that could result in the provided context.[INST]CONTEXT: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.[/INST]INSTRUCTION: Give three tips for staying healthy.'

## Quantization

Model weights are usually represented in full precision, that is 32-bit floating point:

![image.png](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Float_example.svg/590px-Float_example.svg.png)

Instead of the 32-bit precision, we can get an **almost identical inference outcome with 16-bit half-precision**. This halves the model size and enables us to work with a lighter version of it.

QLoRA is great because despite being eight times less precise (32 / 8 = 4-bit) we still achieve great results. There is of course some loss in precision, but the overall gain in time and energy is worth it.

The main trade-off is that we have some latency in inference, as each time we need to update a weight during training, we dequantize it to 16-bit (half-precision), do the operation, and quantize back to 4-bit.

We combine the LoRA approach and Quantization with PEFT (Parameter Efficient Fine Tuning). The process starts by downloading the model weights in full precision (FP = 32-bit) and then loading our PEFT model on the GPU. We will then initialize LoRA configuration and **train the model using quantization (4-bit) and dequantization (16-bit).**


In [None]:
# Specify how to quantize the model
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,  # Load the model on our GPU in a quantized state
            bnb_4bit_use_double_quant=True,  #
            bnb_4bit_quant_type="nf4",  # The numbertype created that is "optimal"

            # Storing numbers in 4-bit is great, working with them is instead very bad!
            # We DEQUANTIZE them each time we operate with them and requantize later.
            # This dequantize-quantize operation will impact a bit the inference time.
            # It is a trade-off we have to pay.
            bnb_4bit_compute_dtype=torch.bfloat16
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

In [None]:
# Define model
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto"
)

In [None]:
base_model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps

In [None]:
prompt = "Retrieval Augmented Generation is"

In [None]:
#%time
input_ids = tokenizer(prompt,
                      return_tensors="pt",
                      truncation=True).input_ids.cuda()

outputs = base_model.generate(input_ids=input_ids,
                              max_new_tokens=100,
                              do_sample=True,
                              top_p=0.95,
                              temperature=0.4)

tokenizer.batch_decode(outputs.detach().cpu().numpy(),
                       skip_special_tokens=True)[0]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.44 µs


'Retrieval Augmented Generation is a new paradigm for language generation, where a generator is trained to produce a response conditioned on the retrieval of relevant context. However, the current retrieval augmented generation methods are limited to a single context. In this paper, we propose a new paradigm called Multi-Context Retrieval Augmented Generation (MCRAG) for language generation, where a generator is trained to produce a response conditioned on the retrieval of multiple contexts. To this end, we propose a novel model called Multi-Context Retrieval Augmented'

In [None]:
# LLAMA3 pre-training doesn't have EOS token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right"

## Model Architecture

It's important to observe the model's construction so you can ensure you know which modules you should apply LoRA to.

As per the QLoRA paper, we are going to focus on the attention weights.

In [None]:
base_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps

In [None]:
# LoRA Configuration
lora_config = LoraConfig(
    r=8,  # the 'rank' of the decomposed matrices we will use to represent our weights
    lora_alpha=32,
    target_modules=["q_proj"],# "v_proj", "k_proj"],  # apply LoRA to attention weights
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

base_model = prepare_model_for_kbit_training(base_model)
model = get_peft_model(base_model, lora_config)

In [None]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): L

In [None]:
model.print_trainable_parameters()

In [None]:
training_args = TrainingArguments(
    output_dir="meta-llama-3-ft",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit", # from the QLoRA paper
    logging_steps=1,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True, # ensure proper upcasting for compute dtypes
    #tf32=True,
    lr_scheduler_type="constant",
)

In [None]:
max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_subset,
    peft_config=lora_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    formatting_func=generate_prompt,
    args=training_args,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
trainer.train()

  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
1,0.5865
2,0.5718
3,0.5517


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


TrainOutput(global_step=3, training_loss=0.5699800848960876, metrics={'train_runtime': 41.498, 'train_samples_per_second': 0.072, 'train_steps_per_second': 0.072, 'total_flos': 69184713129984.0, 'train_loss': 0.5699800848960876, 'epoch': 3.0})

In [None]:
trainer.save_model()

In [None]:

model = AutoPeftModelForCausalLM.from_pretrained(
    training_args.output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(training_args.output_dir)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
sample = dataset_subset[3]

prompt = generate_prompt(sample, return_response=False)

In [None]:
input_ids = tokenizer(prompt[0], return_tensors="pt", truncation=True).input_ids.cuda()

outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.5)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
print(f"Prompt:\n{prompt[0]}\n")
print(f"-------------")
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt[0]):]}")
print(f"-------------")
print(f"Ground truth:\n{sample['instruction']}")

Prompt:
Generate a simple instruction that could result in the provided context.[INST]CONTEXT: The odd one out is Telegram. Twitter and Instagram are social media platforms mainly for sharing information, images and videos while Telegram is a cloud-based instant messaging and voice-over-IP service.[/INST]

-------------
Generated instruction:
CONTEXT: The odd one out is Telegram. Twitter and Instagram are social media platforms mainly for sharing information, images and videos while Telegram is a cloud-based instant messaging and voice-over-IP service.
-------------
Ground truth:
Identify the odd one out.
