# Fine-tune Llama 2 in Google Colab
> 🗣️ Fine-tune Llama 2 and other Language Models in Google Collab


## Install the libraries

In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 datasets

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m235.5/244.2 kB[0m [31m8.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m55.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━

## Load the libraries

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
import pandas as pd
from datasets import Dataset


## ***Instruction Finetuning***

This code snippet demonstrates how to **instruct fine-tune** a large language model (LLM) using the Peft library and the SFTTrainer from the Transformers library.



1. The code includes the installation of **necessary libraries** and the import of relevant modules.
2. It then defines the configuration for the fine-tuning process, including the model name, the dataset to be used, the training arguments, and the optimizer.
3. The code then loads the LLM and the dataset, and initializes the PeftModel and SFTTrainer.
4. Finally, it calls the `train()` method of the trainer to start the fine-tuning process.


In this particular example, the code is fine-tuning the Llama 2 LLM using the
* Lamma 2 Model (daryl149/llama-2-7b-chat-hf) * model on the Lamnini docs dataset (https://huggingface.co/datasets/lamini/lamini_docs). The code snippet provides a basic framework for fine-tuning LLMs using Peft and SFTTrainer. By modifying the configuration parameters, the model, dataset, and training arguments, it is possible to fine-tune LLMs for a variety of tasks and applications.

In [4]:
prompt_template = """ <s>[INST] <<SYS>> You are a honest and helpful assistant who helps users find answers quickly from the given docs about Lamini.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
If the answer can not be found in the text please respond with `Let's keep the discussion relevant to Lamini docs`. <</SYS>>

### Question: {question}
### Answer: {answer}
[/INST] </s>
"""

## Prepare the Dataset

In [5]:
# dataset to be used
# https://huggingface.co/datasets/lamini/lamini_docs
qa_data = load_dataset('lamini/lamini_docs', split="train")
df = pd.DataFrame(qa_data)

examples = df.to_dict()
text = examples["question"][0] + examples["answer"][0]

num_examples = len(examples["question"])
qa_finetuning_dataset = []
for i in range(num_examples):
  question =  examples["question"][i]
  answer = examples["answer"][i]
  text_with_prompt_template = prompt_template.format(question=question, answer=answer)
  qa_finetuning_dataset.append({"text": text_with_prompt_template})

print("One sample from the data:")
print(qa_finetuning_dataset[0])

dataset = Dataset.from_list(qa_finetuning_dataset)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/577 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/615k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/83.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1260 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/140 [00:00<?, ? examples/s]

One sample from the data:
{'text': " <s>[INST] <<SYS>> You are a honest and helpful assistant who helps users find answers quickly from the given docs about Lamini.\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\nIf you don't know the answer to a question, please don't share false information.\nIf the answer can not be found in the text please respond with `Let's keep the discussion relevant to Lamini docs`. <</SYS>>\n\n### Question: How can I evaluate the performance and quality of the generated text from Lamini models?\n### Answer: There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having huma

## Create the model config

In [6]:
################################################################################
# bitsandbytes parameters
################################################################################

quant_config = BitsAndBytesConfig(
    load_in_4bit=True, # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4", # Quantization type (fp4 or nf4), nf4 is the normalized float 4 bit data type
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

## Load the model and tokenizer

In [7]:
# llama_base_model_name = "meta-llama/Llama-2-7b-chat-hf"
llama_base_model_name = "openlm-research/open_llama_3b_v2"

# Path to save the new model / adapter weights
optimized_llama_model = "open-llama-3b-v2-chat-trilok-lamini"

llama_tokenizer = AutoTokenizer.from_pretrained(llama_base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

llama_base_model = AutoModelForCausalLM.from_pretrained(
    llama_base_model_name,
    quantization_config=quant_config,
    device_map={"": 0} # Load the entire model on the GPU 0
)
llama_base_model.config.use_cache = False
llama_base_model.config.pretraining_tp = 1

tokenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/512k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.85G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Lora config to finetune the model using lora

In [8]:
################################################################################
# LoRA Config: LoRA parameters
################################################################################
peft_config = LoraConfig(
    lora_alpha=16, # Alpha parameter for LoRA scaling
    lora_dropout=0.1, # Dropout probability for LoRA layers
    r=8, # r an integer that dictates how the matrices are updated, a lower rank leads to less trainable parameters.
    bias="none", # biases determines which biases will be trained, the options are none, all or lora_only
    task_type="CAUSAL_LM"
)

## Training Parameters

In [9]:
################################################################################
# TrainingArguments parameters
################################################################################

training_params = TrainingArguments(
    output_dir="./llama_finetuning", # Output directory where the model predictions and checkpoints will be stored
    num_train_epochs=1, # Number of training epochs
    per_device_train_batch_size=4, # Batch size per GPU for training
    gradient_accumulation_steps=1, # Number of update steps to accumulate the gradients for
    optim="paged_adamw_32bit", # Optimizer to use
    save_steps=25, # Save checkpoint every X updates steps
    logging_steps=25, # Log every X updates steps
    learning_rate=2e-4, # Initial learning rate (AdamW optimizer)
    weight_decay=0.001, # Weight decay to apply to all layers except bias/LayerNorm weights
    fp16=False, # Enable fp16/bf16 training (set bf16 to True with an A100)
    bf16=False, # Enable fp16/bf16 training (set bf16 to True with an A100)
    max_grad_norm=0.3, # Maximum gradient normal (gradient clipping)
    max_steps=-1, # Number of training steps (overrides num_train_epochs)
    warmup_ratio=0.03, # Ratio of steps for a linear warmup (from 0 to learning rate)
    group_by_length=True, # Group sequences into batches with same length # Saves memory and speeds up training considerably
    lr_scheduler_type="constant", # Learning rate scheduler
    report_to="tensorboard"
)

## Train the model

In [10]:
# Trainer
llama_fine_tuning = SFTTrainer(
    model=llama_base_model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=training_params
  )

# Training
llama_fine_tuning.train()



Map:   0%|          | 0/1260 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.9695
50,0.6634
75,0.8117
100,0.5414
125,0.7293
150,0.5285
175,0.7716
200,0.489
225,0.7319
250,0.522




TrainOutput(global_step=315, training_loss=0.7371621510339161, metrics={'train_runtime': 511.7695, 'train_samples_per_second': 2.462, 'train_steps_per_second': 0.616, 'total_flos': 2753001178752000.0, 'train_loss': 0.7371621510339161, 'epoch': 1.0})

In [None]:
# query = "How can I evaluate the performance and quality of the generated text from Lamini models"
# text_gen = pipeline(task="text-generation", model=llama_base_model, tokenizer=llama_tokenizer, max_length=4096)
# output = text_gen(f"<s>[INST] {query} [/INST]")
# print(output[0]['generated_text'])

Inference using the

In [None]:
# Generate predictions
query = "How can I evaluate the performance and quality of the generated text from Lamini models"
prompt = f"<s>[INST] {query} [/INST]"
inputs = llama_tokenizer(prompt, return_tensors='pt')
inputs = inputs.to(0)
output = llama_base_model.generate(inputs['input_ids'], max_new_tokens=500)
response = llama_tokenizer.decode(output[0].tolist())
print(response)



<s><s>[INST] How can I evaluate the performance and quality of the generated text from Lamini models [/INST]<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><u

In [None]:
# Merge and save the fine-tuned model
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    llama_base_model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
llama_model = PeftModel.from_pretrained(base_model, '/content/llama_finetuning/checkpoint-300')
llama_model = llama_model.merge_and_unload()

# Reload tokenizer to save it
llama_tokenizer = AutoTokenizer.from_pretrained(llama_base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

# Save the merged model
llama_model.save_pretrained(optimized_llama_model)
llama_tokenizer.save_pretrained(optimized_llama_model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 14.75 GiB of which 9.06 MiB is free. Process 2158 has 14.74 GiB memory in use. Of the allocated memory 14.46 GiB is allocated by PyTorch, and 150.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
# Generate predictions
query = "How can I evaluate the performance and quality of the generated text from Lamini models"
prompt = f"<s>[INST] {query} [/INST]"
inputs = llama_tokenizer(prompt, return_tensors='pt')
inputs = inputs.to(0)
output = llama_base_model.generate(inputs['input_ids'], max_new_tokens=500)
response = llama_tokenizer.decode(output[0].tolist())
print(response)

In [None]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()