### Fine-tune Llama2 using QLoRA, Bits and Bytes and PEFT
***

- **[QLoRA](https://arxiv.org/pdf/2305.14314.pdf): Quantized Low Rank Adapters** - a method for fine-tuning LLMs that uses a small number of quantized, updateable parameters to limit the complexity of training (simplifies weight storage). Instead of using precise float point 32 or 16 formats, weights are categorized into "bins" by their values. These bins can then be represented in more straightforward formats like int/float4 or int/float8 - a technique known as Quantization, often dubbed normal float 4 or 8. An added layer, double quantization, further refines these bins to conserve memory. Although the weights get quantized, during back-propagation, we revert them to float 16 to update. Post-update, these specific values are discarded.

- **[Bits and Bytes](https://github.com/TimDettmers/bitsandbytes)**: Excellent package by Tim Dettmers et al., which provides a lightweight wrapper around custom CUDA functions that make LLMs go faster - optimizers, matrix mults and quantization.

- **[PEFT](https://github.com/huggingface/peft): Parameter Efficient Fine-tuning**: Huggingface library that enables a number of PEFT methods, which again make it less expensive to fine-tune LLMs. These methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters

## Installs and imports

In [None]:
%pip install -qqq accelerate==0.22.0
%pip install -qqq bitsandbytes==0.41.1
%pip install -qqq loralib==0.1.2
%pip install -qqq peft==0.5.0


In [2]:
import os
import re
import random

from tqdm import tqdm
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from wordcloud import WordCloud

import transformers
from transformers import (
    AutoModelForCausalLM,      
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)

import torch
from torch.utils.data import DataLoader, Dataset

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel, 
    get_peft_model,
    prepare_model_for_kbit_training
)

import bitsandbytes as bnb

import warnings
warnings.filterwarnings("ignore")

tqdm.pandas()
sns.set_style("dark")
plt.rcParams["figure.figsize"] = (20,8)
plt.rcParams["font.size"] = 14



In [3]:
def seed_everything(seed=None):
    
    max_seed_value = np.iinfo(np.uint32).max
    min_seed_value = np.iinfo(np.uint32).min

    if (seed is None) or not (min_seed_value <= seed <= max_seed_value):
        seed = random.randint(np.iinfo(np.uint32).min, np.iinfo(np.uint32).max)
        
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    return seed

seed_everything(42)

42

***
### Load original model
- Use Llama 2/7B-chat model for fine tuning and quantization

In [5]:
%%time

modelx = "/kaggle/input/llama-2/pytorch/7b-chat-hf/1"
tokenizer = AutoTokenizer.from_pretrained(modelx)

model = AutoModelForCausalLM.from_pretrained(
    modelx,
    torch_dtype=torch.float16,
    device_map='auto'
)

# Calculate the size of the model in bytes
model_size_bytes = sum(p.numel() * p.element_size() for p in model.parameters())

# Convert to megabytes (MB)
model_size_megabytes = model_size_bytes / (1024 * 1024)

print(f"Model size: {model_size_megabytes:.2f} MB")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model size: 12852.51 MB
CPU times: user 6.15 s, sys: 9.34 s, total: 15.5 s
Wall time: 3min 6s


In [6]:
!nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

13715


In [7]:
del model
# del qmodel
torch.cuda.empty_cache()

***
### Prepare model, set up BitsAndBytes config
- Use bitsandbytes with 4-bit quantization configuration
- reduce memory consumption considerably at a cost of some accuracy
- load the weights in 4-bit format, using a normal float 4 with double quantization to improve QLoRA's resolution
- weights are converted back to bfloat16 for weight updates.


In [8]:
%%time

bnb_config = BitsAndBytesConfig(
    load_in_8bit=False,
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

qmodel = AutoModelForCausalLM.from_pretrained(
    modelx,
    use_safetensors=True,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto",
)

# Calculate the size of the model in bytes
model_size_bytes = sum(p.numel() * p.element_size() for p in qmodel.parameters())

# Convert to megabytes (MB)
model_size_megabytes = model_size_bytes / (1024 * 1024)

print(f"Model size: {model_size_megabytes:.2f} MB")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model size: 3588.51 MB
CPU times: user 4.82 s, sys: 11.1 s, total: 15.9 s
Wall time: 3min 10s


In [9]:
!nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

5011


### Preprocesses the model to ready it for training
- Load the tokenizer
- Enable Gradient Checkpointing: a technique that trades off computation time for memory. Instead of storing all intermediate activations in memory (as is typically done in the forward pass of back-propagation), gradient check-pointing stores only a subset of them. Then, during the backward pass, it recomputes the required activations on-the-fly. This way, the memory consumption is significantly reduced at the expense of additional computation.

In [16]:
tkn = AutoTokenizer.from_pretrained(modelx)
tkn.pad_token = tkn.eos_token

qmodel.gradient_checkpointing_enable()
qmodel = prepare_model_for_kbit_training(qmodel)

### Sample question-answer

In [17]:
%%time

prompt_to_test = 'Prompt: What is the theory of relativity? Explain in max 8 sentences. \n'

pipeline = transformers.pipeline(
    "text-generation",
    model=qmodel,
    torch_dtype=torch.float16,
    device_map="auto",
    tokenizer = tkn,
)

sequences = pipeline(
    prompt_to_test,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=400,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Prompt: What is the theory of relativity? Explain in max 8 sentences. 
The theory of relativity states that the laws of physics are the same for all observers in uniform motion relative to one another. This idea challenged the traditional concept of absolute time and space, as proposed by Isaac Newton. Einstein's theory of special relativity introduced the concept of time dilation, where time appears to pass slower for an observer in motion relative to a stationary observer. In addition, the theory of general relativity introduces the concept of gravity as a curvature of spacetime caused by the mass and energy of objects. Overall, the theory of relativity revolutionized our understanding of space and time, and has had significant implications for the development of modern physics.
CPU times: user 8.15 s, sys: 0 ns, total: 8.15 s
Wall time: 8.17 s


In [15]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
    
print_trainable_parameters(qmodel)

trainable params: 262410240 || all params: 3500412928 || trainable%: 7.496550989769399
