### Fine-tune Llama2 using QLoRA, Bits and Bytes and PEFT
***

- **[QLoRA](https://arxiv.org/pdf/2305.14314.pdf): Quantized Low Rank Adapters** - a method for fine-tuning LLMs that uses a small number of quantized, updateable parameters to limit the complexity of training (simplifies weight storage). Instead of using precise float point 32 or 16 formats, weights are categorized into "bins" by their values. These bins can then be represented in more straightforward formats like int/float4 or int/float8 - a technique known as Quantization, often dubbed normal float 4 or 8. An added layer, double quantization, further refines these bins to conserve memory. Although the weights get quantized, during back-propagation, we revert them to float 16 to update. Post-update, these specific values are discarded.

- **[Bits and Bytes](https://github.com/TimDettmers/bitsandbytes)**: Excellent package by Tim Dettmers et al., which provides a lightweight wrapper around custom CUDA functions that make LLMs go faster - optimizers, matrix mults and quantization.

- **[PEFT](https://github.com/huggingface/peft): Parameter Efficient Fine-tuning**: Huggingface library that enables a number of PEFT methods, which again make it less expensive to fine-tune LLMs. These methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters

## Installs and imports

In [None]:
%pip install -qqq accelerate==0.22.0
%pip install -qqq bitsandbytes==0.41.1
%pip install -qqq loralib==0.1.2
%pip install -qqq peft==0.5.0

In [None]:
import os
import re
import random

from tqdm import tqdm
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from wordcloud import WordCloud

import transformers
from transformers import (
    AutoModelForCausalLM,      
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    TrainingArguments, 
    Trainer,
)

import torch
from torch.utils.data import DataLoader, Dataset

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel, 
    get_peft_model,
    prepare_model_for_kbit_training
)

import bitsandbytes as bnb

import warnings
warnings.filterwarnings("ignore")

tqdm.pandas()
sns.set_style("dark")
plt.rcParams["figure.figsize"] = (20,8)
plt.rcParams["font.size"] = 14

In [None]:
def seed_everything(seed=None):
    
    max_seed_value = np.iinfo(np.uint32).max
    min_seed_value = np.iinfo(np.uint32).min

    if (seed is None) or not (min_seed_value <= seed <= max_seed_value):
        seed = random.randint(np.iinfo(np.uint32).min, np.iinfo(np.uint32).max)
        
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    return seed

seed_everything(42)

42

***
### Load original model
- Use Llama 2/7B-chat model for fine tuning and quantization

In [None]:
%%time

modelx = "/kaggle/input/llama-2/pytorch/7b-chat-hf/1"
tokenizer = AutoTokenizer.from_pretrained(modelx)

model = AutoModelForCausalLM.from_pretrained(
    modelx,
    torch_dtype=torch.float16,
    device_map='auto'
)

# Calculate the size of the model in bytes
model_size_bytes = sum(p.numel() * p.element_size() for p in model.parameters())

# Convert to megabytes (MB)
model_size_megabytes = model_size_bytes / (1024 * 1024)

print(f"Model size: {model_size_megabytes:.2f} MB")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model size: 12852.51 MB
CPU times: user 6.76 s, sys: 6.26 s, total: 13 s
Wall time: 3min 16s


In [None]:
!nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

13715


In [None]:
del model
# del qmodel
torch.cuda.empty_cache()

***
### Prepare model, set up BitsAndBytes config
- Use bitsandbytes with 4-bit quantization configuration
- reduce memory consumption considerably at a cost of some accuracy
- load the weights in 4-bit format, using a normal float 4 with double quantization to improve QLoRA's resolution
- weights are converted back to bfloat16 for weight updates.


In [None]:
%%time

bnb_config = BitsAndBytesConfig(
    load_in_8bit=False,
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

qmodel = AutoModelForCausalLM.from_pretrained(
    modelx,
    use_safetensors=True,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto",
)

# Calculate the size of the model in bytes
model_size_bytes = sum(p.numel() * p.element_size() for p in qmodel.parameters())

# Convert to megabytes (MB)
model_size_megabytes = model_size_bytes / (1024 * 1024)

print(f"Model size: {model_size_megabytes:.2f} MB")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model size: 3588.51 MB
CPU times: user 4.94 s, sys: 1.36 s, total: 6.29 s
Wall time: 6.04 s


In [None]:
!nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

5011


***
### Sample question-answer pipeline
- Load the tokenizer
- Run it before transforming model into 'PeftModelForCausalLM'

In [None]:
%%time

tkn = AutoTokenizer.from_pretrained(modelx)
tkn.pad_token = tkn.eos_token

prompt_to_test = 'Prompt: What is the theory of relativity? Explain in max 8 sentences. \n'

pipeline = transformers.pipeline(
    "text-generation",
    model=qmodel,
    torch_dtype=torch.float16,
    device_map="auto",
    tokenizer = tkn,
)

sequences = pipeline(
    prompt_to_test,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=400,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Prompt: What is the theory of relativity? Explain in max 8 sentences. 
>  answer: The theory of relativity, developed by Albert Einstein, is a fundamental concept in modern physics that challenges the traditional understanding of space and time. The theory posits that the laws of physics are the same for all observers in uniform motion relative to one another, and that time and space are relative rather than absolute. In other words, time can appear to slow down or speed up depending on an observer's motion and position, and the length of an object can increase or decrease depending on its speed relative to an observer. The theory of relativity has far-reaching implications, including the idea that the passage of time is relative, that time dilation occurs when objects move at high speeds, and that space and time are intertwined as a single entity called spacetime. The theory has been experimentally verified through numerous observations and experiments, and has led to a greate

***
### Preprocesses the model to ready it for training

- Enable Gradient Checkpointing: a technique that trades off computation time for memory. Instead of storing all intermediate activations in memory (as is typically done in the forward pass of back-propagation), gradient check-pointing stores only a subset of them. Then, during the backward pass, it recomputes the required activations on-the-fly. This way, the memory consumption is significantly reduced at the expense of additional computation.
- PEFT preparing a model before running a training. This includes:1- Cast the layernorm in fp32 2- making output embedding layer require grads 3- Add the upcasting of the lm head to fp32

In [None]:
qmodel.gradient_checkpointing_enable()
qmodel = prepare_model_for_kbit_training(qmodel)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
    
# print_trainable_parameters(qmodel)

***
### Set up the configuration for LORA
params: 
- **r** - rank parameter,  which determines the rank of the adaptation matrix. Let's set r to 16. 
- **lora_alpha** - scaling parameter, which modulates the scaling of learned weights. Higher alpha values yield fewer weight updates; specifically, weights are scaled by r/alpha. Let's opt for an alpha value of 32. 
- **dropout rate** - set the dropout rate at 5% and designate the task as Causal Language Model
- **target_modules** is set using our helper functions - every layer identified by that function will be included in the PEFT update.

In [None]:
import re
def get_num_layers(model):
    numbers = set()
    for name, _ in model.named_parameters():
        for number in re.findall(r'\d+', name):
            numbers.add(int(number))
    return max(numbers)

def get_last_layer_linears(model):
    names = []
    
    num_layers = get_num_layers(model)
    for name, module in model.named_modules():
        if str(num_layers) in name and not "encoder" in name:
            if isinstance(module, torch.nn.Linear):
                names.append(name)
    return names

In [None]:
config = LoraConfig(
    r=16,
    lora_alpha=32,
    # target_modules=get_last_layer_linears(qmodel),
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

qmodel = get_peft_model(qmodel, config)
print_trainable_parameters(qmodel)

# with get_last_layer:
# trainable params: 1249280 || all params: 3501662208 || trainable%: 0.03567677079604818

trainable params: 8388608 || all params: 3508801536 || trainable%: 0.23907331075678143


***
### Generation config
params:
- **top_p**: a method for choosing from among a selection of most probable outputs, as opposed to greedily just taking the highest (the model won't sample from the entire token distribution, instead, it will sample from a narrowed set of tokens whose combined probability surpasses a threshold
- **temperature**: a modulation on the softmax function used to determine the values of our outputs, lower val is more deterministic since values closer to 0 reduce randomness
- **num_return_sequences**: by setting it to 1, the model will produce only one output sequence (force the model to produce short answers)
- **max_new_tokens**: length of answer


In [None]:
generation_config = qmodel.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.5
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tkn.eos_token_id
generation_config.eos_token_id = tkn.eos_token_id

In [None]:
%%time

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

encoding = tkn(prompt_to_test, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = qmodel.generate(
        input_ids = encoding.input_ids,
        attention_mask = encoding.attention_mask,
        generation_config = generation_config
    )

print(tkn.decode(outputs[0], skip_special_tokens=True))

cuda:0
Prompt: What is the theory of relativity? Explain in max 8 sentences. 

The theory of relativity, developed by Albert Einstein, is a fundamental concept in modern physics that challenges our understanding of space and time. The theory posits that the laws of physics are the same for all observers in uniform motion relative to one another, and that the passage of time and the length of objects can vary depending on their speed and position in a gravitational field. The theory consists of two main parts: special relativity, which deals with objects moving at constant speeds relative to each other, and general relativity, which deals with gravity and its effects on spacetime. Special relativity introduces the concept of time dilation, where time appears to pass more slowly for an observer in motion relative to a stationary observer, and length contraction, where objects appear shorter to an observer in motion relative to them. General relativity introduces the concept of gravitatio

***
### Build Dataset

In [None]:
train_df = pd.read_csv("/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/train.csv")
test_df = pd.read_csv("/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/test.csv")
val_df = pd.read_csv("/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/validation.csv")

In [None]:
df = pd.concat([train_df, test_df, val_df],ignore_index=True) # concating them
# del train_df, test_df, val_df # del the rest for memory space
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311971 entries, 0 to 311970
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          311971 non-null  object
 1   article     311971 non-null  object
 2   highlights  311971 non-null  object
dtypes: object(3)
memory usage: 7.1+ MB


In [None]:
DEFAULT_PROMPT = "Below is a snippet from a newspaper article. Be kind and summarize it!"

class SummarizationDataset(Dataset):
    def __init__(self, data, tokenizer, max_input_length=1024, max_output_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_output_length = max_output_length
    
    def __len__(self):
        return len(self.data)
    
    @staticmethod
    def generate_train_prompt(
        article: str, summary: str, prompt: str = DEFAULT_PROMPT 
    ) -> str:
        txt = f"# Instruction: {prompt}\n"
        txt += f"# Input: {article}\n"
        txt += f"# Response: {summary}\n"
        return txt
    
    @staticmethod
    def text_cleaning(x):
        text = re.sub(r' .\n', '. ', x)
        text = re.sub(r"http\S+", "", text)
        text = re.sub(r"\^[^ ]+", "", text)
        text = re.sub(r"@[^\s]+", "", text)
        return text
        
    def __getitem__(self, index):
        article = self.text_cleaning(self.data.iloc[index]["article"])
        summary = self.text_cleaning(self.data.iloc[index]["highlights"])
        txt = self.generate_train_prompt(article, summary)
        txt = self.tokenizer(txt, truncation=True, max_length=self.max_input_length, padding="max_length")
        return txt
#         return {"article": article,
#                 "summary": summary,
#                 "text": txt}

In [None]:
train_dataset = SummarizationDataset(train_df, tkn)

In [None]:
def collate_stuff(batch):
    import pdb; pdb.set_trace()
    input_ids = [torch.tensor(item['text']['input_ids']) for item in batch]
    attention_mask = [torch.tensor(item['text']['attention_mask']) for item in batch]
    # tkn.decode(batch[0]['text']['input_ids']) tokenizer left pads txt if len < max_input_lenght
    return {'input_ids': torch.stack(input_ids, dim=0),
            'attention_mask': torch.stack(attention_mask, dim=0),
           }

# if only embedded txt for __getitem__, can use:
# DataLoader(train_dataset, batch_size=5, collate_fn=transformers.DataCollatorForLanguageModeling(tkn, mlm=False))

In [None]:
from transformers import default_data_collator

train_loader = DataLoader(train_dataset, batch_size=3, shuffle=True, collate_fn=collate_stuff)

# Get the next batch
batch = next(iter(train_loader))
batch

{'input_ids': tensor([[    2,     2,     2,  ..., 29945,   869,    13],
         [    2,     2,     2,  ...,  9224,   869,    13],
         [    1,   396,  2799,  ...,  1207,   278,  5447]]),
 'attention_mask': tensor([[0, 0, 0,  ..., 1, 1, 1],
         [0, 0, 0,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]])}

In [None]:
# Slice the DataFrame to obtain 1% of the data, original is too big for demo training
subset_data = train_df.sample(frac=0.01, random_state=42)
                                   
train_dset = SummarizationDataset(subset_data, tkn)
 
len(train_dataset), len(train_dset)

(287113, 2871)

In [None]:
instruction_plus_input_txt = tkn.decode(train_dset[0]['input_ids'])[:3908]
instruction_plus_input_txt

"</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> # Instruction: Below is a snippet from a newspaper article. Be kind and summarize it!\n# Input: By . Mia De Graaf . Britons flocked to beaches across the southern coast yesterday as millions look set to bask in glorious sunshine today. Temperatures soared to 17C in Brighton and Dorset, with people starting their long weekend in deck chairs by the sea. Figures from Asda suggest the unexpected sunshine has also inspired a wave of impromptu barbecues, with sales of sausages and equipment expected to triple those in April. Sun's out: Brighton beach was packed with Britons enjoying the unexpected sunshine to start the long weekend as temperatures hit 17C across the south coast . Although frost is set to hit the south tonight - with temperatures dropping to 1C - Britons stocking up for a barbecue will be in luck tomorrow, with forecasters predicting dry and sunny weather across southern England, southern Wales and t

***
### Train

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    fp16=True, 
    learning_rate=1e-4,
    num_train_epochs=1,
    output_dir='llam2_summary',
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    report_to="none"
)

trainer = Trainer(
    model=qmodel,
    train_dataset=train_dset,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tkn, mlm=False) #collate_stuff
)
qmodel.config.use_cache = False

trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=89, training_loss=0.260432596956746, metrics={'train_runtime': 9015.5563, 'train_samples_per_second': 0.318, 'train_steps_per_second': 0.01, 'total_flos': 5.910388972663603e+16, 'train_loss': 0.260432596956746, 'epoch': 0.99})

In [None]:
qmodel.save_pretrained("llama2_summary") 

***
### Inference with fine-tuned model

In [None]:
PEFT_MODEL = "/kaggle/working/llama2_summary"

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
generation_config = model.generation_config
generation_config.max_new_tokens = 500
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [24]:
%%time

prompt = instruction_plus_input_txt.strip() + "# Response:"

encoding = tokenizer(prompt, return_tensors="pt").to(device)

with torch.inference_mode():
    outputs = model.generate(
        input_ids = encoding.input_ids,
        attention_mask = encoding.attention_mask,
        generation_config = generation_config
  )

print(tokenizer.decode(outputs[0], skip_special_tokens=True).strip())

# Instruction: Below is a snippet from a newspaper article. Be kind and summarize it!
# Input: By . Mia De Graaf . Britons flocked to beaches across the southern coast yesterday as millions look set to bask in glorious sunshine today. Temperatures soared to 17C in Brighton and Dorset, with people starting their long weekend in deck chairs by the sea. Figures from Asda suggest the unexpected sunshine has also inspired a wave of impromptu barbecues, with sales of sausages and equipment expected to triple those in April. Sun's out: Brighton beach was packed with Britons enjoying the unexpected sunshine to start the long weekend as temperatures hit 17C across the south coast . Although frost is set to hit the south tonight - with temperatures dropping to 1C - Britons stocking up for a barbecue will be in luck tomorrow, with forecasters predicting dry and sunny weather across southern England, southern Wales and the south Midlands. In Weymouth, Dorset, the sun came out in time for the town'

In [32]:
tkn.decode(train_dset[0]['input_ids'])[3909:]

'Response: People enjoyed temperatures of 17C at Brighton beach in West Sussex and Weymouth in Dorset. Asda claims it will sell a million sausages over long weekend despite night temperatures dropping to minus 1C. But the good weather has not been enjoyed by all as the north west and Scotland have seen heavy rain .\n'

In [107]:
import zipfile

# Define the folder you want to zip
folder_to_zip = "/kaggle/working/llama2_summary"

# Define the name of the zip file
zip_file_name = "llama2_trained_model.zip"

# Create a zip archive
with zipfile.ZipFile(zip_file_name, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(folder_to_zip):
        for file in files:
            file_path = os.path.join(root, file)
            zipf.write(file_path, os.path.relpath(file_path, folder_to_zip))


In [90]:
from IPython.display import FileLink

# Display a link to download the zip file
FileLink(zip_file_name)