#**Fine-Tune Llama 2 Model**

**There are two main Fine-Tunining techniques**
1. Supervised Fine-Tuning (SFT)
2. Reinforcement Learning from Human Feedback (RLHF)

Supervised Fine-Tuning (SFT):

In Supervised Fine-Tuning the model is trained/fine-tuned on a dataset of instruction and responses. It adjusts the weights in the LLM to minimize the difference between the generated answers and ground truth responses, acting as labels

**In this notebook, Supervised Fine-Tuning is performed**

**Step:01 Install All the Required Libraries**

In [1]:
#install transformers library to import autotokenizer
#install datasets library to load the dataset from hugging face
#install peft library to fine-tune the Llama 2 model by reducing computational and memory requirements. PEFT methods only fine-tune a small number of (extra) model parameters
#install trl library to import SFT trainer, trl is a wrapper that can be for Supervised Fine Tuning or for Reinforcement Learning from Human Feedback
#install bitsandbytes library for quantization because we are not going to use the model in full precision
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes

**Step 02: Set the enviornment as Hugging Face Token**

In [2]:
import os
os.environ["HF_TOKEN"] = "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"

**Step 03: Import All the Required Libraries**

In [3]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer

**Step 04: Fine-Tune the Llama 2 model using Supervised Fine-Tuning (SFT)**

There are three ways in which we can fine-tune the model using Supervised Fine-Tuning (SFT).

1. Full Fine-Tuning

2. LoRA

3. QLoRA

Full Fine-Tuning: With Full Fine-Tuning we are going to use the entire model, We will train all the weights in the model which is very costly.

LoRA: In LoRA instead of training all the weights, we will add some adapters in some layers and we will only train the added weights, which will reduce the cost of training the model because we are only training like 1% 2% of the entire weights


QLoRA: QLoRA which uses LoRA but here we use a model which has been quantized. If the LLM model is occupying 16bits on the disk, in QLoRA they will be quantized into 4bits so a lots of precision will be lose.

**In this notebook, I am fine-tuning the Llama 2 model with 7 billion parameters, T4-GPU has 15GB of VRAM (GPU Memory) which is barely enough to store the Llama2-7b's weights (7b x 2bytes = 14GB in FP16). Plus we also need to consider the overhead due to optimizer states, gradients and forward activatons.**

**To reduce the VRAM usage (GPU Memory Usage) we will fine-tune the Llama 2 model in 4bit precision which is why we will use QLoRA here**

In [4]:
import pandas as pd
df = pd.read_csv("/content/trainingmain.csv")

In [5]:
df = df[["question","contexts","ground_truth","answers"]]

In [6]:
df['text'] = df.apply(lambda row: row['question'] + " ->: " + row['answers'], axis = 1)

In [7]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [8]:
from datasets import Dataset,DatasetDict
train_dataset_dict = DatasetDict({
    "train": Dataset.from_pandas(train_df),
        })

In [9]:
train_dataset_dict

DatasetDict({
    train: Dataset({
        features: ['question', 'contexts', 'ground_truth', 'answers', 'text', '__index_level_0__'],
        num_rows: 212
    })
})

In [10]:
 # Model
base_model = "NousResearch/Llama-2-7b-chat-hf"
#Fine-tune model name
new_model = "llama-2-7b-platypus"
#Load the Dataset from hugging face
dataset = train_dataset_dict
#Tokenizer
#Load the tokenizer from Llama 2
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
#In Llama2 we dont have the padding token which is a very big problem, because we have a dataset with different number of tokens in each row.
#So, we need to pad it so they all have the same length and here i am using end of sentence token and this will have an impact on the generation of our model
#I am using End of Sentence token for fine-tuning
tokenizer.pad_token=tokenizer.eos_token
tokenizer.padding_side="right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'contexts', 'ground_truth', 'answers', 'text', '__index_level_0__'],
        num_rows: 212
    })
})

In [12]:
#Configration of QLoRA
#Quantization Configuration
#To reduce the VRAM usage we will load the model in 4 bit precision and we will do quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    #Quant type
    #We will use the "nf4" format this was introduced in the QLoRA paper
    bnb_4bit_quant_type="nf4",
    #As the model weights are stored using 4 bits and when we want to compute its only going to use 16 bits so we have more accuracy
    bnb_4bit_compute_dtype=torch.float16,
    #Quantization parameters are quantized
    bnb_4bit_use_double_quant=True,
)


# LoRA configuration
peft_config = LoraConfig(
    #Alpha is the strength of the adapters. In LoRA, instead of training all the weights, we will add some adapters in some layers and we will only
    #train the added weights
    #We can merge these adapters in some layers in a very weak way using very low value of alpha (using very little weight) or using a high value of alpha
    #(using a big weight)
    #15 is very big weight, usually 32 is considered as the standard value for this parameter
    lora_alpha=15,
    #10% dropout
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

# Load base moodel
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map={"": 0}
)

model.config.use_cache = False
model.config.pretraining_tp = 1

# Cast the layernorm in fp32, make output embedding layer require grads, add the upcasting of the lmhead to fp32
#prepare_model_for_kbit_training---> This function basically helps to built the best model possible
model = prepare_model_for_kbit_training(model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [13]:
# Set training arguments
training_arguments = TrainingArguments(
        output_dir="./results",
        num_train_epochs=1,#3,5 good for the Llama 2 Model
        per_device_train_batch_size=4,# Number of batches that we are going to take for every step
        gradient_accumulation_steps=1,
        evaluation_strategy="steps",#Not helpful because we donot want to evaluate the model we just want to train it
        eval_steps=1000,
        logging_steps=25,
        optim="paged_adamw_8bit",#Adam Optimizer we will be using but a version that is paged and in 8 bits, so it will lose less memory
        learning_rate=2e-4,
        lr_scheduler_type="linear",
        warmup_steps=10,
        report_to="tensorboard",
        max_steps=-1, # if maximum steps=2, it will stop after two steps
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset,#No separate evaluation dataset, i am using the same dataset
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,# In dataset creation we put a threshold 2k for context length (input token limit) but we dont have enough VRAM unfortunately it will take a lot of VRAM to put everything into memory so we are just gonna stop at 512
    tokenizer=tokenizer,
    args=training_arguments,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Map:   0%|          | 0/212 [00:00<?, ? examples/s]

Map:   0%|          | 0/212 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss


In [14]:
%load_ext tensorboard
%tensorboard --logdir results/runs

Launching TensorBoard...

In [24]:
questions = train_df["question"].to_list()

In [None]:
for question in test_questions:
    response = senetencewindowretriever.query(f"{question}")
      adv_answers.append(response.response)

In [20]:
# Run text generation pipeline with our model
#Input Prompt
prompt = "What is covid19"
#Wrap the prompt using the right chat template
instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"
#Using Pipeline from the hugging face
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=128)
result = pipe(instruction)
#Trim the response, remove instruction manually
print(result[0]['generated_text'][len(instruction):])


Covid-19 is a disease caused by the SARS-CoV-2 virus. It was first identified in Wuhan, China in December 2019 and has since spread globally, causing a pandemic. The virus can cause mild to severe respiratory illness, including pneumonia and acute respiratory distress syndrome (ARDS). It can also cause other complications, such as kidney failure, sepsis, and encephalitis.

##


In [31]:
# Run text generation pipeline with our model
#Input Prompt

generated_answers = []
for question in questions:
    prompt = f'{question}'
    #Wrap the prompt using the right chat template
    instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"
    #Using Pipeline from the hugging face
    pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=128)
    result = pipe(instruction)
    #Trim the response, remove instruction manually
    print(result[0]['generated_text'][len(instruction):])

    generated_answers.append(result[0]['generated_text'][len(instruction):])




Fine-tuning in BERT refers to the process of adapting the pre-trained BERT model to a specific task or domain by adding task-specific layers on top of the pre-trained weights. This process allows the model to learn task-specific representations while still leveraging the knowledge gained during pre-training.

Pre-training and fine-tuning are closely related in BERT. During pre-training, the model is
The IL-1β treatment in A549 cells leads to the activation of the MAPK pathway, which is involved in cell proliferation. The IL-1β/IL-1R axis plays a crucial role in A549 cell proliferation by activating the MAPK
Cytokine storm plays a significant role in the severe evolution of COVID-19. The virus triggers an exaggerated immune response, leading to the production of pro-inflammatory cytokines, such as TNF-alpha and IL-6. These cytokines can cause a cascade of events, including the activation of immune cells, the production of more cytokines, and the activation of coagulation pathways. This

In [32]:
train_df["gen_answers"] = generated_answers

In [33]:
train_df.head()

Unnamed: 0,question,contexts,ground_truth,answers,text,gen_answers
172,What is the purpose of fine-tuning in BERT and...,"[' before the pandemic."" it would be tokenized...",The purpose of fine-tuning in BERT is to adapt...,Fine-tuning in BERT refers to the process of t...,What is the purpose of fine-tuning in BERT and...,\nFine-tuning in BERT refers to the process of...
232,What is the impact of IL-1β treatment on the M...,"[' stimulation, the IL-1β expression in the pr...",IL-1β treatment results in a significant incre...,The IL-1β treatment in A549 cells leads to a t...,What is the impact of IL-1β treatment on the M...,The IL-1β treatment in A549 cells leads to the...
18,What is the role of cytokine storm in the seve...,[' (CALR) 68 69 that endow mutated CALR protei...,The cytokine storm can contribute to the sever...,The cytokine storm in COVID-19 is thought to p...,What is the role of cytokine storm in the seve...,Cytokine storm plays a significant role in the...
90,What mental health challenges do Canadian Vete...,"[""Media coverage of Canadian Veterans, with a ...",Canadian Veterans face challenges such as post...,Canadian Veterans face mental health challenge...,What mental health challenges do Canadian Vete...,\nCanadian Veterans face various mental health...
182,Can the DCA model predict new SARS-CoV-2 varia...,"[""2 variants are significantly better accordin...","Yes, the DCA model can predict new SARS-CoV-2 ...","Yes, the DCA model can predict new SARS-CoV-2 ...",Can the DCA model predict new SARS-CoV-2 varia...,\nThe DCA model can predict new SARS-CoV-2 var...


In [34]:
train_df.to_csv("finedtunedanswers.csv")

**The response keeps repeating because of the padding technique, end of sentence token, If you dont want to see this behaviour please use a different padding technique**

In [16]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

0

**Merge the Base Model with the Trained Adapter**

In [17]:
# Reload model in FP16 and merge it with LoRA weights
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
#Reload the Base Model and load the QLoRA adapters
model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



**Push the Fine-Tuned Model and Tokenizer to the Hugging Face Hub**

In [18]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [36]:
#!huggingface-cli login
model.push_to_hub("RupeshYadav/FinedTunedLLAMA2", check_pr=True, use_auth_token="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")
tokenizer.push_to_hub("RupeshYadav/FinedTunedLLAMA2d", check_pr=True, use_auth_token="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")

README.md:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

ValueError: The generation config instance is invalid -- `.validate()` throws warnings and/or exceptions. Fix these issues to save the configuration.

Thrown during validation:
[UserWarning('`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.'), UserWarning('`do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.')]