# Fine-tuning LLAMA2-7b using QLoRA method

### Installing the Libraries

1. **`transformers:`**
Transformers is a popular and powerful open-source library for natural language processing (NLP) tasks. It provides a wide range of pre-trained models, tokenizers, and tools for various NLP applications. Here's an overview of its features and uses:
  * Pre-trained Models: Transformers offers a vast collection of pre-trained models, including BERT, GPT-2, and DistilBERT, which are trained on massive amounts of text data and can be fine-tuned for specific tasks.

  * Model Architecture: It provides a variety of transformer architectures, such as encoder-decoder, decoder-only, and autoregressive, suitable for different NLP tasks like text classification, machine translation, and question answering.

  * Tokenization: Transformers includes tokenization tools for processing and converting text into numerical representations that can be understood by transformer models.

  * Fine-tuning: It provides easy-to-use tools for fine-tuning pre-trained models on specific datasets, allowing adaptation to new tasks and domains.

  * Pipelines: Transformers offers pipelines for common NLP tasks, such as text classification, summarization, and question answering, simplifying the process of applying models to real-world applications.


2. **`peft:`**
PEFT (Parameter-Efficient Fine-tuning) is a library for fine-tuning large language models (LLMs) in a parameter-efficient manner. It provides a number of techniques for reducing the number of parameters in an LLM, including:
  * Low-rank adaptation (LoRA): This technique replaces the weight matrices of an LLM with lower-rank matrices. This can significantly reduce the number of parameters without sacrificing accuracy.

  * Prompt tuning: This technique fine-tunes an LLM by adding a small number of parameters to the model that are specifically designed for the task at hand.

  * Weight quantization (QLoRA): This technique quantizes the weights of an LLM to a lower precision, such as 8-bit or 4-bit. This can further reduce the size of the model without sacrificing accuracy.


3. **`accelerate:`** Accelerate is a library for distributed training of PyTorch models. It provides a number of features that can make training PyTorch models faster and easier, including:
  * Distributed data parallel (DDP): This technique allows multiple GPUs to be used to train a PyTorch model.

  * Gradient checkpointing: This technique saves intermediate activations of a PyTorch model, which can reduce the memory footprint of the model and make training faster.

  * Mixed precision training: This technique uses a combination of single-precision and half-precision floating-point numbers to train a PyTorch model. This can make training faster without sacrificing accuracy.
  
  Accelerate can be used to train PyTorch models on a variety of hardware platforms, including CPUs, GPUs, and TPUs.


4. **`bitsandbytes:`** Bitsandbytes is a library for quantizing PyTorch models. It provides a number of quantization algorithms that can be used to reduce the size of a PyTorch model without sacrificing accuracy.

  Bitsandbytes can be used to quantize PyTorch models for a variety of tasks, including natural language processing, computer vision, and speech recognition. It can also be used to quantize PyTorch models to run on mobile devices or other resource-constrained environments.


5. **`trl:`** TRL (Transformers Reinforcement Learning) is a library for training large language models (LLMs) using reinforcement learning (RL). It provides a number of features that can make training LLMs using RL easier, including:
  * A variety of RL algorithms: TRL supports a variety of RL algorithms, including policy gradient, actor-critic, and proximal policy optimization.

  * Integration with popular LLM frameworks: TRL can be used to train LLMs with a variety of popular LLM frameworks, including Hugging Face Transformers, OpenAI Gym, and DeepMind Lab.

  * Easy-to-use API: TRL provides an easy-to-use API that makes it easy to train LLMs using RL.

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m235.5/244.2 kB[0m [31m8.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

### Importing the Libraries

In [None]:
import os
import torch
# from datasets import load_dataset
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
import tensorflow as tf

### Setting the value of Hyperparamters

In [None]:
# 1. The name of the model that you are going to tune
model_name = "NousResearch/llama-2-7b-chat-hf"

# 2. The name of the dataset to use
# dataset_name = "mlabonne/guanaco-llama2-1k"

# 3. Fine-tuned model name
fine_tuned_model = "llama-2-7b-mini-guanco"

# ####################### LoRA Parameters ########################

# # 4. LoRa attention (rank) dimension
# lora_r = 4

# # 5. Alpha parameter for LoRA scaling
# lora_alpha = 16

# # 6. Dropout probability for LoRA layers
# lora_dropout = 0.1

# ####################### bitsandbytes Parameters ########################

# 7. Activate 4bit precision base model loading
use_4bit = True

# # 8. Compute dtype for 4bit base models
# bnb_4bit_compute_dtype = "float16"

# # 9. Quantization type (fp4 or nf4)
# bnb_4bit_quant_type = "nf4"

# # 10. Activate nested quantization for 4-bit base models (double quantization)
# use_nested_quant = False

# ####################### TrainingArguments Parameters ########################

# 11. Output directory where the models predictions and model checkpoints will be stored
output_dir = "./results"

# # 12. Number of training epochs
# num_epochs = 1

# # 13. Enable fp16/bf16 training (set bf16 to True with an A100)
# bf16 = False
# fp16 = False

# # 14. Batch size per GPU for training
# per_device_train_batch_size = 1

# # 15. Batch size per GPU for evalutation
# per_device_evaluation_batch_size = 1

# # 16. Number of updates steps to accumulate gradient for
# gradient_accumulation_steps = 1

# # 17. Enable gradient checkpointing
# gradient_checkpointing = True

# # 18. Max gradient normal (gradient cliping)
# max_grad_norm = 0.3

# # 19. Initial learning rate (AdamW optimizer)
# learning_rate = 2e-4

# # 20. Weight decay apply to all layers except bias/layer normalization weight
# weight_decay = 0.001


# # 23. Number of training steps (overrides train_epochs)
# max_steps = -1

# # 24. Ratio of steps for a linear warmup (from  0 to learning rate)
# warmup_ratio = 0.03

# # 25. Group sequences into batches with same length, saves memory and speeds up training considerably
# group_by_length = True

# # 26. Save checkpoing every X updates steps
# save_steps = 25

# # 27. Log every X updates steps
# logging_steps = 25

# ####################### SFT Parameters ########################

# # 28. Maximum sequence length to use
# max_sequence_length = None

# # 29. Pack multiple short examples in the same input sequence to increase efficiency
# packing = False

# # 30. Load the entire model on GPU 0
# device_map = {"": 0}# 21. Optimizer to use
# optimizer = "paged_adamw_32bit"

# # 22. Learning rate scheduler (constant a bit better than cosine)
# lr_scheduler_type = "constant"

# # 23. Number of training steps (overrides train_epochs)
# max_steps = -1

# # 24. Ratio of steps for a linear warmup (from  0 to learning rate)
# warmup_ratio = 0.03

# # 25. Group sequences into batches with same length, saves memory and speeds up training considerably
# group_by_length = True

# # 26. Save checkpoing every X updates steps
# save_steps = 25

# # 27. Log every X updates steps
# logging_steps = 25

# ####################### SFT Parameters ########################

# # 28. Maximum sequence length to use
# max_sequence_length = None

# # 29. Pack multiple short examples in the same input sequence to increase efficiency
# packing = False

# # 30. Load the entire model on GPU 0
# device_map = {"": 0}

### Downloading the Dataset

In [None]:
!pip install -q kaggle
from google.colab import files

# Choose the kaggle.json file that you downloaded
files.upload()

# Make directory named kaggle and copy kaggle.json file there.
!mkdir ~/.kaggle

!cp kaggle.json ~/.kaggle/

# Change the permissions of the file.
!chmod 600 ~/.kaggle/kaggle.json


# !kaggle datasets list

Saving kaggle.json to kaggle.json


In [None]:
# Downloading the dataset
!kaggle datasets download -d "vaibhavkumar11/hindi-english-parallel-corpus"

Downloading hindi-english-parallel-corpus.zip to /content
 85% 95.0M/112M [00:00<00:00, 193MB/s]
100% 112M/112M [00:00<00:00, 197MB/s] 


In [None]:
# Extracting the dataset
!unzip "/content/hindi-english-parallel-corpus.zip"

Archive:  /content/hindi-english-parallel-corpus.zip
  inflating: hindi_english_parallel.csv  


### Loading the dataset

In [None]:
# Setting the data directory
data_dir = "/content/hindi_english_parallel.csv"

# Loading the dataset
dataset = pd.read_csv(data_dir)
dataset.head()

Unnamed: 0,hindi,english
0,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें,Give your application an accessibility workout
1,एक्सेर्साइसर पहुंचनीयता अन्वेषक,Accerciser Accessibility Explorer
2,निचले पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the bottom panel
3,ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the top panel
4,उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से नि...,A list of plugins that are disabled by default


In [None]:
# type(dataset)
dataset.iloc[1][0]

'एक्सेर्साइसर पहुंचनीयता अन्वेषक'

In [None]:
# Checking for null values
dataset.isnull().sum()

hindi      6056
english     725
dtype: int64

In [None]:
# Dropping the null values
data = dataset.dropna(axis=0)

In [None]:
data.isnull().sum()

hindi      0
english    0
dtype: int64

In [None]:
# Selecting only the first 500 rows for fine tuning the model
data = data[:500]

### Preprocessing the Dataset


In the case of Llama 2, the authors used the following Prompt template for the chat models:
```
<s>[INST] <<SYS>>
System prompt
<</SYS>>

User prompt [/INST] Model answer </s>
```



In [None]:
def transform(example):
  text = f'<s>[INST] {example["hindi"]} [/INST] {example["english"]}</s>' #f"<s>[INST] {example["hindi"]} [/INST] {example["english"]}</s>"
  return text

In [None]:
# transformed_data = map(transform, data)
# This will not work

In [None]:
from datasets import Dataset, DatasetDict

transformed_data = DatasetDict({
    "train": Dataset.from_pandas(data),
})

### Loading bnb Configuration

In [None]:
compute_dtype = getattr(torch, "float16")

bnb_configuration = BitsAndBytesConfig(load_in_4bit=True,
                                       bnb_4bit_quant_type="nf4",
                                       bnb_4bit_compute_dtype="float16",
                                       bnb_4bit_use_double_quant=False)

### Checking GPU Compatibility with bfloat16

In [None]:
if compute_dtype == torch.float16 and use_4bit:
  major, _ = torch.cuda.get_device_capability()
  if major >= 8:
    print("=", * 80)
    print("Your GPU supports bfloat16: accelerate training with bf16=True")
    print("=" * 80)

### Loading the LLAMA2-7b Model

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             quantization_config=bnb_configuration,
                                             device_map={"": 0}) # Load the entire model on GPU 0

model.config.use_cache = False
model.config.pretraining_tp = 1

### Loading the LLAMA Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# Fix weird overflow issue with fp16 training
tokenizer.padding_side = "right"

### Loading the LoRA confgurations

In [None]:
peft_config = LoraConfig(lora_alpha=16,
                         lora_dropout=0.1,
                         r=64,
                         bias="none",
                         task_type="CAUSAL_LM")

### Setting Training Parameters

In [None]:
training_arguments = TrainingArguments(output_dir=output_dir,
                                       num_train_epochs=5,
                                       per_device_train_batch_size=4,
                                       gradient_accumulation_steps=1,
                                       optim="paged_adamw_32bit",
                                       save_steps=2,
                                       logging_steps=2,
                                       learning_rate=2e-4,
                                       weight_decay=0.001,
                                       fp16=False,
                                       bf16=False,
                                       max_grad_norm=0.3,
                                       max_steps=-1,
                                       warmup_ratio=0.03,
                                      #  group_by_length=True,
                                       lr_scheduler_type="constant",
                                      #  report_to="tensorboard",
                                       )

### Setting SFT (Supervised Fine Tuning) parameters

In [None]:
trainer = SFTTrainer(model=model,
                     train_dataset=transformed_data["train"],
                     formatting_func=transform,
                     peft_config=peft_config,
                     dataset_text_field="text",
                    #  max_seq_length=None,
                     tokenizer=tokenizer,
                     args=training_arguments,
                     packing=True)

### Train the model

In [None]:
# clear_cache()
trainer.train()

### Save the trained model

In [None]:
trainer.model.save_pretrained(fine_tuned_model)

### Making prediction before merging fine-tuned weight

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")


In [None]:
print(result[0]['generated_text'])

<s>[INST] What is a large language model? [/INST]  A large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate language outputs that are coherent and natural-sounding. everybody. These models are typically trained on vast amounts of text data, such as books, articles, and websites, and are designed to learn the patterns and structures of language.

Large language models are often used for a variety of natural language processing (NLP) tasks, such as language translation, text summarization, and language generation. They are also used for more creative applications, such as writing poetry or stories, and even creating new languages.

Some examples of large language models include:

1. BERT (Bidirectional Encoder Representations from Transformers): A popular large language model developed by Google that is trained on a large dataset of text and can be fine-t


In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="translation_en_to_hi", model=model, tokenizer=tokenizer, max_length=200)
translation = pipe(f"<s>[INST]Translate the sentence in Hindi:  {prompt} [/INST]")


In [None]:
print(translation[0])

{'translation_text': '[INST]Translate the sentence in Hindi:  What is a large language model? [/INST]  In Hindi, the sentence "What is a large language model?" can be translated as:\n everybody क्या एक बढ़ी भाषा मॉडल है?'}


In [None]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

20933

## Reloading the model and merging the Weights

In [None]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0}) # Load the entire model on GPU 0

model = PeftModel.from_pretrained(base_model, fine_tuned_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

### After fine-tuning model's Results

In [None]:
# Ignore warnings
# logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="translation_en_to_hi", model=model, tokenizer=tokenizer, max_length=200)
translation = pipe(f"<s>[INST]Translate the sentence in Hindi:  {prompt} [/INST]")


In [None]:
print(translation[0])

{'translation_text': '[INST]Translate the sentence in Hindi:  What is a large language model? [/INST]  In Hindi, the sentence "What is a large language model?" can be translated as:\n\nक्या है बढ़ा भाषा मौजूदा?'}


In [None]:
# Ignore warnings
# logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is you name?"
pipe = pipeline(task="translation_en_to_hi", model=model, tokenizer=tokenizer, max_length=200)
translation = pipe(f"<s>[INST]Translate the sentence in Hindi:  {prompt} [/INST]")
print(translation[0])

{'translation_text': '[INST]Translate the sentence in Hindi:  What is you name? [/INST]  In Hindi, the sentence "What is your name?" can be translated as:\n\nतुम्हारा नाम क्या है?'}


In [None]:
# Ignore warnings
# logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Give your application an accessibility workout?"
pipe = pipeline(task="translation_en_to_hi", model=model, tokenizer=tokenizer, max_length=200)
translation = pipe(f"<s>[INST]Translate the sentence in Hindi:  {prompt} [/INST]")
print(translation[0])

{'translation_text': '[INST]Translate the sentence in Hindi:  Give your application an accessibility workout? [/INST]  In Hindi, the sentence "Give your application an accessibility workout" can be translated as:\n\nआपका ऐपलिकेशन अधिकारित करें कार्यक्रम [Aapka aplisikshan adhikariit karein karyakram]'}
