<a href="https://colab.research.google.com/github/rrrohit1/fine-tuning-llama/blob/main/fine_tuning_llama2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Install Packages

In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

**Bitsandbytes**: Used for quantization. Weights are in float type. So, we quantize it to int8 from float16/32 as we have limited GPU.

# Step 2: Import packages

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)
from peft import PeftModel, PeftConfig
from trl import SFTTrainer

## Using LLama 2
The following prompt template is used to chat with the model.

- System Prompt: guides the model (optional)
- User prompt: give instructions (required)
- Model answer: (required)



```
<s>[INST] <<SYS>>
System prompt
<</SYS>>

User prompt [/INST] Model Answer </s>
```

## Reformat the instruction dataset to follow Llama 2 template

Llama 2 has 7 billion weights.
Fine-tuning is not possible due to limited GPU, hence we use PEFT techniques like LoRA.

# Step 3: Fine-tune process

1. Load a llama2-7b-chat-hf model
2. Train it on the dataset which will produce our fine-tuned model Llama-2-7b-chat-finetune

QLoRA: used rank 64 with a scaling parameter of 16. The Llama 2 model is directly loaded in 4-bit precision using NF4 type and train it for one epoch.

In [None]:

model_name = 'NousResearch/Llama-2-7b-chat-hf'

dataset_name = 'mlabonee/guanaco-llama2-1k'

new_model = 'Llama-2-7b-chat-finetune'


# LoRA prarameters

lora_r = 6

lora_alpha = 16

lora_dropout = 0.1


# Quantization?

use_4bit = True

bnb_4bit_compute_dtype = "float16"

bnb_4bit_quant_type = "nf4"

use_nested_quant = True

