# Fine Tuneing LLAMA2 7B

https://llama.meta.com/llama2/

https://huggingface.co/blog/llama2


![Alt text](https://th.bing.com/th/id/OIP.kqWhlzzjHwKJGfyAV1zNUgHaEK?w=307&h=180&c=7&r=0&o=5&dpr=2&pid=1.7 "LLAMA2")




---




### LLM Training Process

![Alt text](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*lO2BB3K21OVvEuQC.png "Training")

1. Pretraining
2. Fine-tuning

    2.2. SFT: Supervised Fine Tuning

    2.3. RLHF: Reinforcement Learning from Human Feedback


In [None]:
!nvidia-smi

In [None]:
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install trl xformers wandb datasets einops gradio sentencepiece bitsandbytes huggingface_hub tensorboard


### Libraries
* transformers: This library provides APIs for downloading pre-trained models.
* bitsandbytes: It’s a library for quantizing a large language model to reduce the memory footprint of the model, especially on GPUs.
* peft: This is used to add a LoRA adapter to the LLM.
* trl: This library contains an SFT (Supervised Fine-Tuning) class to fine-tune a model.
* accelerate and xformers: These libraries are used to increase the inference speed of the model.
* wandb/tensorboard: It’s used for monitoring the training process.
* datasets: This library is used to load datasets from Hugging Face.
* gradio: It’s used for designing simple user interfaces.

In [None]:
import os,torch, wandb, platform, gradio, warnings
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
    TextStreamer,
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer




---



### Prerequisites

To load our desired model, `meta-llama/Llama-2-7b-hf`, we first need to authenticate ourselves on Hugging Face. This ensures we have the correct permissions to fetch the model.

1. Gain access to the model on Hugging Face: [Link](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
2. Use the Hugging Face CLI to login and verify your authentication status.


![Alt Text](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*5aZDn1DACKjcANXu.gif "llm")

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
# !huggingface-cli login

In [None]:
def print_system_specs():
    # Check if CUDA is available
    is_cuda_available = torch.cuda.is_available()
    print("CUDA Available:", is_cuda_available)
    # Get the number of available CUDA devices
    num_cuda_devices = torch.cuda.device_count()
    print("Number of CUDA devices:", num_cuda_devices)
    if is_cuda_available:
        for i in range(num_cuda_devices):
            # Get CUDA device properties
            device = torch.device('cuda', i)
            print(f"--- CUDA Device {i} ---")
            print("Name:", torch.cuda.get_device_name(i))
            print("Compute Capability:", torch.cuda.get_device_capability(i))
            print("Total Memory:", torch.cuda.get_device_properties(i).total_memory, "bytes")
    # Get CPU information
    print("--- CPU Information ---")
    print("Processor:", platform.processor())
    print("System:", platform.system(), platform.release())
    print("Python Version:", platform.python_version())
print_system_specs()



---



### Loading Dataset, Model & Tokenizer

Here, we are preparing our session by loading dataset and both the Llama2 model and its associated tokenizer.

The tokenizer will help in converting our text prompts into a format that the model can understand and process.

In [None]:
# Pre trained model
base_model = "meta-llama/Llama-2-7b-hf"

# New instruction dataset
dataset_name = "vicgalle/alpaca-gpt4"

# New instruction dataset
# guanaco_dataset = "mlabonne/guanaco-llama2-1k"

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = "https://huggingface.co/Rlele1002/Llama-2-7b-hf-ft"

In [None]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train[0:1000]")
dataset["text"][0]

In [None]:
# Load base model(llama-2-7b-hf) and tokenizer
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map={"": 0},
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) #, padding_size="left")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token
# tokenizer.padding_side = "right"

In [None]:
#monitering login
wandb.login(key="a11d38c20254fe0a56d631d4f5e66416e89b12ce")
run = wandb.init(project='Fine tuning llama-2-7B', job_type="training", anonymous="allow")

In [None]:
# Lora config
peft_config = LoraConfig(
    lora_alpha= 8,
    lora_dropout= 0.1,
    r= 4,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj"]
)

In [None]:
# Training arguments
training_arguments = TrainingArguments(
    output_dir= "./results",
    logging_dir="./logs",
    num_train_epochs= 1,
    per_device_train_batch_size= 1,
    gradient_accumulation_steps= 2,
    optim = "paged_adamw_8bit",
    save_steps= 1000,
    logging_steps= 30,
    learning_rate= 2e-4,
    weight_decay= 0.001,
    fp16= True,
    bf16= False,
    max_grad_norm= 0.3,
    max_steps= -1,
    warmup_ratio= 0.3,
    group_by_length= True,
    lr_scheduler_type= "linear",
    report_to="wandb",
)

In [None]:
# SFTT Trainer arguments
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
) 

In [None]:
# Train model
trainer.train()

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

In [None]:
def stream(user_prompt):
    runtimeFlag = "cuda:0"
    system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
    B_INST, E_INST = "### Instruction:\n", "### Response:\n"

    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n\n{E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

In [None]:
stream("what is newtons 2rd law and its formula")

In [None]:
# Clear the memory footprint
del model, trainer
torch.cuda.empty_cache()

In [None]:
model_name = "meta-llama/Llama-2-7b-hf" 

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name, low_cpu_mem_usage=True,
    return_dict=True,torch_dtype=torch.float16,
    device_map= {"": 0})
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)



---



#### Visualize model architecture

In [None]:
model



---



### Creating the Llama Pipeline or Using Steaming Function

We'll set up a pipeline (or use steaming function) for text generation.

This pipeline simplifies the process of feeding prompts to our model and receiving generated text as output.

*Note*: This cell takes 2-3 minutes to run

https://huggingface.co/docs/transformers/main_classes/pipelines


In [None]:
from transformers import pipeline

llama_pipeline = pipeline(
    "text-generation",  # LLM task
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

def get_response(user_prompt: str) -> None:
    """
    Generate a response from the Llama model.

    Parameters:
        prompt (str): The user's input/question for the model.

    Returns:
        None: Prints the model's response.
    """
    sequences = llama_pipeline(
        user_prompt,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=256,
    )
    print(sequences[0]['generated_text'])


In [None]:
def stream(user_prompt: str) -> None:
    runtimeFlag = "cuda:0"
    system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
    B_INST, E_INST = "### Instruction:\n", "### Response:\n"

    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n\n{E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

In [None]:
get_response("what is newtons 2nd law")

In [None]:
# before finetune
stream("what is newtons 2nd law")




---


### Intilize traning parameters and Start the training process


#### Step-by-step Guide to Fine-Tuning LLAMA2

QLoRA (Quantized Low-Rank Adaptation) is an extension of LoRA (Low-Rank Adapters) that uses quantization to improve parameter efficiency during fine-tuning.

![Alt Text](https://miro.medium.com/v2/resize:fit:1200/format:webp/0*AZpOOX-cjO_J8u9M.gif "lora")

QLoRA / LoRA are techniques of Parameter-Efficient Fine-tuning (PEFT).

![Alt Text](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*SJtZupeQVgp3s5HOBymcQw.png "peft")

https://zhuanlan.zhihu.com/p/666234324

https://zhuanlan.zhihu.com/p/623543497


In [None]:
# monitoring login
# https://wandb.ai/authorize
# wandb.login(key="")
# un = wandb.init(project='Fine tuning llama-2-7B', job_type="training", anonymous="allow")

In [None]:
peft_config = LoraConfig(
    lora_alpha= 16,
    lora_dropout= 0.1,
    r= 64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj"]
)

In [None]:
training_arguments = TrainingArguments(
    output_dir= "./results",
    num_train_epochs= 1,
    per_device_train_batch_size= 6,
    gradient_accumulation_steps= 2,
    optim = "paged_adamw_8bit",
    save_steps= 50,  # adjust based on your data
    logging_steps= 25,
    learning_rate= 2e-4,
    weight_decay= 0.001,
    fp16= False,
    bf16= False,
    max_grad_norm= 0.3,
    max_steps= -1,
    warmup_ratio= 0.3,
    group_by_length= True,
    lr_scheduler_type= "linear",
    report_to="tensorboard", # "wandb"
)
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

In [None]:
# Train model
trainer.train()

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

# wandb.finish()
model.config.use_cache = True
model.eval()



---



### Visualize Training Loss with Tensorboard

In [None]:
from tensorboard import notebook
log_dir = "./results/runs"
notebook.start("--logdir {} --port 4001".format(log_dir))

In [None]:
# after finetune
stream("what is newtons 2nd law")

In [None]:
# Clear the memory footprint
del model, trainer
torch.cuda.empty_cache()
import gc
gc.collect()

In [None]:
!nvidia-smi

In [None]:
loaded_base_model = AutoModelForCausalLM.from_pretrained(
    base_model, low_cpu_mem_usage=True,
    return_dict=True,torch_dtype=torch.float16,
    device_map= {"": 0})
model = PeftModel.from_pretrained(loaded_base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# upload fine-tuned model to hugginface repo
model.push_to_hub("llama2_7b_ft_v2", use_temp_dir=True)
tokenizer.push_to_hub("llama2_7b_ft_v2", use_temp_dir=True)



---



### LLAMA2 with Gradio UI



---



### Loading Model & Tokenizer

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main

Here, we are preparing our session by loading both the Llama model and its associated tokenizer.

The tokenizer will help in converting our text prompts into a format that the model can understand and process.

In [None]:
from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf" # https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main

tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)

In [None]:
llama_pipeline = pipeline(
    "text-generation",  # LLM task
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

#### The struct of LLAMA2 prompts:
```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]
```

In [None]:
SYSTEM_PROMPT = """<s>[INST] <<SYS>>
You are a helpful bot. Your answers are clear and concise.
<</SYS>>

"""

# Formatting function for message and history
def format_message(message: str, history: list, memory_limit: int = 3) -> str:
    """
    Formats the message and history for the Llama model.

    Parameters:
        message (str): Current message to send.
        history (list): Past conversation history.
        memory_limit (int): Limit on how many past interactions to consider.

    Returns:
        str: Formatted message string
    """
    # always keep len(history) <= memory_limit
    if len(history) > memory_limit:
        history = history[-memory_limit:]

    if len(history) == 0:
        return SYSTEM_PROMPT + f"{message} [/INST]"

    formatted_message = SYSTEM_PROMPT + f"{history[0][0]} [/INST] {history[0][1]} </s>"

    # Handle conversation history
    for user_msg, model_answer in history[1:]:
        formatted_message += f"<s>[INST] {user_msg} [/INST] {model_answer} </s>"

    # Handle the current message
    formatted_message += f"<s>[INST] {message} [/INST]"

    return formatted_message

In [None]:
# Generate a response from the Llama model
def get_llama_response(message: str, history: list) -> str:
    """
    Generates a conversational response from the Llama model.

    Parameters:
        message (str): User's input message.
        history (list): Past conversation history.

    Returns:
        str: Generated response from the Llama model.
    """
    query = format_message(message, history)
    response = ""

    sequences = llama_pipeline(
        query,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=1024,
    )

    generated_text = sequences[0]['generated_text']
    response = generated_text[len(query):]  # Remove the prompt from the output

    print("Chatbot:", response.strip())
    return response.strip()

In [None]:
print(get_llama_response("why the sky is blue?", []))

In [None]:
import gradio as gr

gr.ChatInterface(get_llama_response).launch()


### Ref

####fine-tuning a llama-2
https://gathnex.medium.com/fine-tuning-llama-2-llm-on-google-colab-a-step-by-step-guide-dd79a788ac16

### Future Work

1. Try different fine-tuning dataset
2. Try different base model such as CodeLLAMA / Mistral-7B