# Fine-tune Llama 2 in Google Colab
[source](https://medium.com/@csakash03/fine-tuning-llama-2-llm-on-google-colab-a-step-by-step-guide-cf7bb367e790)  

## Google Colab limitations:

There’s a prevalent misconception regarding fine-tuning Language Models (LLMs) on the free version of Google Colab, and it’s essential to clarify this misunderstanding.

Here’s the reality check: While it’s feasible to fine-tune an LLM on Google Colab’s free environment, there’s a significant caveat to consider. The platform provides a free-tier service with time limitations. Users have a 12-hour window to execute their code continuously. However, there’s a catch — if there’s no ongoing activity, the session disconnects after a mere 15–30 minutes of inactivity. Additionally, there’s a constraint on GPU usage, limited to approximately 12 hours per day.

Fine-tuning a large LLM on Google Colab’s free version comes with its challenges! Due to these constraints, users may find themselves restricted to fine-tuning smaller LLMs with limited datasets, often constrained to approximately 2 epochs and around 10k samples. Consequently, while it remains possible to fine-tune an LLM using Google Colab’s free tier, the process can be quite demanding, especially for substantial models.  
This notebook runs on a T4 GPU.   


## [Llama 2 Family of Models](https://huggingface.co/models?other=llama-2)   
Llama 2, developed by Meta, is a family of large language models ranging from 7 billion to 70 billion parameters. It is built on the Google transformer architecture and has been fine-tuned for dialogue use cases, outperforming open-source chat models on various benchmarks. Llama 2 is known for its few-shot learning capability, efficiency, and multitask learning. However, training Llama 2 from scratch can be computationally intensive and time-consuming. Despite its advantages, Llama 2 models face challenges in stop generation, which affects their ability to determine the appropriate ‘stop’ point when generating text. Additionally, data contamination has been identified as a critical issue in Llama 2 evaluation, impacting the integrity of model assessments

The journey to greatness for Llama 2 commenced with rigorous training involving an extensive dataset encompassing text and code from diverse sources like books, articles, and code repositories. What distinguishes Llama 2 is its exceptional refinement process, where it gleaned insights from over 1 million human annotations. This phase was pivotal, enhancing its precision and fluency remarkably. Remarkably, Llama 2’s prowess extends beyond controlled environments, surpassing other open-source language models across various practical assessments. Its versatility enables utilization in both research and commercial spheres, making it an adaptable tool poised for limitless possibilities. Brace yourselves, as Llama 2 is on a mission to redefine the AI landscape.

## LLM’s Training process.

The training process for LLM primarily involves two key steps:

    + Pre-training: This initial phase is akin to introducing a language model to the fundamental elements of language. It involves exposing the model to an extensive array of text derived from the vast expanse of the internet. This stage serves to provide the model with a broad comprehension of grammar, vocabulary, and prevalent language patterns. Throughout this phase, the model learns to anticipate and predict subsequent words or phrases in a sentence, thereby developing an understanding of language structure. In essence, it’s comparable to teaching a student the basics, similar to mastering the ABCs, before delving into more complex reading material like books.

    + Fine-tuning constitutes the pivotal phase where the model, having acquired a foundational understanding of language through pre-training, undergoes a more specialized process. This step resembles providing tailored lessons to a well-rounded student for a specific task. For instance, fine-tuning might involve honing the model’s proficiency in answering questions or generating code, akin to guiding a student to excel in a particular subject at school. Fine-tuning essentially adapts the broad language knowledge acquired during pre-training to execute specific tasks with precision and effectiveness.
    Despite the fine-tuning process, the model encounters persistent challenges. These encompass occasional inaccuracies or nonsensical output, sensitivity to input phrasing, susceptibility to biases within the fine-tuning data, and difficulty in comprehending nuanced contexts within intricate conversations. Moreover, models may struggle when generating coherent lengthy content, impacting their suitability for applications such as content generation and chatbots. These limitations emphasize the ongoing necessity for continuous research and development efforts to refine fine-tuned models and address these issues, ensuring more reliable and ethically sound applications of AI.

Reinforcement Learning from Human Feedback (RLHF) serves as a tutor for language models, akin to providing additional guidance after pre-training and fine-tuning. It resembles a teacher reviewing and grading a model’s responses, aiming to further enhance its capabilities. Human feedback, delivered through evaluations and corrections, serves as a means for the model to learn from errors and refine its language skills. Similar to how students improve through feedback in their studies, RLHF assists language models in excelling at specific tasks by incorporating guidance from humans.

Addressing the challenges faced by RLHF, a new technique named Direct Preference Optimization (DPO) is stepping into the game. DPO aims to overcome the limitations of RLHF in fine-tuning large language models (LLMs). Unlike RLHF, which relies on complex learning of reward functions, DPO simplifies the process by treating it as a classification problem using human preference data.

## Fine-Tuning Llama 2 step-by-Step

We’re opting to utilize 🦙Llama-2–7B-HF, a pre-trained smaller model within the Llama-2 lineup, for fine-tuning using the Qlora technique.

QLoRA (Quantized Low-Rank Adaptation) serves as an extension of LoRA (Low-Rank Adapters), integrating quantization to enhance parameter efficiency during the fine-tuning process. Notably, QLoRA proves more memory-efficient compared to LoRA by loading the pre-trained model onto GPU memory as 4-bit weights, whereas LoRA uses 8-bit weights. This optimization reduces memory requirements and accelerates computations.

In simpler terms, instead of training the entire model from scratch, we’ll introduce an adapter between the model components and focus solely on training that adapter. This approach allows us to fine-tune the Language Model (LLM) specifically on the consumer GPU, thereby expediting the training process significantly.

Install required packages

In [1]:
!pip install -qU accelerate peft bitsandbytes transformers trl wandb gradio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m263.7/263.7 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

Import required libraries

    + transformers: This library offers APIs to facilitate the download and use of pre-trained models.
    + bitsandbytes: Designed specifically for quantization purposes, this library focuses on reducing the memory footprint of large language models, particularly on GPUs.
    + peft: Utilized for integrating LoRA adapters into Language Models (LLMs).
    + trl: This library houses an SFT (Supervised Fine-Tuning) class that aids in fine-tuning models.
    + accelerate and xformers: These libraries are employed to enhance the inference speed of the model, thereby optimizing its performance.
    + wandb: This tool serves as a monitoring platform, used to track and observe the training process.
    + datasets: Utilized in conjunction with Hugging Face, this library facilitates the loading of datasets.
    + gradio: This library is employed for the creation of straightforward user interfaces, simplifying the design process.

In [2]:
import os
import torch
import wandb
import platform
import gradio
import warnings
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login



Check system spec

In [3]:
def print_system_specs():
    # Check if CUDA is available
    is_cuda_available = torch.cuda.is_available()
    print("CUDA Available:", is_cuda_available)
    # Get the number of available CUDA devices
    num_cuda_devices = torch.cuda.device_count()
    print("Number of CUDA devices:", num_cuda_devices)
    if is_cuda_available:
        for i in range(num_cuda_devices):
            # Get CUDA device properties
            device = torch.device('cuda', i)
            print(f"--- CUDA Device {i} ---")
            print("Name:", torch.cuda.get_device_name(i))
            print("Compute Capability:", torch.cuda.get_device_capability(i))
            print("Total Memory:", torch.cuda.get_device_properties(i).total_memory, "bytes")
    # Get CPU information
    print("--- CPU Information ---")
    print("Processor:", platform.processor())
    print("System:", platform.system(), platform.release())
    print("Python Version:", platform.python_version())

print_system_specs()

CUDA Available: False
Number of CUDA devices: 0
--- CPU Information ---
Processor: x86_64
System: Linux 6.1.58+
Python Version: 3.10.12


Setting the model variable

In [5]:
# The model that you want to train from the Hugging Face hub
model_name = "meta-llama/Llama-2-7b-hf"

# The instruction dataset to use
dataset_name = "vicgalle/alpaca-gpt4"

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = "Repository link here"

### Log into hugging face hub  
Note : You need to enter the access token, before that you need to apply for access the llama-2 model in [Meta website](https://llama.meta.com/llama-downloads).

In [None]:
notebook_login()

### Load dataset

We are utilizing the pre-processed dataset vicgalle/alpaca-gpt4 from Hugging Face.

In [7]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train[0:10000]")

In [8]:
print(dataset["text"][0])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.


In [10]:
print(dataset["text"][2])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Describe the structure of an atom.

### Response:
An atom is the basic building block of all matter and is made up of three types of particles: protons, neutrons, and electrons. The structure of an atom can be described as a nucleus at the center surrounded by a cloud of electrons.

The nucleus of an atom is made up of protons and neutrons. Protons are positively charged particles and neutrons are neutral particles with no charge. Both of these particles are located in the nucleus of the atom, which is at the center of the atom and contains most of the atom's mass.

Surrounding the nucleus of the atom is a cloud of electrons. Electrons are negatively charged particles that are in constant motion around the nucleus. The electron cloud is divided into shells or orbitals, and each shell can hold a certain number of electrons. The number of electrons in the outermost 

### Loading the model and tokenizer

We are going to load a Llama-2–7B-HF pre-trained model with 4-bit quantization, and the computed data type will be BFloat16.

In [None]:
# Load base model(llama-2-7b-hf) and tokenizer
bnb_config = BitsAndBytesConfig(load_in_4bit= True,
                                bnb_4bit_quant_type= "nf4",
                                bnb_4bit_compute_dtype= torch.float16,
                                bnb_4bit_use_double_quant= False,
                                )
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             quantization_config=bnb_config,
                                             device_map={"": 0}
                                             )

model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

### Monitoring

Apart from training, monitoring is a crucial part we need to consider in LLM training

To get started, create a [WandB account](https://wandb.ai/site). After creating your account, enter the authorization token here.

In [None]:
#monitoring login
wandb.login(key="Enter the Authorization code here")
run = wandb.init(project='Fine tuning llama-2-7B', job_type="training", anonymous="allow")

### Lora config

In [None]:
peft_config = LoraConfig(lora_alpha= 8,
                         lora_dropout= 0.1,
                         r= 16,
                         bias="none",
                         task_type="CAUSAL_LM",
                         target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj"]
                         )

### Training arguments

In [None]:
training_arguments = TrainingArguments(output_dir= "./results",
                                       num_train_epochs= 1,
                                       per_device_train_batch_size= 8,
                                       gradient_accumulation_steps= 2,
                                       optim = "paged_adamw_8bit",
                                       save_steps= 1000,
                                       logging_steps= 30,
                                       learning_rate= 2e-4,
                                       weight_decay= 0.001,
                                       fp16= False,
                                       bf16= False,
                                       max_grad_norm= 0.3,
                                       max_steps= -1,
                                       warmup_ratio= 0.3,
                                       group_by_length= True,
                                       lr_scheduler_type= "linear",
                                       report_to="wandb",)

### SFTT Trainer arguments

In [None]:
# Setting sft parameters
trainer = SFTTrainer(model=model,
                     train_dataset=dataset,
                     peft_config=peft_config,
                     max_seq_length= None,
                     dataset_text_field="text",
                     tokenizer=tokenizer,
                     args=training_arguments,
                     packing= False,
                     )

#### We’re all set to begin the training process.

In [None]:
# Train model
trainer.train()

Once you initiate the code, you’ll gain access to monitor a range of training metrics such as loss, GPU usage, RAM utilization, and additional data directly on the WandB website. The link for accessing this monitoring interface will be provided to you as part of the code execution process.

During this critical phase, it’s essential to keep a vigilant eye on the training loss. Any irregularities or anomalies in the loss pattern serve as a signal to contemplate halting the training process. Overfitting often emerges as a common concern in such instances, necessitating potential adjustments to hyperparameters and the possibility of retrying to attain optimal results . This close monitoring ensures timely intervention and necessary adjustments to enhance the training process.

### Good training loss

“Good training loss” typically refers to a situation where the loss metric during the training of a machine learning model decreases steadily or reaches a low value, indicating that the model is effectively learning from the data. This decline in loss signifies that the model’s predictions are becoming more accurate or aligned with the actual target values in the training dataset.

However, what constitutes a “good” training loss can vary based on the specific problem, dataset, and model architecture. In some cases, achieving a lower training loss might indicate the model has memorized the training data (overfitting) and might not generalize well to new, unseen data.

Hence, while a decreasing training loss is generally desired, it’s crucial to consider other factors such as validation loss, model performance on unseen data, and potential signs of overfitting to assess the true effectiveness of the trained model.

### Bad Training loss

A “bad” training loss typically refers to a situation where the loss metric during the training of a machine learning model exhibits undesirable behavior. This might include:

    + High or increasing loss: The loss value remains consistently high or starts increasing during training, indicating that the model is not effectively learning from the data.
    + Fluctuating loss: The loss metric fluctuates significantly without showing a clear decreasing trend, indicating instability in the model’s learning process.
    + Convergence to a high value: The loss may converge to a relatively high value and stay stagnant, suggesting that the model may not be capable of adequately capturing patterns in the data.
    + Overfitting: In some cases, a decreasing training loss might not necessarily be beneficial if it is accompanied by a significant increase in validation loss, indicating that the model is memorizing the training data rather than learning useful patterns.

Identifying a “bad” training loss is essential as it can signify issues with the model architecture, hyperparameters, or the quality of the training data. Addressing such issues is crucial to improve the model’s performance and ensure better generalization to unseen data.

### What after training ?

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

#### Let’s test the model

In [None]:
def stream(user_prompt):
    runtimeFlag = "cuda:0"
    system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
    B_INST, E_INST = "### Instruction:\n", "### Response:\n"
    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n\n{E_INST}"
    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)


stream("what is newtons 3rd law and its formula")

### Upload a model to hugging face repository

#### Step 1: After completing the training of your model, employing the provided code to release this memory becomes crucial. This action is significant as it aids in preventing your computer from facing memory shortages and can also enhance the performance of other concurrently running programs.

In [None]:
# Clear the memory footprint
del model, trainer
torch.cuda.empty_cache()

#### Step 2: The subsequent step involves merging the adapter with the model.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name, low_cpu_mem_usage=True,
    return_dict=True,torch_dtype=torch.float16,
    device_map= {"": 0})
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Step 3: Finally, once the merger is complete, the next action involves pushing the merged model to the Hugging Face hub. This process facilitates the sharing and accessibility of the model for others in the community.

In [None]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

### Conclusion:

Our assessment indicates that the model’s performance is promising but falls short of being outstanding. It’s essential to highlight that fine-tuning a model on platforms like Google Colab comes with its set of challenges. The time limitations and resource constraints can make this task a formidable one.