<a href="https://www.kaggle.com/code/rraydata/use-less-gpu-resource-to-fine-tune-llama-and-llama?scriptVersionId=147887099" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Use less GPU resource to Fine Tune LLAMA and LLAMA2 

Today we'll explore fine-tuning the Llama 2 model available on Kaggle Models using Multi-lora.

-  LoRa: [Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685): freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
- QloRA: [Quantized Low Rank Adapters](https://arxiv.org/abs/2305.14314):QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA).While the base model is quantized with NF4, the trained LoRA’s parameters remain at a higher precision which is usually FP16. 


Disadvantages of LoRA-based Approaches:

- Memory Consumption: Some Lora techniques might be efficient, but the introduction of low-rank approximations can sometimes heighten memory usage, particularly if one has to store both original and approximated parameters.
- Potential for Reduced Model Accuracy: Lora-based fine-tuning is designed to either maintain or boost model accuracy. However, there can be instances where the approximations cause a dip in performance, especially if the low-rank approximations aren't chosen carefully.
- Dependence on Hyperparameters: Much like other ML techniques, Lora-based strategies involve hyperparameters that need precise fine-tuning. Mistakes in this area can lead to subpar performance.


[ASPEN: Efficient LLM Model Fine-tune and Inference via Multi-Lora Optimization](https://github.com/TUDB-Labs/multi-lora-fine-tune#experiment-results) is an open-source framework for fine-tuning Large Language Models (LLMs) using the efficient multiple LoRA/QLoRA methods. How Multi-lora improve upon LoRA-based approaches:

- GPU Memory Conservation: Use one foundational model for multiple fine tuning process, significantly saving resources.
- Automatic Parameter Learning: Introducing automation in the learning process for hyperparameters during model fine-tuning can speed up the process and guarantee optimal model results.
- Early Stopping Mechanism: Implementing this approach ensures no overfitting occurs, and resources are utilized effectively. It stops training once the model's improvement becomes negligible.

## 1. Clone multi-lora repository

In [None]:
import os
os.chdir('/kaggle/working/')

In [None]:
!git clone https://github.com/TUDB-Labs/multi-lora-fine-tune.git

## 2. Install dependencies

In [None]:
!pip install -r /kaggle/working/multi-lora-fine-tune/requirements.txt

## 3. Config finetune datasets and parameters. You can add multiple lora parameters and datasets.

ASPEN can be used on:

1) Domain-Specific Fine-Tuning:  This involves adapting a single model with various parameters particularly for one domain.

2) Cross-Domain Fine-Tuning: This approach utilizes the foundational model to optimize multiple models, each designed for diverse domains, by incorporating datasets from various or identical domains.

The demo data and prompt are for demonstration purpose.

In [None]:
!cat /kaggle/working/multi-lora-fine-tune/data/data_demo.json

In [None]:
!cat /kaggle/working/multi-lora-fine-tune/template/template_demo.json

In [None]:
config_string = """
{
    "cutoff_len": 256,
    "group_by_length": false,
    "expand_right": true,
    "pad_token_id": -1,
    "save_step": 2000,
    "early_stop_test_step": 2000,
    "train_lora_candidate_num": 4,
    "train_lora_simultaneously_num": 2,
    "train_strategy": "optim",
    "lora": [
        {
            "name": "lora_0",
            "output": "lora_0",
            "optim": "adamw",
            "lr": 3e-4,
            "batch_size": 16,
            "micro_batch_size": 4,
            "test_batch_size": 64,
            "num_epochs": 3,
            "r": 8,
            "alpha": 16,
            "dropout": 0.05,
            "target_modules": {
                "q_proj": true,
                "k_proj": false,
                "v_proj": true,
                "o_proj": false,
                "w1_proj": false,
                "w2_proj": false,
                "w3_proj": false
            },
            "data": "/kaggle/working/multi-lora-fine-tune/data/data_demo.json",
            "test_data": "/kaggle/working/multi-lora-fine-tune/data/data_demo.json",
            "prompt": "/kaggle/working/multi-lora-fine-tune/template/template_demo.json"
        },
        {
            "name": "lora_1",
            "output": "lora_1",
            "optim": "adamw",
            "lr": 3e-4,
            "batch_size": 16,
            "micro_batch_size": 4,
            "test_batch_size": 64,
            "num_epochs": 3,
            "r": 8,
            "alpha": 16,
            "dropout": 0.05,
            "target_modules": {
                "q_proj": true,
                "k_proj": false,
                "v_proj": true,
                "o_proj": false,
                "w1_proj": false,
                "w2_proj": false,
                "w3_proj": false
            },
            "data": "/kaggle/working/multi-lora-fine-tune/data/data_demo.json",
            "test_data": "/kaggle/working/multi-lora-fine-tune/data/data_demo.json",
            "prompt": "/kaggle/working/multi-lora-fine-tune/template/template_demo.json"
        }
    ]
}
"""

with open("./config.json", "w") as f:
    f.write(config_string)

 ## 4. Add the path of the base model and config file path to start finetune

remember to check whether the section is on GPU. 

In [None]:
!python /kaggle/working/multi-lora-fine-tune/mlora.py \
  --base_model /kaggle/input/llama-2/pytorch/7b-hf/1 \
  --config /kaggle/working/config.json \
  --load_8bit

## 5. Then two files(lora_0, lora_1) appear in the current directory, that's the finetuned model adapter. We can download them.