# Parameter Efficient Fine-tuning (QLoRA) LLAMA 2

In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.9/116.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m73.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.8/20.8 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m797.2/797.2 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

Configuring and loading a model for **causal language modeling** using the transformers library from Hugging Face. The code specifically sets up a configuration for a model that **is optimized for running in a resource-constrained environment** by utilizing 4-bit quantization techniques. The following key components and settings are involved:

- **Importing Required Libraries:** The script begins by importing two essential libraries:

- **AutoModelForCausalLM from the transformers module:** This function is responsible for loading a pre-trained model that can be used for causal language modeling, a type of task where the model generates text based on previous input.

- **BitsAndBytesConfig from the same module:** This class is used to configure specific settings related to the quantization of the model, which allows the model to **operate with reduced precision**, saving memory and computational resources.

- **torch:** This is the PyTorch library, which provides support for tensor computations and **enables model training on GPUs.**

**Configuring the Bits and Bytes (bnb) Settings:** The bnb_config object is instantiated using the BitsAndBytesConfig class. The configuration includes several important parameters:

- **load_in_4bit=True:** This enables loading the model in 4-bit precision, which drastically reduces the memory footprint while maintaining an acceptable level of performance.
- **bnb_4bit_quant_type="nf4":** This parameter specifies the quantization type as "nf4" (Normal Float 4), a quantization scheme known for preserving the accuracy of the model while operating with lower precision.

- **bnb_4bit_compute_dtype=torch.float16:** This setting ensures that computations are performed using 16-bit floating-point precision (float16). This balances performance and resource usage during model inference.

- **bnb_4bit_use_double_quant=False:** This setting disables double quantization, which could further reduce precision at the cost of potentially increasing inference speed. In this case, it is set to False to maintain a higher level of accuracy.

Overall, this script is designed to optimize the model for efficient inference, particularly in environments where memory and computational resources are limited. The use of 4-bit quantization and float16 precision strikes a balance between performance and resource efficiency, making it well-suited for deployment in scenarios where hardware constraints are a concern.

In [3]:
model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf",
    quantization_config=bnb_config,
    device_map={"": 0},
)

model.config.use_cache = False
model.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"