## Finetune codellama-34B with QLoRA

### Checkout my [Twitter(@rohanpaul_ai)](https://twitter.com/rohanpaul_ai) for daily LLM bits

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from trl import SFTTrainer
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    return list(lora_module_names)


def print_trainable_parameters(model):
  """
  Prints the number of trainable parameters in the model.
  """
  trainable_params = 0
  all_param = 0
  for _, param in model.named_parameters():
    all_param += param.numel()
    if param.requires_grad:
      trainable_params += param.numel()
  print(
      f"trainable params: {trainable_params} || all params: {all_param} || trainables%: {100 * trainable_params / all_param}"
  )

def setup_environment():
    """ Sets up necessary imports and configurations. """
    output_dir = "./results"
    model_name = "codellama/CodeLlama-34b-hf"
    return output_dir, model_name

def load_and_prepare_dataset():
    """ Loads the dataset and prepares it for training. """
    return load_dataset('timdettmers/openassistant-guanaco', split="train")

def initialize_tokenizer(model_name):
    """ Initializes and configures the tokenizer. """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    return tokenizer

def create_base_model(model_name):
    """ Creates and configures the base model with low-bit quantization. """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, quantization_config=bnb_config)
    base_model.config.use_cache = False
    return prepare_model_for_kbit_training(base_model)

def apply_peft_to_model(base_model):
    """ Applies prompt engineering for fine-tuning (PeFT) using LoRA to the base model. """
    peft_config = LoraConfig(
        r=32,
        lora_alpha=16,
        target_modules=find_all_linear_names(base_model),
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    return get_peft_model(base_model, peft_config)

def setup_training(base_model, dataset, tokenizer):
    """ Configures training arguments and initializes the trainer. """
    training_args = TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        max_grad_norm=0.3,
        num_train_epochs=3,
        learning_rate=1e-4,
        bf16=True,
        save_total_limit=3,
        logging_steps=300,
        output_dir=output_dir,
        optim="paged_adamw_32bit",
        lr_scheduler_type="constant",
        warmup_ratio=0.05,
    )
    return SFTTrainer(
        base_model,
        train_dataset=dataset,
        dataset_text_field="text",
        tokenizer=tokenizer,
        max_seq_length=512,
        args=training_args
    )

def train_and_save_model(trainer, output_dir):
    """ Handles the training process and saves the model. """
    trainer.train()
    trainer.save_model(output_dir)
    final_output_dir = os.path.join(output_dir, "final_checkpoint")
    trainer.model.save_pretrained(final_output_dir)
    tokenizer.save_pretrained(final_output_dir)



In [None]:
output_dir, model_name = setup_environment()

dataset = load_and_prepare_dataset()

tokenizer = initialize_tokenizer(model_name)

base_model = create_base_model(model_name)

base_model = apply_peft_to_model(base_model)

print_trainable_parameters(base_model)

trainer = setup_training(base_model, dataset, tokenizer)

train_and_save_model(trainer, output_dir)

------------------

### Explanations of the Key terms of the BitsAndBytesConfig


```py
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

```

👉 **`load_in_4bit` parameter** is for loading the model in 4 bits precision

This means that the weights and activations of the model are represented using 4 bits instead of the usual 32 bits. This can significantly reduce the memory footprint of the model. 4-bit precision models can use up to 16x less memory than full precision models and can be up to 2x faster than full precision models.

However, if you need the highest possible accuracy, then you may want to use full precision models.

--------------

👉 `bnb_4bit_use_double_quant=True` : This parameter enables double quantization or also called nested quantization, which applies a second quantization after the initial one. It saves an additional 0.4 bits per parameter.

--------------

👉 `use_nested_quant`: A flag will be applied to determine if nested (or double) quantization.

--------------

👉 `bnb_4bit_quant_type="nf4"` : This parameter specifies the type of 4-bit quantization to be used. In this case, "nf4" refers to normalized float 4, which is the default quantization type.

--------------

👉 `bnb_4bit_compute_dtype=torch.bfloat16` : This parameter determines the compute data type used during the computation. It specifies the use of the bfloat16 data type for faster training. The compute data type can be chosen from options like float16, bfloat16, float32, etc.

This configuration is needed because, while 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit and here any combination can be chosen (float16, bfloat16, float32 etc).

The matrix multiplication and training will be faster if one uses a 16-bit compute dtype (and actually the default value for this parameter is torch.float32).

--------------

Does Floating Point 4-bit precision quantization have any hardware requirements?

Note that this method is only compatible with GPUs, hence it is not possible to quantize models in 4bit on a CPU. Among GPUs, there should not be any hardware requirement about this method, therefore any GPU could be used to run the 4bit quantization as long as you have CUDA>=11.2 installed. Keep also in mind that the computation is not done in 4bit, the weights and activations are compressed to that format and the computation is still kept in the desired or native dtype.

=====================

FP8 and FP4 stand for Floating Point 8-bit and 4-bit precision, respectively. They are part of the minifloats family of floating point values (among other precisions, the minifloats family also includes bfloat16 and float16).

----------------

### Further possibilities for improving / re-organizing the code

### Performance and Memory Optimization

1. **Batch Size and Gradient Accumulation**: we're using a batch size of 1 with gradient accumulation. If hardware permits, increasing the batch size can improve training efficiency. Balancing between batch size and gradient accumulation steps is key for optimal GPU utilization.

3. **Model Parallelism**: If we're working with very large models and have access to multiple GPUs, implementing model parallelism can be beneficial. This involves splitting the model across different GPUs.

4. **Data Loading Optimization**: Optimizing data loading can have a significant impact on training speed. Consider using techniques like prefetching, multi-threaded data loading, and ensuring your dataset is stored in a fast-access storage medium.

### Hyperparameter Tuning

1. **Learning Rate Scheduler**: we're using a constant learning rate. Experimenting with different learning rate schedules like linear decay or cyclical learning rates might yield better results.

2. **Optimizer Tweaks**: While we are using `paged_adamw_32bit`, exploring other optimizers like `AdamW` or `SGD` with momentum could offer different performance characteristics.

### Advanced Techniques

1. **Regularization Techniques**: Implementing regularization methods like dropout, weight decay, or more advanced techniques like data augmentation (if applicable to your task) can prevent overfitting.

3. **Evaluation Strategy**: Ensure a robust evaluation strategy is in place, including validation during training and possibly more nuanced evaluation metrics tailored to your specific application.

4. **Experiment Tracking**: If not already in place, integrating an experiment tracking system like Weights & Biases or TensorBoard can be very helpful for monitoring training progress and comparing different training runs.