In this notebook, we are going to fine-tune a pre-existing language model on a new dataset using a technique called Low-rank Adaptation (LoRA) on Intel discrete GPUs. We're going to use the Hugging Face Transformers library to handle the model and the training process.



Let's import the necessary libraries for our fine-tuning task. These include libraries for data manipulation (pandas), transformers (transformers), PyTorch (torch), and the Intel extension for PyTorch (intel_extension_for_pytorch), among others.

### Import necessary libraries and modules

In [2]:
import json
import os
import sys
import warnings
from typing import List

import fire
import pandas as pd
import transformers
import torch
import intel_extension_for_pytorch as ipex

from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
)

from transformers import LlamaForCausalLM
from transformers import LlamaTokenizer
from datasets import load_dataset
from datasets import load_from_disk


  warn(f"Failed to load image Python extension: {e}")


Now let's set the environment for the Intel extension for PyTorch (IPEX). IPEX is used to accelerate deep learning inference and training by using Intel's hardware and software capabilities and also importantly to give torch the `xpu` namespace, so that you can use things like `torch.xpu.get_device_name()`.

Here we're forcing IPEX to treat multiple slices (compute blocks) of the dGPU as a single GPU. This is essential for our use case as we're working with a specific GPU (e.g., 1550 dGPU) that has multiple slices per dGPU and also want to utilize the full 128 GB VRAM of the dGPU.

In [3]:
os.environ["IPEX_TILE_AS_DEVICE"] = "0"

Check to see if XPU (Intel's XPU is a mix of CPUs, GPUs, FPGAs, and AI accelerators) is available for training. If available, the device name of the XPU is printed. For today, we will be using a GPU - The Intel Data Centre Max GPU - 1550

In [4]:
# Check for XPU availability
if torch.xpu.is_available():
    print("Using '{}' as an xpu device.".format(torch.xpu.get_device_name()))

Next, we define the hyperparameters and configuration for our model and the LoRA process. These settings include the LoRA parameters (rank r, alpha value lora_alpha, target modules, and dropout rate), the batch sizes, the number of training steps, the learning rate, and the directory where we'll save our outputs.

### LoRA - What is it?

Before that, let's understand what LoRA is and why it is significant. 

Low-Rank Adaptation, or LoRA, is a method to fine-tune large language models (LLMs). LLMs are pre-trained on diverse, large-scale datasets to gain general language understanding. The problem, however, is that fine-tuning these models on specific tasks is computationally expensive and often requires substantial resources.

LoRA is a technique that allows us to effectively fine-tune these models without being computationally expensive as traditional methods. The idea is to restrict the fine-tuning to a low-rank subspace of the original parameter space. Instead of updating all parameters of the model during fine-tuning, we update only a small fraction of them (the ones corresponding to this low-rank subspace), making the process more efficient.

The key idea is to reduce the complexity of the fine-tuning process by focusing only on a small, carefully chosen subset of the model's parameters.

To further understand this, you need to know what a 'rank' of a matrix is. In simple terms, the rank of a matrix in linear algebra is a measure of the 'dimensionality' of the information it contains. A lower rank means that the matrix can represent fewer dimensions, and consequently, carries less information.

![low rank decompose]("./images/lora_decompose.png")



When we say a 'low-rank' matrix, we are essentially talking about a simpler, compressed representation of the original data. The 'low-rank' characteristic refers to the fact that this matrix has fewer dimensions, but these dimensions are chosen in such a way that they capture the most important aspects of the data.

In the case of LoRA, the low-rank matrix is designed to capture the essential information required to adapt the pre-trained model to the new, specific task. The goal of introducing a low-rank matrix in LoRA is not to discard information, but to distill and concentrate the fine-tuning updates into this matrix. This allows us to get the most 'bang for our buck' — achieving effective fine-tuning while dramatically reducing the computational complexity of the process.


To use LoRA, we create a LoraConfig object. This is a configuration class provided by the peft library, which stands for Pretraining with Effective Fine-Tuning. This class accepts various parameters that control the behavior of LoRA:

    r: The 'rank' of the LoRA matrix. This controls the complexity of the LoRA matrix and hence, the amount of information it can represent. A lower rank will result in a simpler matrix and hence faster, less resource-intensive fine-tuning, while a higher rank will allow for more complex adaptations but at the cost of increased computational resources. Here we set it to 8, a value that strikes a good balance between efficiency and effectiveness.

    lora_alpha: The 'alpha' hyperparameter in LoRA. This controls the strength of the LoRA adaptation relative to the original model parameters. Higher values will result in the LoRA matrix having a stronger influence on the fine-tuning updates, while lower values will result in the original model parameters having a stronger influence. The choice of this hyperparameter depends on the specific task and dataset, and may require some experimentation to optimize. According to the original paper introducing LoRA, the learning rate for the low-rank matrix should be higher than that of the bias terms, so this parameter is used to scale up the learning rate.

    target_modules: The layers or modules of the model to which LoRA is applied. Here we specify that LoRA is applied to the "q_proj" and "k_proj" layers of the Transformer model, which are part of the attention mechanism.

    lora_dropout: The dropout rate for the LoRA layers. Dropout is a regularization technique that randomly 'drops out' (i.e., sets to zero) a proportion of the layer's outputs during training, to prevent overfitting. Here we set it to 0.05, indicating a 5% dropout rate.

    bias: The type of bias to use in the LoRA layers. Here we set it to "none", indicating no bias is used.

    task_type: The type of task for which the model is being fine-tuned. Here we set it to "CAUSAL_LM", indicating that we're fine-tuning for a causal language modeling task.

In [5]:
# Hyper-parameters and configuration
# Low-rank adaptation (LoRA) parameters
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = [
    "q_proj",
    "k_proj",
]

LoRA enables us to 'distill' the essence of the fine-tuning updates into this low-rank matrix, reducing the computational complexity of the fine-tuning process. This makes the process more efficient, requiring less memory and compute resources, and enables the fine-tuning of LLMs even on devices with modest hardware specifications.

Furthermore, by focusing the learning on a low-rank matrix, LoRA also helps to prevent 'catastrophic forgetting', a common problem in fine-tuning where the model forgets the knowledge it gained during pre-training. Since the original parameters of the model remain largely unchanged in LoRA, the model retains its pre-training knowledge, which often results in better performance on the downstream tasks.

The LoRA configuration is then passed to the get_peft_model() function, which modifies the original Transformer model to include the LoRA adaptation. The fine-tuned model can then be trained as usual. Thus, the real magic of LoRA is its ability to find a balance between efficiency (using a low-rank matrix) and effectiveness (capturing the most critical information for fine-tuning). This makes it a highly practical and powerful approach for fine-tuning large language models.

In [6]:
# Other config and hyper params
# Training parameters
BATCH_SIZE = 128
MICRO_BATCH_SIZE = 12
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
LEARNING_RATE = 3e-4
TRAIN_STEPS = 100
OUTPUT_DIR = "experiments"
CUTOFF_LEN = 256


###  Base Model and tokenizers

Now, let's define the base language model that we will use for fine-tuning. The base model is specified by its identifier, which is a string that points to a particular model in the Hugging Face model hub. In this case, the base model is "openlm-research/open_llama_3b".

Then, we load the base model and its associated tokenizer using the from_pretrained method. This method fetches the model and tokenizer from the Hugging Face model hub.

Finally, we set the padding configuration for the tokenizer. Padding is used to ensure that all sequences in a batch have the same length. The pad token id is set to 0, and the padding is done to the left of the sequences. This is because causal language models like Llama are trained to predict the next token in a sequence, so they only need to look at the previous tokens.

In [7]:
# Define base model
BASE_MODEL = "openlm-research/open_llama_3b"

# Load the pretrained base model and tokenizer
model = LlamaForCausalLM.from_pretrained(BASE_MODEL)
tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL)

# Set padding configuration for tokenizer
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

### Helper Functions

Several helper functions are defined to facilitate the data preprocessing for model training.

    print_trainable_parameters(model): This function takes a PyTorch model as an input and prints the number of trainable parameters in the model. It iterates through all parameters of the model and counts the ones that require gradient computation (i.e., the ones that will be updated during training).

    prompter(data): This function takes a sample from the Simpsons dataset and formats it into a dialogue prompt. The prompt consists of an instruction, input, and a response.

    tokenize(prompt, add_eos_token=True): This function takes a prompt, tokenizes it using the model's tokenizer, and optionally adds an End of Sentence (EOS) token at the end. The tokenized prompt is then returned.

    generate_and_tokenize_prompt(data): This function combines the above steps by generating a dialogue prompt from the given data and then tokenizing this prompt.

In [11]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    # ... code ...

def prompter(data):
    """
    Format a dialogue prompt from Simpsons dataset.
    """
    # ... code ...

def tokenize(prompt, add_eos_token=True):
    """
    Tokenize the prompt and add EOS token if not present.
    """
    # ... code ...

def generate_and_tokenize_prompt(data):
    """
    Generate a dialogue prompt and tokenize it.
    """
    # ... code ...


### LoRa Configuration & Dataset Loading

Now finally we are in the `main` section of the code. Here, first, a LoraConfig object is created to configure the LoRa parameters for the model. Then, the base model is wrapped with the LoRa model using get_peft_model() function. This function prepares our model for fine-tuning with LoRA by adding the low-rank matrices to the specified layers of the model.

Finally, we print the number of trainable parameters in our model by calling the print_trainable_parameters function. This gives us an indication of the complexity of our model and the computational resources required to fine-tune it.

Next, it checks if the preprocessed dataset is available in the disk. If it is, the dataset is loaded; otherwise, the raw data is loaded and preprocessed. The train_val object is a split of the dataset into training and validation sets, while train_data and val_data are tokenized versions of the corresponding datasets.

In [None]:
if __name__ == "__main__":
    # Configure LoRA for the model
    config = LoraConfig(
        # ... parameters ...
    )
    model = get_peft_model(model, config)
    print(print_trainable_parameters(model))

    # Load data
    # ... loading and preprocessing code ...


### Training Prepration 

For training, we use the TrainingArguments class from the Hugging Face transformers library to define the training configurations.

Then, we print out some details of the training configuration, such as the process rank, the device used for training, the number of GPUs, and the type of distributed training.

Following that, a data collator is created, which is used to collate individual data samples into a batch for training. The DataCollatorForSeq2Seq class is used, which is specifically designed for sequence-to-sequence models.

Finally, a Trainer object is initialized with the model, training dataset, evaluation dataset, training arguments, and data collator. This object is used to manage the training and evaluation process.

### Important HuggingFace Trainer Training Arguments for Intel dGPUs (XPUs)


- **bf16**: This argument is used to specify whether to use BFloat16 precision for model training or not. BFloat16 can offer better training speed and memory utilization while providing acceptable levels of model accuracy.
- **no_cuda**: This argument is set to True to indicate we are not using 'cuda' to train the model
- **use_xpu**: This argument is set to True to indicate that the model training should be performed on the Intel GPU (XPU). Note that for this option to work, the necessary Intel PyTorch extensions and drivers need to be properly installed and the Intel GPU needs to be available in the system.
- **use_ipex**: This argument is set to True to indicate that the Intel PyTorch Extension (IPEX) should be used. IPEX can provide better model performance on Intel hardware by utilizing low precision computation and other optimization techniques.

In [None]:
# Training arguments
training_arguments = transformers.TrainingArguments(
    # ... parameters ...
)

# Printing training configuration
# ... print statements ...

data_collator = transformers.DataCollatorForSeq2Seq(
    # ... parameters ...
)

# Huggingface Trainer config
trainer = transformers.Trainer(
    # ... parameters ...
)


###  Training & Saving

Before training, we update the state_dict() function of the model. This function returns the model's parameters, and it's used by PyTorch to save and load models. The update is required because we have wrapped the base model with LoRa, so we need to make sure that the state_dict() function returns the correct parameters.

When we save a model in PyTorch, it uses a method called state_dict() to gather all the parameters and their respective states. This state_dict is a Python dictionary object that maps each layer to its parameter tensor.

For a standard PyTorch model, state_dict() works perfectly fine, but when we modify the model's architecture, as we do here with LoRa (Low-Rank Adaptation), we are essentially creating additional parameters (the low-rank matrices) that don't exist in the original model's structure. The base model's state_dict() doesn't know about these additional parameters and hence, won't include them when called.

By replacing the state_dict() method with get_peft_model_state_dict(), we're ensuring that the LoRa parameters are included when the model's state is saved or loaded. In other words, this method returns the state dictionary of the model, including the extra parameters added by LoRa.

This way, when we save the model after training and load it back later for inference, the model's state also includes these extra parameters, allowing it to perform as expected. Without this modification, the saved model wouldn't work correctly because it would be missing the LoRa parameters.

Then, we call trainer.train() to start the training process. Once training is completed, the model is saved to the specified output directory using the save_pretrained() method.

In [None]:
model.config.use_cache = False
old_state_dict = model.state_dict
model.state_dict = (
    lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
).__get__(model, type(model))

# train and save the model
trainer.train()
model.save_pretrained(OUTPUT_DIR)


In conclusion, in this code we fine-tune a transformer language model using the Hugging Face transformers library with a technique called Low-Rank Adaptation (LoRa). This method allows us to adapt a large pre-trained language model for a specific task with fewer trainable parameters. We use the Simpsons dialogues dataset to fine-tune the language model to generate plausible responses to dialogue prompts. It is structured to run efficiently on an Intel GPU, utilizing the PyTorch extension for Intel devices, and it employs the Trainer API from Hugging Face for training and evaluation.