# Assignment 2

In this assignment, we're set to embark on an exciting journey to the cutting edge of machine learning methodologies. Specifically, we'll explore and apply state-of-the-art fine-tuning techniques to large language models. The techniques we'll delve into are not only robust but also resource-efficient, allowing us to perform the task of fine-tuning even on the free T4 GPUs available in a Kaggle notebook.

Among the techniques we will explore is **Low-Rank Adaptation (LoRA)**, a novel method that has proven to be efficient and effective in adapting large pre-trained language models to specific tasks. LoRA is grounded in the hypothesis that updates to the weights during adaptation have a low "intrinsic rank", allowing us to constrain weight updates and reduce computational complexity, while preserving model performance.

Complementing LoRA, we will also engage with **mixed-precision training**. This technique combines different numerical precisions to perform computations, aiming to maximize the computational power of modern GPUs. Mixed-precision training can accelerate model training, reduce memory requirements, and thus enable us to train larger, more powerful models.

Finally, we will delve into **distributed training**, a must-know technique for handling very large models or datasets. With distributed training, we can leverage multiple GPUs or even multiple machines to collectively train a single model, effectively overcoming the limitations posed by the memory capacity of individual GPUs.

By the end of this assignment, you should be well-acquainted with these cutting-edge techniques and be capable of integrating them into your own machine learning projects. Let's embark on this exciting journey into the vanguard of machine learning fine-tuning methodologies!

### Dataset

The Stanford Alpaca dataset is part of a project that aims to build and share an instruction-following model called Alpaca. The dataset contains 52,000 examples used for fine-tuning the Alpaca model, with each example consisting of a unique instruction that the model should follow, an optional context or input for the task, and the corresponding output generated by the OpenAI's text-davinci-003 model. More information is available at the [Data release](https://github.com/tatsu-lab/stanford_alpaca/blob/main/README.md#data-release) and [Alpaca project page](https://crfm.stanford.edu/2023/03/13/alpaca.html).

### Model

[Hugging Face's BLOOM language model](https://bigscience.huggingface.co/blog/bloom) marks a significant step in AI research, being the largest open multilingual language model available. With 176 billion parameters, BLOOM has the ability to generate text in 46 natural languages and 13 programming languages, including Spanish, French, and Arabic, among others.

BLOOM represents the cumulative work of over 1000 researchers from more than 70 countries and 250 institutions, [including VietAI](https://www.washingtonpost.com/technology/2022/07/21/big-science-ai-open-source-language-model/). The model was trained on the Jean Zay supercomputer over a span of 117 days.

For our purposes, we will be fine-tuning the BLOOM-1.7B (1.7 billion parameters) variant, using the Stanford Alpaca dataset. This assignment will allow us to simulate the supervised fine-tuning phase, similar to what is done in the development of models like ChatGPT. By using the Alpaca dataset, we hope to enhance the BLOOM-1.7B's ability to follow instructions and perform tasks as specified by users.

### Initial setup

To prepare the environment for our project, please execute the commands below:


In [1]:
# import shutil

# folder_path = 'lora-ft'

# try:
#     shutil.rmtree(folder_path)
#     print(f"Folder '{folder_path}' and its contents removed successfully.")
# except OSError as e:
#     print(f"Error: {e.strerror}")

Error: No such file or directory


In [2]:
!git clone https://github.com/viethoangtranduong/lora-ft.git

Cloning into 'lora-ft'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (37/37), done.[K
remote: Total 41 (delta 14), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (41/41), 35.67 KiB | 1.37 MiB/s, done.
Resolving deltas: 100% (14/14), done.


In [3]:
%cd /kaggle/working/lora-ft
!pip install -r requirements.txt

/kaggle/working/lora-ft
Collecting git+https://github.com/huggingface/peft.git (from -r requirements.txt (line 9))
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-kmk_vl2e
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-kmk_vl2e
  Resolved https://github.com/huggingface/peft.git to commit 189a6b8e357ecda05ccde13999e4c35759596a67
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting gdown (from -r requirements.txt (line 1))
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Collecting loralib (from -r requirements.txt (line 4))
  Downloading loralib-0.1.1-py3-none-any.whl (8.8 kB)
Collecting bitsandbytes (from -r requirements.txt (line 5))
  Downloading bitsandbytes-0.39.0-py3-none-any.whl (92.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/92.2 

## Part 1: Low-Rank Adaptation of Large Language Models for Efficient Fine-tuning

![](https://miro.medium.com/v2/resize:fit:730/1*D_i25E9dTd_5HMa45zITSg.png)

Figure 1: LoRA method. We only train A and B.


### 1. Introduction

**LoRA, or Low-Rank Adaptation**, is a technique for adapting large language models to specific tasks or domains more efficiently. It's based on the observation that as models get larger, the conventional approach of full fine-tuning becomes less feasible due to the large number of parameters involved.

This process involves injecting the matrices into the dense layer's update, optimizing them for the specific adaptation task while the original pretrained model weights remain unchanged.

Here are some of the key points of the LoRA technique:

- **Freezing Pretrained Weights**: Instead of modifying all the parameters of a pretrained model during fine-tuning, LoRA freezes the pretrained weights. This means that the original model weights remain unchanged during the adaptation process.

- **Rank Decomposition Matrices**: LoRA freezes the pretrained model weights and *injects trainable rank decomposition matrices* into each layer of the Transformer architecture. These matrices are used to adjust the output of each layer in a way that's specific to the adaptation task.

- **Indirect Training of Dense Layers**: The rank decomposition matrices allow for the indirect training of each dense layer in the neural network. They are injected into the layer's update during the adaptation process and optimized to enhance the layer's performance on the specific task or domain.

- **Significant Reduction in Trainable Parameters**: By focusing on these rank decomposition matrices instead of the entire set of model weights, LoRA greatly reduces the number of trainable parameters for downstream tasks. For instance, in the case of GPT-3, LoRA can reduce the number of trainable parameters by a factor of 10,000.

- **Maintaining Model Performance**: Despite the significant reduction in the number of trainable parameters, the LoRA technique is designed to maintain or even improve the performance of the large language model on the specific task or domain.

In summary, LoRA is a method that tackles the challenge of adapting large language models to specific tasks or domains in a more efficient and feasible way, making the fine-tuning process more manageable and less resource-intensive.

### 2. Details

The LoRA technique introduces a mathematical concept known as low-rank approximation into the fine-tuning process of large language models. Here's a mathematical description of the process:

LoRA involves modifying the pre-trained weight matrix $\mathbf{W}_0 \in \mathbb{R}^{d \times k}$ of a neural network layer by introducing a low-rank parametrized update matrix $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$, where $\mathbf{B} \in \mathbb{R}^{d \times r}$, $\mathbf{A} \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$.

During the adaptation process, $\mathbf{W}_0$ is kept frozen, which means it does not receive any gradient updates. The trainable parameters are contained within $\mathbf{A}$ and $\mathbf{B}$, which form the low-rank update matrix $\Delta \mathbf{W}$.

It's important to note that both $\mathbf{W}_0$ and $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$ are multiplied with the same input, and their respective output vectors are summed. If $\mathbf{x}$ is the input and $\mathbf{h} = \mathbf{W}_0\mathbf{x}$ is the output of the original weight matrix, the modified output is:

$\mathbf{h} = \mathbf{W}_0\mathbf{x} + \Delta \mathbf{W} \mathbf{x} = \mathbf{W}_0\mathbf{x} + \mathbf{B}\mathbf{A}\mathbf{x} = (\mathbf{W}_0 + \mathbf{B}\mathbf{A})\mathbf{x}$ 

in which $\mathbf{W}_0 + \mathbf{B}\mathbf{A}$ is called **merge** operation. We will implement it in this assignment. 

At the beginning of training, we initialize $\mathbf{A}$ with a random Gaussian distribution and $\mathbf{B}$ with zero, such that $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$ is zero, as shown in **Figure 1**. This ensures that the initial output of the model remains the same as in the pre-training phase, and the adaptation starts from the original model state. 

The low-rank update $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$ then evolves during training, helping to specialize the model for a specific task while keeping the number of trainable parameters manageable. Additionally, $\Delta \mathbf{W}$ is scaled by $\frac{\alpha}{r}$ where $\alpha$ is a constant hyper-parameter. **(*)**

This process is applied for each Linear layer of self-attention layer in the BLOOM language model, leading to an adapted model that's specialized for a specific task or domain, with significantly fewer trainable parameters than the original model. For example, with GPT-3 175B, VRAM consumption during training is reduced from 1.2TB to 350GB. If $r = 4$ and only the query and value projection matrices are adapted, the checkpoint size is reduced by approximately 10,000 times (from 350GB to 35MB). This allows training with significantly fewer GPUs and helps to avoid communication overhead.

Another benefit is the ability to switch between tasks at a lower cost by only swapping the LoRA weights, as opposed to all parameters. This enables the creation of many customized models that can be swapped in and out on the fly on machines that store the pre-trained weights in VRAM.

**(*)** The reason for scaling the update $\Delta W x$ by $\frac{\alpha}{r}$ is primarily for easier optimization.

Consider the scenario where the rank $r$ changes during training. If you were to increase or decrease $r$, without this scaling factor, it would significantly affect the magnitude of the weight updates and thereby the learning dynamics of the model. In other words, changing $r$ would mean that you need to retune the learning rate or other hyperparameters, which is a laborious and time-consuming task.

By scaling the updates by $\frac{\alpha}{r}$, the authors make the learning process more robust to changes in $r$. $\alpha$ is a constant, so this scaling factor effectively normalizes the magnitude of the updates relative to the rank of the low-rank approximation.

This way, even when $r$ changes, the overall scale of the updates remains approximately constant, meaning you can use the same learning rate and other hyperparameters. This is advantageous because it makes the training process more efficient and less sensitive to the choice of $r$.

Keep in mind that this is a heuristic and it may not always provide the optimal solution for every problem or dataset, but it is a practical choice that often works well in practice.

### 3. Implementation

Let's break down the code, please take a look at the `lora_layer.py` file. The main components are:

- **LoraLayer** class: This is a base class that provides common functionality for both linear and embedding layers using the LoRA technique. It keeps track of LoRA parameters including the rank **r** and two sets of weights **lora_A** and **lora_B** (or **lora_embedding_A** and **lora_embedding_B** for the embedding layer). Two methods, **update_layer** and **update_layer_embedding**, are defined to update these parameters for linear and embedding layers, respectively.

- **Linear** and **Embedding** classes: These classes extend their corresponding PyTorch classes (**nn.Linear** and **nn.Embedding**) and the **LoraLayer** class. They initialize their superclasses as well as the LoRA parameters, and overwrite the **merge**, **unmerge**, and **forward** methods to implement the LoRA technique. The **merge** method combines the original weights of the layer with the LoRA weights, and **unmerge** undoes this operation. The **forward** method applies the layer operation either with or without the LoRA technique, depending on whether LoRA is enabled.

This assignment is heavily based on the internal codebase of 🤗 PEFT library. 🤗 PEFT, or **Parameter-Efficient Fine-Tuning (PEFT)**, is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. Recent state-of-the-art PEFT techniques achieve performance comparable to that of full fine-tuning.

If you are new to PEFT, get started by reading the [Quicktour](https://huggingface.co/docs/peft/quicktour) guide and conceptual guides for [LoRA](https://huggingface.co/docs/peft/conceptual_guides/lora) methods.

#### Q1: Implement the Merge operation in LoRA (15 points)
In the provided `lora_layer.py` file, your task is to complete the `merge` method within the `Linear` class. As a useful reference, consider the already implemented `merge` method in the `Embedding` class. This should provide a clear guide on how to approach this task.

#### Q2: Implement the Forward Pass in LoRA (15 points)
In the provided `lora_layer.py` file, your task is to complete the `forward` method within the `Linear` class. As a useful reference, consider the already implemented `forward` method in the `Embedding` class. This should provide a clear guide on how to approach this task.

#### Q3: Construct the LoRA Model and Dataloaders for Training (20 points)
In the provided `train.py` file, your task is to complete the `load_pretrained_model` function and `prepare_dataloader`. The aforementioned PEFT's Quicktour guide and LoRA's conceptual guide can be useful references. Note, the specific details related to distributed training can be overlooked at this stage.

***Important note***: *If you have not confidently completed **Q1** and **Q2**, feel free to uncomment the line `model = get_peft_model(model, lora_config)` (and remove `model = LoraModelForCasualLM(model, lora_config)`) to use PEFT library to create LoRA model instead of using your own implementation. This will be particularly useful when you wish to test your LoRA implementation by running training on both and comparing the results.*

Once you've finished the implementation, it's time to train your LoRA model. Congratulations!

*Note: To first validate your implementation with a small sample dataset, you can enable the **DEBUG** flag. This preliminary step will help ensure your code is functioning as expected before moving on to larger datasets.*

In [None]:
# train with sample dataset
!DEBUG=true python train.py 

In [None]:
# train with full dataset
!python train.py

##### Challenge 1: Will LoRA enhance inference speed? (5 points)

No - Need inference with both weights set


##### Challenge 2: Will LoRA improve training speed? (5 points)

Yes - only learning on W_delta -> faster

## Part 2: Mixed precision training

The paper "Mixed Precision Training" is a game-changer in the world of deep learning. It introduces a method that combines different numerical precisions (like 32-bit and 16-bit) during model training. By using lower precision for certain parts of the training process, such as weight updates, we can speed up computations and reduce memory requirements without sacrificing accuracy. This technique leverages the increased computational power of modern GPUs and accelerators to achieve impressive results.

### Implementation

#### Q4: Implement Mixed Precision Training (15 points)
In the provided `train.py` file, your objective is to enable mixed precision training. To achieve this, complete the assignment of the `mixed_precision_dtype`, `self.ctx` and `self.gradscaler`. You may have to modify the `_run_batch` and `_run_epoch` using `self.gradscaler` in case you are using `mixed_precision_dtype` of `torch.float16`. If you paid close attention to the coding session during week 6, you should find this task straightforward.

Once you have carried out these steps, proceed to execute the following cell to train your LoRA model with mixed precision training. You should observe significant speed improvement in training.

In [None]:
# mixed precision training with sample dataset
!DEBUG=true MIX_PREC_TRAIN=true python train.py 

In [None]:
# mixed precision training with full dataset
!MIX_PREC_TRAIN=true python train.py

## Part 3: Distributed Training with DistributedDataParallel

When it comes to training large language models, like those used for NLP tasks, the computational requirements can be ridiculously expensive. These models often have billions of parameters and require vast amounts of data to train effectively. This is where distributed training, and more specifically DistributedDataParallel (DDP), comes into play.

Training large language models on a single GPU can be extremely time-consuming and sometimes outright impossible due to memory limitations. DDP allows us to train these models across multiple GPUs, and even across several machines. This not only speeds up the process but also allows us to train much larger models than would be possible on a single GPU.

By dividing the model and the dataset across multiple GPUs, each with its own subset of data, we can train in parallel. This significantly reduces the time required to train these large models. Furthermore, the synchronization of model parameters after each forward and backward pass ensures consistency and accuracy across all model replicas.

In this sections, we will utilize DistributedDataParallel for training large language models. Let's dive in!

### Implementation

#### Q5: Setup environment for DDP (25 points)

In the provided `train.py` file, your mission is to enable distributed training utilizing the `DistributedDataParallel` (DDP) module from PyTorch. This task involves selecting the correct `distributed_strategy`, initializing the process group, establishing the local rank, and filling out the `_set_ddp_training` function. Furthermore, you are required to adapt the `load_pretrained_model` function and the `prepare_dataloader` method to be compatible with DDP training.

Once you have carried out these steps, complete the `torchrun` command below to execute the following cell to train your LoRA model on Kaggle GPU T4 x2.

In [None]:
# distributed training with sample dataset
!DEBUG=true DDP=true MIX_PREC_TRAIN=true torchrun --standalone --nproc_per_node=2 train.py

In [None]:
# distributed training with full dataset
!DDP=true MIX_PREC_TRAIN=true torchrun --standalone --nproc_per_node=2 train.py

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
Welcome to bitsandbytes. For bug re

### Inference

Once the training phase concludes, we can utilize the subsequent code to evaluate our model and generate some instructions. Let's give it a try!

In [None]:
from inference import generate_inference

model_path = 'bigscience/bloom-1b7' 
lora_weights_path = # TODO fill folder path
instruction = # TODO  fill instruction
user_inp = # TODO  fill input 

generate_inference(instruction=instruction, user_inp=user_inp, model_path=model_path, lora_weights_path=lora_weights_path)