## Fine-tune BioMedGPT models using 🤗 `LoRa` adapters, `transformers` & `bitsandbytes`

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

### Install requirements

First, run the cells below to install the requirements:

In [2]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install git+https://github.com/Baijiong-Lin/LoRA-Torch

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting git+https://github.com/Baijiong-Lin/LoRA-Torch
  Cloning https://github.com/Baijiong-Lin/LoRA-Torch to /tmp/pip-req-build-o6z7afxi
  Run

In [3]:
!git lfs install
!git clone --single-branch --branch feature/add_transformers https://github.com/OFA-Sys/OFA.git
!pip install OFA/transformers/
!git clone https://huggingface.co/OFA-Sys/OFA-tiny


Git LFS initialized.
Cloning into 'OFA'...
remote: Enumerating objects: 5745, done.[K
remote: Counting objects: 100% (916/916), done.[K
remote: Compressing objects: 100% (254/254), done.[K
remote: Total 5745 (delta 695), reused 662 (delta 662), pack-reused 4829[K
Receiving objects: 100% (5745/5745), 97.78 MiB | 23.96 MiB/s, done.
Resolving deltas: 100% (2243/2243), done.
Processing ./OFA/transformers
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sacremoses (from transformers==4.18.0.dev0)
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,>=0.11.1 (from transformers==4.18.0.dev0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.

In [4]:
!pip install torchsummary
!pip install sentencepiece ## For faster tokenization

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


### Model loading

Here let's load the `OFA-Tiny Version` model, its weights in half-precision (float16) are about 13GB on the Hub! If we load them in 8-bit we would require around 7GB of memory instead.

In [98]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from torchvision import transforms
from transformers import OFATokenizer, OFAModel
from transformers.models.ofa.generate import sequence_generator
import numpy as np
from torch.optim import Adam
from torchsummary import summary
from transformers import DataCollatorForSeq2Seq, get_cosine_schedule_with_warmup
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import EncoderDecoderConfig
from transformers import DataCollatorWithPadding
from transformers import EncoderDecoderModel
from transformers import CONFIG_MAPPING

import loralib as lora
import loratorch as LoraT

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [99]:
ckpt_dir='OFA-tiny'
tokenizer = OFATokenizer.from_pretrained(ckpt_dir)
model = OFAModel.from_pretrained(ckpt_dir, use_cache=False)

OFA-tiny
<super: <class 'OFATokenizer'>, <OFATokenizer object>>


In [22]:
# for param in model.parameters():
#   param.requires_grad = False  # freeze the model - train adapters later
#   if param.ndim == 1:
#     # cast the small parameters (e.g. layernorm) to fp32 for stability
#     param.data = param.data.to(torch.float32)

# model.gradient_checkpointing_enable()  # reduce number of stored activations

# class CastOutputToFloat(nn.Sequential):

#   def forward(self, x): return super().forward(x).to(torch.float32)
# model.encoder , model.decoder= CastOutputToFloat(model.encoder) ,  CastOutputToFloat(model.decoder)

### Apply LoRA

Here comes the magic with `LOra`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_Lora_model` utility function from `Loratorch`.

In [102]:
lora_r = 4
USE_LORA = True

def make_lora_layer(layer, lora_r=4):
    new_layer = LoraT.Linear(
        in_features=layer.in_features,
        out_features=layer.out_features,
        bias=layer.bias is not None,  # Fixing the bias check
        r=lora_r
    )

    new_layer.weight = nn.Parameter(layer.weight.detach().clone())  # Cloning the tensor

    if layer.bias is not None:
        new_layer.bias = nn.Parameter(layer.bias.detach().clone())  # Cloning the tensor

    return new_layer

def make_lora_replace(model, depth=1, path="", verbose=True):
    if depth > 10:
        return model

    if isinstance(model, nn.Linear) and ("self_attn" in path or "cross_attn" in path):
        if verbose:
            print(f"Find linear {path}:", type(model))
        return make_lora_layer(model)

    for key, module in model.named_children():  # Using named_children() for cleaner iteration
        if isinstance(module, nn.Linear) and ("self_attn" in path or "cross_attn" in path):
            layer = make_lora_layer(module)
            setattr(model, key, layer)
            if verbose:
                print(f"Find linear {path}:{key} :", type(module))

        elif isinstance(module, nn.ModuleList):
            for i, elem in enumerate(module):
                layer = make_lora_replace(elem, depth+1, f"{path}:{key}[{i}]", verbose=verbose)
                if layer is not None:
                    module[i] = layer

        elif isinstance(module, nn.ModuleDict):
            for module_key, item in module.items():
                layer = make_lora_replace(item, depth+1, f"{path}:{key}:{module_key}", verbose=verbose)
                if layer is not None:
                    module[module_key] = layer

        else:
            layer = make_lora_replace(module, depth+1, f"{path}:{key}", verbose=verbose)
            if layer is not None:
                setattr(model, key, layer)

    return model

In [103]:
if USE_LORA:
    make_lora_replace(model, verbose=True)



Find linear :encoder:layers[0]:self_attn:k_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[0]:self_attn:v_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[0]:self_attn:q_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[0]:self_attn:out_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[1]:self_attn:k_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[1]:self_attn:v_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[1]:self_attn:q_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[1]:self_attn:out_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[2]:self_attn:k_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[2]:self_attn:v_proj : <class 'torch.nn.modules.linear.Linear'>
Find linear :encoder:layers[2]:self_attn:q_proj : <class 'torch.nn.modules.linear.Line

In [104]:
lora_r = 4 ## Try different max matrix ranks for different results

total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters before LoRA: {total_trainable_params}")

## Apply LoRA
LoraT.mark_only_lora_as_trainable(model)

total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters after LoRA: {total_trainable_params}")

Total trainable parameters before LoRA: 33586640
Total trainable parameters after LoRA: 98304


In [105]:


for name, param in model.named_parameters():
    if "deberta" not in name:
        print(name)
#             print(param.shape)
        param.requires_grad = True



encoder.layernorm_embedding.weight
encoder.layernorm_embedding.bias
encoder.embed_tokens.weight
encoder.type_embedding.weight
encoder.embed_images.conv1.weight
encoder.embed_images.bn1.weight
encoder.embed_images.bn1.bias
encoder.embed_images.layer1.0.conv1.weight
encoder.embed_images.layer1.0.bn1.weight
encoder.embed_images.layer1.0.bn1.bias
encoder.embed_images.layer1.0.conv2.weight
encoder.embed_images.layer1.0.bn2.weight
encoder.embed_images.layer1.0.bn2.bias
encoder.embed_images.layer1.0.conv3.weight
encoder.embed_images.layer1.0.bn3.weight
encoder.embed_images.layer1.0.bn3.bias
encoder.embed_images.layer1.0.downsample.0.weight
encoder.embed_images.layer1.0.downsample.1.weight
encoder.embed_images.layer1.0.downsample.1.bias
encoder.embed_images.layer1.1.conv1.weight
encoder.embed_images.layer1.1.bn1.weight
encoder.embed_images.layer1.1.bn1.bias
encoder.embed_images.layer1.1.conv2.weight
encoder.embed_images.layer1.1.bn2.weight
encoder.embed_images.layer1.1.bn2.bias
encoder.embed_i

### Training

In [None]:
model.push_to_hub("ybelkada/opt-6.7b-lora", use_auth_token=True)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.

As you can see by fine-tuning for few steps we have almost recovered the quote from Albert Einstein that is present in the [training data](https://huggingface.co/datasets/Abirate/english_quotes).