This notebook shows how to combine multiple adapters fine-tuned with LoRA. I use Llama 2 but it would work the same with other LLMs.

To demonstrate how it works, the notebook combine an adapter fine-tuned for chat with another fine-tuned for translation. The result of this combination is one new adapter that can chat and translate.

We need install the following dependices:

In [None]:
!pip install transformers accelerate peft bitsandbytes

Collecting accelerate
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.6.2-py3-none-any.whl (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.7/174.7 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.41.2.post2-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes, accelerate, peft
Successfully installed accelerate-0.24.1 bitsandbytes-0.41.2.post2 peft-0.6.2


Enter your Hugging Face access token to be able to download Llama 2:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
from peft import PeftModel

Load and quantize Llama 2 7B.

In [None]:
base_model = "meta-llama/Llama-2-7b-hf"
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
        #base_model, device_map={"": 0}, torch_dtype=torch.float16 #Uncomment this line to test without quantization
        base_model, device_map={"": 0},  quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Define a method to generate text given a prompt.

In [None]:
def generate(prompt):
  tokenized_input = tokenizer(prompt, return_tensors="pt")
  input_ids = tokenized_input["input_ids"].to('cuda')

  generation_output = model.generate(
          input_ids=input_ids,
          return_dict_in_generate=True,
          output_scores=True,
          max_new_tokens=130

  )
  for seq in generation_output.sequences:
      output = tokenizer.decode(seq, skip_special_tokens=True)
      print(output.strip())

Load the adatpers that we want to combine. If you have more adapters to load, use "load_adapter".

Here, I load an adapter fine-tuned for chat (kaitchup/Llama-2-7B-oasstguanaco-adapter) and an adapter fine-tuned for translating French into English (kaitchup/Llama-2-7b-mt-French-to-English).

In [None]:
model = PeftModel.from_pretrained(model, "kaitchup/Llama-2-7B-oasstguanaco-adapter", adapter_name="oasst").cpu()
model.load_adapter("kaitchup/Llama-2-7b-mt-French-to-English", adapter_name="fren")
print(model)


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (oasst): Dropout(p=0.05, inplace=False)
                  (fren): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (oasst): Linear(in_features=409

Combine the adapters and save the resulting adapter in a directory named "cat_1_1".

In [None]:
model.add_weighted_adapter(["fren", "oasst"], [1.0,1.0], combination_type="cat", adapter_name="fren_oasst")
print(model)
model.save_pretrained("./cat_1_1")


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (oasst): Dropout(p=0.05, inplace=False)
                  (fren): Dropout(p=0.05, inplace=False)
                  (fren_oasst): Dropout(p=0.05, inplace=False)
                )
                (lora_A)

Then, I recommend to reload and quantize again the base model before mounting and using the new adapter.

In [None]:
#We have to reload the model with quantization
base_model = "meta-llama/Llama-2-7b-hf"
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
        #base_model, device_map={"": 0}, torch_dtype=torch.float16 #Uncomment this line to test without quantization
        base_model, device_map={"": 0},  quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Load the new adapter:

In [None]:
model = PeftModel.from_pretrained(model, "./cat_1_1/")

Test inference with a translation prompt and a chat prompt.

In [None]:
#Test generation with a translation prompt
generate("Tu es le seul client du magasin. ###>")
#Test generation with an oasst prompt
generate("### Human: Hello!### Assistant:")

Tu es le seul client du magasin. ###>You're the only customer in the store.
### Human: Hello!### Assistant: Hello!
