## Finetuning and Inference using Low-Rank Adaptations(LoRA)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/LoRA_Finetuning&Inference.ipynb)

## Introduction

In this notebook we demonstrate how to perform LoRA finetuning and inference using the Together AI API!

LoRA is a very useful fine-tuning technique, here is how it works: 

Instead of updating all model parameters(blue parameters in the figure below) during fine-tuning (which is computationally expensive), LoRA adds a small ammount of trainable parameters (orange matrices A and B) alongside the original model weights. 

These smaller matrices get updated during the fine-tuning phase and get added to the main weights. This dramatically reduces the time it takes to fine-tune the model and the compute resources required while maintaining good performance.

When paired with fast LoRA inference you can swap betweeen multiple LoRA adapters and run inference with different fine-tunes - all while using the same base model!

In this notebook we demonstrate:
1. How to perform LoRA fine-tuning on Together AI
2. How to perform LoRA inference on the trained model
3. How to swap and perform inference using various LoRA fine-tunes!


<img src="images/lora.png" width="450">

## Install Library

In [1]:
!pip install -qU together

In [2]:
from together import Together
import os

TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
WANDB_API_KEY = os.getenv("WANDB_API_KEY")

client = Together(api_key = TOGETHER_API_KEY)

## Perform LoRA Fine-tune

Below we upload a file that can be used to fine-tune Llama 3.1 8B

In [3]:
# Upload dataset to Together AI

train_file_resp = client.files.upload("datasets/small_coqa_10.jsonl", check=True)
print(train_file_resp.id)

Uploading file small_coqa_10.jsonl: 100%|██████████| 33.4k/33.4k [00:01<00:00, 30.6kB/s]


file-a3c8206d-91f9-4c88-9b17-82647a99455d


In [None]:
ft_resp = client.fine_tuning.create(
    training_file = train_file_resp.id,
    model = 'meta-llama/Llama-3.2-1B-Instruct', # changed to 1B model
    train_on_inputs= "auto",
    n_epochs = 3,
    n_checkpoints = 1,
    wandb_api_key = WANDB_API_KEY,
    lora = True,
    warmup_ratio=0,
    learning_rate = 1e-5,
    suffix = 'FT-webinar-demo-1b',
)

print(ft_resp.id)

ft-8bc4cb28-44c6-4ce7-b47f-055992a7d3c3


In [13]:
# The output model name
ft_resp.output_name

'zainhas/Llama-3.2-1B-Instruct-FT-webinar-demo-1b-6521872f'

## LoRA Inference

Once the fine-tuning job finishes you can directly perform inference.

To check the status of the finetuning job you can check the `Jobs` page: https://api.together.ai/jobs

In [18]:
model_name = ft_resp.output_name
user_prompt = "What is the capital of the France?"

response = client.chat.completions.create(
    model = model_name + '-adapter',
    messages=[
        {
            "role": "user",
            "content": user_prompt,
        }
    ],
    max_tokens=124,
    temperature=0.7,
)

print(response.choices[0].message.content)

The capital of France is Paris.


## Swap between different LoRA adapters on the go!

If you have trained multiple LoRA adapters you can loop through and use them all. This can be quite useful to evaluate multiple fine-tunes togehter.

The first time you run LoRA inference with an adapter it might take some time - however following inference calls to the same LoRA adapter will be alot faster!

In [20]:
# List of LoRA fine-tunes

LoRA_adapters = ["zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a",
                 "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-30b975fd",
                 "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-f9ef93c8"]

In [None]:
# Loop over different LoRA fine-tunes and call with same query

for adapter in LoRA_adapters:
    
    response = client.chat.completions.create(
    model = adapter,
    messages=[
        {
            "role": "user",
            "content": "Write a short haiku about elephants.",
        }
    ],
    max_tokens=124,
    temperature=0.7,
    )

    print(f"Response from {adapter}:\n")

    print(response.choices[0].message.content)

    print('\n'+20*'######'+'\n')

Response from zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a:

Here is a short haiku about elephants:

Gray giants roam free
Trunk entwined in ancient dance
Wisdom's gentle soul

########################################################################################################################

Response from zainhas/Meta-Llama-3.1-8B-Instruct-Reference-30b975fd:

Here is a short haiku about elephants:

Gray giants roam free
Tusks lift spirits high above
Nature's gentle king

########################################################################################################################

Response from zainhas/Meta-Llama-3.1-8B-Instruct-Reference-f9ef93c8:

Tusks gently unfold
Memories in wrinkled grey
Wisdom's ancient steps

########################################################################################################################



## Learn more about LoRA Inference

You can also bring you own adapters or source LoRA adapters from HugginFace. To learn more refer to the [docs here](https://docs.together.ai/docs/lora-inference)!