<p> <center> <a href="../../Start-NIM-RAG.ipynb">Home Page</a> </center> </p>

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="locally_deployed_nim.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 34%; text-align: center;">
        <a href="rag_nim_endpoints.ipynb">1</a>
        <a href="locally_deployed_nim.ipynb">2</a>
        <a>3</a>
        <!-- <a href="challenge.ipynb">4</a> -->
    </span>
    <!-- <span style="float: left; width: 33%; text-align: right;"><a href="challenge.ipynb">Next Notebook</a></span> -->
</div>

# Running NIM With LoRA Adapters
---

**Overview:** The lab aims to introduce the concepts of Parameter Efficient Fine Tuning with a focus on LoRA adapters.

## Low Rank Adapters

Full fine-tuning (that is, updating all parameters of the model) for the largest LLMs can be difficult due to the amount of computational infrastructure required to learn across the whole model.  Infrastructure costs are also increased at deployment time, where users are required to either host multiple large models in memory or tolerate increased latency as entire models are swapped in and out. Hence, the need is to achieve the objective with minimal number of changes and read & write to the weights.

A popular method is to use a state-of-the-art [Parameter-Efficient Finetuning (PEFT)](https://github.com/huggingface/peft/tree/main) approach. [PEFT](https://arxiv.org/abs/2305.16742) allows finetuning a small number of (extra) model parameters instead of all the model's parameters, and this significantly decreases the computational and storage costs. One of the ways to implement PEFT is to adopt the Low-Rank Adaptation (LoRA) technique. Lora makes finetuning more efficient by greatly reducing the number of trainable parameters for downstream tasks. It does this by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. According to the [authors of LoRA](https://arxiv.org/abs/2106.09685), Aside from reducing the number of trainable parameters by 10k times, it also reduces the GPU consumption by 3x, thus delivering high throughput with no inference latency. For quick background on LoRA, please follow this [link](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

<center><img src="imgs/lora-arch.png" height="500px" width="900px"  /></center>
<center>  LoRA reparametrization and Weight merging. <a href="https://huggingface.co/docs/peft/main/en/conceptual_guides/lora"> View source</a> </center>

## NIMs with LoRA support

In order to utilise the multi-faceted advantages of LoRA, NIMs allow for incorporation of multiple LoRA instances that can be placed at the top of the base model. This helps with custom tailored responses to multiple genres of user applications while using the same backbone, like an example illustration below:


<center><img src="imgs/nvidia-nim-dynamic-lora-architecture.png" height="500px" width="900px"  /></center>
<center>  LoRA reparametrization and Weight merging. <a href="https://huggingface.co/docs/peft/main/en/conceptual_guides/lora"> View source</a> </center>


## Model Profiles

NIMs are designed to deliver the best performance extractable from the compute resources. It is done via optimized model profiles specifically tuned based on available hardware (i.e., the architecture and number of GPUs), requirements, etc. A complete list can be found [here](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html#supported-lora-formats). 

Let's try to list the available model profiles for our previously downloaded `Llama3-8b` NIM. Before that we would set our port 


In [None]:
import random
import socket
import os
def find_available_port(start=11000, end=11999):
    while True:
        # Randomly select a port between start and end range
        port = random.randint(start, end)
        
        # Try to create a socket and bind to the port
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
            try:
                sock.bind(("localhost", port))
                # If binding is successful, the port is free
                return port
            except OSError:
                # If binding fails, the port is in use, continue to the next iteration
                continue

# Find and print an available port
os.environ['CONTAINER_PORT'] = str(find_available_port())
print(f"Your have been alloted the available port: {os.environ['CONTAINER_PORT']}")

In [None]:
! docker run -it --rm --gpus all --shm-size=16GB   -u $(id -u) -p $CONTAINER_PORT:8000  nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 list-model-profiles

**Likely Output:**

```text
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3. 
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs:
  -  [20b2:10de] (0) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 1%]
MODEL PROFILES
- Compatible with system and runnable:
  - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput)
  - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
  - With LoRA support:
    - cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
    - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Incompatible with system:
  - dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
  - f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
 ...
  
  - 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora)
  - 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
  - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)

```
<br/>

Let's take up a sample profile: `cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)`

The description tell us the details of the NIMs model profile, in our example the profile indicates that the optimisations happen via:
- `tensorrt_llm` engines
- for Nvidia `A100` GPU
- with `fp16` precision
- to be run on a single `tensor parallel` (aka single device)
- with a focus on `throughput` and
- support for `lora`

### Running NIM with a model profile

In [None]:
import os
import getpass

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key
    os.environ["NGC_API_KEY"] = nvapi_key

In [None]:
from os.path import expanduser
home = expanduser("~")

In [None]:
import os
try:
    os.mkdir(f"/local/.cache/nim")
    print("Cache Created")
except:
    print("Cache Exists..")
    pass
    

In [None]:
os.environ['LOCAL_NIM_CACHE']=f"/local/.cache/nim"

Let's run the NIM with a listed profile

In [None]:
! docker run -it --rm --gpus all --shm-size=16GB \
-e NIM_MODEL_PROFILE=8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 \
-e NGC_API_KEY \
-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
-u $(id -u) -p $CONTAINER_PORT:8000  nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

**Likely Output:**

```

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3. 
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2024-08-14 07:27:01,272 [INFO] PyTorch version 2.2.2 available.
2024-08-14 07:27:02,170 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-08-14 07:27:02,171 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-08-14 07:27:02,214 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 08-14 07:27:03.173 api_server.py:489] NIM LLM API version 1.0.0
INFO 08-14 07:27:03.175 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 08-14 07:27:03.175 ngc_profile.py:219] Detected 0 compatible profile(s).
INFO 08-14 07:27:03.175 ngc_profile.py:221] Detected additional 2 compatible profile(s) that are currently not runnable due to low free GPU memory.
ERROR 08-14 07:27:03.176 utils.py:21] You are running NIM without LoRA, but the selected profile '8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740' has LoRA enabled. Please select a profile that does not have LoRA enabled, or alternatively, run NIM with LoRA enabled.
```

**The NIM log clearly indicates that we chose a LoRA enabled model profile, but passed no lora adapter definitions.**

### LoRA Setup Overview

You can extend a NIM to serve LoRA models using configuration defined in environment variables. However, a condition must be met:  The underlying NIM you use must be compatible with the base model of the LoRAs. Therefore, configuration is required. Further details are described below.

#### LoRA Adapters

- Download LoRA adapters from `NGC` or `Hugging Face`, or use your `custom LoRA adapters`. 
- LoRA adapters must be stored in separate directories, and one or more LoRA directories within the `LOCAL_PEFT_DIRECTORY` directory.
- The names of the loaded LoRA adapters must match the name of the adapters’ directories. 

NIM for LLMs supports the NeMo and Hugging Face Transformers compatible formats.

- A NeMo-formatted LoRA directory must contain one file with the `.nemo` extension. The name of the `.nemo` file does not need to match the name of its parent directory. The supported target modules are `["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj", "attention_qkv"]`.

- LoRA adapters trained with Hugging Face Transformers are supported. The LoRA must contain an adapter_config.json file and one of {adapter_model.safetensors, adapter_model.bin} files. The supported target modules for NIM are `["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj"]`.

#### LoRA Model Directory Structure

For example, the directory used for storing one or more LoRAs (`LOCAL_PEFT_DIRECTORY`) should be organized as follows: 
- `loras` should the name of the directory you pass into the docker container as the value of `LOCAL_PEFT_DIRECTORY`. 
- Then the LoRAs that get loaded would be called `llama3-8b-math`, `llama3-8b-math-hf`, `llama3-8b-squad`, and `llama3-8b-squad-hf`.

```text

loras
├── llama3-8b-math
│   └── llama3_8b_math.nemo
├── llama3-8b-math-hf
│   ├── adapter_config.json
│   └── adapter_model.bin
├── llama3-8b-squad
│   └── squad.nemo
└── llama3-8b-squad-hf
    ├── adapter_config.json
    └── adapter_model.safetensors

```

### Downloading LoRA Adapters

As mentioned above, you can download pre-trained adapters from the NVIDIA model registry NGC or HuggingFace.  Note that LoRA model weights are tied to a particular base model. You must only deploy LoRA models that were tuned with the same model as those being served by NIM. An example of how to download adapters from HuggingFace is given below.  

*Please note that huggingface-cli login is a requirement.*

```text
#export and make directory
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY

#download a LoRA from Hugging Face Hub
mkdir $LOCAL_PEFT_DIRECTORY/llama3-lora
huggingface-cli download <Hugging Face LoRA name> adapter_config.json adapter_model.safetensors --local-dir $LOCAL_PEFT_DIRECTORY/llama3-lora

#create permissions
chmod -R 777 $LOCAL_PEFT_DIRECTORY

```


Next, we demonstrate how to download LoRA adapters for `llama3-8b-instruct` from NGC.

### Login to NVCR (NVIDIA Container Registry)

To access a NIM docker image, you must login via `docker login nvcr.io.` This process requires a default username as `--username $oauthtoken` and `--password-stdin` that accepts the value of `$NGC_API_KEY.`

In [None]:
! echo -e "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

In [None]:
# create loras directory
!mkdir -p loras

In [None]:
# downloading vLLM-format loras
# NOTE: Downloading LoRA adapter from NGC requires to have a config setup, the bootcamp has the LoRA pre-downloaded.
# !ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:hf-math-v1" --dest "./loras/" 

**Likely Output:** (if downloading from NGC)

```text
Downloaded 37.14 MB in 2s, Download speed: 18.57 MB/s               
--------------------------------------------------------------------------------
   Transfer id: llama3-8b-instruct-lora_vhf-math-v1
   Download status: Completed
   Downloaded local path: /home/<USER>/NIMs_Bootcamp/loras/llama3-8b-instruct-lora_vhf-math-v1-1
   Total files downloaded: 4
   Total downloaded size: 37.14 MB
   Started at: 2024-08-15 12:37:20.924505
   Completed at: 2024-08-15 12:37:22.928174
   Duration taken: 2s
   
...

```

### Adding LoRA Adapter to NIM

Grant all view permissions for the LoRA files, for them to be accessible by docker.

In [None]:
! chmod -R 777 loras/
os.environ['NIM_PEFT_SOURCE'] = f'{os.getcwd()}/loras'
os.environ['LOCAL_PEFT_DIRECTORY'] = f'{os.getcwd()}/loras'

In [None]:
print(os.environ['NIM_PEFT_SOURCE'])

As discussed above, lora configuration is sensitive to the files and folders structure, so to ensure correct deployment we get rid of the .ipynb checkpoints that might have crept in

In [None]:
! find . -type d -name ".ipynb_checkpoints" -exec rm -rf {} +


Run the NIM docker container.

In [None]:
os.environ['LORA_NIM_DOCKER']="nim_llm_with_lora"
! docker run -it -d --rm --gpus all --shm-size=16GB --name=$LORA_NIM_DOCKER   -e NGC_API_KEY  -e NIM_PEFT_SOURCE -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE -v $LOCAL_NIM_CACHE:/opt/nim/.cache  -u $(id -u) -p $CONTAINER_PORT:8000  nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

# In order to ensure, the local NIM container is completely loaded and doesn't remain in pending stage, we instantiate a wait interval. 
! sleep 60

Confirm that the NIM is running through the docker log.

In [None]:
! docker logs --tail 45 $LORA_NIM_DOCKER

**Likely Output**:  

```
INFO 09-10 12:17:48.149 api_server.py:456] Serving endpoints:
  0.0.0.0:8000/openapi.json
  0.0.0.0:8000/docs
  0.0.0.0:8000/docs/oauth2-redirect
  0.0.0.0:8000/metrics
  0.0.0.0:8000/v1/health/ready
  0.0.0.0:8000/v1/health/live
  0.0.0.0:8000/v1/models
  0.0.0.0:8000/v1/version
  0.0.0.0:8000/v1/chat/completions
  0.0.0.0:8000/v1/completions
INFO 09-10 12:17:48.149 api_server.py:460] An example cURL request:
curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama3-8b-instruct",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,
    "stop": ["hello"]
  }'

INFO 09-10 12:17:48.196 server.py:82] Started server process [33]
INFO 09-10 12:17:48.197 on.py:48] Waiting for application startup.
INFO 09-10 12:17:48.224 on.py:62] Application startup complete.
INFO 09-10 12:17:48.225 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

### Running Inference

We can quickly check the available model endpoint and hosted LoRA from the `/models` endpoint given via triton inference server in NIMs

In [None]:
! curl -s http://0.0.0.0:${CONTAINER_PORT}/v1/models | jq

**Likely Output:**

```json

{
  "object": "list",
  "data": [
    {
      "id": "meta/llama3-8b-instruct",
      "object": "model",
      "created": 1723751020,
      "owned_by": "system",
      "root": "meta/llama3-8b-instruct",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-0f38baa00db14455846b31342c8b4367",
          "object": "model_permission",
          "created": 1723751020,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    },
    {
      "id": "llama3-8b-instruct-lora_vhf-math-v1",
      "object": "model",
      "created": 1723751020,
      "owned_by": "system",
      "root": "meta/llama3-8b-instruct",
      "parent": null,
      "permission": [     
    ...   
  ]
}

```

Running a simple inference request.

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

model = "llama3-8b-instruct-lora_vhf-math-v1" # "meta/llama-3.1-70b-instruct"
llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['CONTAINER_PORT']),model=model)

In [None]:
out = llm.invoke("What is 7+9?")

print(out)

**Likely Output:**
```python
AIMessage(content='7 + 9 = 16', response_metadata={'role': 'assistant', 'content': '7 + 9 = 16', 'token_usage': {'prompt_tokens': 17, 'total_tokens': 25, 'completion_tokens': 8}, 'finish_reason': 'stop', 'model_name': 'llama3-8b-instruct-lora_vhf-math-v1'}, id='run-c52c2e00-80d4-409b-88c1-8b977b6e9041-0', role='assistant')
```

In [None]:
out = llm.invoke("Find the coordinates of the point halfway between the points $(3,7)$ and $(5,1)$.")

print(out.content)

Stop all runinng docker containers

In [None]:
!docker container stop $LORA_NIM_DOCKER 

### Creating Custom LoRA Adapter 

To use a custom LoRA adapter, you must build one using the <a href="custom_lora.ipynb"> custom_lora notebook </a>. After completing the process, please deploy the custom LoRA adapter with NIM by running the cells below.  The first step would be to copy the adapter to `loras` directory. 

In [None]:
!cp  -r model/Llama-3-8b-instruct-hf-finetune loras 

Create permissions

In [None]:
!chmod -R 777 loras/

Run the NIM docker container.

In [None]:
! docker run -it --rm -d --gpus all --shm-size=16GB   -e NGC_API_KEY  -e NIM_PEFT_SOURCE -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE -v $LOCAL_NIM_CACHE:/opt/nim/.cache  -u $(id -u) -p $CONTAINER_PORT:8000  nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
! sleep 60 # wait for container to load fully

In [None]:
! docker logs --tail 50 $(docker ps -q -l) # view the last logs from the container

### Running Inference with Custom LoRA Adapter

We can quickly check the available model endpoint and hosted LoRA from the `/models` endpoint given via triton inference server in NIMs

In [None]:
!curl -s http://0.0.0.0:${CONTAINER_PORT}/v1/models | jq

Test to confirm the model is reachable and running

In [None]:
!curl -X 'POST' \
    "http://0.0.0.0:${CONTAINER_PORT}/v1/completions" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{"model": "Llama-3-8b-instruct-hf-finetune", \
         "prompt": "Once upon a time",\
         "max_tokens": 64        }'


Running a simple inference request.


In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

model = "Llama-3-8b-instruct-hf-finetune" # using custom LoRA adapter 

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['CONTAINER_PORT']),model=model, temperature=0.1, max_tokens=200, top_p=1.0 )

In [None]:
output = llm.invoke("explain what is astrophotography?")
print(output.content)

---

## References

- https://docs.nvidia.com/nim/large-language-models/latest/peft.html

## Licensing

Copyright © 2024 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<br>
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="locally_deployed_nim.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 34%; text-align: center;">
        <a href="rag_nim_endpoints.ipynb">1</a>
        <a href="locally_deployed_nim.ipynb">2</a>
        <a>3</a>
        <!-- <a href="challenge.ipynb">4</a> -->
    </span>
    <!-- <span style="float: left; width: 33%; text-align: right;"><a href="challenge.ipynb">Next Notebook</a></span> -->
</div>

<br>
<p> <center> <a href="../../Start-NIM-RAG.ipynb">Home Page</a> </center> </p>