<p> <center> <a href="../../Start-NIM-RAG.ipynb">Home Page</a> </center> </p>

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="rag_nim_endpoints.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 34%; text-align: center;">
        <a href="rag_nim_endpoints.ipynb">1</a>
        <a >2</a>
        <a href="nim_lora_adapter.ipynb">3</a>
        <!-- <a href="challenge.ipynb">4</a> -->
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="nim_lora_adapter.ipynb">Next Notebook</a></span>
</div>

# Building RAG With A Localized NIM
---

This notebook will demonstrate building a Retrieval Augmented Generation (RAG) pipeline using localized NVIDIA Inference Microservice (NIM). The notebook will walk you through setting up your NVIDIA API Key, pulling and deploying a NIM image, and building a RAG application that uses the locally deployed NIM.


### Setup NVIDIA API Key

In the previous notebook, we learned how to set up our generated NVIDIA API KEY. As a requirement for this notebook, you must set up the key as enviroment variable `NVIDIA_API_KEY` to pull the NIMs docker images of your choice. If you haven't gotten your key, please visit the NVIDIA NIMs API [homepage](https://build.nvidia.com/explore/discover) and generate your API Key. Please run the cell below, input your NVIDIA API KEY in the display textbox, and press the enter key on your keyboard.

In [1]:
import os
import getpass

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = "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"
    os.environ["NGC_API_KEY"] = "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"


Enter your NVIDIA API key:  ········


### Self-Hosted NIMs

Please execute the cell below to ensure that your docker daemon is up and running.

In [3]:
! docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


**Expected Output (if you have no running containers):**

```python

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

```

### Login to NVCR (NVIDIA Container Registry)

To access a NIM docker image, you must login via `docker login nvcr.io.` This process requires a default username as `--username $oauthtoken` and `--password-stdin` that accepts the value of `$NGC_API_KEY.`

In [4]:
! echo -e "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


**Expected Output**:
```
WARNING! Your password will be stored unencrypted in /home/yagupta/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
```

### Selection of NIM

The [NIMs Catalog](https://build.nvidia.com/explore/reasoning) lists multiple state-of-the-art models in different domains. Look for the ones with the `RUN ANYWHERE` tag, as shown in the screenshot below. These NIM images are available to download and contain models and required optimized runtimes that help in getting started quickly.

<img src="imgs/catalog1.jpg" style="width: 900px; height: auto;">

Select the NIM model of your choice, click on the docker tab, and  copy the image name in the red box as shown in the screenshot below.  

<img src="imgs/catalog2.jpg" style="width: 900px; height: auto;">

### Pull The Image 

The next step is to Pull the docker image. We demonstrate this step by pulling `llama3-8b-instruct:1.0.0`.

In [4]:
! docker pull nvcr.io/nim/nvidia/llama-3.1-nemotron-70b-instruct:latest

latest: Pulling from nim/nvidia/llama-3.1-nemotron-70b-instruct

[1B537751ce: Pulling fs layer 
[1Bcc6ef577: Pulling fs layer 
[1B674c5713: Pulling fs layer 
[1Ba0e64b30: Pulling fs layer 
[1Baecefab5: Pulling fs layer 
[1Bfc3e58e1: Pulling fs layer 
[1Bb6715b99: Pulling fs layer 
[1B9764df2d: Pulling fs layer 
[1B551a5552: Pulling fs layer 
[1Bab3d5cc6: Pulling fs layer 
[1Be9c10d2a: Pulling fs layer 
[1Bd954f53b: Pulling fs layer 
[1B3fab16d6: Pulling fs layer 
[1B9e832163: Pulling fs layer 
[1Bfffb29aa: Pulling fs layer 
[1Bb78484ee: Pulling fs layer 
[1B093c30f4: Pulling fs layer 
[1B5c68a47b: Pulling fs layer 
[1Bbcffb9bc: Pulling fs layer 
[1B37b2775c: Pulling fs layer 
[1Bfeebaa4f: Pulling fs layer 
[1Bdbcb5fa1: Pulling fs layer 
[1B6f93b6d5: Pulling fs layer 
[1B7631ce3c: Pulling fs layer 
[1B204eba89: Pulling fs layer 
[1B138739f2: Pulling fs layer 
[24B0e64b30: Waiting fs layer 
[1BDigest: sha256:cfffa236a89373859e36f612612fccba06b91ec761f56c3b60df

**Likely output:** (When you have the image pulled already)
```python

1.0.0: Pulling from nim/meta/llama3-8b-instruct
Digest: sha256:7fe6071923b547edd9fba87c891a362ea0b4a88794b8a422d63127e54caa6ef7
Status: Image is up to date for nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
```

Let's check the model image by listing available images. *Please note that the `IMAGE ID` may differ from what you see under the expected output below*.

In [5]:
! docker image ls

REPOSITORY                                           TAG       IMAGE ID       CREATED         SIZE
nvcr.io/nim/nvidia/llama-3.1-nemotron-70b-instruct   latest    d0a77e5206ea   5 days ago      13.2GB
nginx                                                latest    60c8a892f36f   6 weeks ago     192MB
nvcr.io/nim/meta/llama3-8b-instruct                  1.0.0     3cb29b0d79e6   5 months ago    12.5GB
hello-world                                          latest    d2c94e258dcb   18 months ago   13.3kB


In [9]:
! docker ps -a

CONTAINER ID   IMAGE         COMMAND    CREATED        STATUS                    PORTS     NAMES
7a04b416aee6   hello-world   "/hello"   36 hours ago   Exited (0) 36 hours ago             upbeat_dhawan


**Expected Output**:

```python
REPOSITORY                            TAG       IMAGE ID       CREATED        SIZE
nvcr.io/nim/meta/llama3-8b-instruct   1.0.0     3cb29b0d79e6   2 months ago   12.5GB
```


#### Setting up Cache for the Model Artifacts

The NIMs download a number of files for ensuring the best profiles are selected to achieve max performance on hardware. Set up location for caching the model artifacts as `LOCAL_NIM_CACHE` and export the variable.

In [6]:
from os.path import expanduser
home = expanduser("~")
os.environ['LOCAL_NIM_CACHE']=f"/local/.cache/nim"
!echo $LOCAL_NIM_CACHE

/local/.cache/nim


In [7]:
!mkdir -p "$LOCAL_NIM_CACHE"
!chmod 777 "$LOCAL_NIM_CACHE"

chmod: changing permissions of '/local/.cache/nim': Operation not permitted


### Launch NIM LLM Microservice

Launch the NIM LLM microservice by executing the docker run command in the cell bellow.

```python
docker run -it --rm -d --gpus all --name=llm_nim --shm-size=16GB  -e NGC_API_KEY  -v '$LOCAL_NIM_CACHE':/opt/nim/.cache  -u $(id -u) -p 8000:8000  nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
```

This Docker command launches NIM LLM microservice using the following flags:

- `-it`: Allocates a pseudo-TTY and keeps STDIN open for interactive processes
- `--rm`: Automatically removes the container when it exits
- `-d`: Runs the container in detached mode (in the background)
- `--gpus all`: Allows the container to access all available GPUs
- `--name=llm_nim`: Names the container "llm_nim"
- `--shm-size=16GB`: Sets the size of /dev/shm to 16GB
- `-e NGC_API_KEY`: Passes the NGC_API_KEY environment variable to the container
- `-v $LOCAL_NIM_CACHE:/opt/nim/.cache`: Mounts the local NIM cache directory to /opt/nim/.cache in the container
- `-u $(id -u)`: Runs the container with the current user's UID
- `-p 8000:8000`: Maps port 8000 on the host to port 8000 in the container
- `nvcr.io/nim/meta/llama3-8b-instruct:1.0.0`: Specifies the Docker image to use



A system can have multiple running proceesses, so it is must to ensure we are not overtaking a port with any running application. The following code finds a unique free port and allots it:

In [3]:
import random
import socket
import os 

def find_available_port(start=11000, end=11999):
    while True:
        # Randomly select a port between start and end range
        port = random.randint(start, end)
        
        # Try to create a socket and bind to the port
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
            try:
                sock.bind(("localhost", port))
                # If binding is successful, the port is free
                return port
            except OSError:
                # If binding fails, the port is in use, continue to the next iteration
                continue

# Find and print an available port
os.environ['CONTAINER_PORT'] = str(find_available_port())
print(f"Your have been alloted the available port: {os.environ['CONTAINER_PORT']}")

Your have been alloted the available port: 11945


In [9]:
! docker run -it -d --rm \
--gpus all \
--name=nemotron_nim \
--shm-size=16GB  \
-e NGC_API_KEY \
-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
-u $(id -u) \
-p $CONTAINER_PORT:8000 \
nvcr.io/nim/nvidia/llama-3.1-nemotron-70b-instruct:latest

# In order to ensure, the local NIM container is completely loaded and doesn't remain in pending stage, we instantiate a wait interval
! sleep 60

037e4de54c6a54a07afaf6c8cfd5a473ab696db8edf5eb4173f8abd4b6b300f3


In [2]:
! docker logs --tail 45 nemotron_nim

== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.2.3
Model: nvidia/llama-3.1-nemotron-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

The NIM container is governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and the Product Specific Terms for AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products).

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the NVIDIA Open Model License Agreement (found at https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).

ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement. Built with Llama.

2024-11-20 09:59:39,775 [INFO] PyTorch version 2.3.1 available.
INFO 2024-11-20 09:59:47.973 ng

**Expected Output:**
```
WARNING 09-10 12:08:40.618 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 09-10 12:08:40.631 api_server.py:456] Serving endpoints:
  0.0.0.0:8000/openapi.json
  0.0.0.0:8000/docs
  0.0.0.0:8000/docs/oauth2-redirect
  0.0.0.0:8000/metrics
  0.0.0.0:8000/v1/health/ready
  0.0.0.0:8000/v1/health/live
  0.0.0.0:8000/v1/models
  0.0.0.0:8000/v1/version
  0.0.0.0:8000/v1/chat/completions
  0.0.0.0:8000/v1/completions
INFO 09-10 12:08:40.631 api_server.py:460] An example cURL request:
curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama3-8b-instruct",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,
    "stop": ["hello"]
  }'

INFO 09-10 12:08:40.681 server.py:82] Started server process [32]
INFO 09-10 12:08:40.681 on.py:48] Waiting for application startup.
INFO 09-10 12:08:40.710 on.py:62] Application startup complete.
INFO 09-10 12:08:40.712 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

### Initiate A Quick Test
You can quickly test that your NIM is up and running via two methods:
- LangChain NVIDIA Endpoints
- A simple OpenAI completion request

**Parameter description:**
- **base_url**: The ULR where the NIM docker image is deployed.
- **model**: The name of the NIM model deployed. 
- **temperature**: To modulate the randomness of sampling. Reducing the temperature increases the chance of selecting words with high probabilities.
- **top_p**: To control how deterministic the model is. If you are looking for exact and factual answers, keep this low. If you seek more diverse responses, increase to a higher value.
- **max_tokens**: maximum number of output tokens to be generated.


In [12]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['CONTAINER_PORT']), model="nvidia/llama-3.1-nemotron-70b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

result = llm.invoke("What is the capital of France?")
print(result.content)

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In case of error outputs, wait for sometime and rerun the above cell. The error might be due to the NIM container not being up completely.

In [25]:
!curl -X 'POST' \
    "http://0.0.0.0:${CONTAINER_PORT}/v1/completions" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d '{"model": "meta/llama3-8b-instruct", "prompt": "What is the capital of France?", "max_tokens": 64}'

{"id":"cmpl-cc76ec2749b54e26935b78119596a240","object":"text_completion","created":1732070795,"model":"meta/llama3-8b-instruct","choices":[{"index":0,"text":" 2\nWhat is the capital of France? – Paris\n\nIs this the correct answer?Well, actually, Paris is the capital and most populous city of France, which makes it the correct answer to the question!\n\n0 0\nWho is the main character in \"The Lord of the Rings\"? 2\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":72,"completion_tokens":64}}

### RAG Application 

In this section, we will follow the steps from the previous notebook to build a RAG application that is based on the locally deployed NIM. For our demonstration, we will not create a conversational retrieval Chain using two LLMs as in the previous notebook, but a conversational retrieval chain using a single LLM `llama3-8b-instruct`. This is because each NIM image has one base model. It is possible to use the locally deployed NIM and remote access, but for clarity and ease of understanding, we will stick with a single LLM approach.
 

#### Import libraries

In [26]:
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

#### Create Web Link Data Source

You can replace and add more web links of your choice. 

In [27]:
urls = ["https://www.nvidia.com/en-in/glossary/retrieval-augmented-generation/",
       "https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
        "https://docs.nvidia.com/cuda/",
        "https://github.com/NVIDIA/cuda-samples"
       ]

#### Create A Function To Load HTML Files

Below is a helper function for loading html files, which we’ll use to generate the embeddings. 

In [28]:
import re
import requests
from bs4 import BeautifulSoup
from typing import List, Union

def html_document_loader(url: Union[str, bytes]) -> str:
    """
    Loads the HTML content of a document from a given URL and return it's content.

    Args:
        url: The URL of the document.

    Returns:
        The content of the document.

    Raises:
        Exception: If there is an error while making the HTTP request.

    """
    try:
        response = requests.get(url)
        html_content = response.text
    except Exception as e:
        print(f"Failed to load {url} due to exception {e}")
        return ""

    try:
        # Create a Beautiful Soup object to parse html
        soup = BeautifulSoup(html_content, "html.parser")

        # Remove script and style tags
        for script in soup(["script", "style"]):
            script.extract()

        # Get the plain text from the HTML document
        text = soup.get_text()

        # Remove excess whitespace and newlines
        text = re.sub("\s+", " ", text).strip()

        return text
    except Exception as e:
        print(f"Exception {e} while loading document")
        return ""

#### Create Embeddings and Document Text Splitter

Let's create a function that initializes the path to store our embeddings, execute the `html_document_loader` function, and split the document into chunks of text.

In [29]:
def create_embeddings(embeddings_model,embedding_path: str = "./embed"):

    embedding_path = "./embed"
    print(f"Storing embeddings to {embedding_path}")

    documents = []
    for url in urls:
        document = html_document_loader(url)
        documents.append(document)


    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=0,
        length_function=len,
    )
    print("Total documents:",len(documents))
    texts = text_splitter.create_documents(documents)
    print("Total texts:",len(texts))
    index_docs(embeddings_model,url, text_splitter, texts, embedding_path,)
    print("Generated embedding successfully")

#### Generate Embeddings Using NVIDIA AI Endpoints From LangChain

In this section we demostrate how to generate embeddings using NVIDIA AI Endpoints for LangChain and save embeddings to offline vector store in the `/embed` directory for future re-use.

In [30]:
embeddings_model = NVIDIAEmbeddings(model="NV-Embed-QA") # or use nvidia/nv-embedqa-e5-v5

Below, we create an `index_docs` function that loops through the document page content to extend text and metadata and applies [FAISS](https://faiss.ai/index.html). The embeddings are stored locally.

In [31]:
from typing import List, Union


def index_docs(embeddings_model, url: Union[str, bytes], splitter, documents: List[str], dest_embed_dir: str) -> None:
    """
    Split the documents into chunks and create embeddings for them.
    
    Args:
        embeddings_model: Model used for creating embeddings.
        url: Source url for the documents.
        splitter: Splitter used to split the documents.
        documents: List of documents whose embeddings need to be created.
        dest_embed_dir: Destination directory for embeddings.
    """
    texts = []
    metadatas = []

    for document in documents:
        chunk_texts = splitter.split_text(document.page_content)
        texts.extend(chunk_texts)
        metadatas.extend([document.metadata] * len(chunk_texts))

    if os.path.exists(dest_embed_dir):
        docsearch = FAISS.load_local(
            folder_path=dest_embed_dir, 
            embeddings=embeddings_model, 
            allow_dangerous_deserialization=True
        )
        docsearch.add_texts(texts, metadatas=metadatas)
    else:
        docsearch = FAISS.from_texts(texts, embedding=embeddings_model, metadatas=metadatas)

    docsearch.save_local(folder_path=dest_embed_dir)

#### Load Embeddings from the Vector Store and Build a RAG using NVIDIA Endpoints

Next, we call the function `create_embeddings` and load documents from [vector store](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/) using FAISS. The Vector store stores relevant information in a high dimensional space called embeddings.

Please run the two cells below. 

In [32]:
%%time
create_embeddings(embeddings_model=embeddings_model)

Storing embeddings to ./embed
Total documents: 4
Total texts: 144
Generated embedding successfully
CPU times: user 542 ms, sys: 62.8 ms, total: 605 ms
Wall time: 9.97 s


In [33]:
# load Embed documents
embedding_path = "./embed/"
docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embeddings_model, allow_dangerous_deserialization=True)

### Create A Conversational Retrieval Chain With llama3-8b-instruct

Since our deployed NIM is up and running at `http://0.0.0.0:8000`, we will create a [conversational retrieval chain](https://python.langchain.com/v0.1/docs/modules/chains/#conversationalretrievalchain-with-streaming-to-stdout) based on the NIM base model `llama3-8b-instruct`.

In [34]:
llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['CONTAINER_PORT']),
                 model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

qa_prompt=QA_PROMPT

doc_chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_PROMPT)

qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=docsearch.as_retriever(),
    chain_type="stuff",
    memory=memory,
    combine_docs_chain_kwargs={'prompt': qa_prompt},
)

  memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  doc_chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_PROMPT)


### Test With Query

In [35]:
query = "What are the salient features of CUDA"
result = qa({"question": query})
print(result.get("answer"))

  result = qa({"question": query})


According to the provided context, the salient features of CUDA include:

* GPU-accelerated libraries
* Debugging and optimization tools
* C/C++ compiler
* Runtime library to deploy applications
* Built-in capabilities for distributing computations across multi-GPU configurations
* Support for Tegra devices and NvMedia Application Programming Interface (API) for processing image and video data
* CUFFT Callback Routines for user-supplied kernel routines on Linux x86_64 systems

Note that this is not an exhaustive list, as the context only provides a selection of features and capabilities.


In [36]:
query = "What is RAG?"
result = qa({"question": query})
print(result.get("answer"))

Retrieval-Augmented Generation (RAG) is a software architecture that combines the capabilities of large language models (LLMs) with information sources specific to a business, such as documents, SQL databases, and internal business applications, to enhance the accuracy and relevance of the LLM's responses.


Before we move ahead, let's free up GPU VRAM by stopping the docker container.

In [37]:
! docker container stop llm_nim

llm_nim


The next notebook walks through to add the PEFT functionalities like LoRA with NIMs.

---

## References

- https://developer.nvidia.com/blog/tips-for-building-a-rag-pipeline-with-nvidia-ai-langchain-ai-endpoints/
- https://nvidia.github.io/GenerativeAIExamples/latest/notebooks/05_RAG_for_HTML_docs_with_Langchain_NVIDIA_AI_Endpoints.html

## Licensing

Copyright © 2024 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<br>
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="rag_nim_endpoints.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 34%; text-align: center;">
        <a href="rag_nim_endpoints.ipynb">1</a>
        <a >2</a>
        <a href="nim_lora_adapter.ipynb">3</a>
        <!-- <a href="challenge.ipynb">4</a> -->
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="nim_lora_adapter.ipynb">Next Notebook</a></span>
</div>

<br>
<p> <center> <a href="../../Start-NIM-RAG.ipynb">Home Page</a> </center> </p>