<img src="./images/DLI_Header.png" width=400/>

# Build a RAG chain for NVIDIA Triton documentation website

In this notebook we demonstrate how to build a RAG using [NVIDIA AI Endpoints for LangChain](https://python.langchain.com/docs/integrations/text_embedding/nvidia_ai_endpoints). We create a vector store by downloading web pages and generating their embeddings using FAISS. We then showcase two different chat chains for querying the vector store. For this example, we use the NVIDIA Triton documentation website, though the code can be easily modified to use any other source.  

### First stage is to load NVIDIA Triton documentation from the web, chunkify the data, and generate embeddings using FAISS

To run this notebook, you need to complete the [setup](https://python.langchain.com/docs/integrations/text_embedding/nvidia_ai_endpoints#setup) and generate an NVIDIA API key. To obtain an API key, Log into [API Catalog](https://build.nvidia.com/), find the model you want to use, and click “Get API Key.” This key is used to both authenticate with the docker registry to pull the NIM container (whether in the brev environment or your own cloud/local environment) and/or allow you to make API calls to the model endpoint hosted on the API catalog. The NVIDIA API catalog is a trial experience of NVIDIA NIM limited to 5000 free API credits. Upon sign-up, users are granted 1000 API credits. To obtain more, click on your profile from within the [API catalog](https://build.nvidia.com/) → ‘Request More’. If you signed up to use the API catalog with a personal email address, you will be asked to provide a business email to activate a free 90-day NVIDIA AI Enterprise license and unlock additional 4000 credits. See [NVIDIA NIM FAQ](https://forums.developer.nvidia.com/t/nvidia-nim-faq/300317) for more information regarding API credits. 

In [1]:
import os
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain

from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

Provide the API key by running the cell below.

In [2]:
import getpass

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

Enter your NVIDIA API key:  ········


Helper functions for loading html files, which we'll use to generate the embeddings. We'll use this later to load the relevant html documents from the Triton documentation website and convert to a vector store.

In [3]:
import re
from typing import List, Union

import requests
from bs4 import BeautifulSoup

def html_document_loader(url: Union[str, bytes]) -> str:
    """
    Loads the HTML content of a document from a given URL and return it's content.

    Args:
        url: The URL of the document.

    Returns:
        The content of the document.

    Raises:
        Exception: If there is an error while making the HTTP request.

    """
    try:
        response = requests.get(url)
        html_content = response.text
    except Exception as e:
        print(f"Failed to load {url} due to exception {e}")
        return ""

    try:
        # Create a Beautiful Soup object to parse html
        soup = BeautifulSoup(html_content, "html.parser")

        # Remove script and style tags
        for script in soup(["script", "style"]):
            script.extract()

        # Get the plain text from the HTML document
        text = soup.get_text()

        # Remove excess whitespace and newlines
        text = re.sub("\s+", " ", text).strip()

        return text
    except Exception as e:
        print(f"Exception {e} while loading document")
        return ""

Read html files and split text in preparation for embedding generation
Note chunk_size value must match the specific LLM used for embedding genetation

Make sure to pay attention to the chunk_size parameter in TextSplitter. Setting the right chunk size is critical for RAG performance, as much of a RAG’s success is based on the retrieval step finding the right context for generation. The entire prompt (retrieved chunks + user query) must fit within the LLM’s context window. Therefore, you should not specify chunk sizes too big, and balance them out with the estimated query size. For example, while OpenAI LLMs have a context window of 8k-32k tokens, Llama3 is limited to 8k tokens. Experiment with different chunk sizes, but typical values should be 100-600, depending on the LLM.

In [4]:
def create_embeddings(embedding_path: str = "./data/nv_embedding"):

    embedding_path = "./data/nv_embedding"
    print(f"Storing embeddings to {embedding_path}")

    # List of web pages containing NVIDIA Triton technical documentation
    urls = [
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_analyzer.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html",
    ]

    documents = []
    for url in urls:
        document = html_document_loader(url)
        documents.append(document)


    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=0,
        length_function=len,
    )
    texts = text_splitter.create_documents(documents)
    index_docs(url, text_splitter, texts, embedding_path)
    print("Generated embedding successfully")

Generate embeddings using NVIDIA AI Endpoints for LangChain and save embeddings to offline vector store in the ./data/nv_embedding directory for future re-use

In [5]:
def index_docs(url: Union[str, bytes], splitter, documents: List[str], dest_embed_dir) -> None:
    """
    Split the document into chunks and create embeddings for the document

    Args:
        url: Source url for the document.
        splitter: Splitter used to split the document
        documents: list of documents whose embeddings needs to be created
        dest_embed_dir: destination directory for embeddings

    Returns:
        None
    """
    embeddings = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")

    for document in documents:
        texts = splitter.split_text(document.page_content)

        # metadata to attach to document
        metadatas = [document.metadata]

        # create embeddings and add to vector store
        if os.path.exists(dest_embed_dir):
            update = FAISS.load_local(folder_path=dest_embed_dir, embeddings=embeddings, allow_dangerous_deserialization=True)
            update.add_texts(texts, metadatas=metadatas)
            update.save_local(folder_path=dest_embed_dir)
        else:
            docsearch = FAISS.from_texts(texts, embedding=embeddings, metadatas=metadatas)
            docsearch.save_local(folder_path=dest_embed_dir)

In [8]:
create_embeddings()

Storing embeddings to ./data/nv_embedding


Exception: [429] Too Many Requests
{'status': 429, 'title': 'Too Many Requests'}

### Second stage is to load the embeddings from the vector store and build a RAG using NVIDIAEmbeddings

Create the embeddings model using NVIDIA Retrieval QA Embedding endpoint. This model represents words, phrases, or other entities as vectors of numbers and understands the relation between words and phrases. See here for reference: https://build.nvidia.com/nvidia/embed-qa-4

In [9]:
embedding_model = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END", allow_dangerous_deserialization=True)

Load documents from vector database using FAISS

In [10]:
# Embed documents
embedding_path = "./data/nv_embedding"
docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embedding_model, allow_dangerous_deserialization=True)
retriever = docsearch.as_retriever()

In [11]:
# This should return documents related to the test query
retriever.invoke("Deploy TensorRT-LLM Engine on Triton Inference Server")

[Document(metadata={}, page_content='NVIDIA Triton Inference Server — NVIDIA Triton Inference Server Skip to main content Back to top Ctrl+K NVIDIA Triton Inference Server GitHub NVIDIA Triton Inference Server GitHub Table of Contents Home Release notes Compatibility matrix Getting Started Quick Deployment Guide by backend TRT-LLM vLLM Python with HuggingFace PyTorch ONNX TensorFlow Openvino LLM With TRT-LLM Multimodal model Stable diffusion Scaling guide Multi-Node (AWS) Multi-Instance LLM Features Constrained Decoding Function Calling Speculative Decoding TRT-LLM vLLM Client API Reference OpenAI API KServe API HTTP/REST and GRPC Protocol Extensions Binary tensor data extension Classification extension Schedule policy extension Sequence extension Shared-memory extension Model configuration extension Model repository extension Statistics extension Trace extension Logging extension Parameters extension In-Process Triton Server API C/C++ Python Kafka I/O Rayserve Java Client Libraries Py

Create a ConversationalRetrievalChain chain. In this chain we demonstrate the use of 2 LLMs: one for summarization and another for chat. This improves the overall result in more complicated scenarios. We'll use Llama3 70B for the first LLM and Mixtral for the Chat element in the chain. We add a question_generator to generate relevant query prompt. See here for reference: https://python.langchain.com/docs/modules/chains/popular/chat_vector_db#conversationalretrievalchain-with-streaming-to-stdout

In [12]:
print(f"{CONDENSE_QUESTION_PROMPT = }")
print(f"{QA_PROMPT = }")

CONDENSE_QUESTION_PROMPT = PromptTemplate(input_variables=['chat_history', 'question'], input_types={}, partial_variables={}, template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:')
QA_PROMPT = PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:")


In [13]:
llm = ChatNVIDIA(model='mistralai/mixtral-8x7b-instruct-v0.1')
chat = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1", temperature=0.1, max_tokens=1000, top_p=1.0)

retriever = docsearch.as_retriever()

## Requires question and chat_history
qa_chain = (RunnablePassthrough()
    ## {question, chat_history} -> str
    | CONDENSE_QUESTION_PROMPT | llm | StrOutputParser()
    # | RunnablePassthrough(print)
    ## str -> {question, context}
    | {"question": lambda x: x, "context": retriever}
    # | RunnablePassthrough(print)
    ## {question, context} -> str
    | QA_PROMPT | chat | StrOutputParser()
)

Ask any question about Triton

In [14]:
chat_history = []

query = "What is Triton?"
chat_history += [qa_chain.invoke({"question": query, "chat_history": chat_history})]
chat_history

["Based on the provided documents, Triton refers to the Triton Inference Server. It's a server that makes machine learning models available for inferencing. It supports multiple deep learning frameworks and scheduling and batching algorithms. It also provides a backend C API for extending its functionality. The models served by Triton can be queried and controlled by a dedicated model management API, and it supports HTTP/REST, gRPC, and C API for inference protocols."]

Ask another question about Triton

In [15]:
query = "What interfaces does Triton support?"
chat_history += [""]
for token in qa_chain.stream({"question": query, "chat_history": chat_history[:-1]}):
    print(token, end="")
    chat_history[-1] += token

Based on the provided documents, the Triton Inference Server supports the following interfaces for receiving inference requests:

1. HTTP/REST
2. gRPC
3. C API

These interfaces allow clients to send inference requests to the server for processing.

Finally showcase chat capabilites by asking a question about the previous query

In [16]:
query = "But why?"
for token in qa_chain.stream({"question": query, "chat_history": chat_history}):
    print(token, end="")

The Triton Inference Server supports multiple interfaces for receiving inference requests, including HTTP/REST, gRPC, and C API, to provide flexibility and convenience for different use cases and user preferences. 

HTTP/REST and gRPC are widely used, well-established protocols for communication between services. HTTP/REST is simple and easy to use, while gRPC offers faster communication and additional features like bidirectional streaming and flow control.

The C API allows Triton to be integrated directly into applications for edge and other in-process use cases, providing more control and efficiency when embedding the inference server.

By supporting these interfaces, Triton Inference Server can cater to a broader range of use cases and user requirements, making it a versatile and adaptable solution for various projects and systems.

Now we demonstrate a simpler chain using a single LLM only, a chat LLM

In [17]:
chat = ChatNVIDIA(
    model='mistralai/mixtral-8x7b-instruct-v0.1', 
    temperature=0.1, 
    max_tokens=1000, 
    top_p=1.0
)

qa_prompt = ChatPromptTemplate.from_messages([
    ("user", 
        "Use the following pieces of context to answer the question at the end."
        " If you don't know the answer, just say that you don't know, don't try to make up an answer."
        "\n\nHISTORY: {history}\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"
    )
])

## Requires question and chat_history
qa_chain = (
    RunnablePassthrough.assign(context = (lambda state: state.get("question")) | retriever)
    # | RunnablePassthrough(print)
    | qa_prompt | chat | StrOutputParser()
)

Now try asking a question about Triton with the simpler chain. Compare the answer to the result with previous complex chain model

In [18]:
chat_history = []

query = "What is Triton?"
chat_history += [qa_chain.invoke({"question": query, "history": chat_history})]
chat_history

['Based on the provided context, Triton refers to the NVIDIA Triton Inference Server. It is an open-source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. It delivers optimized performance for various query types and supports inference across different environments like cloud, data center, edge, and embedded devices on NVIDIA GPUs, x86, and ARM CPU, or AWS Inferentia. Triton Inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.']

Ask another question about Triton

In [None]:
query = "Does Triton support ONNX?"
chat_history += [""]
for token in qa_chain.stream({"question": query, "history": chat_history[:-1]}):
    print(token, end="")
    chat_history[-1] += token

Finally showcase chat capabilites by asking a question about the previous query

In [None]:
query = "How come?"
for token in qa_chain.stream({"question": query, "history": chat_history}):
    print(token, end="")

<img src="./images/DLI_Header.png" width=400/>