# Text Embedding NIM LangChain Playbook

In LLM and retrieval-augmented generation (RAG) workflows, embeddings transform text into vectors that capture semantic meaning. This enables efficient search for contextually relevant documents based on a user's query. These documents are then provided as additional context to the LLM, enhancing its ability to generate accurate responses. 

This playbook goes over how to use the NeMo Retriever Text Embedding NIM (Text Embedding NIM) with LangChain for a RAG workflow using the `NVIDIAEmbeddings` class. First, it shows how to generate embeddings from a user query. Then, it uses this approach to embed a document, store the embeddings in a vector store, and finally uses the embeddings in a LangChain Expression Language (LCEL) chain to help the LLM answer a question about the NVIDIA H200.

## Use NVIDIA NIM for LLMs 

First, initialize the LLM for this playbook. This playbook uses NVIDIA NIM for LLMs. You can access the chat models using the `ChatNVIDIA` class from the `langchain-nvidia-ai-endpoints` package, which contains LangChain integrations for building applications with models on  NVIDIA NIM for large language models (LLMs). For more information, see the [ChatNVIDIA](https://python.langchain.com/v0.2/docs/integrations/chat/nvidia_ai_endpoints/) documentation.

Once the Llama3-8b-instruct NIM has been deployed on your infrastructure, you can access it using the `ChatNVIDIA` class, as shown in the following example.

In [2]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# connect to a LLM NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b-instruct")

After the LLM is ready, you can use it with LangChain's `ChatPromptTemplate`, which is a class for structuring multi-turn conversations and formatting inputs for the language model.

In [3]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Say you don't know if you don't have this information."
    )),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

To interact with the LLM in the LangChain Expression Language (LCEL) chain, use the `invoke` method, as shown in the following example.

In [5]:
print(chain.invoke({"question": "What's the difference between a GPU and a CPU?"}))

A GPU (Graphics Processing Unit) is a specialized computer chip that handles graphics and compute tasks, providing faster rendering and increased performance for gaming and graphics-intensive applications. In contrast, a CPU (Central Processing Unit) is the "brain" of the computer, handling general-purpose computing tasks such as instructions, calculations, and memory management.


In [7]:
print(chain.invoke({"question": "What does the A in the NVIDIA A100 stand for?"}))

I'm not sure what the "A" in the NVIDIA A100 stands for. It's named after Andy Gredley, a lawyer at Stanford University who had a circle in his name that resembles an "A", with "100" referring to the 1000 millimeters in diameter form factor.


Next, ask a question about the NVIDIA H200 GPU. Since the knowledge cutoff for many LLMs is late 2022 or early 2023, the model might not have access to any information after that timeframe. 

In [8]:
print(chain.invoke({"question": "How much memory does the NVIDIA H200 have?"}))

I don't have specific information on the NVIDIA H200's memory specifications. Can you please provide more context or details about the NVIDIA H200 you're referring to?


## Generate Embeddings with Text Embedding NIM

To answer the previous question, build a simple [retrieval-augmented generation (RAG) pipeline](https://developer.nvidia.com/blog/build-enterprise-retrieval-augmented-generation-apps-with-nvidia-retrieval-qa-embedding-model/).

The following example demonstrates how to use LangChain to interact with Text Embedding NIM using the `NVIDIAEmbeddings` Python class from the same `langchain-nvidia-ai-endpoints` package as the first example. Be sure that Text Embedding NIM is running. Since this example uses the `nvidia/nv-embedqa-e5-v5` Text Embeddimg NIM, update `model` accordingly if you are using a different Text Embedding NIM.

Generate embeddings from a user query with the following command:

In [9]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# Initialize and connect to a NeMo Retriever Text Embedding NIM (nvidia/nv-embedqa-e5-v5) running at localhost:8000
embedding_model = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5",
                                   base_url="http://localhost:8001/v1")

# Create vector embeddings of the query
embedding_model.embed_query("How much memory does the NVIDIA H200 have?")[:10]

[-0.0251007080078125,
 -0.038055419921875,
 0.035980224609375,
 -0.061309814453125,
 0.056396484375,
 -0.001224517822265625,
 0.01220703125,
 -0.04010009765625,
 -0.0258941650390625,
 -0.029815673828125]

Next, load a PDF of the [NVIDIA H200 Datasheet](https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446). This document becomes the knowledge base that the LLM uses to retrieve relevant information to answer questions.

LangChain provides a variety of [document loaders](https://python.langchain.com/docs/integrations/document_loaders) that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites). This example uses the LangChain [`PyPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html) to load the datasheet about the NVIDIA H200 Tensor Core GPU. 

In [10]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf")

document = loader.load()
document[0]

Document(metadata={'source': 'https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf', 'page': 0}, page_content='NVIDIA H200 Tensor Core GPU\u2002|\u2002Datasheet\u2002|\u20021\nNVIDIA H200 Tensor Core GPU\nSupercharging AI and HPC workloads.\nHigher Performance With Larger, Faster Memory\nThe NVIDIA H200 Tensor Core GPU supercharges generative AI and high-\nperformance computing (HPC) workloads with game-changing performance \nand memory capabilities. \nBased on the NVIDIA Hopper™ architecture, the NVIDIA H200 is the first GPU to \noffer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)—\nthat’s nearly double the capacity of the NVIDIA H100 Tensor Core GPU with \n1.4X more memory bandwidth. The H200’s larger and faster memory accelerates \ngenerative AI and large language models, while advancing scientific computing for \nHPC workloads with better energy efficiency and lower total cost of ownership. \nUnlock Insights With 

Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, such as the text from a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/). 

LangChain provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as text splitters. In this example, we use a [``RecursiveCharacterTextSplitter``](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html). The ``RecursiveCharacterTextSplitter`` is divides a large body of text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters, such as "\n\n", "\n", " ", and "", to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.

There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together. 

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

document_chunks = text_splitter.split_documents(document)
print("Number of chunks from the document:", len(document_chunks))

Number of chunks from the document: 17


The following code snippet demonstrates how to create vector embeddings for a single document. This step is not necessary for the RAG pipeline, but included here for demonstrative purposes. The example uses the embedding model to convert the text chunks into a vectors. It displays only the first 10 elements of this vector from the first document chunk to get a glimpse of what these embeddings look like.

In [12]:
# Extract text (page content) from the document chunks
page_contents = [doc.page_content for doc in document_chunks]

# Create vector embeddings from the document
embedding_model.embed_documents(page_contents)[0][:10]

[-0.04156494140625,
 -0.035552978515625,
 0.06524658203125,
 -0.050384521484375,
 0.0848388671875,
 -0.02410888671875,
 0.0245208740234375,
 -0.02685546875,
 -0.0136566162109375,
 -0.003902435302734375]

## Store Document Embeddings in the Vector Store

Once the document embeddings are generated, they are stored in a vector store. When a user query is received, you can:

<ol>
<li>Embed the query</li>
<li>Perform a similarity search in the vector store to retrieve the most relevant document embeddings</li>
<li>Use the retrieved documents to generate a response to the user's query</li>
</ol>

A vector store takes care of storing the embedded data and performing a vector search. LangChain provides support for a [variety of vector stores](https://python.langchain.com/docs/integrations/vectorstores/), we'll be using FAISS for this example.

In [14]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(document_chunks, embedding=embedding_model)

## Use Text Embedding NIM with LCEL

The next example integrates the vector database with the LLM. A [LangChain Expression Language (LCEL)](https://python.langchain.com/docs/modules/chains/) combines these components together. It then formulates the prompt placeholders (context and question) and pipes them to our LLM connector to answer the original question from the first example (`How much memory does the NVIDIA H200 have?`) with embeddings from the `NVIDIA H200 datasheet` document.

In [15]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", 
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Do not hallucinate. Say you don't know if you don't have this information."
        # "Answer the question using only the context"
        "\n\nQuestion: {question}\n\nContext: {context}"
    ),
    ("user", "{question}")
])

chain = (
    {
        "context": vector_store.as_retriever(),
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [16]:
print(chain.invoke("How much memory does the NVIDIA H200 have?"))

The NVIDIA H200 has 141 gigabytes (GB) of HBM3e memory.


In [17]:
print(chain.invoke("What does the 'H' in the NVIDIA H200 stand for?"))

The document does not explicitly state what the 'H' in NVIDIA H200 stands for.
