# Text Reranking NIM LangChain Playbook

Reranking is crucial for achieving high accuracy and efficiency in retrieval pipelines. It plays a vital role, particularly when the pipeline incorporates citations from diverse datastores, where each datastore may employ its own unique similarity scoring algorithm. Reranking serves two primary purposes:

<ol>
    <li>Improving accuracy for individual citations within each datastore.</li>
    <li>Integrating results from multiple datastores to provide a cohesive and relevant set of citations.</li>
</ol>

This playbook goes over how to use the NeMo Retriever Text Reranking NIM (Text Reranking NIM) with LangChain for document compression and retrieval via the `NVIDIARerank` class.

## Use NVIDIA NIM for LLMs 

First, initialize the LLM for this playbook. This playbook uses NVIDIA NIM for LLMs. You can access the chat models using the `ChatNVIDIA` class from the `langchain-nvidia-ai-endpoints` package, which contains LangChain integrations for building applications with models on  NVIDIA NIM for large language models (LLMs). For more information, see the [ChatNVIDIA](https://python.langchain.com/v0.2/docs/integrations/chat/nvidia_ai_endpoints/) documentation.

Once the Llama3-8b-instruct NIM has been deployed on your infrastructure, you can access it using the `ChatNVIDIA` class, as shown in the following example.


In [1]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# connect to a LLM NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b-instruct")

After the LLM is ready, use LangChain's `ChatPromptTemplate` class to structure multi-turn conversations and format inputs for the language model, as shown in the following example.

In [2]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Say you don't know if you don't have this information."
    )),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

To interact with the LLM in the LangChain Expression Language (LCEL) chain, use the `invoke` method, as shown in the following example.

In [3]:
print(chain.invoke({"question": "What's the difference between a GPU and a CPU?"}))

A GPU (Graphics Processing Unit) is a specialized computer chip that handles graphics and compute tasks, providing faster rendering and increased performance for gaming and graphics-intensive applications. In contrast, a CPU (Central Processing Unit) is the "brain" of the computer, handling general-purpose computing tasks such as instructions, calculations, and memory management.


Next ask the following question about the NVIDIA H200 GPU. Since the knowledge cutoff for many LLMs is late 2022 or early 2023, the model might not have access to information after that timeframe.

In [4]:
print(chain.invoke({"question": "What does the H in the NVIDIA H200 stand for?"}))

The "H" in NVIDIA H200 stands for HGX H100, which is a data center GPU solution, with "H" in HGX H100 indicating it is designed for High-Performance Computing.


>  I'm sorry, at the moment I don't have information on what the 'H' in the NVIDIA H200 stands for. It could possibly be a model-specific identifier or code. You might want to check NVIDIA's official documentation or contact them directly for clarification.

## Reranking with Text Reranking NIM

To answer the previous question, build a simple retrieval and reranking pipeline to find the most relevant piece of information to the query.

Load the [NVIDIA H200 Datasheet](https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446) to use in the retrieval pipeline. LangChain provides a variety of [document loaders](https://python.langchain.com/docs/integrations/document_loaders) for various types of documents, such as HTML, PDF, and code, from sources and locations such as private S3 buckets and public websites. The following example uses a LangChain [`PyPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html) to load a datasheet about the NVIDIA H200 Tensor Core GPU.

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf")

document = loader.load()
document[0]

Document(metadata={'source': 'https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf', 'page': 0}, page_content='NVIDIA H200 Tensor Core GPU\u2002|\u2002Datasheet\u2002|\u20021\nNVIDIA H200 Tensor Core GPU\nSupercharging AI and HPC workloads.\nHigher Performance With Larger, Faster Memory\nThe NVIDIA H200 Tensor Core GPU supercharges generative AI and high-\nperformance computing (HPC) workloads with game-changing performance \nand memory capabilities. \nBased on the NVIDIA Hopper™ architecture, the NVIDIA H200 is the first GPU to \noffer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)—\nthat’s nearly double the capacity of the NVIDIA H100 Tensor Core GPU with \n1.4X more memory bandwidth. The H200’s larger and faster memory accelerates \ngenerative AI and large language models, while advancing scientific computing for \nHPC workloads with better energy efficiency and lower total cost of ownership. \nUnlock Insights With 

Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, such as a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/).

LangChain provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as text splitters. The following example uses a [``RecursiveCharacterTextSplitter``](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html). The ``RecursiveCharacterTextSplitter`` is divides a large body of text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters, such as "\n\n", "\n", " ", and "", to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.

There are some nuanced complexities to text splitting since, in theory, semantically related text should be kept together.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

document_chunks = text_splitter.split_documents(document)
print("Number of chunks from the document:", len(document_chunks))

Number of chunks from the document: 17


The following example shows how to use LangChain to interact with Text Reranking NIM using the `NVIDIAReranking` class from the same `langchain-nvidia-ai-endpoints` package as the first example. Be sure that you have the NeMo Retriever Text Reranking NIM running before this step. `nvidia/nv-rerankqa-mistral-4b-v3` is used in the following example, update `model` accordingly if you use a different Text Reranking NIM.

In [7]:
from langchain_nvidia_ai_endpoints import NVIDIARerank

query = "What does the H in the NVIDIA H200 stand for?"

# Initialize and connect to a NeMo Retriever Text Reranking NIM running at localhost:8000
reranker = NVIDIARerank(model="nvidia/nv-rerankqa-mistral-4b-v3",
                        base_url="http://localhost:8002/v1")

reranked_chunks = reranker.compress_documents(query=query,
                                              documents=document_chunks)

The next section shows the results of using Text Reranking NIM to rerank the document chunks based on a relevance score from the query to the document.

In [8]:
for chunks in reranked_chunks:

    # Access the metadata of the document
    metadata = chunks.metadata

    # Get the page content
    page_content = chunks.page_content
    
    # Print the relevance score if it exists in the metadata, followed by page content
    if 'relevance_score' in metadata:
        print(f"Relevance Score: {metadata['relevance_score']}, Page Content: {page_content}...")
    print(f"{'-' * 100}")

Relevance Score: 16.3125, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 1
NVIDIA H200 Tensor Core GPU
Supercharging AI and HPC workloads.
Higher Performance With Larger, Faster Memory
The NVIDIA H200 Tensor Core GPU supercharges generative AI and high-
performance computing (HPC) workloads with game-changing performance 
and memory capabilities. 
Based on the NVIDIA Hopper™ architecture, the NVIDIA H200 is the first GPU to 
offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)—...
----------------------------------------------------------------------------------------------------
Relevance Score: 10.1953125, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 3
AI Acceleration for Mainstream Enterprise Servers With  
H200 NVL
NVIDIA H200 NVL is ideal for lower-power, air-cooled enterprise rack designs that 
require flexible configurations, delivering acceleration for every AI and HPC workload 
regardless of size. With up to four GPUs connected by NVI

## Use Text Reranking NIM with LCEL

One challenge with retrieval is that usually you don't know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

[Contextual compression](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/contextual_compression/) is a technique to improve retrieval systems by:
<ol>
    <li> Addressing the challenge of handling unknown future queries when ingesting data.</li>
    <li> Reducing irrelevant text in retrieved documents to improve LLM response quality and efficiency.</li>
    <li> Compressing individual documents and filtering out irrelevant ones based on the query context.</li>
</ol>

The Contextual Compression Retriever requires:

* A base retriever
* A document Compressor

It works by:
<ol>
    <li> Passing queries to the base retriever</li>
    <li> Sending retrieved documents through the Document Compressor</li>
    <li> Shortening the list of documents by reducing content or removing irrelevant documents entirely</li>
</ol>

The following example demonstrates how to use the Text Reranking NIM as a document compressor with LangChain.

First, initialize an embedding model to embed the query and document chunks. This example uses the Text Embedding NIM that is already deployed at the beginning of the LaunchPad lab. You can access this model using the `NVIDIAEmbeddings` class, as shown in the following example.

In [15]:
import os
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embedding_model = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5",
                                   base_url="http://localhost:8001/v1")

Next, we'll initialize a simple vector store retriever and store the document chunks of the `NVIDIA H200 datasheet`. LangChain provides support for a [great selection of vector stores](https://python.langchain.com/docs/integrations/vectorstores/), we'll be using FAISS for this example.

In [16]:
from langchain_community.vectorstores import FAISS

retriever = FAISS.from_documents(document_chunks, embedding=embedding_model).as_retriever(search_kwargs={"k": 10})

Wrap the base retriever with a `ContextualCompressionRetriever` class, using `NVRerank` as a document compressor, as shown in the following example. As previously mentioned, `nv-rerankqa-mistral-4b-v3` is used for this step, be sure to update `model` accordingly if a different Text Reranking NIM is being used.

In [17]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain_nvidia_ai_endpoints import NVIDIARerank

# Re-initialize and connect to a NeMo Retriever Text Reranking NIM running at localhost:8000
compressor = NVIDIARerank(model="nvidia/nv-rerankqa-mistral-4b-v3",
                          base_url="http://localhost:8002/v1")

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

Next, ask the LLM the same question about the "H" in NVIDIA H200 again but with the retrieval and reranking pipeline.

In [18]:
from langchain.chains import RetrievalQA

query = "What does the H in the NVIDIA H200 stand for?"

chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever)
chain.invoke(query)

{'query': 'What does the H in the NVIDIA H200 stand for?',
 'result': 'I don\'t know. The text doesn\'t explicitly mention what "H" stands for.'}