#### Reranking Methods in RAG Systems

Reranking is a crucial step in Retrieval-Augmented Generation (RAG) systems that aims to improve the relevance and quality of retrieved documents. It involves reassessing and reordering initially retrieved documents to ensure that the most pertinent information is prioritized for subsequent processing or presentation.

#### Method Details
The reranking process generally follows these steps:

1. Initial Retrieval: Fetch an initial set of potentially relevant documents.
2. Pair Creation: Form query-document pairs for each retrieved document.
3. Scoring:
    LLM Method: Use prompts to ask the LLM to rate document relevance.
    Cross-Encoder Method: Feed query-document pairs directly into the model.
4. Score Interpretation: Parse and normalize the relevance scores.
5. Reordering: Sort documents based on their new relevance scores.
6. Selection: Choose the top K documents from the reordered list.

In [1]:
import os
import sys
from dotenv import load_dotenv
load_dotenv()
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from utility import encode_pdf, show_context, retrieve_context_per_question
from langchain_core.output_parsers import StrOutputParser
from typing import List
from concurrent.futures import ThreadPoolExecutor, as_completed
from langchain_community.docstore.in_memory import InMemoryDocstore
from tqdm import tqdm
from langchain.vectorstores import Chroma, FAISS
import faiss
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from utility import replace_t_with_space
from langchain_experimental.text_splitter import SemanticChunker

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
file_path="data/Understanding_Climate_Change.pdf"
vector_store = encode_pdf(file_path)

  from .autonotebook import tqdm as notebook_tqdm


#### 1. LLM based method to rerank the retrieved documents

In [4]:
from pydantic import BaseModel, Field

In [5]:
#schema for scoring a document 
class RatingScore(BaseModel):
    relevance_score:float = Field(description="The relevance score of a document to a query")

def rerank_documents(query:str,docs:List[Document],top_n:int = 3) -> List[Document]:
    prompt = PromptTemplate(
        template = """ On a scale of 1-10, rate the relevance of the following document to the query. 
        Consider the specific context and intent of the query, not just keyword matches.
        Query: {query}
        Document: {doc}
        Relevance Score:"""
    )

    groq_api_key=os.getenv("GROQ_API_KEY")
    llm=ChatGroq(groq_api_key=groq_api_key,model_name="llama-3.1-8b-instant")

    llm_chain = prompt | llm.with_structured_output(RatingScore)

    scored_docs = []
    for doc in docs:
        score = llm_chain.invoke({'query':query,'doc':doc.page_content}).relevance_score
        try:
            score = float(score)
        except ValueError:
            score = 0
        scored_docs.append((doc,score))
    reranked_docs = sorted(scored_docs,key=lambda x:x[1],reverse=True)
    return [doc for doc,_ in reranked_docs[:top_n]]



In [None]:
retriever = vector_store.as_retriever(search_kwargs={'k':10})

In [6]:
query="What are the impacts of climate change on biodiversity?"
initial_docs = vector_store.similarity_search(query,k=10)
reranked_docs = rerank_documents(query,initial_docs)

# print first 3 initial documents
print("Top initial documents:")
for i, doc in enumerate(initial_docs[:3]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


# Print results
print(f"Query: {query}\n")
print("Top reranked documents:")
for i, doc in enumerate(reranked_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document

Top initial documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Document 2:
protection, and habitat creation. 
Climate-Resilient Conservation 
Conservation strategies must account for climate change impacts to be effective. This 
includes identifying climate refugia, areas le...

Document 3:
The economic costs of climate change include damage to infrastructure, reduced agricultural 
productivity, health care costs, and lost labor productivity. Extreme weather events, such as 
hurricanes a...
Query: What are the impacts of climate change on biodiversity?

Top reranked documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Document 2:
Coral re

In [10]:
#Implement custom retriever to build the chain
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from typing import Any 

class CustomRetriever(BaseRetriever,BaseModel):
    """ 
    A Custom retriever that gives the top k relevant documents based on reranking method
    """
    vectorstore: Any = Field(description="vector store for initial retrival")

    class Config:
        arbitrary_types_allowed = True

    def _get_relevant_documents(self, query: str, *, num_docs = 2,run_manager: CallbackManagerForRetrieverRun) -> list[Document]:
        initial_docs = self.vectorstore.similarity_search(query,k=10)
        return rerank_documents(query,initial_docs,top_n=num_docs)

custom_retriever = CustomRetriever(vectorstore=vector_store)

In [11]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
groq_api_key=os.getenv("GROQ_API_KEY")
llm=ChatGroq(groq_api_key=groq_api_key,model_name="llama-3.1-8b-instant")
system_prompt = """ 
    Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}
"""
prompt = ChatPromptTemplate(
    [
        ("system",system_prompt),
        ("user","{input}")
    ]
)

qa_chain = create_stuff_documents_chain(llm,prompt)
retriever_chain = create_retrieval_chain(custom_retriever,qa_chain)

In [12]:
query="What are the impacts of climate change on biodiversity?"
result = retriever_chain.invoke({'input':query})

In [13]:
result

{'input': 'What are the impacts of climate change on biodiversity?',
 'context': [Document(id='614539c3-c47f-4c85-9634-9947d9f75f7f', metadata={'producer': 'Microsoft® Word 2021', 'creator': 'Microsoft® Word 2021', 'creationdate': '2024-07-13T20:17:34+03:00', 'author': 'Nir', 'moddate': '2024-07-13T20:17:34+03:00', 'source': 'data/Understanding_Climate_Change.pdf', 'total_pages': 33, 'page': 4, 'page_label': '5'}, page_content='Coral reefs are highly sensitive to changes in temperature and acidity. Ocean acidification \nand warming waters contribute to coral bleaching and mortality, threatening biodiversity and \nfisheries. Protecting and restoring coral reefs is essential for marine conservation. \nMarine Ecosystems \nAcidification affects the health and survival of various marine species, disrupting food webs \nand ecosystems. This has implications for commercial fisheries and the livelihoods of people \nwho depend on the ocean. Efforts to reduce CO2 emissions and enhance marine prot