### Fusion Retrieval in Document Search

This code implements a Fusion Retrieval system that combines vector-based similarity search with keyword-based BM25 retrieval. The approach aims to leverage the strengths of both methods to improve the overall quality and relevance of document retrieval.

Traditional retrieval methods often rely on either semantic understanding (vector-based) or keyword matching (BM25). Each approach has its strengths and weaknesses. Fusion retrieval aims to combine these methods to create a more robust and accurate retrieval system that can handle a wider range of queries effectively.

In [1]:
import os
import sys
from dotenv import load_dotenv
load_dotenv()
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from utility import encode_pdf, show_context, retrieve_context_per_question
from langchain_core.output_parsers import StrOutputParser
from typing import List
from concurrent.futures import ThreadPoolExecutor, as_completed
from langchain_community.docstore.in_memory import InMemoryDocstore
from tqdm import tqdm
from langchain.vectorstores import Chroma, FAISS
import faiss
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from utility import replace_t_with_space
from langchain_experimental.text_splitter import SemanticChunker

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
file_path = "data/Understanding_Climate_Change.pdf"
def encode_pdf_split_documents(path,chunk_size=1000,chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using HuggingFace embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """
    #Load the Pdf file 
    loader=PyPDFLoader(path)
    docs=loader.load()

    #Split the documents into chunks
    splitter=RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )

    texts=splitter.split_documents(docs)
    cleaned_texts=replace_t_with_space(texts)

    #Embeddings 
    embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

    #Create vector store
    vectorstore=FAISS.from_documents(cleaned_texts,embeddings)

    return vectorstore,cleaned_texts

In [3]:
vectorstore, cleaned_texts = encode_pdf_split_documents(file_path)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
#Create BM25 index for retriving documents by keywords
from rank_bm25 import BM25Okapi

def create_bm25_index(documents :List[Document]) -> BM25Okapi:
    """
    Create a BM25 index from the given documents.

    BM25 (Best Matching 25) is a ranking function used in information retrieval.
    It's based on the probabilistic retrieval framework and is an improvement over TF-IDF.

    Args:
    documents (List[Document]): List of documents to index.

    Returns:
    BM25Okapi: An index that can be used for BM25 scoring.
    """
    tokenize_doc = [doc.page_content.split() for doc in documents]
    return BM25Okapi(tokenize_doc)

In [8]:
bm25 = create_bm25_index(cleaned_texts)

##### Define Function that retrieved both semantically and by keyword and normalize the scores and get the top k documents

In [5]:
import numpy as np

In [9]:
def fusion_retrieval(vectorstore,bm25,query:str,k:int = 5,alpha:float = 0.5) -> List[Document]:
    """
    Perform fusion retrieval combining keyword-based (BM25) and vector-based search.

    Args:
    vectorstore (VectorStore): The vectorstore containing the documents.
    bm25 (BM25Okapi): Pre-computed BM25 index.
    query (str): The query string.
    k (int): The number of documents to retrieve.
    alpha (float): The weight for vector search scores (1-alpha will be the weight for BM25 scores).

    Returns:
    List[Document]: The top k documents based on the combined scores.
    """

    epsilon = 1e-8
    #Step:1 Get all the documents from vectorstore
    all_docs = vectorstore.similarity_search("",k=vectorstore.index.ntotal)

    #Step:2 Perform BM25 search
    bm25_scores = bm25.get_scores(query.split())

    #Step:3 Perform vector search
    vector_results = vectorstore.similarity_search_with_score(query,k=len(all_docs))    

    #Step:4 Normalize scores
    vector_scores = np.array([score for _,score in vector_results])
    vector_scores = 1 - (vector_scores - np.min(vector_scores)) / (np.max(vector_scores) - np.min(vector_scores) + epsilon)

    bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) -  np.min(bm25_scores) + epsilon)

    # Step 5: Combine scores
    combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores

    # Step 6: Rank documents
    sorted_indices = np.argsort(combined_scores)[::-1]
    
    # Step 7: Return top k documents
    return [all_docs[i] for i in sorted_indices[:k]]

In [10]:
## Test fusion retrival
query = "What are the impacts of climate change on the environment?"

# Perform fusion retrieval
top_docs = fusion_retrieval(vectorstore, bm25, query, k=5, alpha=0.5)
docs_content = [doc.page_content for doc in top_docs]
show_context(docs_content)

Context 1:
Journalists and media organizations play a key role in informing the public about climate 
change. Investigative reporting, in-depth analysis, and human-interest stories can highlight 
the urgency and impacts of climate change. Media coverage can also hold policymakers and 
businesses accountable. 
Public Engagement 
Public engagement initiatives, such as citizen science projects, forums, and dialogues, 
encourage active participation in climate action. These initiatives provide platforms for 
sharing knowledge, experiences, and ideas. Engaging the public fosters a sense of ownership 
and responsibility. 
Chapter 12: The Path Forward


Context 2:
Carbon Pricing 
Carbon pricing mechanisms, such as carbon taxes and cap-and-trade systems, incentivize 
emission reductions by assigning a cost to carbon emissions. These policies encourage 
businesses and individuals to reduce their carbon footprints and invest in cleaner 
technologies. 
Renewable Energy Targets 
Many countries hav