In [1]:
!pip install llama_hub
!pip install llama_index
!pip install packaging==23.2
!pip install torch sentence-transformers
!pip install trafilatura
!pip install torch sentence-transformers
!pip install llama-index-readers-web
!pip install llama-index-embeddings-huggingface
!pip install langchain
!pip install sentence-transformers
!pip install faiss-gpu
!pip install langchain_openai
!pip install wikipedia
!pip install llama-index-llms-langchain



In [2]:
# Helper function for printing docs

def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [3]:
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.core import SimpleDirectoryReader

import warnings
warnings.filterwarnings("ignore")

documents = PyPDFLoader("/content/RAG for Large Language Models.pdf").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(texts, HuggingFaceEmbeddings()).as_retriever()

docs = retriever.get_relevant_documents(
    "What is the difference between naive RAG and advanced RAG"
)

pretty_print_docs(docs)

Document 1:

4
Fig. 3. Comparison between the three paradigms of RAG. (Left) Naive RAG mainly consists of three parts: indexing, retrieval and generation. (Middle)
Advanced RAG proposes multiple optimization strategies around pre-retrieval and post-retrieval, with a process similar to the Naive RAG, still following a
chain-like structure. (Right) Modular RAG inherits and develops from the previous paradigm, showcasing greater flexibility overall. This is evident in the
introduction of multiple specific functional modules and the replacement of existing modules. The overall process is not limited to sequential retrieval and
generation; it includes methods such as iterative and adaptive retrieval.
Pre-retrieval process . In this stage, the primary focus is
on optimizing the indexing structure and the original query.
The goal of optimizing indexing is to enhance the quality of
the content being indexed. This involves strategies: enhancing
data granularity, optimizing index structures, add

In [4]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import OpenAI

llm = OpenAI(temperature=0,api_key='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')
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "What is the difference between naive RAG and advanced RAG"
)
pretty_print_docs(compressed_docs)

Document 1:

Advanced RAG proposes multiple optimization strategies around pre-retrieval and post-retrieval, with a process similar to the Naive RAG, still following a chain-like structure.
----------------------------------------------------------------------------------------------------
Document 2:

- RAG methods
- Difference between naive RAG and advanced RAG
----------------------------------------------------------------------------------------------------
Document 3:

The specific metrics for each evaluation aspect are sum-
marized in Table III. It is essential to recognize that these
metrics, derived from related work, are traditional measures
and do not yet represent a mature or standardized approach for
quantifying RAG evaluation aspects. Custom metrics tailored
to the nuances of RAG models, though not included here, have
also been developed in some evaluation studies.
D. Evaluation Benchmarks and Tools
A series of benchmark tests and tools have been proposed
to facilitate th

In [5]:
original_contexts_len = len("\n\n".join([d.page_content for i, d in enumerate(docs)]))
compressed_contexts_len = len("\n\n".join([d.page_content for i, d in enumerate(compressed_docs)]))

print("Original context length:", original_contexts_len)
print("Compressed context length:", compressed_contexts_len)
print("Compressed Ratio:", f"{original_contexts_len/(compressed_contexts_len + 1e-5):.2f}x")

Original context length: 20497
Compressed context length: 1361
Compressed Ratio: 15.06x


In [6]:
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.embeddings import HuggingFaceBgeEmbeddings

embeddings = HuggingFaceBgeEmbeddings() # could be any embedding of your choice
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "What is the difference between naive RAG and advanced RAG"
)
pretty_print_docs(compressed_docs)

Document 1:

4
Fig. 3. Comparison between the three paradigms of RAG. (Left) Naive RAG mainly consists of three parts: indexing, retrieval and generation. (Middle)
Advanced RAG proposes multiple optimization strategies around pre-retrieval and post-retrieval, with a process similar to the Naive RAG, still following a
chain-like structure. (Right) Modular RAG inherits and develops from the previous paradigm, showcasing greater flexibility overall. This is evident in the
introduction of multiple specific functional modules and the replacement of existing modules. The overall process is not limited to sequential retrieval and
generation; it includes methods such as iterative and adaptive retrieval.
Pre-retrieval process . In this stage, the primary focus is
on optimizing the indexing structure and the original query.
The goal of optimizing indexing is to enhance the quality of
the content being indexed. This involves strategies: enhancing
data granularity, optimizing index structures, add

In [7]:
original_contexts_len = len("\n\n".join([d.page_content for i, d in enumerate(docs)]))
compressed_contexts_len = len("\n\n".join([d.page_content for i, d in enumerate(compressed_docs)]))

print("Original context length:", original_contexts_len)
print("Compressed context length:", compressed_contexts_len)
print("Compressed Ratio:", f"{original_contexts_len/(compressed_contexts_len + 1e-5):.2f}x")

Original context length: 20497
Compressed context length: 20497
Compressed Ratio: 1.00x


In [8]:
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

embeddings = HuggingFaceBgeEmbeddings()
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[splitter, redundant_filter, relevant_filter]
)

from langchain.retrievers import ContextualCompressionRetriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline_compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "What is the difference between naive RAG and advanced RAG"
)
pretty_print_docs(compressed_docs)



Document 1:

(Middle)
Advanced RAG proposes multiple optimization strategies around pre-retrieval and post-retrieval, with a process similar to the Naive RAG, still following a
chain-like structure. (Right) Modular RAG inherits and develops from the previous paradigm, showcasing greater flexibility overall
----------------------------------------------------------------------------------------------------
Document 2:

4
Fig. 3. Comparison between the three paradigms of RAG. (Left) Naive RAG mainly consists of three parts: indexing, retrieval and generation
----------------------------------------------------------------------------------------------------
Document 3:

Modular RAG
The modular RAG architecture advances beyond the for-
mer two RAG paradigms, offering enhanced adaptability and
versatility
----------------------------------------------------------------------------------------------------
Document 4:

Despite
its distinctiveness, Modular RAG builds upon the foundational
pri

In [9]:
original_contexts_len = len("\n\n".join([d.page_content for i, d in enumerate(docs)]))
compressed_contexts_len = len("\n\n".join([d.page_content for i, d in enumerate(compressed_docs)]))

print("Original context length:", original_contexts_len)
print("Compressed context length:", compressed_contexts_len)
print("Compressed Ratio:", f"{original_contexts_len/(compressed_contexts_len + 1e-5):.2f}x")

Original context length: 20497
Compressed context length: 8770
Compressed Ratio: 2.34x


In [10]:
! pip install llama-index-embeddings-langchain



In [11]:
from llama_index.core import (
    VectorStoreIndex,
    download_loader,
    load_index_from_storage,
    StorageContext,
)
WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
documents = loader.load_data(pages=['Mexican–American_War'])


from llama_index.core import ServiceContext
from langchain_community.embeddings import HuggingFaceEmbeddings

service_context = ServiceContext.from_defaults(
    embed_model=HuggingFaceEmbeddings(),
    llm=llm
)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

retriever = index.as_retriever(similarity_top_k=3)

question = "What were the main outcomes of the war"
contexts = retriever.retrieve(question)

context_list = [n.get_content() for n in contexts]
context_list

['=== Effect on the United States ===\nIn much of the United States, victory and the acquisition of new land brought a surge of patriotism. Victory seemed to fulfill Democrats\' belief in their country\'s Manifest Destiny. Although the Whigs had opposed the war, they made Zachary Taylor their presidential candidate in the election of 1848, praising his military performance while muting their criticism of the war.\n\nHas the Mexican War terminated yet, and how? Are we beaten? Do you know of any nation about to besiege South Hadley [Massachusetts]? If so, do inform me of it, for I would be glad of a chance to escape, if we are to be stormed. I suppose [our teacher] Miss [Mary] Lyon [founder of Mount Holyoke College] would furnish us all with daggers and order us to fight for our lives ...\nA month before the end of the war, Polk was criticized in a United States House of Representatives amendment to a bill praising Taylor for "a war unnecessarily and unconstitutionally begun by the Presi

In [12]:
! pip install llama-index-postprocessor-longllmlingua



In [13]:
from llama_index.postprocessor.longllmlingua import LongLLMLinguaPostprocessor

In [14]:
! pip install llmlingua
! pip install accelerate



In [72]:
from llama_index.core import VectorStoreIndex, download_loader, SimpleDirectoryReader

import openai
openai.api_key = "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"
documents = PyPDFLoader("/content/RAG for Large Language Models.pdf").load()
reader = SimpleDirectoryReader(input_files=['/content/RAG for Large Language Models.pdf'])
data = reader.load_data()
docs = []
for data in reader.iter_data():
    for d in data:
        d.text = d.text.upper()
        docs.append(d)

index = VectorStoreIndex.from_documents(docs)
retriever = index.as_retriever(similarity_top_k=3)

In [73]:
question = "What is the difference between naive RAG and advanced RAG"
retrieved_nodes = retriever.retrieve(question)
original_contexts = "\n\n".join([n.get_content() for n in retrieved_nodes])

In [74]:
# Setup LLMLingua
from llama_index.postprocessor.longllmlingua import LongLLMLinguaPostprocessor

node_postprocessor = LongLLMLinguaPostprocessor(
    device_map='cpu',
    instruction_str="Given the context, please answer the final question",
    target_token=300,
    rank_method="longllmlingua",
    additional_compress_kwargs={
        "condition_compare": True,
        "condition_in_question": "after",
        "context_budget": "+100",
        "reorder_context": "sort",  # enable document reorder,
        "dynamic_context_compression_ratio": 0.3,
    },
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [75]:
from llama_index.core.indices.query.schema import QueryBundle

new_retrieved_nodes = node_postprocessor.postprocess_nodes(
    retrieved_nodes, query_bundle=QueryBundle(query_str=question)
)
compressed_contexts = "\n\n".join([n.get_content() for n in new_retrieved_nodes])


In [76]:
from langchain.prompts import PromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate, ChatPromptTemplate
def get_response(context_str, query_str, model="gpt-4-1106-preview"):

    llm = OpenAI(model=model,api_key='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')

    template = (
        "Given the provided context information below: \n"
        "---------------------\n"
        "{context_str}"
        "\n---------------------\n"
        "please answer the question: {query_str}\n"
    )

    qa_template = ChatPromptTemplate.from_template(template)
    from langchain_core.runnables import RunnablePassthrough
    from langchain.chains import LLMChain
    from langchain.schema import StrOutputParser
    # you can create text prompt (for completion API)
    prompt = qa_template.format(context_str=context_str, query_str=query_str)
    chain = (
        {"context_str": RunnablePassthrough(), 'query_str': RunnablePassthrough()}|
        qa_template |
        llm|
        StrOutputParser()
    )
    # # chain = LLMChain(llm=llm, prompt=qa_template)
    # chain = prompt | llm | StrOutputParser()
    response = chain.invoke({'context_str': context_str, 'query_str': query_str})
    return response

In [77]:
print('Naive RAG Response: ')
response1 = get_response(context_str=original_contexts, query_str = question)
print(response1)
print('='*1000)
print('\n')
print('Compressed RAG Response: ')
response2 = get_response(context_str=compressed_contexts, query_str = question)
print(response2)

Naive RAG Response: 
---------------------
The difference between naive RAG and advanced RAG lies in their approach to optimizing the retrieval process. Naive RAG consists of three main parts: indexing, retrieval, and generation. It follows a chain-like structure without any specific optimization strategies. Advanced RAG, on the other hand, proposes multiple optimization strategies around pre-retrieval and post-retrieval processes. For example, in the pre-retrieval process, advanced RAG focuses on optimizing the indexing structure and the original query to enhance the quality of the content being indexed and to make the user's original question clearer and more suitable for retrieval. This includes enhancing data granularity, optimizing index structures, adding metadata, alignment optimization, and mixed retrieval. In the post-retrieval process, it includes methods such as reranking chunks and context compressing to integrate the retrieved context effectively with the query. Advanced R

In [78]:
original_tokens = node_postprocessor._llm_lingua.get_token_length(original_contexts)
compressed_tokens = node_postprocessor._llm_lingua.get_token_length(compressed_contexts)

print("Original Tokens:", original_tokens)
print("Compressed Tokens:", compressed_tokens)
print("Compressed Ratio:", f"{original_tokens/(compressed_tokens + 1e-5):.2f}x")

Original Tokens: 2354
Compressed Tokens: 326
Compressed Ratio: 7.22x
