# **Build a RAG System on “Leave No Context Behind” Paper**

In [1]:
# ! pip install pypdf
# ! pip install langchain_google_genai

In [2]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Setup API Key
f = open('gemini_api_key.txt')
GOOGLE_API_KEY = f.read()

chat_model = ChatGoogleGenerativeAI(google_api_key=GOOGLE_API_KEY, model="gemini-1.5-pro-latest")

# Loading the Document

In [4]:
# Load a document

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("Leave_No_Context_Behind.pdf")

data = loader.load_and_split()

data[:5]

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

# Spliting the document into chunks

In [6]:
# Spliting the document into chunks
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(data)

print(len(chunks))

Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 633, which is longer than the specified 500


110


# Creating Chunks into Embedding

In [8]:
# Creating Chunks Embedding
# We are just loading OpenAIEmbeddings

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, model="models/embedding-001")

# vectors = embeddings.embed_documents(chunks)

# Storing the chunks in vector store

In [9]:
# Store the chunks in vector store
from langchain_community.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db_rag")

# Persist the database on drive
db.persist()

In [10]:
# Setting a Connection with the ChromaDB
connection = Chroma(persist_directory="./chroma_db_rag", embedding_function=embedding_model)

# Settingup the Vector Store as a Retriever

In [11]:
# Converting CHROMA db_connection to Retriever Object
retriever = connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

# Based on users query retrieving the context

### Query -1

In [12]:
user_query = "What is Infini-attention?"

In [14]:
retrieved_docs = retriever.invoke(user_query)

In [15]:
len(retrieved_docs)

5

In [16]:
print(retrieved_docs[0].page_content)

2.1 Infini-attention
As shown Figure 1, our Infini-attention computes both local and global context states and
combine them for its output.

Similar to multi-head attention (MHA), it maintains Hnumber
2


### Query - 2

In [17]:
user_query = "Tell me about LLMs?"

In [18]:
retrieved_docs = retriever.invoke(user_query)

In [19]:
len(retrieved_docs)

5

In [20]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


# Passing the context and questioning to the LLM

In [21]:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot. 
    You take the context and question from user. Your answer should be based on the specific context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Answer the question based on the given context.
    Context:
    {context}
    Question: 
    {question}
    
    Answer: """)
])

In [22]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

In [23]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | chat_template
    | chat_model
    | output_parser
)

## Query - 1

In [29]:
from IPython.display import Markdown as markdown
response = rag_chain.invoke("What is LLMs?")

markdown(response)

## LLMs Explained

Based on the context you provided, **LLMs stands for Large Language Models**. These are complex AI systems trained on massive amounts of text data, enabling them to understand and generate human-like text in response to a wide range of prompts and questions. 

The context specifically discusses the challenge of incorporating efficient memory techniques into LLMs. While LLMs excel at processing and generating text, their ability to retain and utilize past information (memory) efficiently is an ongoing area of research. 


## Query - 2

In [30]:
response = rag_chain.invoke("Explain about LLM Pre-training?")

markdown(response)

## LLM Pre-training for Long-Context Adaptation:

The provided text describes a method for adapting existing Large Language Models (LLMs) to handle longer context lengths through a process called **continual pre-training**. This is necessary because LLMs often struggle with processing and understanding information that spans long sequences of text. 

Here's a breakdown of the key points:

**Challenges with Long Contexts:**

* Standard LLMs like transformers have limitations in handling long sequences due to the computational complexity of attention mechanisms.
* Efficiently processing and remembering information from earlier parts of a long text is crucial for maintaining context and understanding.

**Solutions Explored:**

* **Extending Attention Mechanisms:** Some approaches modify the attention layers in LLMs to better handle long-range dependencies within text sequences (Xiong et al., 2023; Fu et al., 2024).
* **Compressed Input Representations:** Techniques are developed to summarize past segments of text, creating a compressed representation that the LLM can easily access and utilize (Rae et al., 2019; Chevalier et al., 2023).
* **Transformer-based Compression:** Recent methods employ another transformer model to compress the input sequence, enabling efficient processing of long contexts (Bulatov et al., 2022; Chevalier et al., 2023; Ge et al., 2023; Mu et al., 2024). 

**LLM Continual Pre-training:**

* This approach involves further training an existing LLM on a dataset specifically designed for long sequences. 
* In the context provided, the pre-training data includes PG19, Arxiv-math corpus, and C4 text, all containing sequences longer than 4K tokens.
* The LLM is trained with these long sequences, allowing it to adapt and improve its ability to handle long-range dependencies and context.
* One specific example mentioned is replacing the standard attention mechanism in a 1B parameter LLM with "Infini-attention" and then pre-training on 4K token long inputs.

**Current State and Future Needs:**

While progress has been made, the text highlights the need for more effective and practical compressive memory techniques for LLMs. These techniques should balance simplicity with the ability to maintain the quality of understanding and context retention. 
