# **Build a RAG System on “Leave No Context Behind” Paper**

**Steps**
1. Load a Document
2. Split the document into chunks
3. Creating Chunks Embedding
4. Store the chunks in vector store
5. Setup the Vector Store as a Retriever
6. Based on users query retrieve the context
7. Pass the context and question to the LLM

In [None]:
# ! pip install pypdf

In [1]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Setup API Key
f = open('keys/.gemini_API_key.txt')
GOOGLE_API_KEY = f.read()

chat_model = ChatGoogleGenerativeAI(google_api_key=GOOGLE_API_KEY, model="gemini-1.5-pro-latest")

In [2]:
# Load a document

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/Leave_No_Context_Behind_Paper.pdf")

data = loader.load_and_split()

data

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

In [3]:
# Split the document into chunks
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(data)

print(len(chunks))

print(type(chunks[0]))

Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 633, which is longer than the specified 500


110
<class 'langchain_core.documents.base.Document'>


In [4]:
# Creating Chunks Embedding
# We are just loading OpenAIEmbeddings

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, model="models/embedding-001")

# vectors = embeddings.embed_documents(chunks)

In [5]:
# Store the chunks in vector store
from langchain_community.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db_")

# Persist the database on drive
db.persist()

In [6]:
# Setting a Connection with the ChromaDB
db_connection = Chroma(persist_directory="./chroma_db_", embedding_function=embedding_model)

In [7]:
# Converting CHROMA db_connection to Retriever Object
retriever = db_connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

In [8]:
user_input = "Can you tell me about the Infini-attention?"

In [9]:
retrieved_docs = retriever.invoke(user_input)

In [10]:
len(retrieved_docs)

5

In [11]:
print(retrieved_docs[0].page_content)

2.1 Infini-attention
As shown Figure 1, our Infini-attention computes both local and global context states and
combine them for its output.

Similar to multi-head attention (MHA), it maintains Hnumber
2


In [12]:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot. 
    You take the context and question from user. Your answer should be based on the specific context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Aswer the question based on the given context.
    Context:
    {context}
    
    Question: 
    {question}
    
    Answer: """)
])

In [13]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

In [14]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | chat_template
    | chat_model
    | output_parser
)

In [15]:
response = rag_chain.invoke("Can you tell me about the Infini-attention?")

response

'## Infini-attention: A Powerful Attention Mechanism\n\nInfini-attention is a novel attention mechanism designed to efficiently capture both **long- and short-range contextual dependencies** within sequences. It achieves this by incorporating two key components:\n\n* **Compressive Memory:** Unlike standard attention mechanisms that discard past key-value (KV) states, Infini-attention stores them in a compressive memory. This allows the model to access and utilize information from much earlier in the sequence, effectively capturing long-term dependencies.\n* **Local Causal Attention:**  Alongside the long-term memory, Infini-attention also employs a masked local attention mechanism. This focuses on the relationships between nearby elements in the sequence, ensuring that short-range dependencies are also captured.\n\n### Advantages of Infini-attention:\n\n* **Enhanced Contextual Understanding:** By combining long- and short-range context, Infini-attention provides a more comprehensive un

In [16]:
from IPython.display import Markdown as md

md(response)

## Infini-attention: A Powerful Attention Mechanism

Infini-attention is a novel attention mechanism designed to efficiently capture both **long- and short-range contextual dependencies** within sequences. It achieves this by incorporating two key components:

* **Compressive Memory:** Unlike standard attention mechanisms that discard past key-value (KV) states, Infini-attention stores them in a compressive memory. This allows the model to access and utilize information from much earlier in the sequence, effectively capturing long-term dependencies.
* **Local Causal Attention:**  Alongside the long-term memory, Infini-attention also employs a masked local attention mechanism. This focuses on the relationships between nearby elements in the sequence, ensuring that short-range dependencies are also captured.

### Advantages of Infini-attention:

* **Enhanced Contextual Understanding:** By combining long- and short-range context, Infini-attention provides a more comprehensive understanding of the sequence, leading to improved performance in various tasks.
* **Minimal Modification:**  Infini-attention introduces minimal changes to the standard scaled dot-product attention, making it easy to integrate into existing Transformer models.
* **Plug-and-Play Continual Learning:** The design of Infini-attention inherently supports continual pre-training and long-context adaptation, enabling the model to continuously learn and adapt to new information. 

### Overall, Infini-attention presents a promising approach for improving the effectiveness of attention mechanisms in capturing long-range dependencies, leading to better performance and adaptability in various sequence modeling tasks. 


In [17]:
response = rag_chain.invoke("What is LLMContinual Pre-training?")

md(response)

## LLM Continual Pre-training Explained:

Based on the context you provided, **LLM Continual Pre-training** refers to a method of further training existing Large Language Models (LLMs) to handle long sequences of text (long-context) more effectively. This is achieved by extending or modifying the attention mechanisms within the LLM.

Here's a breakdown of the key points:

* **Goal:** Adapt existing LLMs to process and understand long sequences of text (beyond the typical limit of LLMs).
* **Method:**
    * **Lightweight pre-training:** Existing LLMs are further trained on long-context data (e.g., PG19, Arxiv-math, C4 text) without significant architectural changes. 
    * **Modified attention mechanisms:** Techniques like **Infini-attention** are used. Infini-attention incorporates a compressive memory and combines local and long-term attention mechanisms, allowing the model to handle longer sequences efficiently.
* **Benefits:**
    * Enables LLMs to work with longer text sequences, improving their understanding of context and complex relationships within the text.
    * Offers a more efficient way to adapt existing LLMs compared to training new models from scratch. 

**Examples from the context:**

* The researchers used Infini-attention to modify a 1B parameter LLM and pre-trained it on 4K token-long inputs, demonstrating the effectiveness of continual pre-training for long-context adaptation.
* The use of compressed input representations is mentioned as another approach to efficient long-context modeling, where the LLM itself summarizes past segments to manage longer sequences. 
