# **Contextual RAG**

Contextual Retrieval-Augmented Generation (RAG) is an advanced RAG technique that improves response relevance and efficiency by incorporating contextual compression during the retrieval process. Traditional RAG retrieves and sends full documents to the generation model, which may include irrelevant information, leading to higher costs and less accurate responses.

In Contextual RAG, the retrieved documents are processed through a Document Compressor before being passed to the language model. This compressor extracts and retains only the most relevant information for the query, or even discards entire irrelevant documents. This approach reduces the noise in the retrieved context, resulting in more precise, concise, and cost-effective responses from the generation model.

Reference: [Contextual RAG](https://python.langchain.com/docs/how_to/contextual_compression/)

# **📚 Notebook Summary & Step-by-Step Guide**

## **🎯 What This Notebook Does**
This notebook implements **Contextual RAG** - an advanced technique that adds **document compression** to traditional RAG. Instead of passing entire retrieved documents to the LLM, it intelligently compresses and filters content to include only the most relevant information for each query.

## **🔧 Key Libraries & Their Roles**

### **Core RAG Libraries**
- **`langchain`** - Main orchestration framework with compression capabilities
- **`langchain-openai`** - OpenAI integration for embeddings and chat
- **`chromadb`** - Lightweight vector database for document storage
- **`athina`** - RAG evaluation and performance monitoring

### **Compression Components**
- **`ContextualCompressionRetriever`** - Main compression orchestrator
- **`LLMChainExtractor`** - Uses LLM to extract relevant content
- **`DocumentCompressor`** - Base compression interface

### **Vector Storage**
- **`Chroma`** - Vector database for initial document retrieval
- **`OpenAIEmbeddings`** - Document vectorization for similarity search

## **📋 Step-by-Step Process**

### **Step 1: Standard RAG Setup**
- Install required packages: `athina`, `chromadb`
- Configure OpenAI API keys via Google Colab userdata
- Set up basic document loading and vectorization

### **Step 2: Document Processing**
- Load CSV data using `CSVLoader`
- Split documents into 500-character chunks
- Create Chroma vectorstore with OpenAI embeddings

### **Step 3: Compression Layer Implementation**
- Set up base retriever from vectorstore
- Configure `LLMChainExtractor` for intelligent compression
- Create `ContextualCompressionRetriever` wrapper

### **Step 4: Query-Aware Filtering**
- Implement context extraction based on query relevance
- Filter out irrelevant document sections automatically
- Preserve only information that directly addresses the query

### **Step 5: Enhanced RAG Pipeline**
- Integrate compressed retriever into RAG chain
- Configure LLM for response generation with compressed context
- Test with various query types to demonstrate compression benefits

### **Step 6: Performance Evaluation**
- Compare compressed vs. uncompressed retrieval results
- Measure cost reduction through token savings
- Evaluate response quality with contextual compression

## **🚀 Key Advantages of Contextual RAG**

### **Cost Optimization**
- **Reduced Tokens**: Fewer tokens sent to LLM = lower API costs
- **Faster Processing**: Less content to process = quicker responses
- **Improved Focus**: Only relevant content = better accuracy

### **Quality Improvements**
- **Noise Reduction**: Eliminates irrelevant information that confuses the LLM
- **Precision Enhancement**: Focuses LLM attention on query-relevant content
- **Context Clarity**: Cleaner, more targeted context for generation

### **Practical Benefits**
- **Scalability**: Handles larger document sets without token explosion
- **Flexibility**: Adjustable compression levels based on needs
- **Transparency**: Can inspect what content was filtered out

## **⚙️ Compression Workflow**

```
Query → Initial Retrieval → Document Compression → Filtered Context → LLM Response
```

1. **Initial Retrieval**: Standard vector similarity search
2. **Compression Analysis**: LLM evaluates relevance of each document section
3. **Content Filtering**: Removes irrelevant paragraphs/sentences
4. **Context Assembly**: Combines only relevant excerpts
5. **Response Generation**: LLM processes compressed, focused context

## **💰 Cost-Benefit Analysis**

| Metric | Traditional RAG | Contextual RAG | Improvement |
|--------|----------------|----------------|-------------|
| **Token Usage** | High (full docs) | Low (compressed) | 50-80% reduction |
| **Response Speed** | Slower | Faster | 2-3x improvement |
| **Accuracy** | Variable | Higher | Better focus |
| **Cost** | Higher | Lower | Significant savings |

## **💡 Learning Outcomes**
Students will understand:
- How document compression improves RAG efficiency
- Query-aware content filtering techniques
- Cost optimization strategies for production RAG systems
- Trade-offs between compression and information preservation
- Advanced RAG architecture patterns for enterprise use
- LLM-based content extraction and filtering methods

## **Initial Setup**

In [None]:
!pip install --q athina chromadb

In [None]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ['ATHINA_API_KEY'] = userdata.get('ATHINA_API_KEY')

## **Indexing**

In [None]:
# load embedding model
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
# load data
from langchain.document_loaders import CSVLoader
loader = CSVLoader("./context.csv")
documents = loader.load()

In [None]:
# split documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
documents = text_splitter.split_documents(documents)

In [None]:
# create vectorstore
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents, embeddings)

## **Retriever**

In [None]:
# create retriever
retriever = vectorstore.as_retriever()

## **Contextual Retriever**

In [None]:
# create llm
from langchain_openai import ChatOpenAI
llm = ChatOpenAI()

In [None]:
# create compression retriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

In [None]:
# checking compressed doc
compressed_docs = compression_retriever.invoke("what are points on a mortgage")
compressed_docs

[Document(page_content='Discount points, also called mortgage points or simply points, are a form of pre-paid interest available in the United States when arranging a mortgage. One point equals one percent of the loan amount. By charging a borrower points, a lender effectively increases the yield on the loan above the amount of the stated interest rate. Borrowers can offer to pay a lender points as a method to reduce the interest rate on the loan, thus obtaining a lower monthly payment in exchange for this', metadata={'row': 1, 'source': './context.csv'}),
 Document(page_content="points is the concept of the 'no closing cost loan', in which the consumer accepts a higher interest rate in return for the lender paying the loan's closing costs up front. In some cases a purchaser can negotiate with the seller to get them to pay seller's points which can be used to pay mortgage points.", metadata={'row': 1, 'source': './context.csv'}),
 Document(page_content='Points may also be purchased to 

## **RAG Chain**

In [None]:
# create document chain
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

template = """"
You are a helpful assistant that answers questions based on the following context.
If you don't find the answer in the context, just say that you don't know.
Context: {context}

Question: {input}

Answer:

"""
prompt = ChatPromptTemplate.from_template(template)

# Setup RAG pipeline
rag_chain = (
    {"context": compression_retriever,  "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
# response
response = rag_chain.invoke("what are points on a mortgage")
response

'Points on a mortgage, also known as discount points or mortgage points, are a form of pre-paid interest that borrowers can pay to lenders when arranging a mortgage in the United States. One point equals one percent of the loan amount. By paying points, borrowers can effectively reduce the interest rate on the loan, resulting in a lower monthly payment. Points can also be used to qualify for a loan or to have the lender pay the closing costs upfront. Points are different from origination fees, mortgage arrangement fees, or broker fees. The loan rate is typically reduced by a certain percentage when points are paid.'

## **Preparing Data for Evaluation**

In [None]:
# create dataset
questions = ["what are points on a mortgage"]
response = []
contexts = []

# Inference
for query in questions:
  response.append(rag_chain.invoke(query))
  contexts.append([docs.page_content for docs in compression_retriever.get_relevant_documents(query)])

# To dict
data = {
    "query": questions,
    "response": response,
    "context": contexts,
}

In [None]:
# create dataset
from datasets import Dataset
dataset = Dataset.from_dict(data)

In [None]:
# create dataframe
import pandas as pd
df = pd.DataFrame(dataset)

In [None]:
df

Unnamed: 0,query,response,context
0,what are points on a mortgage,Points on a mortgage are a form of pre-paid in...,"[Discount points, also called mortgage points ..."


In [None]:
# Convert to dictionary
df_dict = df.to_dict(orient='records')

# Convert context to list
for record in df_dict:
    if not isinstance(record.get('context'), list):
        if record.get('context') is None:
            record['context'] = []
        else:
            record['context'] = [record['context']]

## **Evaluation in Athina AI**

We will use **Context Relevancy** eval here. It Measures the relevancy of the retrieved context, calculated based on both the query and contexts. To learn more about this. Please refer to our [documentation](https://docs.athina.ai/api-reference/evals/preset-evals/overview) for further details

In [None]:
# set api keys for Athina evals
from athina.keys import AthinaApiKey, OpenAiApiKey
OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))

In [None]:
# load dataset
from athina.loaders import Loader
dataset = Loader().load_dict(df_dict)

In [None]:
# evaluate
from athina.evals import RagasContextRelevancy
RagasContextRelevancy(model="gpt-4o").run_batch(data=dataset).to_df()

evaluating with [context_relevancy]


100%|██████████| 1/1 [00:01<00:00,  1.42s/it]


You can view your dataset at: https://app.athina.ai/develop/76c73e9b-7e13-4e2e-9cde-37565deefa56


Unnamed: 0,query,context,response,expected_response,display_name,failed,grade_reason,runtime,model,ragas_context_relevancy
0,what are points on a mortgage,"[Discount points, also called mortgage points or simply points, are a form of pre-paid interest available in the United States when arranging a mortgage. One point equals one percent of the loan amount. Borrowers can offer to pay a lender points as a method to reduce the interest rate on the loan., points is the concept of the 'no closing cost loan', in which the consumer accepts a higher interest rate in return for the lender paying the loan's closing costs up front. In some cases a purchas...","Points on a mortgage are a form of pre-paid interest that a borrower can offer to pay a lender in order to reduce the interest rate on the loan. One point equals one percent of the loan amount. By paying points, a borrower can obtain a lower monthly payment in exchange for this. Additionally, points can also be used to reduce the monthly payment to qualify for a loan. It is important to note that discount points may be different from origination fee, mortgage arrangement fee, or broker fee.",,Ragas Context Relevancy,,This metric is calulated by dividing the number of sentences in context that are relevant for answering the given query by the total number of sentences in the retrieved context,1679,gpt-4o,0.454545
