A query-aware summarization approach that reduces token usage by 60% while retaining 95% of baseline accuracy in Retrieval-Augmented Generation pipelines.
- Overview
- Motivation
- Proposed Method
- Installation
- Usage
- Experiment Results
- Cost-Benefit Analysis
- Performance by Question Type
- Advanced Metrics
- Recommendations
- Project Structure
- Future Work
Standard RAG pipelines retrieve documents and pass them in full to the LLM for answer generation. Most of that retrieved content is irrelevant to the query — yet it still contributes to the total token count, inflating API costs and latency.
This project introduces a two-stage context compression step inserted between retrieval and generation:
- Keyword-based sentence filtering — select the top-k sentences most relevant to the query via word overlap
- LM-based query-aware summarization — compress those sentences using a local T5 model (Small / Base / Large)
The final prompt sent to the LLM is: query + compressed summary instead of query + full retrieved documents.
The approach is model-agnostic and designed for HPC batch pipelines with high query volumes.
| Problem | Impact |
|---|---|
| Token bloat | Baseline RAG uses 604K tokens for 918 questions |
| API cost at scale | $0.30 per 918 questions; compounds to thousands at HPC scale |
| No context filtering | Same context sent regardless of query specificity |
PDF Docs
↓
Text Cleaning (remove HTML)
↓
Chunking (SentenceSplitter, chunk_size=200, overlap=50)
↓
Embedding + Vector Index (Weaviate)
↓
Hybrid Search + Cross-Encoder Reranking (top-5)
↓
★ [NEW] Keyword-based Sentence Filtering
↓
★ [NEW] T5 Query-Aware Summarization
↓
GPT-3.5-Turbo Answer Generation
Keyword-based Sentence Filtering
- Extract sentences from retrieved documents
- Score by word overlap with the query
- Select top-k; fall back to first-k if no overlap exists
LM Summarization
- Prompt:
"Summarize for the question: {query}\n\nContext: {filtered_sentences}" - Runs locally — no additional API cost
- Three model sizes tested: T5-Small, T5-Base, T5-Large
Final Prompt
query + document_summary(instead of full retrieved chunks)- Sent to GPT-3.5-Turbo for answer generation
⚠️ Do not run on Windows — Embedded Weaviate DB is not supported. See issue #3315.
docker-compose up --build -d
# App available at http://0.0.0.0:3000/pip3 install -r requirements.txtpython3 -m pip install --upgrade pip
pip3 install -r requirements_mac.txtCreate a .env file in the root directory:
OPENAI_API_KEY=your_openai_api_key_here
Visit http://0.0.0.0:3000/ after starting the app. ▶ Video walkthrough
Upload a PDF:
python upload.py --pdf_file=your_document.pdfAsk a question:
python query.py --question="What is mt5?"self.save_data_from_index_to_file(client)
# Output: index_data.jsonDataset: rag-mini-wikipedia · 918 QA pairs
Generator: GPT-3.5-Turbo
Retrieval: FAISS + Cross-Encoder reranking (top-5)
| Model | Exact Match | F1 Score | BERTScore | Tokens Used | Est. Cost |
|---|---|---|---|---|---|
| Baseline RAG | 58.28% ★ | 0.704 ★ | 0.9474 ★ | 604,223 | $0.30 |
| T5-Small Summary ⭐ | 55.34% | 0.640 | 0.9389 | 240,580 (-60%) | $0.12 |
| T5-Base Summary | 53.70% | 0.633 | 0.9397 | 231,525 (-62%) | $0.12 |
| T5-Large Summary | 52.83% | 0.614 | 0.9360 | 241,491 (-60%) | $0.12 |
★ Best in category · ⭐ Recommended · (-%) Token reduction vs Baseline
Key finding: T5-Small achieves 95% of baseline accuracy at 40% of the cost.
| Model | Tokens vs Baseline | Cost Savings | Accuracy Trade-off | Efficiency Score* |
|---|---|---|---|---|
| T5-Small | -60.2% (363K saved) | $0.18 saved | -2.9% | 4.61 ⭐ |
| T5-Base | -61.7% (373K saved) | $0.19 saved | -4.6% | 4.48 |
| T5-Large | -60.0% (363K saved) | $0.18 saved | -5.5% | 4.39 |
| Baseline | — | — | — | 1.94 |
Efficiency Score = Accuracy / Cost (higher is better)
| Use Case | Recommended Model | Why |
|---|---|---|
| Accuracy-critical (research / benchmark) | Baseline RAG | Best exact match (58.28%), highest BERTScore |
| Cost-sensitive / high-volume | T5-Base Summary | Lowest token count (231K), cheapest at scale |
| Balanced production workload | T5-Small Summary ⭐ | Best efficiency ratio (4.61), 95% of accuracy |
| ❌ Avoid | T5-Large Summary | Lowest accuracy, no cost benefit over T5-Small |
T5 Model Size Paradox: Larger T5 models do not improve summarization quality in this setting — T5-Small outperforms T5-Large. Bigger models likely over-compress and discard key facts.
| Question Type | Count | Baseline | T5-Small | T5-Base | T5-Large | Average |
|---|---|---|---|---|---|---|
| Yes/No | 420 | 89.8% ✅ | 85.2% | 83.6% | 85.7% | 86.1% |
| When | 41 | 53.7% | 53.7% | 41.5% | 31.7% | 45.2% |
| Which | 11 | 45.5% | 36.4% | 45.5% | 36.4% | 40.9% |
| Other | 51 | 41.2% | 47.1% | 41.2% | 39.2% | 42.2% |
| Who | 54 | 40.7% | 29.6% | 31.5% | 29.6% | 32.9% |
| Where | 32 | 34.4% | 28.1% | 18.8% | 25.0% | 26.6% |
| What | 221 | 28.1% | 26.7% | 29.0% | 24.0% | 26.9% |
| How | 63 | 22.2% | 25.4% | 17.5% | 15.9% | 20.2% |
| Why | 25 | 4.0% 🔴 | 0.0% 🔴 | 4.0% 🔴 | 4.0% 🔴 | 3.0% |
Insights:
- ✅ All models excel at Yes/No questions (>83%) — binary retrieval works well
- 🔴 All models struggle with Why questions (<5%) — open-ended reasoning is a critical gap
- 📊 What questions make up 24% of the dataset but average only ~27% accuracy — high-impact improvement area
- 9× performance gap between the best (Yes/No) and worst (Why) question types
| Model | Precision | Recall | F1 |
|---|---|---|---|
| Baseline | 0.9465 | 0.9490 | 0.9474 |
| T5-Small | 0.9405 | 0.9381 | 0.9389 |
| T5-Base | 0.9419 | 0.9383 | 0.9397 |
| T5-Large | 0.9393 | 0.9337 | 0.9360 |
All models score >0.93 BERTScore — answers are semantically correct even when they don't exactly match the ground truth string. The exact match gap between Baseline and T5 variants is partly a surface-level phrasing artifact, not a deep semantic failure.
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Baseline | 0.724 | 0.166 | 0.719 |
| T5-Small | 0.653 | 0.127 | 0.651 |
| T5-Base | 0.649 | 0.133 | 0.646 |
| T5-Large | 0.626 | 0.117 | 0.623 |
| Model | Mean | Std Dev |
|---|---|---|
| Baseline | 0.796 | 0.288 |
| T5-Small | 0.748 | 0.329 |
| T5-Base | 0.749 | 0.328 |
| T5-Large | 0.712 | 0.356 |
| Model | Correct / Total | Error Rate | Token Efficiency |
|---|---|---|---|
| Baseline | 535 / 918 | 41.7% | 1.00× |
| T5-Small | 508 / 918 | 44.7% | 2.51× |
| T5-Base | 493 / 918 | 46.3% | 2.61× |
| T5-Large | 485 / 918 | 47.2% | 2.50× |
Use Baseline RAG when:
- Accuracy is critical (58.28% vs 53–55% for T5)
- Cost is not a concern
- You need the highest semantic quality (0.947 BERTScore)
Use T5-Small when (recommended default):
- You need the best balance of performance and cost
- 55% accuracy is acceptable (only 5% drop)
- You want 60% token reduction at 40% of the cost
- Deploying at HPC scale with high query volume
Use T5-Base when:
- Minimizing tokens/cost is the top priority (62% reduction)
- Volume is very high and every cent matters
Avoid T5-Large:
- Lowest accuracy (52.83%) with no cost advantage
- Smaller T5 models outperform it — over-compression loses key facts
.
├── upload.py # PDF ingestion script
├── query.py # CLI query interface
├── requirements.txt # Linux dependencies
├── requirements_mac.txt # Mac dependencies
├── docker-compose.yml # Docker setup
├── .env # API keys (not committed)
├── index_data.json # Exported vector DB data
└── src/
├── text_cleaner.py # HTML/text cleaning
├── indexing.py # Chunking + embedding
├── retriever.py # Hybrid search + reranking
├── summarizer.py # T5 query-aware summarization
└── query_engine.py # End-to-end pipeline
Key components in code:
# Text cleaning
clean_text = TextCleaner(doc.text).clean()
# Chunking
Settings.text_splitter = SentenceSplitter(
separator=" ", chunk_size=200, chunk_overlap=50,
paragraph_separator="\n\n\n",
secondary_chunking_regex="[^,.;。]+[,.;。]?",
tokenizer=tiktoken.encoding_for_model(self.model_name).encode
)
# Embedding + indexing
index, nodes = indexing.get_index()
# Reranking
self.rerank = SentenceTransformerRerank(top_n=5, model=self.model_reranker)
# Hybrid retrieval
query_engine = self.index.as_query_engine(
similarity_top_k=5,
vector_store_query_mode="hybrid",
alpha=0.5,
node_postprocessors=[self.postproc, self.rerank],
)
# Answer generation
response = Retriever(index, nodes).get_response("What is t5?")High Priority
- Chain-of-thought prompting for Why/How questions (currently <5% accuracy)
- Hybrid routing: use Baseline for complex questions, T5-Small for simple ones
Medium Priority
- Named-entity-aware sentence filtering to improve Who/Where accuracy
- Fine-tune T5 on HPC domain documents for better in-domain summarization
Long-term
- Replace word-overlap sentence selection with semantic similarity (e.g., cosine over sentence embeddings)
- Explore extractive + abstractive hybrid compression
If you use this work, please cite:
@misc{rag-token-optimization-2026,
title = {Efficient Token Optimization in Retrieval-Augmented Generation Pipelines for HPC},
author = {Rasel & Samia},
year = {2026},
school = {Iowa State University},
note = {CS 6250 Course Project}
}