Efficient Token Optimization in RAG Pipelines for HPC

A query-aware summarization approach that reduces token usage by 60% while retaining 95% of baseline accuracy in Retrieval-Augmented Generation pipelines.

Overview

Standard RAG pipelines retrieve documents and pass them in full to the LLM for answer generation. Most of that retrieved content is irrelevant to the query — yet it still contributes to the total token count, inflating API costs and latency.

This project introduces a two-stage context compression step inserted between retrieval and generation:

Keyword-based sentence filtering — select the top-k sentences most relevant to the query via word overlap
LM-based query-aware summarization — compress those sentences using a local T5 model (Small / Base / Large)

The final prompt sent to the LLM is: query + compressed summary instead of query + full retrieved documents.

The approach is model-agnostic and designed for HPC batch pipelines with high query volumes.

Motivation

Problem	Impact
Token bloat	Baseline RAG uses 604K tokens for 918 questions
API cost at scale	$0.30 per 918 questions; compounds to thousands at HPC scale
No context filtering	Same context sent regardless of query specificity

Proposed Method

Pipeline

PDF Docs
   ↓
Text Cleaning (remove HTML)
   ↓
Chunking (SentenceSplitter, chunk_size=200, overlap=50)
   ↓
Embedding + Vector Index (Weaviate)
   ↓
Hybrid Search + Cross-Encoder Reranking (top-5)
   ↓
★ [NEW] Keyword-based Sentence Filtering
   ↓
★ [NEW] T5 Query-Aware Summarization
   ↓
GPT-3.5-Turbo Answer Generation

Step Details

Keyword-based Sentence Filtering

Extract sentences from retrieved documents
Score by word overlap with the query
Select top-k; fall back to first-k if no overlap exists

LM Summarization

Prompt: "Summarize for the question: {query}\n\nContext: {filtered_sentences}"
Runs locally — no additional API cost
Three model sizes tested: T5-Small, T5-Base, T5-Large

Final Prompt

query + document_summary (instead of full retrieved chunks)
Sent to GPT-3.5-Turbo for answer generation

Installation

⚠️ Do not run on Windows — Embedded Weaviate DB is not supported. See issue #3315.

Using Docker (recommended)

docker-compose up --build -d
# App available at http://0.0.0.0:3000/

Local — Linux

pip3 install -r requirements.txt

Local — Mac

python3 -m pip install --upgrade pip
pip3 install -r requirements_mac.txt

Environment Setup

Create a .env file in the root directory:

OPENAI_API_KEY=your_openai_api_key_here

Usage

GUI

Visit http://0.0.0.0:3000/ after starting the app. ▶ Video walkthrough

Command Line

Upload a PDF:

python upload.py --pdf_file=your_document.pdf

▶ Video

Ask a question:

python query.py --question="What is mt5?"

▶ Video

Export Vector DB Data

self.save_data_from_index_to_file(client)
# Output: index_data.json

Experiment Results

Dataset: rag-mini-wikipedia · 918 QA pairs
Generator: GPT-3.5-Turbo
Retrieval: FAISS + Cross-Encoder reranking (top-5)

Overall Performance

Model	Exact Match	F1 Score	BERTScore	Tokens Used	Est. Cost
Baseline RAG	58.28% ★	0.704 ★	0.9474 ★	604,223	$0.30
T5-Small Summary ⭐	55.34%	0.640	0.9389	240,580 (-60%)	$0.12
T5-Base Summary	53.70%	0.633	0.9397	231,525 (-62%)	$0.12
T5-Large Summary	52.83%	0.614	0.9360	241,491 (-60%)	$0.12

★ Best in category · ⭐ Recommended · (-%) Token reduction vs Baseline

Key finding: T5-Small achieves 95% of baseline accuracy at 40% of the cost.

Cost-Benefit Analysis

Model	Tokens vs Baseline	Cost Savings	Accuracy Trade-off	Efficiency Score*
T5-Small	-60.2% (363K saved)	$0.18 saved	-2.9%	4.61 ⭐
T5-Base	-61.7% (373K saved)	$0.19 saved	-4.6%	4.48
T5-Large	-60.0% (363K saved)	$0.18 saved	-5.5%	4.39
Baseline	—	—	—	1.94

Efficiency Score = Accuracy / Cost (higher is better)

Decision Matrix

Use Case	Recommended Model	Why
Accuracy-critical (research / benchmark)	Baseline RAG	Best exact match (58.28%), highest BERTScore
Cost-sensitive / high-volume	T5-Base Summary	Lowest token count (231K), cheapest at scale
Balanced production workload	T5-Small Summary ⭐	Best efficiency ratio (4.61), 95% of accuracy
❌ Avoid	T5-Large Summary	Lowest accuracy, no cost benefit over T5-Small

T5 Model Size Paradox: Larger T5 models do not improve summarization quality in this setting — T5-Small outperforms T5-Large. Bigger models likely over-compress and discard key facts.

Performance by Question Type

Question Type	Count	Baseline	T5-Small	T5-Base	T5-Large	Average
Yes/No	420	89.8% ✅	85.2%	83.6%	85.7%	86.1%
When	41	53.7%	53.7%	41.5%	31.7%	45.2%
Which	11	45.5%	36.4%	45.5%	36.4%	40.9%
Other	51	41.2%	47.1%	41.2%	39.2%	42.2%
Who	54	40.7%	29.6%	31.5%	29.6%	32.9%
Where	32	34.4%	28.1%	18.8%	25.0%	26.6%
What	221	28.1%	26.7%	29.0%	24.0%	26.9%
How	63	22.2%	25.4%	17.5%	15.9%	20.2%
Why	25	4.0% 🔴	0.0% 🔴	4.0% 🔴	4.0% 🔴	3.0%

Insights:

✅ All models excel at Yes/No questions (>83%) — binary retrieval works well
🔴 All models struggle with Why questions (<5%) — open-ended reasoning is a critical gap
📊 What questions make up 24% of the dataset but average only ~27% accuracy — high-impact improvement area
9× performance gap between the best (Yes/No) and worst (Why) question types

Advanced Metrics

BERTScore (Semantic Similarity)

Model	Precision	Recall	F1
Baseline	0.9465	0.9490	0.9474
T5-Small	0.9405	0.9381	0.9389
T5-Base	0.9419	0.9383	0.9397
T5-Large	0.9393	0.9337	0.9360

All models score >0.93 BERTScore — answers are semantically correct even when they don't exactly match the ground truth string. The exact match gap between Baseline and T5 variants is partly a surface-level phrasing artifact, not a deep semantic failure.

ROUGE Scores

Model	ROUGE-1	ROUGE-2	ROUGE-L
Baseline	0.724	0.166	0.719
T5-Small	0.653	0.127	0.651
T5-Base	0.649	0.133	0.646
T5-Large	0.626	0.117	0.623

Semantic Similarity (Cosine)

Model	Mean	Std Dev
Baseline	0.796	0.288
T5-Small	0.748	0.329
T5-Base	0.749	0.328
T5-Large	0.712	0.356

Error Summary

Model	Correct / Total	Error Rate	Token Efficiency
Baseline	535 / 918	41.7%	1.00×
T5-Small	508 / 918	44.7%	2.51×
T5-Base	493 / 918	46.3%	2.61×
T5-Large	485 / 918	47.2%	2.50×

Recommendations

Use Baseline RAG when:

Accuracy is critical (58.28% vs 53–55% for T5)
Cost is not a concern
You need the highest semantic quality (0.947 BERTScore)

Use T5-Small when (recommended default):

You need the best balance of performance and cost
55% accuracy is acceptable (only 5% drop)
You want 60% token reduction at 40% of the cost
Deploying at HPC scale with high query volume

Use T5-Base when:

Minimizing tokens/cost is the top priority (62% reduction)
Volume is very high and every cent matters

Avoid T5-Large:

Lowest accuracy (52.83%) with no cost advantage
Smaller T5 models outperform it — over-compression loses key facts

Project Structure

.
├── upload.py                  # PDF ingestion script
├── query.py                   # CLI query interface
├── requirements.txt           # Linux dependencies
├── requirements_mac.txt       # Mac dependencies
├── docker-compose.yml         # Docker setup
├── .env                       # API keys (not committed)
├── index_data.json            # Exported vector DB data
└── src/
    ├── text_cleaner.py        # HTML/text cleaning
    ├── indexing.py            # Chunking + embedding
    ├── retriever.py           # Hybrid search + reranking
    ├── summarizer.py          # T5 query-aware summarization
    └── query_engine.py        # End-to-end pipeline

Key components in code:

# Text cleaning
clean_text = TextCleaner(doc.text).clean()

# Chunking
Settings.text_splitter = SentenceSplitter(
    separator=" ", chunk_size=200, chunk_overlap=50,
    paragraph_separator="\n\n\n",
    secondary_chunking_regex="[^,.;。]+[,.;。]?",
    tokenizer=tiktoken.encoding_for_model(self.model_name).encode
)

# Embedding + indexing
index, nodes = indexing.get_index()

# Reranking
self.rerank = SentenceTransformerRerank(top_n=5, model=self.model_reranker)

# Hybrid retrieval
query_engine = self.index.as_query_engine(
    similarity_top_k=5,
    vector_store_query_mode="hybrid",
    alpha=0.5,
    node_postprocessors=[self.postproc, self.rerank],
)

# Answer generation
response = Retriever(index, nodes).get_response("What is t5?")

Future Work

High Priority

Chain-of-thought prompting for Why/How questions (currently <5% accuracy)
Hybrid routing: use Baseline for complex questions, T5-Small for simple ones

Medium Priority

Named-entity-aware sentence filtering to improve Who/Where accuracy
Fine-tune T5 on HPC domain documents for better in-domain summarization

Long-term

Replace word-overlap sentence selection with semantic similarity (e.g., cosine over sentence embeddings)
Explore extractive + abstractive hybrid compression

Citation

If you use this work, please cite:

@misc{rag-token-optimization-2026,
  title   = {Efficient Token Optimization in Retrieval-Augmented Generation Pipelines for HPC},
  author  = {Rasel & Samia},
  year    = {2026},
  school  = {Iowa State University},
  note    = {CS 6250 Course Project}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.vscode		.vscode
rag		rag
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
comprehensive_rag_eval.py		comprehensive_rag_eval.py
config.py		config.py
debug_port_forward		debug_port_forward
debug_run		debug_run
docker-compose.yml		docker-compose.yml
get_comparative_results.py		get_comparative_results.py
get_dataset.py		get_dataset.py
kanagawa.json		kanagawa.json
main.py		main.py
query.py		query.py
requirements		requirements
requirements.txt		requirements.txt
requirements_mac.txt		requirements_mac.txt
stop_client		stop_client
upload.py		upload.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Efficient Token Optimization in RAG Pipelines for HPC

Table of Contents

Overview

Motivation

Proposed Method

Pipeline

Step Details

Installation

Using Docker (recommended)

Local — Linux

Local — Mac

Environment Setup

Usage

GUI

Command Line

Export Vector DB Data

Experiment Results

Overall Performance

Cost-Benefit Analysis

Decision Matrix

Performance by Question Type

Advanced Metrics

BERTScore (Semantic Similarity)

ROUGE Scores

Semantic Similarity (Cosine)

Error Summary

Recommendations

Project Structure

Future Work

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages