# Advanced RAG Systems

## RAG over Structured Data (SQL + Graph Database)




**RAG over Structured Data (SQL + Graph Database):**
Retrieval-Augmented Generation (RAG) over structured data enhances LLM responses by retrieving relevant information from structured sources like SQL databases or graph databases. The system translates user queries into executable SQL or graph queries (e.g., Cypher), retrieves precise data, and feeds it to the LLM for grounded and context-aware responses. This reduces hallucinations and improves factual accuracy in complex, data-driven tasks.


**Use Case**
“What drugs are most frequently prescribed for Type 2 diabetes, and what side effects are linked to each?”


### 1. Query Understanding

* **Purpose:** Normalize terminology (e.g. “Type 2 diabetes” → internal disease code).
* **Tools:** GPT‑4 or Claude for light question rewriting, ensuring consistency with your schema.

### 2. Dual Structured Retrieval

Retriver frist translates the user’s question into both SQL statements **and** corresponding graph‑traversal patterns. Simple lookups map cleanly, but complex needs—nested joins, aggregations or multi‑hop graph paths—risk misinterpretation. To avoid hallucinations, break the task into logical steps (“Which tables or node types hold the data?”, “How do they join or connect?”, “What filters or traversal depths apply?”). Guiding the LLM through each sub‑query with a Chain‑of‑Thought prompt ensures faithful, executable SQL and graph queries.


* **SQL Retriever**

  * Fetch rows from tables such as `drugs(disease_treated, drug_name)` and `side_effects(drug_id, effect)`.
  * Tools: Snowflake, PostgreSQL, BigQuery.
* **Graph Retriever**

  * Traverse a medical graph (Neo4j, TigerGraph) for patterns like `(Drug)-[:TREATS]->(Disease)` → `(Drug)-[:CAUSES]->(Effect)`.
  * Tools: Neo4j Aura, Amazon Neptune.

### 3. Evidence Aggregation & Reranking

* **Combine** SQL rows and graph paths into one ranked list of “drug → effect” pairs.
* **Rerank** with a cross‑encoder (ColBERT, BGE) to surface the top 5–10 entries.

### 4. Prompt Assembly

* **Structure:**

  ```
  Question: [normalized question]  
  SQL insights: – Metformin → nausea, diarrhea  
  Graph facts: – Glipizide → hypoglycemia, weight gain  
  ```
* **Instruction:** “Answer using only these bullets, citing each as (SQL) or (Graph).”

### 5. LLM Generation & Attribution

* **Model:** GPT‑4, Mixtral, Claude.
* **Output style:**

  > “Metformin is common for Type 2 diabetes and may cause nausea or diarrhea (SQL). Glipizide often leads to low blood sugar and weight gain (Graph).”

### 6. Faithfulness Check

* Tools like RAGAS or FactScore re‑verify every claim against the original SQL rows and graph relations, flagging inconsistencies.


### **Few Examples of advance RAG Systems**

1. Multi-Hop RAG System with Structured + Unstructured Sources for Medical Diagnosis and QA

2. Financial RAG with Real-Time API Retrieval and Historical Document Indexing for Market Trend Analysis

3. Graph-Augmented RAG for Legal Contract Analysis and Clause Comparison

## Key Pointers for Optimizing RAG :

Retrieval-Augmented Generation can be powerful, but its performance depends heavily on how you design, tune, and evaluate each stage of the pipeline.

### 1. Enhancing Data Quality and Granularity

#### Data Cleaning

Ensuring high-quality data is fundamental to the success of a RAG system—after all, "Garbage in, garbage out." Effective data cleaning involves removing irrelevant or noisy elements such as special characters, HTML tags, and other markup that do not contribute to the semantic understanding. Additionally, filtering out stop words—common and less meaningful words like “the” and “a”—helps reduce noise in the dataset. It is equally important to eliminate unnecessary metadata and irrelevant documents altogether to prevent the retrieval of misleading or unrelated information.

#### Error Correction

Correcting spelling mistakes, typos, and grammatical errors is essential to improve the quality of embeddings and ensure better semantic matching between queries and documents. Clean, error-free text allows the model to generate more accurate vector representations, which directly enhances retrieval performance.

#### Pronoun Replacement

In documents split into smaller chunks, replacing pronouns with explicit entity names helps maintain semantic clarity and context continuity. This step reduces ambiguity during retrieval, allowing the system to more accurately identify relevant content and improve the overall effectiveness of the RAG pipeline.

---

### 2. Adding Metadata

#### Purpose and Benefits

Adding metadata to chunks introduces structured, filterable attributes that significantly enhance retrieval precision. Metadata can include dates, document sections, conceptual tags, or any contextual information relevant to the content.

![RAG Adding Metadata](https://i.postimg.cc/tJQPZFjR/Metadata.webp)

#### Common Metadata Types and Use Cases

* **Dates and Recency:** Useful when sorting results by time, such as news articles or time-sensitive documents.
* **Document Sections:** Filtering by sections like “Experiment” or “Conclusion” in scientific papers focuses the retrieval on the most relevant parts.
* **Conceptual Tags:** Labels that represent topics or levels of information further refine search results.

By incorporating metadata, retrieval systems gain an additional structured search layer that complements vector similarity search, improving efficiency and accuracy.

---

### 3. Optimizing Index Structures

#### Graph-Based Indexing

Utilizing Knowledge Graphs or Graph Neural Networks captures relationships between entities and contextual dependencies that vector indexes alone might miss. This relational information enriches the retrieval process by providing a deeper semantic understanding of document interconnections.

#### Vector Indexing and Chunking

Choosing the right chunk size is crucial for balancing retrieval relevance and computational efficiency:

* **Small Chunks (\~128 tokens):** Provide fine-grained, detailed retrieval but may risk missing critical context if top-k retrieval is limited.
* **Larger Chunks (\~512 tokens):** Offer more comprehensive context, reducing retrieval misses but may introduce irrelevant information and increase processing time.

Chunking should be adapted based on downstream tasks—larger chunks for high-level tasks like summarization and smaller chunks for detailed tasks like coding.

#### Chunk Overlap

Overlapping chunks ensure continuity of context across chunk boundaries, helping prevent loss of important semantic information during splitting, albeit at the cost of increased index size.

---

### 4. Advanced Chunking Techniques

#### Small2Big / Parent Document Retrieval

This technique involves creating embeddings for both fine-grained child chunks and their larger parent documents. Initial retrieval targets small, detailed child chunks to identify relevant information precisely. Once matched, the entire parent document is retrieved to provide broader context, improving the quality of responses generated by the LLM.

#### Sentence Window Retrieval

This method embeds a limited set of sentences for initial retrieval and separately stores the surrounding “window context.” After identifying top sentences, their broader context is re-integrated before being passed to the LLM. This balances the precision of targeted retrieval with the richness of contextual information.

---

### 5. Retrieval Optimization

#### Query Rewriting

Rewriting or rephrasing queries using LLMs enhances the semantic alignment between the user's input and document embeddings. Since queries that seem similar to humans may have distant representations in embedding space, rewriting improves retrieval relevance by capturing implicit intent.

#### MultiQuery Retrieval

![Sub Quering](https://i.postimg.cc/zfTqDh0F/subquery.png)

Generating multiple query variants from a single user query allows retrieval from different perspectives or subtopics. Combining results from these multiple queries increases the diversity and recall of relevant documents, especially useful for complex questions with multiple facets.

#### Hyde / Query2Doc

These methods expand short or ambiguous queries by generating relevant background or contextual information, leveraging LLM knowledge to bridge gaps in query specificity and improve retrieval accuracy.

#### StepBack-Prompting

StepBack-prompting encourages the LLM to first consider broader, abstract concepts before addressing specific queries. This layered approach improves reasoning and enhances responses to complex or conceptual questions.

#### Fine-tuning Embeddings

Customizing embedding models with domain-specific, synthetic datasets—generated via LLMs—improves retrieval accuracy in specialized fields by capturing nuanced terminology and context that general models might miss.

#### Hybrid Search

Combining keyword-based sparse retrievers (e.g., BM25) with dense semantic retrievers (embedding similarity) leverages the strengths of both approaches. Sparse search excels in exact keyword matches, while dense search captures semantic relevance, ensuring comprehensive and accurate retrieval across query types.

---

### 6. Post-Retrieval Optimization

#### Re-Ranking

Initial retrieval based on vector similarity scores may not always prioritize the most relevant documents. Applying a re-ranking step using dedicated models refines the ordering of results, boosting relevance and reducing noise. This also helps to manage context window size by filtering down the set to the top documents before input to the LLM.

#### Prompt Compression / Contextual Compression

To reduce the cost and improve the quality of LLM responses, retrieved documents are compressed by filtering out irrelevant information and highlighting pivotal paragraphs. Small language models can estimate the importance of different document parts relative to the user query, allowing selective inclusion of critical context only.

#### Modular RAG Systems

Modular RAG architectures integrate specialized modules for different pipeline components, such as search, fine-tuned retrieval, re-ranking, and compression. This approach enhances maintainability and flexibility, enabling targeted improvements.

#### RAG Fusion

RAG Fusion combines multi-query retrieval and re-ranking strategies. Multiple query perspectives retrieve a diverse document set, which is then re-ranked and filtered to best match user intent, including implicit or less obvious information needs. This results in a more robust and accurate retrieval-augmented generation process.

---

# Summary

Optimizing a RAG pipeline requires a holistic approach across data preparation, retrieval, and post-processing:

* Pre-retrieval focuses on clean, granular, and well-indexed data with enriched metadata.
* Retrieval employs query enhancement, multiple retrieval strategies, and domain-tuned embeddings.
* Post-retrieval sharpens results through re-ranking, compression, and modular pipeline design.

Together, these techniques ensure that the RAG system delivers relevant, precise, and contextually rich responses efficiently.



# Challenges and Limitations of RAG System

**1. Poor Retrieval Quality:**
RAG systems depend heavily on the retriever to find relevant documents. If the retrieval step brings back unrelated or outdated content, the generator can produce inaccurate answers. For example, asking about side effects of a new COVID-19 drug might result in misleading output if the system retrieves old or irrelevant studies.


**2. Latency Issues:**
Because RAG involves both retrieval and generation, it introduces more delay than using a standalone LLM. This two-step process—first retrieving documents, then generating an answer—makes RAG less suitable for real-time applications like live chat or mobile interfaces requiring fast responses.



**3. Scalability Concerns:**
Handling large-scale knowledge bases requires efficient indexing and significant memory. As document collections grow, maintaining fast and accurate retrieval becomes more complex. For instance, news or legal databases need constant updates and re-indexing to ensure system performance.



**4. Input Context Limitations:**
LLMs have limited context windows, so only a subset of retrieved documents can be processed at once. If important details fall outside this limit, they may be ignored. For example, a legal query requiring multiple case references might lose key info if some texts are truncated.


**5. Risk of Hallucination:**
Even with relevant documents, the model may still generate false or unsupported information. This happens when the context is vague or incomplete. For example, the model might fabricate statistics or facts not present in the source, posing risks in sensitive fields like healthcare or finance.


### Other Limitations & Considerations

* **Lack of Ground Truth** – Hard to score when multiple “right” answers exist.
* **Data Quality Issues** – Garbage in, garbage out: outdated or biased sources hurt results.
* **Metric Blind Spots** – BLEU, ROUGE, and even embeddings may miss factual errors.
* **Attribution Problems** – Unclear if info comes from retrieval, model memory, or hallucination.
* **Human Evaluation Costs** – Accurate but slow and expensive.
* **Sensitivity to Setup** – Small tweaks to top-k, chunk size, or prompts can change results a lot.
* **Truthfulness Gaps** – Automated fact-checking still struggles with niche or technical queries.
* **Trust & Safety Needs** – Add PII detection, toxicity checks, and moderation for risky domains.
* **Cost Management** – Use disk-based indexes, serverless scaling, and vector compression for efficiency.


# Reference

https://galileo.ai/mastering-rag

https://medium.com/@simeon.emanuilov/retrieval-augmented-generation-rag-limitations-d0c641d8b627

https://luv-bansal.medium.com/advance-rag-improve-rag-performance-208ffad5bb6a

https://milvus.io/docs/how_to_enhance_your_rag.md

https://www.youtube.com/watch?v=5-l_43wDhUE <- Recommend This Video