# Self-Reflective Multi-Agent RAG System

This notebook goes through each part of the system step by step.

The basic idea is this: instead of doing the usual retrieve-and-generate in one shot,
we have separate agents that each handle one piece of the puzzle:

1. **Planner** — breaks down complex questions into smaller parts
2. **Retriever** — finds relevant chunks from the paper
3. **Answer agent** — generates an answer from the context
4. **Critic** — checks if the answer is actually good
5. **Revision agent** — fixes the answer if the critic isn't happy
6. **Memory** — remembers past interactions

There's also a query complexity detector that decides whether the planner
is even needed (no point decomposing simple questions).

Everything runs locally — no API keys required.

---
## 1. Setup

Make sure you've installed the requirements:
```
pip install -r requirements.txt
```

In [None]:
import sys
import os

# need the project root on the path for imports to work
root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root not in sys.path:
    sys.path.insert(0, root)

print(f"Project root: {root}")

---
## 2. Loading the PDF

First step is getting the text out of the PDF. We use pypdf for this.
Put your PDF at `data/research_paper.pdf` or change the path below.

In [None]:
from src.pdf_loader import load_pdf, show_pdf_info

PDF_PATH = os.path.join(root, 'data', 'research_paper.pdf')

if not os.path.exists(PDF_PATH):
    print(f"No PDF found at {PDF_PATH}")
    print("Put a research paper there or change PDF_PATH")
else:
    pdf_info = load_pdf(PDF_PATH)
    show_pdf_info(pdf_info)

---
## 3. Chunking

We split the text into smaller pieces. Each chunk becomes one vector in our search index.

The overlap (100 chars) is important — without it you lose context at the edges of chunks.
I tested with and without overlap and retrieval quality was noticeably worse without it.

Chunk size of 600 chars is a compromise. Too small and each chunk lacks context.
Too big and the embedding becomes too vague.

In [None]:
from src.chunking import chunk_text, show_chunk_stats

if 'pdf_info' in dir():
    chunks = chunk_text(pdf_info['text'], chunk_size=600, overlap=100)
    show_chunk_stats(chunks)
    
    # show one chunk so we can see what they look like
    if chunks:
        c = chunks[0]
        print(f"\nChunk 0 (start={c['start']}, end={c['end']}):")
        print(c['text'][:300] + '...')
else:
    # no PDF? use some dummy text
    dummy = (
        "Machine learning is a subset of AI that builds systems that learn from data. "
        "The methodology involves training on labeled datasets. Key limitations include "
        "the need for lots of labeled data and overfitting risk. "
    ) * 10
    chunks = chunk_text(dummy, chunk_size=600, overlap=100)
    show_chunk_stats(chunks)
    print('\n(using dummy text since no PDF was loaded)')

---
## 4. Embeddings

Each chunk gets turned into a 384-dimensional vector using the all-MiniLM-L6-v2 model.
The idea is that chunks about similar topics will have vectors that are close together.
So when we embed a question and search for nearest neighbors, we get relevant chunks.

The model is about 80MB and downloads on the first run. After that its cached.

In [None]:
from src.embeddings import get_model, make_embeddings, build_index

emb_model = get_model()
vecs = make_embeddings(chunks, emb_model)

print(f"\nShape: {vecs.shape}")
print(f"Each chunk is now a {vecs.shape[1]}-dim vector")

---
## 5. FAISS Index

We store the vectors in a FAISS IndexFlatL2 for searching. Its brute-force (checks every
vector on each query) which is O(n*d) per search. For a single paper with maybe 100-200
chunks thats basically instant. You'd need approximate methods for larger collections.

In [None]:
index = build_index(vecs)
print(f"Index has {index.ntotal} vectors")
print(f"Memory: ~{index.ntotal * 384 * 4 / 1024:.1f} KB")

---
## 6. Retrieval

Now we can search. Given a question, we embed it and find the closest chunks.
Lower L2 distance = more similar. If the distance is too high we flag it.

In [None]:
from src.retrieval import find_top_chunks, build_context, show_results

test_q = "What is the methodology used in this paper?"
found = find_top_chunks(test_q, index, chunks, emb_model, top_k=3)
show_results(found)

ctx = build_context(found)
print(f"\nContext length: {len(ctx)} chars")

---
## 7. Query Complexity Detection

Before we even plan anything, we check if the query actually needs planning.

Simple queries like "what dataset was used?" don't need to be decomposed.
Complex ones like "compare methodology and limitations" do.

The detector looks for keywords: "compare", "difference", "and" (connecting topics),
"advantages and disadvantages", "limitations", etc.

In [None]:
from src.planner_agent import check_complexity

test_queries = [
    "What dataset was used?",
    "Explain the methodology and limitations.",
    "What are the advantages and disadvantages?",
    "Compare results with baseline.",
]

for q in test_queries:
    c = check_complexity(q)
    tag = 'COMPLEX' if c['is_complex'] else 'SIMPLE'
    print(f"[{tag}] {q}")
    print(f"  reason: {c['reason']}\n")

---
## 8. Planner Agent

For complex queries, the planner splits them into focused subtasks.

This really matters for retrieval quality. If you search for "methodology and limitations"
as one query the embedding is somewhere in between both topics. By splitting, each subtask
gets its own focused search.

In [None]:
from src.planner_agent import plan_query, show_plan

test = [
    "What is the problem statement?",
    "Explain the methodology and limitations.",
    "What are the contributions compared to existing work?",
]

for q in test:
    plan = plan_query(q)
    show_plan(plan)
    print()

---
## 9. Answer Generation

We use flan-t5-base to generate answers. Its a 250M parameter model from Google
that runs on CPU. Not amazing, but free and it handles Q&A decently.

The prompt tells it to only use the provided context. This helps against hallucination
but doesn't eliminate it — thats why we have the critic.

In [None]:
from src.answer_agent import build_answer, show_answer

ans = build_answer(test_q, ctx)
show_answer(ans)

---
## 10. Critic Agent

The critic checks the answer on a few things:
- Is it long enough?
- Is it grounded in the context (not hallucinated)?
- Does it actually answer the question?

I use heuristic checks as the main scoring signal because honestly, asking flan-t5-base
to evaluate its own output is not very reliable. The heuristics catch the obvious stuff
and the LLM feedback adds some qualitative notes.

In [None]:
from src.critic_agent import evaluate, show_eval

ev = evaluate(ans['answer'], ctx, test_q)
show_eval(ev)

---
## 11. Revision Agent

If the score is below 7, we try to improve the answer.
It takes the original answer + the critic's feedback and generates a new version.
Max 2 rounds to avoid wasting time.

With flan-t5-base the improvements are sometimes small. A bigger model would
make this loop way more effective.

In [None]:
from src.revision_agent import run_revision_loop, show_revision

if ev['needs_revision']:
    rev = run_revision_loop(ans['answer'], ctx, test_q, evaluate)
    show_revision(rev)
else:
    print('Score is fine, no revision needed.')
    rev = None

---
## 12. Memory

The memory module just keeps a log of everything that happened.
Useful for follow-up questions and debugging.

In [None]:
from src.memory import Memory

mem = Memory()

final = rev['final_answer'] if rev else ans['answer']
mem.save(
    query=test_q,
    answer=ans['answer'],
    final_answer=final,
    critic_score=ev['score'],
    feedback=ev['feedback'],
    revisions=rev['rounds'] if rev else 0,
)

mem.show()
print(f"\nContext for follow-ups:\n{mem.get_recent()}")

---
## 13. Full Pipeline

Now lets run everything together using the orchestrator from main.py.

In [None]:
from src.main import answer_query

# reuse what we already loaded
if 'pdf_info' in dir():
    pipe = {
        'chunks': chunks,
        'index': index,
        'model': emb_model,
        'memory': Memory(),
        'pdf_info': pdf_info,
    }
    print('Using previously loaded components.')
else:
    print('No PDF loaded. Go back to step 2.')

In [None]:
# test query 1 - problem statement
if 'pipe' in dir():
    r1 = answer_query("What is the problem statement of this paper?", pipe)

In [None]:
# test query 2 - compound query, tests the planner
if 'pipe' in dir():
    r2 = answer_query("Explain the methodology and limitations.", pipe)

In [None]:
# test query 3 - contribution comparison
if 'pipe' in dir():
    r3 = answer_query("What are the key contributions compared to existing work?", pipe)

In [None]:
# check what's in memory
if 'pipe' in dir():
    pipe['memory'].show()

---
## 14. Results Summary

| Query | Type | Subtasks | Chunks | Critic Score | Revisions |
|-------|------|----------|--------|-------------|----------|
| Problem statement | Simple | 1 | 3 | (fill in) | (fill in) |
| Methodology & limitations | Complex | 3 | 6+ | (fill in) | (fill in) |
| Contributions vs existing | Complex | 2+ | 3+ | (fill in) | (fill in) |

Fill in the scores after running with your PDF.

### What I noticed

- The planner correctly picks up compound queries and splits them
- Retrieval quality depends a lot on how well the PDF text maps to common terms
- The critic is decent at catching short or ungrounded answers
- Revision helps sometimes, but with a 250M param model the improvements are limited
- The complexity detector saves time on simple queries by skipping unnecessary planning

---
## 15. Limitations

- flan-t5-base is small. Answers are often short and lacking detail.
- Context window is ~512 tokens so we have to truncate retrieved text.
- The planner uses keyword matching which misses some edge cases.
- FAISS brute force search is O(n*d) — fine here, not scalable.
- pypdf can't handle scanned PDFs or complex layouts well.
- The critic's heuristics are approximate at best.
- Hallucination is still possible despite the grounding prompts.