# Retrieval-Augmented QA (RAG)

A compact RAG system you can run locally:

- **Corpus → chunks → embeddings → index** (FAISS or TF-IDF fallback)
- **Retriever**: top-k passages for a query
- **Answerer**: small greedy LLM if available, else extractive fallback
- **Citations**: bracket refs like `[0]`, `[1]` from retrieved passages

### Modes
- `BUILD_INDEX` — build & save a local index from the provided corpus  
- `ASK_ONCE` — answer one example question (or your own)  
- `CHAT_CLI` — small console chat loop (type `exit` to quit)

This notebook is **offline-tolerant** and runs on CPU/Windows.


## 🚀 Quick Start

1. Run **Install** once.
2. Pick a **MODE** in Config.
3. Run all cells top → bottom.

If model downloads are blocked, we automatically switch to **TF-IDF** retrieval and an **extractive** answerer.


In [1]:
%pip install -q sentence-transformers faiss-cpu transformers scikit-learn numpy


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os, json, numpy as np
import math
import faiss
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import joblib
import re
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [3]:
MODE = "ASK_ONCE"             # "BUILD_INDEX" | "ASK_ONCE" | "CHAT_CLI"
INDEX_DIR = "./rag_index"
TOP_K = 5
MAX_NEW_TOKENS = 220

# NEW: choose how answers are produced
#   "extractive"  -> deterministic: quote top passages + cite them
#   "llm"         -> greedy LLM; auto-falls back to extractive if gibberish
ANSWER_BACKEND = "extractive"   # ← set to "llm" if you really want generation

print("MODE:", MODE, "| INDEX_DIR:", INDEX_DIR, "| ANSWER_BACKEND:", ANSWER_BACKEND)


MODE: ASK_ONCE | INDEX_DIR: ./rag_index | ANSWER_BACKEND: extractive


In [4]:
# Small, curated corpus (offline). Edit or extend as you like.
DOCS = [
    {"id": "econ-1", "title": "Eurozone Q2 Growth", "text": "The eurozone economy grew by 0.3% last quarter, driven by exports and services. Analysts expect moderate momentum into next quarter."},
    {"id": "econ-2", "title": "Central Bank Update", "text": "The central bank maintained interest rates but signaled potential cuts if inflation cools further. Markets reacted cautiously."},
    {"id": "tech-1", "title": "AI Startup Funding", "text": "A startup announced record funding to scale its AI research team and expand globally. Investors cited strong product-market fit."},
    {"id": "space-1", "title": "New Exoplanet", "text": "Scientists identified a potentially habitable exoplanet in a nearby system. Follow-up studies will probe atmosphere composition."},
    {"id": "sports-1", "title": "Championship Win", "text": "The national team won the championship after a dramatic final, with a late goal in extra time sealing the victory."},
    {"id": "health-1", "title": "Sleep & Cognition", "text": "Multiple studies correlate consistent sleep schedules with improved memory consolidation and attention across age groups."},
    {"id": "env-1", "title": "Urban Trees", "text": "Expanding urban tree canopy can reduce summer peak temperatures and improve air quality at the neighborhood level."},
    {"id": "econ-3", "title": "Labor Market", "text": "Job vacancies eased slightly while participation remained steady, suggesting a gradual rebalancing of labor demand and supply."},
    {"id": "tech-2", "title": "Edge Computing", "text": "Edge computing reduces latency by processing data near the source, improving reliability for real-time applications."},
    {"id": "health-2", "title": "Hydration", "text": "Adequate hydration supports physical performance and cognitive function; mild dehydration can impair mood and alertness."},
]
len(DOCS)


10

In [5]:

def chunk_text(text, chunk_size=80, overlap=20):
    # simple word-based chunker
    words = text.split()
    chunks, i = [], 0
    while i < len(words):
        chunk = " ".join(words[i:i+chunk_size])
        chunks.append(chunk)
        i += max(1, chunk_size - overlap)
    return chunks


In [6]:
def build_passages(docs, chunk_size=80, overlap=20):
    passages = []
    for di, d in enumerate(docs):
        for ci, ch in enumerate(chunk_text(d["text"], chunk_size, overlap)):
            passages.append({
                "doc_id": d["id"],
                "title": d["title"],
                "chunk_id": f"{d['id']}::chunk{ci}",
                "text": ch
            })
    return passages

In [7]:
PASSAGES = build_passages(DOCS, chunk_size=80, overlap=20)
len(PASSAGES), PASSAGES[0]

(10,
 {'doc_id': 'econ-1',
  'title': 'Eurozone Q2 Growth',
  'chunk_id': 'econ-1::chunk0',
  'text': 'The eurozone economy grew by 0.3% last quarter, driven by exports and services. Analysts expect moderate momentum into next quarter.'})

In [8]:

EMBED_BACKEND = None           # "sbert" or "tfidf"
_sbert_model = None
_faiss_index = None
_tfidf = None
_tfidf_matrix = None


In [9]:
def try_load_sbert():
    global _sbert_model, EMBED_BACKEND
    if _sbert_model is not None: return True
    try:
        _sbert_model = SentenceTransformer("all-MiniLM-L6-v2")
        EMBED_BACKEND = "sbert"
        print("[embed] using SentenceTransformers all-MiniLM-L6-v2")
        return True
    except Exception as e:
        print("[embed] SBERT unavailable:", e)
        return False

In [10]:
def fit_index(passages):
    """
    Build an index from passages.
    If SBERT is available -> FAISS IP index with L2-normalized embeddings.
    Else -> TF-IDF + cosine similarity matrix.
    """
    global _faiss_index, _tfidf, _tfidf_matrix, EMBED_BACKEND
    texts = [p["text"] for p in passages]
    if try_load_sbert():
        X = _sbert_model.encode(texts, convert_to_numpy=True)
        faiss.normalize_L2(X)
        _faiss_index = faiss.IndexFlatIP(X.shape[1])
        _faiss_index.add(X.astype(np.float32))
        EMBED_BACKEND = "sbert"
        print(f"[index] FAISS built with {len(texts)} chunks.")
        return {"backend": "sbert", "vectors": X}
    else:
        _tfidf = TfidfVectorizer(min_df=1, max_df=1.0, ngram_range=(1,2))
        _tfidf_matrix = _tfidf.fit_transform(texts)  # (n_chunks, n_features)
        EMBED_BACKEND = "tfidf"
        print(f"[index] TF-IDF built with {len(texts)} chunks.")
        return {"backend": "tfidf"}


In [11]:
def search(query, k=TOP_K):
    """
    Return top-k (indices, scores) for the current backend.
    """
    texts = [p["text"] for p in PASSAGES]
    if EMBED_BACKEND == "sbert" and _faiss_index is not None:
        qv = _sbert_model.encode([query], convert_to_numpy=True)
        faiss.normalize_L2(qv)
        D, I = _faiss_index.search(qv.astype(np.float32), min(k, len(texts)))
        return I[0].tolist(), D[0].tolist()
    elif EMBED_BACKEND == "tfidf" and _tfidf is not None:
        
        qv = _tfidf.transform([query])          # (1, n_features)
        sims = cosine_similarity(qv, _tfidf_matrix)[0]
        idx = np.argsort(-sims)[:k]
        return idx.tolist(), sims[idx].tolist()
    else:
        raise RuntimeError("Index not built. Run BUILD_INDEX mode first.")

In [12]:
def save_index(index_dir, meta):
    os.makedirs(index_dir, exist_ok=True)
    with open(os.path.join(index_dir, "passages.jsonl"), "w", encoding="utf-8") as f:
        for p in PASSAGES: f.write(json.dumps(p)+"\n")
    with open(os.path.join(index_dir, "meta.json"), "w", encoding="utf-8") as f:
        json.dump({"backend": EMBED_BACKEND}, f)
    if EMBED_BACKEND == "sbert":
        # re-embed and save since FAISS object is not trivially serializable here
        X = _sbert_model.encode([p["text"] for p in PASSAGES], convert_to_numpy=True)
        faiss.normalize_L2(X)
        fa = faiss.IndexFlatIP(X.shape[1]); fa.add(X.astype(np.float32))
        faiss.write_index(fa, os.path.join(index_dir, "faiss.index"))
        np.save(os.path.join(index_dir, "vectors.npy"), X.astype(np.float32))
    else:
        joblib.dump(_tfidf, os.path.join(index_dir, "tfidf.joblib"))
        # matrix can be recomputed, keeping it simple
    print("[index] saved to", index_dir)

In [13]:
def load_index(index_dir):
    global PASSAGES, EMBED_BACKEND, _faiss_index, _sbert_model, _tfidf, _tfidf_matrix
    with open(os.path.join(index_dir, "passages.jsonl"), "r", encoding="utf-8") as f:
        PASSAGES = [json.loads(l) for l in f]
    meta = json.load(open(os.path.join(index_dir, "meta.json"), "r", encoding="utf-8"))
    EMBED_BACKEND = meta.get("backend", "tfidf")
    texts = [p["text"] for p in PASSAGES]
    if EMBED_BACKEND == "sbert":
        try_load_sbert()
        _faiss_index = faiss.read_index(os.path.join(index_dir, "faiss.index"))
    else:
        _tfidf = joblib.load(os.path.join(index_dir, "tfidf.joblib"))
        _tfidf_matrix = _tfidf.transform(texts)
    print("[index] loaded backend:", EMBED_BACKEND, "| chunks:", len(PASSAGES))

In [14]:

LLM_OK, PIPE, LLM_ERR = True, None, None

def try_load_llm():
    global LLM_OK, PIPE, LLM_ERR
    if PIPE is not None or LLM_OK is False: return
    if ANSWER_BACKEND != "llm":
        LLM_OK = False
        return
    try:
        for name in ["gpt2", "sshleifer/tiny-gpt2"]:
            try:
                tok = AutoTokenizer.from_pretrained(name)
                if tok.pad_token is None: tok.pad_token = tok.eos_token
                model = AutoModelForCausalLM.from_pretrained(name)
                PIPE = pipeline("text-generation", model=model, tokenizer=tok,
                                do_sample=False, temperature=None, top_k=None, top_p=None)
                print(f"[llm] using {name} for answer synthesis")
                return
            except Exception as e:
                LLM_ERR = f"{name}: {e}"; continue
        LLM_OK = False; print("[llm] no model; using extractive fallback.")
        if LLM_ERR: print("[llm] last error:", LLM_ERR)
    except Exception as e:
        LLM_OK, LLM_ERR = False, str(e)
        print("[llm] transformers unavailable; using extractive fallback.")
        print("[llm] error:", LLM_ERR)



Extractive answerer (deterministic & cited)

In [15]:
_SENT_SPLIT = re.compile(r"(?<=[.!?])\s+")
def split_sentences(t: str):
    parts = [s.strip() for s in _SENT_SPLIT.split(t) if s.strip()]
    return parts if parts else [t.strip()]

Gibberish detector for LLM outputs

In [16]:
def extractive_answer(question: str, passages: list, max_sents: int = 2) -> str:
    # Pick 1–2 key sentences from the top passages and cite them.
    picked = []
    for i, p in enumerate(passages[:3]):   # look at top-3 passages
        sents = split_sentences(p)
        if sents:
            picked.append(f"{sents[0]} [{i}]")
        if len(picked) >= max_sents: break
    if not picked and passages:
        picked = [passages[0][:200] + " […] [0]"]
    pref = "According to the retrieved documents, "
    return pref + " ".join(picked)

In [17]:
def looks_like_gibberish(text: str) -> bool:
    # Heuristic: too many non-alnum chunks or weird unicode => gibberish
    ascii_text = text.encode("ascii", "ignore").decode("ascii")
    if not ascii_text.strip():  # everything got stripped
        return True
    tokens = re.findall(r"[A-Za-z0-9]+", ascii_text)
    ratio = (sum(len(t) for t in tokens) / max(1, len(ascii_text)))
    # Low alphabetic density → likely junk; also flag repeated nonsense
    if ratio < 0.55:
        return True
    if re.search(r"(?:\b\w{3,}\b).*\1", ascii_text):  # simple repeated word pattern
        return False  # repetition alone isn't enough
    return False


Prompt builder + Synthesizer

In [18]:
def build_prompt(question, passages):
    ctx = "\n\n".join([f"[{i}] {p}" for i, p in enumerate(passages)])
    return f"You are a helpful assistant. Use ONLY the context to answer and cite like [0], [1].\n\nContext:\n{ctx}\n\nQuestion: {question}\n\nAnswer:\n"


In [19]:
def synthesize_answer(question, passages):
    # 1) If forced extractive or no LLM, do extractive
    try_load_llm()
    if ANSWER_BACKEND != "llm" or (not LLM_OK or PIPE is None):
        return extractive_answer(question, passages)

    # 2) Try LLM; if it looks off, fall back to extractive
    prompt = build_prompt(question, passages)
    out = PIPE(prompt, max_new_tokens=MAX_NEW_TOKENS)[0]["generated_text"]
    ans = out.split("Answer:", 1)[-1].strip()
    if looks_like_gibberish(ans):
        return extractive_answer(question, passages)
    return ans

In [20]:
if MODE == "BUILD_INDEX":
    meta = fit_index(PASSAGES)
    save_index(INDEX_DIR, meta)
    print("Index built & saved.")


In [21]:
def retrieve_passages(query, k=TOP_K):
    idxs, scores = search(query, k)
    results = [(i, PASSAGES[i]["text"], PASSAGES[i]["title"], PASSAGES[i]["doc_id"], scores[j]) for j, i in enumerate(idxs)]
    return results


In [22]:
def answer(query, k=TOP_K, show=True):
    items = retrieve_passages(query, k)
    passages = [x[1] for x in items]
    ans = synthesize_answer(query, passages)
    if show:
        print("Q:", query, "\n")
        print("Answer:", ans, "\n")
        print("Sources:")
        for j, (i, text, title, doc_id, score) in enumerate(items):
            print(f"[{j}] {title} (doc={doc_id}) — score={round(float(score),3)}")
    return ans, items

In [23]:
if MODE == "ASK_ONCE":
    # build (if not already) then answer an example
    if not os.path.exists(INDEX_DIR):
        meta = fit_index(PASSAGES); save_index(INDEX_DIR, meta)
    else:
        load_index(INDEX_DIR)

    _ = answer("What happened to eurozone growth last quarter?", k=TOP_K, show=True)

elif MODE == "CHAT_CLI":
    # tiny CLI loop; type 'exit' to quit
    if not os.path.exists(INDEX_DIR):
        meta = fit_index(PASSAGES); save_index(INDEX_DIR, meta)
    else:
        load_index(INDEX_DIR)
    print("RAG chat — type 'exit' to quit.")
    while True:
        try:
            q = input("> ").strip()
        except EOFError:
            break
        if q.lower() in {"exit","quit"}: break
        _ = answer(q, k=TOP_K, show=True)


[embed] using SentenceTransformers all-MiniLM-L6-v2
[index] loaded backend: sbert | chunks: 10
Q: What happened to eurozone growth last quarter? 

Answer: According to the retrieved documents, The eurozone economy grew by 0.3% last quarter, driven by exports and services. [0] The central bank maintained interest rates but signaled potential cuts if inflation cools further. [1] 

Sources:
[0] Eurozone Q2 Growth (doc=econ-1) — score=0.687
[1] Central Bank Update (doc=econ-2) — score=0.308
[2] Labor Market (doc=econ-3) — score=0.229
[3] Championship Win (doc=sports-1) — score=0.18
[4] AI Startup Funding (doc=tech-1) — score=0.175


## 📌 Notes & Next Steps

- **Backends**: The notebook prefers `SentenceTransformers + FAISS`. If the model can’t download, it falls back to **TF-IDF**.  
- **Answering**: A tiny greedy LLM is used if available; otherwise we do a simple **extractive** synthesis from top passages (still cited).  
- **Extending**:
  - Replace `DOCS` with your own documents (longer strings are fine).  
  - Increase `chunk_size`/reduce `overlap` in the chunker for larger corpora.  
  - Add metadata (dates/authors) to improve results display.  
- **Evaluation**:
  - Create a small list of (question, expected snippet) pairs and check that top-k retrieved passages contain the expected snippet.  
  - Optionally measure retrieval precision@k on your own labeled dataset.

