# RAG Demo – Airline Policy Knowledge (Clean Ordered Version)

This notebook implements a minimal, student‑friendly Retrieval‑Augmented Generation (RAG) pipeline over a tiny airline policy / ticketing corpus.

Ordered Sections:
1. Requirements
2. Imports
3. Manual Gemini Key (optional) / LLM Mode Toggle
4. Ingestion (read PDFs + CSV)  – falls back to a synthetic mini corpus if `dummy_files/` not found
5. Preprocessing
6. Embeddings (TF‑IDF fallback; SentenceTransformer optional if installed)
7. Build Simple Vector Index (in‑memory matrix or FAISS if available)
8. Retrieval Test
9. RAG Assembly (prompt build + generation)
10. End-to-End Example
11. Sanity Checks
12. Notes / Next Steps

Run top → bottom. Everything works offline by default (mock LLM + TF‑IDF). Enable Gemini only if you manually paste an API key in Section 3.


In [61]:
# 11. Sanity Checks

def sanity(r: Dict[str, Any]):
    assert r['docs'], 'No retrieval'
    assert isinstance(r['answer'], str) and r['answer'].strip(), 'Empty answer'
    for d in r['docs']:
        assert d['id'] in r['prompt'], 'Doc id missing in prompt'
    assert len(r['prompt']) > 40, 'Prompt too short'
    return True

print('Sanity passed:', sanity(result))

Sanity passed: True


## 11. Sanity Checks
Lightweight asserts confirm the pipeline produces non-empty, coherent artifacts.

In [32]:
# 10. End-to-End Example
question = "How are cancellations handled and what about baggage limits?"
result = rag(question, k=4)
print("QUESTION:\n", result['question'])
print("\nRETRIEVED:")
for d in result['docs']:
    print(f"- {d['id']} (score={d['score']:.3f})")
print("\nANSWER:\n", result['answer'][:400])

QUESTION:
 How are cancellations handled and what about baggage limits?

RETRIEVED:
- ticketing_info_row0 (score=0.000)
- ticketing_info_row1 (score=0.000)
- ticketing_info_row2 (score=0.000)
- ticketing_info_row3 (score=0.000)

ANSWER:
 (Mock) Answer ONLY from context; if unknown say 'unsure'. QUESTION: How are cancellations handled and what about baggage limits? ANSWER:


## 10. End-to-End Example
Ask a composite question and view retrieved documents + answer.

In [33]:
# 9. RAG Assembly (prompt build + generation)
PROMPT_TEMPLATE = (
    "You are a concise assistant. Use only the provided CONTEXT. If the answer is unknown, say you are unsure.\n"
    "CONTEXT:\n{context}\n\nQUESTION: {question}\nANSWER:"
)

def build_prompt(question: str, docs: List[Dict[str, Any]]) -> str:
    ctx_blocks = [f"[Doc {d['id']}]\n{d['content']}" for d in docs]
    context = "\n\n".join(ctx_blocks)
    return PROMPT_TEMPLATE.format(context=context, question=question)

def rag(question: str, k: int = 3) -> Dict[str, Any]:
    docs = retrieve(question, k=k)
    prompt = build_prompt(question, docs)
    answer = llm.generate(prompt)
    return {"question": question, "answer": answer, "docs": docs, "prompt": prompt}

example = rag("What is the baggage allowance?", k=3)
print(example['answer'][:160])
assert example['docs']

(Mock) You are a concise assistant. Use only the provided CONTEXT. If the answer is unknown, say you are unsure. QUESTION: What is the baggage allowance? ANSWER


## 9. RAG Assembly
Combine retrieval + prompt building + LLM generation into a reusable function.

In [34]:
# 8. LLM Client (Mock or Gemini)
class LLMClient:
    def __init__(self, use_gemini: bool = False):
        self.use_gemini = use_gemini
        if self.use_gemini and (not HAS_GEMINI_LIB or not manual_gemini_key.strip()):
            raise RuntimeError("Gemini requested but library or key missing.")

    def generate(self, prompt: str) -> str:
        if not self.use_gemini:
            # Heuristic: pick lines with keywords from the question tail
            question_part = prompt.split('QUESTION:')[-1].lower()
            keywords = [w for w in re.split(r"[^a-z0-9]+", question_part) if len(w) > 4][:6]
            lines = [l.strip() for l in prompt.split('\n') if l.strip()]
            chosen = [l for l in lines if any(k in l.lower() for k in keywords)][:4]
            synthesis = ' '.join(chosen) or lines[-1]
            return f"(Mock Answer) {synthesis[:400]}".strip()
        model = genai.GenerativeModel("gemini-pro")
        resp = model.generate_content(prompt)
        return getattr(resp, 'text', str(resp))

llm = LLMClient(use_gemini=use_gemini)
print(llm.generate("CONTEXT: baggage allowance varies. QUESTION: What varies?"))

(Mock Answer) CONTEXT: baggage allowance varies. QUESTION: What varies?


## 8. LLM Client (Mock or Gemini)
A tiny wrapper: mock mode extracts salient lines; Gemini mode calls the API if enabled.

In [35]:
# 7. Retrieval Function

def embed_query(q: str) -> np.ndarray:
    return l2_normalize(embed([clean(q)])).astype("float32")

def retrieve(query: str, k: int = 3):
    qv = embed_query(query)
    if HAS_FAISS and index is not None:
        scores, idxs = index.search(qv, k)
        results = []
        for score, idx in zip(scores[0], idxs[0]):
            doc = documents[idx]
            results.append({"id": doc['id'], "score": float(score), "content": doc['content']})
        return results
    # brute force
    sims = norm_embeddings @ qv.T
    sims = sims.ravel()
    order = np.argsort(-sims)[:k]
    return [{"id": documents[i]['id'], "score": float(sims[i]), "content": documents[i]['content']} for i in order]

_test = retrieve("baggage allowance", k=2)
print("Sample retrieval:", [(r['id'], round(r['score'],3)) for r in _test])
assert _test

Sample retrieval: [('ticketing_info_row0', 0.0), ('ticketing_info_row1', 0.0)]


## 7. Retrieval Function
Embed the query, search top-k by cosine similarity.

In [55]:
# 6. Build Simple Vector Index (normalize + optional FAISS)

def l2_normalize(x: np.ndarray) -> np.ndarray:
    return x / (np.linalg.norm(x, axis=1, keepdims=True) + 1e-12)

norm_embeddings = l2_normalize(embeddings)
if HAS_FAISS:
    index = faiss.IndexFlatIP(norm_embeddings.shape[1])
    index.add(norm_embeddings)
    print("FAISS index built (ntotal=", index.ntotal, ")")
else:
    index = None
    print("FAISS not available; using matrix multiply for similarity.")
assert norm_embeddings.shape == embeddings.shape

FAISS not available; using matrix multiply for similarity.


## 6. Vector Index
We normalize vectors and build either a FAISS index (if available) or keep the matrix for brute-force similarity.

In [54]:
# 5. Embeddings (SentenceTransformer optional, TF-IDF fallback)
class TFIDFEmbedder:
    def __init__(self, texts: List[str]):
        from collections import Counter
        self.tokenizer = lambda s: [w for w in re.split(r"[^a-z0-9]+", s.lower()) if w]
        token_lists = [self.tokenizer(t) for t in texts]
        df = Counter()
        for toks in token_lists:
            for t in set(toks):
                df[t] += 1
        self.vocab = {w: i for i, (w, _) in enumerate(df.items())}
        n = len(texts)
        self.idf = {w: math.log((1 + n) / (1 + c)) + 1 for w, c in df.items()}

    def encode(self, texts: List[str]) -> np.ndarray:
        m = np.zeros((len(texts), len(self.vocab)), dtype="float32")
        for row, text in enumerate(texts):
            toks = self.tokenizer(text)
            if not toks:
                continue
            tf = {}
            for t in toks:
                tf[t] = tf.get(t, 0) + 1
            for t, c in tf.items():
                if t in self.vocab:
                    col = self.vocab[t]
                    m[row, col] = (c / len(toks)) * self.idf[t]
            norm = np.linalg.norm(m[row]) + 1e-12
            m[row] /= norm
        return m

EMBED_MODEL_NAME = "all-MiniLM-L6-v2"
if HAS_ST:
    st_model = SentenceTransformer(EMBED_MODEL_NAME)
    def embed(texts: List[str]) -> np.ndarray:
        v = st_model.encode(texts, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=False)
        return v.astype("float32")
else:
    tfidf_model = TFIDFEmbedder([d['cleaned'] for d in documents])
    def embed(texts: List[str]) -> np.ndarray:  # type: ignore
        return tfidf_model.encode(texts)

corpus_texts = [d['cleaned'] for d in documents]
embeddings = embed(corpus_texts)
print("Embeddings shape:", embeddings.shape)
assert embeddings.shape[0] == len(documents)
assert embeddings.shape[1] > 0

Embeddings shape: (83, 513)


## 5. Embeddings
Use SentenceTransformer if present; otherwise a lightweight TF-IDF style fallback embedder (deterministic, zero external calls).

In [53]:
# 4. Preprocessing

def clean(text: str) -> str:
    t = text.lower()
    t = re.sub(r"\s+", " ", t).strip()
    return t
for d in documents:
    d["cleaned"] = clean(d["content"])
print("Preview:", documents[0]['cleaned'][:90], '...')
assert all('cleaned' in d for d in documents)

Preview: airline baggage policy (effective 2025 q3) cabin (carry-on): each passenger may bring one  ...


## 4. Preprocessing
Minimal cleaning: lowercase, collapse whitespace. (For larger corpora you might remove boilerplate, HTML, etc.)

In [52]:
# 3. Ingestion (PDF + CSV + Generated Policies & Tickets with fallback)
"""Build a richer mixed corpus:
1. Try to read real PDFs & CSVs from dummy_files/.
2. Inject full-text policy documents (multi-paragraph) for baggage, cancellations, terms.
3. Generate a set of synthetic natural-language ticket records (more query-friendly than raw CSV rows).
If dummy_files/ is absent, fall back to a tiny synthetic set.
"""
from datetime import datetime, timedelta
import random

RANDOM_SEED = 42
random.seed(RANDOM_SEED)

DATA_DIR = Path("dummy_files")
policy_docs = []
raw_ticket_rows = []

# 3.1 Read PDFs if possible
if DATA_DIR.exists():
    if HAS_PDF:
        for pdf_path in sorted(DATA_DIR.glob("*.pdf")):
            try:
                reader = PdfReader(str(pdf_path))
                pages = [p.extract_text() or "" for p in reader.pages]
                text = "\n".join(pages)
                policy_docs.append({"id": f"pdf_{pdf_path.stem}", "content": text.strip()})
            except Exception as e:
                print(f"PDF read fail {pdf_path.name}: {e}")
    else:
        # Provide placeholder rich text if we cannot parse PDFs
        print("PyPDF2 not installed; using embedded policy placeholders instead of PDF extraction.")
else:
    print("dummy_files/ not found; using synthetic fallback only.")

# 3.2 Add curated realistic policy texts (always add so we have semantic material)
policy_texts = {
    "policy_baggage": """
Airline Baggage Policy (Effective 2025 Q3)

Cabin (Carry-On): Each passenger may bring one cabin bag up to 7 kg and a personal item (laptop bag or small handbag). Oversized cabin items must be checked at the gate and may incur an oversized handling fee.

Checked Allowance: Standard economy fares include 15 kg total checked baggage on domestic routes and 23 kg on international routes. FLEX and BUSINESS fares include an additional 10 kg allowance. Any single bag over 32 kg will not be accepted due to safety restrictions.

Excess & Oversize: Excess weight beyond the included allowance is charged per kg at the prevailing airport rate. Oversize sports equipment (e.g., surfboards) must be pre-declared and may attract a fixed handling fee.

Prohibited & Restricted Items: Lithium batteries, power banks, and electronic cigarettes must be carried in cabin baggage only. Sharp objects, flammable materials, and loose ammunition are strictly prohibited. Fragile items should be cushioned; the airline is not liable for minor cosmetic damage to suitcases.

Delayed / Damaged Baggage: Report issues at the baggage service desk before exiting the arrival hall. Interim purchase reimbursements require original receipts and are capped at a daily limit.
""".strip(),
    "policy_cancellation": """
Ticket Change & Cancellation Policy (Updated July 2025)

24-Hour Cooling-Off: Bookings made directly through our website can be fully refunded if canceled within 24 hours, provided departure is at least 7 days away.

Fare Classes:
- BASIC: Non-refundable after cooling-off. Changes not permitted or require full reissue at current fare.
- STANDARD: Changes allowed with a change fee plus any fare difference. Partial refund (taxes only) on cancellation.
- FLEX: Unlimited date changes (fare difference applies). Cancellation allowed with moderate fee up to 2 hours before departure.
- BUSINESS: Fully refundable and changeable until departure; no change fees (fare difference may apply).

No-Show: Failure to check in before cut-off voids remaining flight coupons; only unused refundable taxes are returned (if applicable).

Irregular Operations: If the airline cancels or significantly delays a flight (over 3 hours), passengers may request rebooking, voucher credit, or full refund regardless of fare class.
""".strip(),
    "policy_terms": """
General Conditions of Carriage (Excerpt)

Check-In Deadlines: Domestic flights close 45 minutes prior; international flights close 60 minutes prior. Boarding gates typically close 20 minutes before departure.

Travel Documents: Passengers are responsible for valid passports, visas, and health certificates. Denied boarding due to documentation issues does not entitle the traveler to a refund beyond fare rules.

Safety & Conduct: Crew instructions must be followed at all times. Abusive or disruptive behavior can result in denied boarding or diversion costs being charged to the passenger.

Liability Limits: Compensation for baggage loss or damage is limited per kilogram unless a declared value was purchased. The airline is not liable for indirect or consequential losses.

Data & Privacy: Booking data may be processed in jurisdictions with differing data protection standards. See full privacy policy on our website.
""".strip(),
}

for pid, text in policy_texts.items():
    policy_docs.append({"id": pid, "content": text})

# 3.3 Read raw CSV ticket rows (optional)
if DATA_DIR.exists():
    for csv_path in sorted(DATA_DIR.glob("*.csv")):
        try:
            df = pd.read_csv(csv_path)
            raw_ticket_rows.append(df)
        except Exception as e:
            print(f"CSV read fail {csv_path.name}: {e}")

# 3.4 Generate enriched natural-language ticket records
FIRST_NAMES = ["Aarav", "Diya", "Rohan", "Maya", "Liam", "Aisha", "Noah", "Ishita", "Karan", "Eva", "Sofia", "Omar", "Priya", "Zara", "Leo"]
LAST_NAMES = ["Patel", "Sharma", "Das", "Khan", "Verma", "Iyer", "Nair", "Fernandes", "Roy", "Singh", "Mehta", "Ghosh"]
ORIGINS = ["BLR", "DEL", "BOM", "HYD", "MAA", "CCU", "IXC", "PNQ", "GOI"]
DESTS = ["DXB", "SIN", "DOH", "BKK", "LHR", "JFK", "MAA", "BLR", "BOM"]
FARES = ["BASIC", "STANDARD", "FLEX", "BUSINESS"]
STATUSES = ["Booked", "Cancelled", "Completed", "Delayed"]

BAGGAGE_BASE = {"BASIC": 15, "STANDARD": 15, "FLEX": 23, "BUSINESS": 32}
CHANGE_POLICY = {
    "BASIC": "no changes after booking except within 24h cooling-off",
    "STANDARD": "changes allowed with fee",
    "FLEX": "unlimited date changes (fare difference applies)",
    "BUSINESS": "fully flexible, no change fees",
}
CANCEL_POLICY = {
    "BASIC": "non-refundable beyond taxes",
    "STANDARD": "partial refund (taxes) plus fee",
    "FLEX": "refund with moderate fee",
    "BUSINESS": "fully refundable",
}

NUM_TICKETS = 80  # adjustable
start_date = datetime(2025, 6, 1)

def make_ticket(i: int) -> dict:
    fn = random.choice(FIRST_NAMES)
    ln = random.choice(LAST_NAMES)
    origin = random.choice(ORIGINS)
    dest = random.choice([d for d in DESTS if d != origin])
    depart = start_date + timedelta(days=random.randint(0, 220), hours=random.randint(0, 23), minutes=random.choice([0,15,30,45]))
    fare = random.choices(FARES, weights=[0.35, 0.30, 0.25, 0.10])[0]
    status = random.choices(STATUSES, weights=[0.65, 0.1, 0.2, 0.05])[0]
    baggage_allow = BAGGAGE_BASE[fare]
    ticket_id = 2000 + i
    pnr = ''.join(random.choices('ABCDEFGHJKMNPQRSTUVWXYZ23456789', k=6))
    base_price = {"BASIC": 3200, "STANDARD": 5200, "FLEX": 8600, "BUSINESS": 14500}[fare]
    price = base_price + random.randint(-400, 1200)
    narrative = (
        f"Ticket {ticket_id} (PNR {pnr}) for passenger {fn} {ln} travels from {origin} to {dest} on {depart.strftime('%Y-%m-%d %H:%M')} "
        f"in {fare} class. Included checked baggage allowance {baggage_allow} kg. Change policy: {CHANGE_POLICY[fare]}. "
        f"Cancellation: {CANCEL_POLICY[fare]}. Current status: {status}."
    )
    return {"id": f"ticket_{ticket_id}", "content": narrative}

synthetic_tickets = [make_ticket(i) for i in range(NUM_TICKETS)]

# 3.5 Aggregate all documents
if not policy_docs and not synthetic_tickets:
    # fallback minimal set
    documents = [
        {"id": "fallback_baggage", "content": "Baggage allowance is 15 kg for economy and 23 kg for flex fares."},
        {"id": "fallback_cancellation", "content": "Cancellations within 24 hours are refunded if departure is at least 7 days away."},
    ]
else:
    documents = policy_docs + synthetic_tickets

print(f"Policies: {len(policy_docs)}, Tickets: {len(synthetic_tickets)}, Total docs: {len(documents)}")
print("Sample doc IDs:", [d['id'] for d in documents[:5]])
# Quick diversity check
unique_fares = {fare for d in documents if ' class.' in d['content'] for fare in FARES if fare.lower() in d['content'].lower()}
print("Detected fare classes in tickets:", sorted(unique_fares))
assert documents, "No documents ingested."

PyPDF2 not installed; using embedded policy placeholders instead of PDF extraction.
Policies: 3, Tickets: 80, Total docs: 83
Sample doc IDs: ['policy_baggage', 'policy_cancellation', 'policy_terms', 'ticket_2000', 'ticket_2001']
Detected fare classes in tickets: ['BASIC', 'BUSINESS', 'FLEX', 'STANDARD']


## 3. Ingestion (read PDFs + CSV)
Reads from `dummy_files/` (PDF + CSV). If the directory or supported files are missing, falls back to a tiny synthetic corpus.

In [40]:
# 2. Manual Gemini Key (optional) / LLM Mode Toggle
use_gemini = False  # change to True to use Gemini
manual_gemini_key = ""  # Paste key here if use_gemini=True (environment ignored for clarity)
if use_gemini:
    if not manual_gemini_key.strip():
        raise EnvironmentError("Gemini mode enabled but no manual key provided.")
    if not HAS_GEMINI_LIB:
        raise ImportError("google-generativeai not installed; can't use Gemini.")
    genai.configure(api_key=manual_gemini_key.strip())
print(f"Gemini mode active: {use_gemini} (manual key provided={bool(manual_gemini_key.strip())})")

Gemini mode active: False (manual key provided=False)


## 2. Manual Gemini Key (optional) / LLM Mode Toggle
Set `use_gemini = True` only if you paste a key below. Otherwise a deterministic mock LLM is used so the pipeline always runs offline.

In [41]:
# 1. Imports
import os, re, math, random
from pathlib import Path
from typing import List, Dict, Any
import numpy as np
import pandas as pd

try:
    from sentence_transformers import SentenceTransformer  # type: ignore
    HAS_ST = True
except Exception:
    SentenceTransformer = None  # type: ignore
    HAS_ST = False

try:
    import faiss  # type: ignore
    HAS_FAISS = True
except Exception:
    faiss = None  # type: ignore
    HAS_FAISS = False

try:
    import google.generativeai as genai  # type: ignore
    HAS_GEMINI_LIB = True
except Exception:
    genai = None
    HAS_GEMINI_LIB = False

try:
    from PyPDF2 import PdfReader  # type: ignore
    HAS_PDF = True
except Exception:
    PdfReader = None  # type: ignore
    HAS_PDF = False

np.random.seed(123)
print(f"SentenceTransformer: {HAS_ST}, FAISS: {HAS_FAISS}, Gemini lib: {HAS_GEMINI_LIB}, PDF: {HAS_PDF}")

SentenceTransformer: False, FAISS: False, Gemini lib: False, PDF: False


In [42]:
# 0. Requirements (write requirements.txt)
requirements = """numpy
pandas
sentence-transformers>=2.2 # optional, will fallback if missing
faiss-cpu>=1.7 # optional
google-generativeai>=0.3.0 # only if you manually supply a key
pypdf2
"""
with open("requirements.txt", "w", encoding="utf-8") as f:
    f.write(requirements)
print("requirements.txt written. Install with: pip install -r requirements.txt")

requirements.txt written. Install with: pip install -r requirements.txt


In [18]:
# 3. Manual Gemini Key (optional) / LLM toggle
use_gemini = False  # Keep False unless you paste a key below
manual_gemini_key = ""  # Paste key here if using Gemini; leave blank for mock

if use_gemini:
    if not manual_gemini_key.strip():
        raise EnvironmentError("Gemini mode enabled but no manual key provided.")
    if not HAS_GEMINI_LIB:
        raise ImportError("google-generativeai not installed.")
    genai.configure(api_key=manual_gemini_key.strip())
print(f"Gemini active: {use_gemini}")

Gemini active: False


(legacy cell removed)

In [None]:
# 3. Ingestion (PDF + CSV with fallback)
# (updated already above - retained for execution ordering)

PyPDF2 not installed; skipping PDF extraction (add pypdf2 to requirements & reinstall).
Loaded 20 docs (PDF pages collapsed + CSV rows). Example IDs: ['ticketing_info_row0', 'ticketing_info_row1', 'ticketing_info_row2']


### Simple Preprocessing
Lowercase + collapse whitespace (minimal example).

In [43]:
# 5. Preprocessing
import re

def clean(text: str) -> str:
    t = text.lower()
    t = re.sub(r"\s+", " ", t)
    return t.strip()
for d in documents:
    d['cleaned'] = clean(d['content'])
print('Cleaned sample:', documents[0]['cleaned'][:120], '...')

Cleaned sample: ticketid=1000; pnr=y6dpbh; airline=indigo; flightno=6e125; departuredatetime=2025-12-09 07:15; origin=blr; destination=d ...


## Embeddings
Use local SentenceTransformer model for document embeddings.

In [44]:
# 6. Embeddings (SentenceTransformer or TF-IDF fallback)
class TFIDFEmbedder:
    def __init__(self, texts: List[str]):
        from collections import Counter
        self.tok = lambda s: [w for w in re.split(r"[^a-z0-9]+", s) if w]
        token_lists = [self.tok(t) for t in texts]
        df = Counter()
        for toks in token_lists:
            for t in set(toks):
                df[t] += 1
        self.vocab = {w: i for i, (w, _) in enumerate(df.items())}
        n = len(texts)
        self.idf = {w: math.log((1 + n) / (1 + c)) + 1 for w, c in df.items()}
    def encode(self, texts: List[str]) -> np.ndarray:
        m = np.zeros((len(texts), len(self.vocab)), dtype='float32')
        for i, txt in enumerate(texts):
            toks = self.tok(txt)
            if not toks: continue
            tf = {}
            for t in toks: tf[t] = tf.get(t, 0) + 1
            for t, c in tf.items():
                if t in self.vocab:
                    col = self.vocab[t]
                    m[i, col] = (c/len(toks))*self.idf[t]
            nrm = np.linalg.norm(m[i]) + 1e-9
            m[i] /= nrm
        return m

if HAS_ST:
    st_model = SentenceTransformer('all-MiniLM-L6-v2')
    def embed(texts: List[str]) -> np.ndarray:
        v = st_model.encode(texts, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=False)
        return v.astype('float32')
else:
    tfidf_model = TFIDFEmbedder([d['cleaned'] for d in documents])
    def embed(texts: List[str]) -> np.ndarray:  # type: ignore
        return tfidf_model.encode(texts)

corpus_texts = [d['cleaned'] for d in documents]
embeddings = embed(corpus_texts)
print('Embeddings shape:', embeddings.shape)
assert embeddings.shape[0] == len(documents)

Embeddings shape: (20, 159)


## Vector Index (FAISS)
Normalize vectors and use inner product as cosine similarity approximation.

In [45]:
# 7. Vector Index (normalize + optional FAISS)

def l2_normalize(x: np.ndarray) -> np.ndarray:
    return x / (np.linalg.norm(x, axis=1, keepdims=True) + 1e-9)

norm_embeddings = l2_normalize(embeddings)
if HAS_FAISS:
    index = faiss.IndexFlatIP(norm_embeddings.shape[1])
    index.add(norm_embeddings)
    print('FAISS index built:', index.ntotal)
else:
    index = None
    print('Using brute-force numpy similarity.')

Using brute-force numpy similarity.


### Retrieval Function
Embed query, search top-k, return scored docs.

In [56]:
# 8. Retrieval Test

def embed_query(q: str) -> np.ndarray:
    return l2_normalize(embed([clean(q)])).astype('float32')

def retrieve(query: str, k: int = 3):
    qv = embed_query(query)
    if HAS_FAISS and index is not None:
        scores, idxs = index.search(qv, k)
        return [{"id": documents[i]['id'], "score": float(scores[0][j]), "content": documents[i]['content']} for j,i in enumerate(idxs[0])]
    sims = (norm_embeddings @ qv.T).ravel()
    order = np.argsort(-sims)[:k]
    return [{"id": documents[i]['id'], "score": float(sims[i]), "content": documents[i]['content']} for i in order]

print('Test retrieval:', [(r['id'], round(r['score'],3)) for r in retrieve('baggage allowance', k=2)])

Test retrieval: [('ticket_2051', 0.112), ('ticket_2044', 0.109)]


## LLM Client Wrapper
Mock mode (default) + optional Gemini mode.

In [57]:
# 9. LLM Wrapper (mock or Gemini)
class LLMClient:
    def __init__(self, use_gemini: bool=False):
        self.use_gemini = use_gemini
        if self.use_gemini and (not HAS_GEMINI_LIB or not manual_gemini_key.strip()):
            raise RuntimeError('Gemini requested but library or manual key missing.')
    def generate(self, prompt: str) -> str:
        if not self.use_gemini:
            question_part = prompt.split('QUESTION:')[-1].lower()
            keywords = [w for w in re.split(r"[^a-z0-9]+", question_part) if len(w)>4][:6]
            lines = [l.strip() for l in prompt.split('\n') if l.strip()]
            chosen = [l for l in lines if any(k in l.lower() for k in keywords)][:5]
            synthesis = ' '.join(chosen) or lines[-1]
            return f"(Mock) {synthesis[:400]}"
        model = genai.GenerativeModel('gemini-pro')
        resp = model.generate_content(prompt)
        return getattr(resp, 'text', str(resp))

llm = LLMClient(use_gemini=use_gemini)
print('LLM test:', llm.generate('CONTEXT: baggage fees vary. QUESTION: What varies?'))

LLM test: (Mock) CONTEXT: baggage fees vary. QUESTION: What varies?


## RAG Prompt Assembly & Answer Function

In [58]:
# 10. RAG Assembly
PROMPT_TEMPLATE = (
    "Answer ONLY from context; if unknown say 'unsure'.\n"\
    "CONTEXT:\n{context}\n\nQUESTION: {question}\nANSWER:"
)

def build_prompt(question: str, docs: List[Dict[str, Any]]) -> str:
    blocks = [f"[Doc {d['id']}]\n{d['content']}" for d in docs]
    return PROMPT_TEMPLATE.format(context='\n\n'.join(blocks), question=question)

def rag(question: str, k: int=3) -> Dict[str, Any]:
    docs = retrieve(question, k=k)
    prompt = build_prompt(question, docs)
    answer = llm.generate(prompt)
    return {"question": question, "answer": answer, "docs": docs, "prompt": prompt}

print(rag('What is baggage allowance policy?', k=2)['answer'][:160])

(Mock) Answer ONLY from context; if unknown say 'unsure'. [Doc policy_terms] Liability Limits: Compensation for baggage loss or damage is limited per kilogram u


## End-to-End Example

In [59]:
# 11. End-to-End Example
question = 'How are cancellations handled and what about baggage limits?'
result = rag(question, k=4)
print('QUESTION:', result['question'])
print('\nRETRIEVED:')
for d in result['docs']:
    print(f"- {d['id']} score={d['score']:.3f}")
print('\nANSWER:\n', result['answer'][:500])

QUESTION: How are cancellations handled and what about baggage limits?

RETRIEVED:
- policy_baggage score=0.276
- policy_terms score=0.126
- policy_cancellation score=0.079
- ticket_2051 score=0.011

ANSWER:
 (Mock) Answer ONLY from context; if unknown say 'unsure'. [Doc policy_baggage] Airline Baggage Policy (Effective 2025 Q3) Checked Allowance: Standard economy fares include 15 kg total checked baggage on domestic routes and 23 kg on international routes. FLEX and BUSINESS fares include an additional 10 kg allowance. Any single bag over 32 kg will not be accepted due to safety restrictions. Prohibited & Re


## Additional Sample Questions
Below are several general user-style questions to exercise the RAG pipeline across baggage, fare rules, cancellations, delays, and safety policies.

## Evaluation / Sanity Checks
Basic asserts to verify pipeline integrity.

In [50]:
# 11.5 (Optional) Inspect Prompt
print(result['prompt'][:600])

Answer ONLY from context; if unknown say 'unsure'.
CONTEXT:
[Doc ticketing_info_row0]
TicketID=1000; PNR=Y6DPBH; Airline=IndiGo; FlightNo=6E125; DepartureDateTime=2025-12-09 07:15; Origin=BLR; Destination=DXB; PassengerName=Emily Davis; Price(INR)=9000; Status=Booked

[Doc ticketing_info_row1]
TicketID=1001; PNR=K51FPK; Airline=IndiGo; FlightNo=6E833; DepartureDateTime=2025-11-23 17:45; Origin=HYD; Destination=AMD; PassengerName=Harper Patel; Price(INR)=4500; Status=Cancelled

[Doc ticketing_info_row2]
TicketID=1002; PNR=0T9NT3; Airline=IndiGo; FlightNo=6E194; DepartureDateTime=2025-07-04 11:3


In [60]:
# Additional RAG queries (diverse coverage)
sample_questions = [
    "What is the checked baggage allowance for flex and business fares?",
    "How do cancellation rules differ between basic and flex tickets?",
    "If my flight is delayed more than 3 hours what compensation or options do I have?",
    "Can I change a standard fare ticket and what fees might apply?",
    "What safety or conduct rules could get a passenger denied boarding?",
]

multi_results = []
for q in sample_questions:
    r = rag(q, k=4)
    multi_results.append(r)
    top_ids = ", ".join([d['id'] for d in r['docs']])
    print(f"Q: {q}\nTop Docs: {top_ids}\nAnswer: {r['answer'][:200]}\n---")

# Basic assertion: each query returned at least one doc
assert all(mr['docs'] for mr in multi_results), "One of the sample questions returned no documents."

Q: What is the checked baggage allowance for flex and business fares?
Top Docs: policy_baggage, policy_terms, policy_cancellation, ticket_2025
Answer: (Mock) Answer ONLY from context; if unknown say 'unsure'. [Doc policy_baggage] Airline Baggage Policy (Effective 2025 Q3) Cabin (Carry-On): Each passenger may bring one cabin bag up to 7 kg and a pers
---
Q: How do cancellation rules differ between basic and flex tickets?
Top Docs: policy_baggage, policy_terms, policy_cancellation, ticket_2016
Answer: (Mock) Travel Documents: Passengers are responsible for valid passports, visas, and health certificates. Denied boarding due to documentation issues does not entitle the traveler to a refund beyond fa
---
Q: If my flight is delayed more than 3 hours what compensation or options do I have?
Top Docs: policy_cancellation, policy_terms, ticket_2076, ticket_2007
Answer: (Mock) Answer ONLY from context; if unknown say 'unsure'. 24-Hour Cooling-Off: Bookings made directly through our website can b

## 13. Notes / Next Steps
- Add document chunking for long PDFs.
- Persist vectors (FAISS index save or use a vector DB like Chroma / pgvector).
- Add metadata filters (doc type: pdf, csv).
- Introduce reranking or hybrid search.
- Log queries & answers for evaluation.
- Add citations highlighting which doc lines support each answer.

Done — you've built a minimal RAG pipeline over airline policy docs.