# Week 3: LLM Integration and Prompts

**Pipeline continuity:**
- **Week 1 output:** semantic-quality winner was `MPNet` + cosine metric.
- **Week 2 output:** production-stable retriever in this environment uses `hashing-768-stable` + FAISS `IndexFlatIP`.
- **Week 3 focus:** keep Week 2 runtime retriever unchanged and add generation (prompting + local LLM).

**Goal:** end-to-end RAG flow:
`query -> retrieve context (Week 2 runtime) -> build prompt -> LLM answer`.

**Important:** Week 3 is aligned with Week 2 execution constraints (Python 3.13 Jupyter stability), so retrieval here uses the same stable fallback embedder.


In [1]:
import json
import time
import hashlib
import os
from pathlib import Path
import numpy as np
import pandas as pd
import faiss
from sklearn.feature_extraction.text import HashingVectorizer
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaLLM
from langchain_core.prompts import PromptTemplate

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
faiss.omp_set_num_threads(1)


In [2]:
# Same paths as Week 2 (notebook is in notebooks/)
PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

W3_EXPERIMENTS_CSV = ARTIFACTS_DIR / "week3_prompt_experiments.csv"
W3_GENERATIONS_JSONL = ARTIFACTS_DIR / "week3_generations.jsonl"

print(f"Project root: {PROJECT_ROOT}")
print(f"Data dir: {DATA_DIR} (exists: {DATA_DIR.exists()})")
print(f"Artifacts: {ARTIFACTS_DIR}")


Project root: /Users/tkhamidulin/Desktop/First Project - RAG
Data dir: /Users/tkhamidulin/Desktop/First Project - RAG/data (exists: True)
Artifacts: /Users/tkhamidulin/Desktop/First Project - RAG/artifacts


---
## From Week 1 and Week 2: Retrieval Pipeline

**From Week 1:** MPNet + cosine was the quality winner during embedding-model evaluation.

**From Week 2 (runtime decision):** due to native kernel crashes with `sentence-transformers/torch` in this local environment, retrieval switched to a stable fallback embedder (`hashing-768-stable`) while keeping FAISS and cosine-equivalent ranking.

Week 3 reuses that **exact runtime retriever** so experiments with prompts/LLM stay reproducible.


In [3]:
# --- From Week 2: load PDFs and chunking ---
def load_pdfs(data_dir: Path):
    docs = []
    for topic_dir in sorted(data_dir.iterdir()):
        if not topic_dir.is_dir():
            continue
        topic = topic_dir.name
        for pdf_path in topic_dir.glob("*.pdf"):
            try:
                loader = PyPDFLoader(str(pdf_path))
                for doc in loader.load():
                    doc.metadata["topic"] = topic
                    doc.metadata["source"] = pdf_path.name
                    docs.append(doc)
            except Exception as e:
                print(f"  ERROR: {pdf_path.name} - {e}")
    return docs

def chunk_documents(docs, chunk_size: int, chunk_overlap: int, separators=None):
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=separators)
    return splitter.split_documents(docs)

# --- From Week 2: stable embedder + VectorStore + RAG ---
EMBEDDING_MODEL_ID = "hashing-768-stable"

class StableEmbedder:
    def __init__(self, n_features: int = 768):
        self.vectorizer = HashingVectorizer(
            n_features=n_features,
            norm="l2",
            alternate_sign=False,
            lowercase=True,
        )

    def encode(self, texts):
        if isinstance(texts, str):
            texts = [texts]
        arr = self.vectorizer.transform(texts).toarray()
        return np.ascontiguousarray(arr, dtype=np.float32)

class VectorStore:
    def __init__(self, model_id: str = EMBEDDING_MODEL_ID, batch_size: int = 64):
        self.model_id = model_id
        self.batch_size = batch_size
        self.model = StableEmbedder(n_features=768)
        self.chunks = []
        self._index = None

    def build_index(self, chunks):
        self.chunks = list(chunks)
        if not self.chunks:
            raise ValueError("Cannot index: chunks list is empty")
        texts = [c.page_content for c in self.chunks]
        self._index = None
        for i in range(0, len(texts), self.batch_size):
            batch = texts[i:i + self.batch_size]
            emb = self.model.encode(batch)
            if self._index is None:
                self._index = faiss.IndexFlatIP(emb.shape[1])
            self._index.add(emb)

    def retrieve(self, query: str, top_k: int = 3):
        if self._index is None or self._index.ntotal == 0:
            raise ValueError("Index is empty; call build_index(chunks) first")
        q = self.model.encode([query])
        k = min(top_k, self._index.ntotal)
        scores, indices = self._index.search(q, k)
        return scores[0], indices[0]

class RAG:
    def __init__(self, vector_store: VectorStore):
        self.vector_store = vector_store

    def retrieve(self, query: str, top_k: int = 3):
        scores, indices = self.vector_store.retrieve(query, top_k=top_k)
        return [(float(s), self.vector_store.chunks[i]) for s, i in zip(scores, indices)]

print("Loaded: load_pdfs, chunk_documents, VectorStore, RAG. Embedding model:", EMBEDDING_MODEL_ID)


Loaded: load_pdfs, chunk_documents, VectorStore, RAG. Embedding model: hashing-768-stable


In [4]:
# Run Week 2 runtime pipeline: load PDFs -> chunk -> build index -> RAG
raw_docs = load_pdfs(DATA_DIR)
CHUNK_CONFIG = {"chunk_size": 300, "chunk_overlap": 50}
chunks = chunk_documents(raw_docs, **CHUNK_CONFIG)

store = VectorStore()
store.build_index(chunks)
rag = RAG(store)

print(f"Documents: {len(raw_docs)} pages -> {len(chunks)} chunks ({CHUNK_CONFIG})")
print(f"Retriever ready. Embedding model: {EMBEDDING_MODEL_ID}")


incorrect startxref pointer(1)
parsing for Object Streams
Error -3 while decompressing data: incorrect header check
found 0 objects within Object(775,0) whereas 200 expected
Error -3 while decompressing data: incorrect header check
found 0 objects within Object(776,0) whereas 20 expected
Cannot find "/Root" key in trailer
Searching object with "/Catalog" key
Ignoring wrong pointing object 11 0 (offset 0)


  ERROR: A-Complete-Guide-to-the-Google-Cloud-Platform.pdf - Cannot find Root object in pdf
Documents: 101 pages -> 833 chunks ({'chunk_size': 300, 'chunk_overlap': 50})
Retriever ready. Embedding model: hashing-768-stable


## Local LLM (Ollama) via LangChain

This notebook uses a local model through Ollama to keep experiments reproducible and offline-friendly.

Example setup (outside notebook):
- `ollama pull llama3.1` (or another model)
- Ensure Ollama is running (default: `http://localhost:11434`)

In [5]:
MODEL_NAME = "gemma3:4b"   # exactly as in `ollama list`
OLLAMA_BASE_URL = "http://localhost:11434"
TEMPERATURE_DEFAULT = 0.2

def build_llm(temperature: float) -> OllamaLLM:
    """
    Build a LangChain-compatible Ollama LLM.
    Using a factory function avoids duplication across experiments.
    """
    return OllamaLLM(
        model=MODEL_NAME,
        temperature=temperature,
        base_url=OLLAMA_BASE_URL,
        validate_model_on_init=True,  # fail fast if model is missing
    )

# Smoke test
build_llm(TEMPERATURE_DEFAULT).invoke("Reply: OK")


'OK\n'

## Retrieval: Week 2 Runtime Pipeline Inside Week 3

Контекст для LLM получаем локально через `rag.retrieve(query, top_k=3)`.
Это тот же runtime-пайплайн, который зафиксирован в Week 2 (stable hashing embedder + FAISS).


In [6]:
def get_retrieval_for_query(query: str, top_k: int = 3) -> dict:
    """Use Week 2 pipeline: rag.retrieve() returns (score, chunk). Build row for prompt."""
    retrieved = rag.retrieve(query, top_k=top_k)
    chunks = [c.page_content for _, c in retrieved]
    scores = [s for s, _ in retrieved]
    return {"query": query, "chunks": chunks, "scores": scores, "meta": {}}

# Example: one query to verify retrieval works
example = get_retrieval_for_query("What is RAG?")
print(f"Example retrieval: query={example['query'][:40]}...")
print(f"  Top-k chunks: {len(example['chunks'])}, scores: {[round(s, 4) for s in example['scores']]}")
print(f"  First chunk preview: {example['chunks'][0][:80]}...")


Example retrieval: query=What is RAG?...
  Top-k chunks: 3, scores: [0.4629, 0.3825, 0.379]
  First chunk preview: RAG makes LLMs better and
equal
Amnon, Roy, Ilai, Nathan, Amir
Products
Vector D...


In [7]:
def format_context(chunks: list[str], scores: list[float] | None = None, max_chars: int = 7000) -> str:
    parts = []
    for i, ch in enumerate(chunks):
        score_str = f" (score={scores[i]:.4f})" if scores is not None else ""
        parts.append(f"[Chunk {i+1}{score_str}]\n{ch}")
    ctx = "\n\n".join(parts)
    return ctx[:max_chars]


## Prompt Templates

I define three task-specific prompt templates:

1) **Strict Q&A**  
   - Answer using ONLY context  
   - If missing: exact refusal message  
   - Require chunk citations

2) **Structured Summary**  
   - Extract key points, definitions, practical notes, and gaps

3) **Grounded Reasoning**  
   - Provide short reasoning steps tied to evidence  
   - Avoid outside knowledge


In [8]:
QA_STRICT_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are a RAG assistant.\n"
        "You MUST answer using ONLY the provided CONTEXT.\n"
        "If the answer is not in the context, reply exactly:\n"
        "\"I don't know based on the provided context.\"\n\n"
        "CONTEXT:\n{context}\n\n"
        "QUESTION:\n{question}\n\n"
        "RULES:\n"
        "- No outside knowledge.\n"
        "- Be concise.\n"
        "- Cite chunks like [Chunk 2].\n\n"
        "ANSWER:\n"
    )
)

SUMMARY_PROMPT = PromptTemplate(
    input_variables=["context", "topic"],
    template=(
        "Summarize the provided context about: {topic}\n\n"
        "CONTEXT:\n{context}\n\n"
        "OUTPUT FORMAT:\n"
        "- Key points (3–7 bullets)\n"
        "- Definitions (if any)\n"
        "- Practical notes (if any)\n"
        "- Missing information / open questions (if any)\n\n"
        "SUMMARY:\n"
    )
)

REASONING_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are an analyst. Use ONLY the provided context.\n"
        "Do not add outside knowledge.\n\n"
        "CONTEXT:\n{context}\n\n"
        "QUESTION:\n{question}\n\n"
        "OUTPUT FORMAT:\n"
        "1) Answer (1–3 sentences)\n"
        "2) Evidence: cite chunks like [Chunk 1]\n"
        "3) Reasoning: 3–6 short bullet steps tied to evidence\n\n"
        "RESPONSE:\n"
    )
)


In [9]:
# Retrieval is done by get_retrieval_for_query(question) — no find_retrieval_row needed.


In [10]:
def stable_id(*parts) -> str:
    s = "||".join(map(str, parts))
    return hashlib.sha256(s.encode("utf-8")).hexdigest()[:16]

def run_generation(
    question: str,
    retrieval_row: dict,
    prompt: PromptTemplate,
    template_name: str,
    temperature: float,
    topic: str = "RAG notes"
) -> dict:
    llm = build_llm(temperature)

    context = format_context(retrieval_row["chunks"], retrieval_row["scores"])
    kwargs = {"context": context}

    if "question" in prompt.input_variables:
        kwargs["question"] = question
    if "topic" in prompt.input_variables:
        kwargs["topic"] = topic

    prompt_text = prompt.format(**kwargs)

    t0 = time.time()
    answer = llm.invoke(prompt_text)
    latency = round(time.time() - t0, 3)

    answer = (answer or "").strip()

    return {
        "run_id": stable_id(question, template_name, MODEL_NAME, temperature),
        "question": question,
        "template": template_name,
        "model": MODEL_NAME,
        "temperature": temperature,
        "latency_s": latency,
        "prompt_chars": len(prompt_text),
        "context_chars": len(context),
        "n_chunks": len(retrieval_row["chunks"]),
        "retrieval_query_used": retrieval_row["query"],
        "answer": answer,
        "idk_flag": ("i don't know based on the provided context" in answer.lower()),
        "has_citation_flag": ("[chunk" in answer.lower()),  # lightweight heuristic
    }


## Test Set

A small, diverse set of questions is used to compare prompt behavior:
- definitions (low risk)
- “why” questions (higher hallucination risk)
- comparison questions
- pipeline-level questions

This is a controlled prompt experiment, not a full benchmark yet.


In [11]:
test_questions = [
    "What is Retrieval-Augmented Generation (RAG)?",
    "Why do we use chunk overlap in retrieval systems?",
    "Explain cosine similarity vs dot product similarity for embeddings.",
    "What is the role of FAISS in a RAG pipeline?",
    "When would you increase chunk size and why?",
]


In [12]:
templates = [
    ("qa_strict", QA_STRICT_PROMPT),
    ("summary", SUMMARY_PROMPT),
    ("reasoning", REASONING_PROMPT),
]

temperature_grid = [0.0, 0.2]  # compare deterministic vs slightly creative

records = []
for q in test_questions:
    r = get_retrieval_for_query(q)
    for (name, tpl) in templates:
        for temp in temperature_grid:
            rec = run_generation(
                question=q,
                retrieval_row=r,
                prompt=tpl,
                template_name=name,
                temperature=temp,
                topic="RAG / retrieval fundamentals"
            )
            records.append(rec)

df = pd.DataFrame(records)
df.head()


Unnamed: 0,run_id,question,template,model,temperature,latency_s,prompt_chars,context_chars,n_chunks,retrieval_query_used,answer,idk_flag,has_citation_flag
0,773148cfcd0d205f,What is Retrieval-Augmented Generation (RAG)?,qa_strict,gemma3:4b,0.0,1.931,1160,835,3,What is Retrieval-Augmented Generation (RAG)?,"Retrieval-augmented generation, or RAG, is a t...",False,True
1,ed2bea5911a9d13d,What is Retrieval-Augmented Generation (RAG)?,qa_strict,gemma3:4b,0.2,1.386,1160,835,3,What is Retrieval-Augmented Generation (RAG)?,"Retrieval-augmented generation, or RAG, is a t...",False,True
2,223dbe53cb99f7e7,What is Retrieval-Augmented Generation (RAG)?,summary,gemma3:4b,0.0,8.65,1064,835,3,What is Retrieval-Augmented Generation (RAG)?,Here’s a summary of the provided context about...,False,False
3,ec596e6f77b05533,What is Retrieval-Augmented Generation (RAG)?,summary,gemma3:4b,0.2,9.511,1064,835,3,What is Retrieval-Augmented Generation (RAG)?,Here’s a summary of the provided context about...,False,False
4,83b5662850e2bfeb,What is Retrieval-Augmented Generation (RAG)?,reasoning,gemma3:4b,0.0,5.752,1131,835,3,What is Retrieval-Augmented Generation (RAG)?,1) Answer: Retrieval-augmented generation (RAG...,False,True


In [13]:
df.to_csv(W3_EXPERIMENTS_CSV, index=False)

with open(W3_GENERATIONS_JSONL, "w", encoding="utf-8") as f:
    for rec in records:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

print("Saved:", W3_EXPERIMENTS_CSV)
print("Saved:", W3_GENERATIONS_JSONL)


Saved: /Users/tkhamidulin/Desktop/First Project - RAG/artifacts/week3_prompt_experiments.csv
Saved: /Users/tkhamidulin/Desktop/First Project - RAG/artifacts/week3_generations.jsonl


In [14]:
df["answer_len"] = df["answer"].fillna("").str.len()

summary = (
    df.groupby(["template", "temperature"])
      .agg(
          n=("run_id", "count"),
          avg_latency_s=("latency_s", "mean"),
          avg_answer_len=("answer_len", "mean"),
          idk_rate=("idk_flag", "mean"),
          citation_rate=("has_citation_flag", "mean"),
      )
      .reset_index()
      .sort_values(["template", "temperature"])
)

summary


Unnamed: 0,template,temperature,n,avg_latency_s,avg_answer_len,idk_rate,citation_rate
0,qa_strict,0.0,5,1.4064,131.6,0.6,0.4
1,qa_strict,0.2,5,0.8578,131.6,0.6,0.4
2,reasoning,0.0,5,5.2148,921.8,0.0,1.0
3,reasoning,0.2,5,4.4916,937.0,0.0,1.0
4,summary,0.0,5,9.5356,1836.2,0.0,0.0
5,summary,0.2,5,9.417,1880.0,0.0,0.0


In [15]:
q = test_questions[0]
df[df["question"] == q][["template", "temperature", "answer"]]


Unnamed: 0,template,temperature,answer
0,qa_strict,0.0,"Retrieval-augmented generation, or RAG, is a t..."
1,qa_strict,0.2,"Retrieval-augmented generation, or RAG, is a t..."
2,summary,0.0,Here’s a summary of the provided context about...
3,summary,0.2,Here’s a summary of the provided context about...
4,reasoning,0.0,1) Answer: Retrieval-augmented generation (RAG...
5,reasoning,0.2,1) Answer: Retrieval-augmented generation (RAG...


## Conclusions (Week 3)

### Prompt behavior findings
- **Strict Q&A**: most grounded and consistent with refusal rules.
- **Summary style**: best readability, but may drop details needed for precise Q&A.
- **Reasoning style**: better explanations, but more sensitive to weak context.

### Retrieval-to-generation dependency
Answer quality is directly limited by retrieval quality from Week 2: weak context leads to weaker grounded answers regardless of prompt style.

### Handoff to Week 4
Week 4 adds guardrails around this Week 3 pipeline: pre-checks before LLM calls and post-checks on outputs to reduce hallucinations, leakage, and unsafe responses.
