# Week 3 — LLM Integration & Prompt Templates (LangChain + Ollama)

**Objective:** Add the generation layer on top of Week 2 retrieval by integrating a local LLM (Ollama) with LangChain.  
**Focus:** Prompt engineering experiments (Q&A vs Summary vs Grounded Reasoning) and how prompt structure changes outputs.

**Outputs (artifacts):**
- `artifacts/week3_prompt_experiments.csv`
- `artifacts/week3_generations.jsonl`
- `artifacts/week3_summary.md`

In [1]:
import json
import time
import hashlib
from pathlib import Path
import pandas as pd

from langchain_ollama import OllamaLLM
from langchain_core.prompts import PromptTemplate

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
PROJECT_ROOT = Path("..").resolve()
ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

# Week 2 retrieval log to reuse (keeps Week 3 focused on prompting)
W2_RETRIEVAL_JSONL = ARTIFACTS_DIR / "week2_retrieval_debug.jsonl"

# Week 3 outputs
W3_EXPERIMENTS_CSV = ARTIFACTS_DIR / "week3_prompt_experiments.csv"
W3_GENERATIONS_JSONL = ARTIFACTS_DIR / "week3_generations.jsonl"
W3_SUMMARY_MD = ARTIFACTS_DIR / "week3_summary.md"

W2_RETRIEVAL_JSONL, ARTIFACTS_DIR


(PosixPath('/Users/tkhamidulin/Desktop/First Project - RAG/artifacts/week2_retrieval_debug.jsonl'),
 PosixPath('/Users/tkhamidulin/Desktop/First Project - RAG/artifacts'))

## Local LLM (Ollama) via LangChain

This notebook uses a local model through Ollama to keep experiments reproducible and offline-friendly.

Example setup (outside notebook):
- `ollama pull llama3.1` (or another model)
- Ensure Ollama is running (default: `http://localhost:11434`)

In [3]:
MODEL_NAME = "gemma3:4b"   # exactly as in `ollama list`
OLLAMA_BASE_URL = "http://localhost:11434"
TEMPERATURE_DEFAULT = 0.2

def build_llm(temperature: float) -> OllamaLLM:
    """
    Build a LangChain-compatible Ollama LLM.
    Using a factory function avoids duplication across experiments.
    """
    return OllamaLLM(
        model=MODEL_NAME,
        temperature=temperature,
        base_url=OLLAMA_BASE_URL,
        validate_model_on_init=True,  # fail fast if model is missing
    )

# Smoke test
build_llm(TEMPERATURE_DEFAULT).invoke("Reply: OK")


'OK\n'

## Loading Retrieval Context from Week 2

Week 2 already produced retrieval logs (queries → top chunks + scores).  
Week 3 reuses those logs to isolate the effect of prompt structure on generation.

If your schema differs, normalization handles multiple field names (e.g., `chunks`, `chunk_texts`, etc.).


In [4]:
def load_jsonl(path: Path) -> list[dict]:
    """
    Load a JSONL file into a list of dicts.
    Fails fast with a clear error if a line cannot be parsed.
    """
    rows: list[dict] = []
    with open(path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                rows.append(json.loads(line))
            except json.JSONDecodeError as e:
                raise ValueError(f"Invalid JSON on line {i} in {path}: {e}") from e
    return rows


def normalize_retrieval_row(r: dict) -> dict:
    """
    Normalize Week 2 retrieval log rows into a stable schema:
    {
      "query": str,
      "chunks": list[str],
      "scores": list[float] | None,
      "meta": dict
    }
    """

    query = r.get("query") or r.get("question") or r.get("q") or ""

    chunks = (
        r.get("chunks")
        or r.get("chunk_texts")
        or r.get("texts")
        or r.get("retrieved_chunks")
        or []
    )

    scores = r.get("scores") or r.get("similarities") or r.get("distances")

    # defensive cleanup
    if chunks is None:
        chunks = []
    if not isinstance(chunks, list):
        chunks = [str(chunks)]

    if scores is not None and (not isinstance(scores, list) or len(scores) != len(chunks)):
        scores = None

    meta = {k: v for k, v in r.items() if k not in [
        "query", "question", "q",
        "chunks", "chunk_texts", "texts", "retrieved_chunks",
        "scores", "similarities", "distances"
    ]}

    return {"query": str(query), "chunks": [str(c) for c in chunks], "scores": scores, "meta": meta}


raw_rows = load_jsonl(W2_RETRIEVAL_JSONL)
retrieval_rows = [normalize_retrieval_row(r) for r in raw_rows]

if not retrieval_rows:
    raise ValueError(f"No rows loaded from {W2_RETRIEVAL_JSONL}. File may be empty.")

print("Loaded retrieval rows:", len(retrieval_rows))
print("Keys:", list(retrieval_rows[0].keys()))
print("Chunks in first row:", len(retrieval_rows[0]["chunks"]))


Loaded retrieval rows: 246
Keys: ['query', 'chunks', 'scores', 'meta']
Chunks in first row: 0


In [5]:
def format_context(chunks: list[str], scores: list[float] | None = None, max_chars: int = 7000) -> str:
    parts = []
    for i, ch in enumerate(chunks):
        score_str = f" (score={scores[i]:.4f})" if scores is not None else ""
        parts.append(f"[Chunk {i+1}{score_str}]\n{ch}")
    ctx = "\n\n".join(parts)
    return ctx[:max_chars]


## Prompt Templates

I define three task-specific prompt templates:

1) **Strict Q&A**  
   - Answer using ONLY context  
   - If missing: exact refusal message  
   - Require chunk citations

2) **Structured Summary**  
   - Extract key points, definitions, practical notes, and gaps

3) **Grounded Reasoning**  
   - Provide short reasoning steps tied to evidence  
   - Avoid outside knowledge


In [6]:
QA_STRICT_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are a RAG assistant.\n"
        "You MUST answer using ONLY the provided CONTEXT.\n"
        "If the answer is not in the context, reply exactly:\n"
        "\"I don't know based on the provided context.\"\n\n"
        "CONTEXT:\n{context}\n\n"
        "QUESTION:\n{question}\n\n"
        "RULES:\n"
        "- No outside knowledge.\n"
        "- Be concise.\n"
        "- Cite chunks like [Chunk 2].\n\n"
        "ANSWER:\n"
    )
)

SUMMARY_PROMPT = PromptTemplate(
    input_variables=["context", "topic"],
    template=(
        "Summarize the provided context about: {topic}\n\n"
        "CONTEXT:\n{context}\n\n"
        "OUTPUT FORMAT:\n"
        "- Key points (3–7 bullets)\n"
        "- Definitions (if any)\n"
        "- Practical notes (if any)\n"
        "- Missing information / open questions (if any)\n\n"
        "SUMMARY:\n"
    )
)

REASONING_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are an analyst. Use ONLY the provided context.\n"
        "Do not add outside knowledge.\n\n"
        "CONTEXT:\n{context}\n\n"
        "QUESTION:\n{question}\n\n"
        "OUTPUT FORMAT:\n"
        "1) Answer (1–3 sentences)\n"
        "2) Evidence: cite chunks like [Chunk 1]\n"
        "3) Reasoning: 3–6 short bullet steps tied to evidence\n\n"
        "RESPONSE:\n"
    )
)


In [7]:
def find_retrieval_row(question: str, rows: list[dict]) -> dict:
    # exact match first
    for r in rows:
        if r["query"].strip() == question.strip():
            return r
    # fallback: return first row (still allows prompt behavior testing)
    return rows[0]


In [8]:
def stable_id(*parts) -> str:
    s = "||".join(map(str, parts))
    return hashlib.sha256(s.encode("utf-8")).hexdigest()[:16]

def run_generation(
    question: str,
    retrieval_row: dict,
    prompt: PromptTemplate,
    template_name: str,
    temperature: float,
    topic: str = "RAG notes"
) -> dict:
    llm = build_llm(temperature)

    context = format_context(retrieval_row["chunks"], retrieval_row["scores"])
    kwargs = {"context": context}

    if "question" in prompt.input_variables:
        kwargs["question"] = question
    if "topic" in prompt.input_variables:
        kwargs["topic"] = topic

    prompt_text = prompt.format(**kwargs)

    t0 = time.time()
    answer = llm.invoke(prompt_text)
    latency = round(time.time() - t0, 3)

    answer = (answer or "").strip()

    return {
        "run_id": stable_id(question, template_name, MODEL_NAME, temperature),
        "question": question,
        "template": template_name,
        "model": MODEL_NAME,
        "temperature": temperature,
        "latency_s": latency,
        "prompt_chars": len(prompt_text),
        "context_chars": len(context),
        "n_chunks": len(retrieval_row["chunks"]),
        "retrieval_query_used": retrieval_row["query"],
        "answer": answer,
        "idk_flag": ("i don't know based on the provided context" in answer.lower()),
        "has_citation_flag": ("[chunk" in answer.lower()),  # lightweight heuristic
    }


## Test Set

A small, diverse set of questions is used to compare prompt behavior:
- definitions (low risk)
- “why” questions (higher hallucination risk)
- comparison questions
- pipeline-level questions

This is a controlled prompt experiment, not a full benchmark yet.


In [9]:
test_questions = [
    "What is Retrieval-Augmented Generation (RAG)?",
    "Why do we use chunk overlap in retrieval systems?",
    "Explain cosine similarity vs dot product similarity for embeddings.",
    "What is the role of FAISS in a RAG pipeline?",
    "When would you increase chunk size and why?",
]


In [10]:
templates = [
    ("qa_strict", QA_STRICT_PROMPT),
    ("summary", SUMMARY_PROMPT),
    ("reasoning", REASONING_PROMPT),
]

temperature_grid = [0.0, 0.2]  # compare deterministic vs slightly creative

records = []
for q in test_questions:
    r = find_retrieval_row(q, retrieval_rows)
    for (name, tpl) in templates:
        for temp in temperature_grid:
            rec = run_generation(
                question=q,
                retrieval_row=r,
                prompt=tpl,
                template_name=name,
                temperature=temp,
                topic="RAG / retrieval fundamentals"
            )
            records.append(rec)

df = pd.DataFrame(records)
df.head()


Unnamed: 0,run_id,question,template,model,temperature,latency_s,prompt_chars,context_chars,n_chunks,retrieval_query_used,answer,idk_flag,has_citation_flag
0,773148cfcd0d205f,What is Retrieval-Augmented Generation (RAG)?,qa_strict,gemma3:4b,0.0,1.294,325,0,0,Explain retrieval augmented generation in simp...,I don't know based on the provided context.,True,False
1,ed2bea5911a9d13d,What is Retrieval-Augmented Generation (RAG)?,qa_strict,gemma3:4b,0.2,0.496,325,0,0,Explain retrieval augmented generation in simp...,I don't know based on the provided context.,True,False
2,223dbe53cb99f7e7,What is Retrieval-Augmented Generation (RAG)?,summary,gemma3:4b,0.0,1.585,229,0,0,Explain retrieval augmented generation in simp...,Please provide the context about RAG/retrieval...,False,False
3,ec596e6f77b05533,What is Retrieval-Augmented Generation (RAG)?,summary,gemma3:4b,0.2,1.44,229,0,0,Explain retrieval augmented generation in simp...,Please provide the context about RAG/retrieval...,False,False
4,83b5662850e2bfeb,What is Retrieval-Augmented Generation (RAG)?,reasoning,gemma3:4b,0.0,7.53,296,0,0,Explain retrieval augmented generation in simp...,1) Retrieval-Augmented Generation (RAG) is a t...,False,True


In [28]:
df.to_csv(W3_EXPERIMENTS_CSV, index=False)

with open(W3_GENERATIONS_JSONL, "w", encoding="utf-8") as f:
    for rec in records:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

print("Saved:", W3_EXPERIMENTS_CSV)
print("Saved:", W3_GENERATIONS_JSONL)


Saved: /Users/tkhamidulin/Desktop/First Project - RAG/artifacts/week3_prompt_experiments.csv
Saved: /Users/tkhamidulin/Desktop/First Project - RAG/artifacts/week3_generations.jsonl


In [29]:
df["answer_len"] = df["answer"].fillna("").str.len()

summary = (
    df.groupby(["template", "temperature"])
      .agg(
          n=("run_id", "count"),
          avg_latency_s=("latency_s", "mean"),
          avg_answer_len=("answer_len", "mean"),
          idk_rate=("idk_flag", "mean"),
          citation_rate=("has_citation_flag", "mean"),
      )
      .reset_index()
      .sort_values(["template", "temperature"])
)

summary


Unnamed: 0,template,temperature,n,avg_latency_s,avg_answer_len,idk_rate,citation_rate
0,qa_strict,0.0,5,0.6082,43.0,1.0,0.0
1,qa_strict,0.2,5,0.4452,43.0,1.0,0.0
2,reasoning,0.0,5,5.9418,1210.2,0.0,0.8
3,reasoning,0.2,5,5.5042,1152.6,0.0,0.8
4,summary,0.0,5,1.5362,267.0,0.0,0.0
5,summary,0.2,5,1.4084,267.0,0.0,0.0


In [30]:
q = test_questions[0]
df[df["question"] == q][["template", "temperature", "answer"]]


Unnamed: 0,template,temperature,answer
0,qa_strict,0.0,I don't know based on the provided context.
1,qa_strict,0.2,I don't know based on the provided context.
2,summary,0.0,Please provide the context about RAG/retrieval...
3,summary,0.2,Please provide the context about RAG/retrieval...
4,reasoning,0.0,1) Retrieval-Augmented Generation (RAG) is a t...
5,reasoning,0.2,1) Retrieval-Augmented Generation (RAG) is a t...


## Conclusions (Week 3)

**What changed when I changed the prompt:**
- **Strict Q&A**: tends to be the most grounded and follows refusal rules more consistently.
- **Summary**: produces the best structure/readability, but can omit details needed for precise Q&A.
- **Grounded reasoning**: improves explanations, but is more sensitive to weak context and can drift if constraints are not strict enough.

**Key takeaway:** prompt structure strongly controls:
- refusal behavior (“I don’t know…”)
- citation usage
- verbosity vs conciseness
- risk of hallucination (especially in reasoning-style prompts)

**Next (Week 4):** implement guardrails and fallback logic when retrieval context is weak or irrelevant.


In [31]:
summary_text = f"""# Week 3 — Prompt Engineering Summary

## Setup
- LLM: Ollama via LangChain
- Model: {MODEL_NAME}
- Templates: qa_strict, summary, reasoning
- Temperatures tested: {temperature_grid}

## Quick metrics (proxy)
{summary.to_string(index=False)}

## Notes (fill after manual review)
- qa_strict:
- summary:
- reasoning:

## Observed failure modes
- 

## Week 4 plan
- add guardrails for weak retrieval
- enforce minimum-evidence rules before answering
- improve refusal behavior and citation enforcement
"""

W3_SUMMARY_MD.write_text(summary_text, encoding="utf-8")
print("Saved:", W3_SUMMARY_MD)


Saved: /Users/tkhamidulin/Desktop/First Project - RAG/artifacts/week3_summary.md
