
# Assignment 4: Embedding Models, Dense Retrieval, and RAG

**Student names**: Vegard Aa Albretsen <br>
**Group number**: 62 <br>
**Date**: _We will see_

## Important notes
Please carefully read the following notes and consider them for the assignment delivery. Submissions that do not fulfill these requirements will not be assessed and should be submitted again.
1. You may work in groups of maximum 2 students.
2. The assignment must be delivered in ipynb format.
3. The assignment must be typed. Handwritten assignments are not accepted.

**Due date**: 26.10.2025 23:59

In this assignment, you will:
- Build a vector search index over a blog corpus using sentence embeddings
- Implement dense retrieval (cosine similarity)
- Use the vector index as the foundation for a simple Retrieval-Augmented Generation (RAG) chat system with evaluation on three queries



---
## Dataset

You will use the blog files, provided in the folder: 
- `blogs-sample` (in the same directory as this notebook)

Use only the blog files provided in the folder below. Each file contains multiple `<post>` elements. Treat **each `<post>` as a separate document**.

**The code to parse files is not provided. Implement the loading yourself in 4.1.**



## 4.1 – Load and parse blog documents

Load all XML files from `blogs-sample`, extract the text of each `<post>`, and store one string per document. Keep the raw text per post as the document text.

You may experience some trouble parsing all lines in the files, but this is okay.



In [13]:
from pathlib import Path
from bs4 import BeautifulSoup

folder = Path("blogs-sample").resolve()
assert folder.is_dir(), f"Folder not found: {folder}"

documents = {}
files = [p for p in folder.rglob("*.xml")]

print(f"Found {len(files)} XML files under {folder}")

for f in files:
    data = f.read_bytes()
    soup = BeautifulSoup(data, "xml")

    # strip namespace prefixes like ns:post -> post
    for t in soup.find_all(True):
        if ":" in t.name:
            t.name = t.name.split(":", 1)[1]

    # posts directly anywhere under Blog/date/post/date/post...
    posts = soup.find_all("post")
    print(f"{f.name}: {len(posts)} <post> tags")

    for i, p in enumerate(posts, 1):
        documents[f"{f.name}#{i}"] = p.get_text(strip=True)

print(f"Total posts: {len(documents)}")
# show a few
for k, v in list(documents.items())[:5]:
    print(k, "=>", v[:120])


Found 25 XML files under C:\Users\vegar\OneDrive - NTNU\Skule\Fag\Informasjonsgjenfinning\TDT4117Assignment4\blogs-sample
11253.male.26.Technology.Aquarius.xml: 7 <post> tags
11762.female.25.Student.Aries.xml: 20 <post> tags
15365.female.34.indUnk.Cancer.xml: 844 <post> tags
17944.female.39.indUnk.Sagittarius.xml: 128 <post> tags
21828.male.40.Internet.Cancer.xml: 69 <post> tags
23166.female.25.indUnk.Virgo.xml: 1 <post> tags
23191.female.23.Advertising.Taurus.xml: 5 <post> tags
23676.male.33.Technology.Scorpio.xml: 12 <post> tags
24336.male.24.Technology.Leo.xml: 849 <post> tags
26357.male.27.indUnk.Leo.xml: 41 <post> tags
27603.male.24.Advertising.Sagittarius.xml: 52 <post> tags
28417.female.24.Arts.Capricorn.xml: 73 <post> tags
28451.male.27.Internet.Aquarius.xml: 13 <post> tags
40964.female.23.RealEstate.Leo.xml: 5 <post> tags
46465.male.25.Internet.Virgo.xml: 19 <post> tags
47519.male.23.Communications-Media.Sagittarius.xml: 179 <post> tags
48428.female.34.indUnk.Aquarius.xml: 5 <


## 4.2 – Embedding Models

Select and load a sentence embedding model (e.g., `sentence-transformers/all-MiniLM-L6-v2`) and compute embeddings for all documents.

- Store document embeddings in a variable named `doc_embeddings`.
- Ensure that the same model will be used for query encoding later.

**Report**:
- The embedding matrix shape 


In [None]:

# TODO: Load a sentence embedding model and encode all documents into `doc_embeddings`.
# You may use `sentence-transformers`. Report the embedding matrix shape.
from sentence_transformers import SentenceTransformer
import numpy as np

# documents: dict[str, str]
ids, texts = zip(*documents.items())  # stable order

model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

# 384-dim embeddings; normalize for cosine
doc_embeddings = model.encode(
    list(texts),
    batch_size=128,
    convert_to_numpy=True,
    normalize_embeddings=True,
).astype("float32")

print("Embedding matrix shape:", doc_embeddings.shape)   


Embedding matrix shape: (5245, 384)



## 4.3 – Dense Retrieval

Implement a cosine similarity search over `doc_embeddings` for a given query.

- Write a function `dense_search(query: str, k: int = 5) -> list[int]` that returns the indices of the top-k documents.
- Use the same embedding model to encode the query.
- Use cosine similarity for ranking.

**Report**:
- Results for the provided query showing the indices of the top results.


In [21]:

# TODO: Implement dense retrieval using cosine similarity.
# Function signature to implement:
# def dense_search(query: str, k: int = 5) -> list[int]:
from numpy.linalg import norm
def dense_search(query:str, k: int = 5) -> list[int]:
    q = model.encode([query], convert_to_numpy=True, normalize_embeddings=True).astype("float32")
    qv = q[0]
    scores = doc_embeddings @ qv
    k = min(k, len(scores))
    topk_idx = np.argpartition(-scores, k-1)[:k]
    return topk_idx[np.argsort(-scores[topk_idx])].tolist()


indices = dense_search("How do people feel about their jobs?", k=5)
doc_ids = [ids[i] for i in indices]
docs = [documents[k] for k in doc_ids]
for d in docs: 
    print(d)


urlLink Workplace : "For INFPs the job search can be an opportunity to use their creativity, flexibility and their skills in self-expression. They can generate a variety of job possibilities, consider them for their ability to fulfill their values, and pursue them using their skills in communicating with others, either in writing or in person. Their idealism, commitment, flexibility and people skills will usually be communicated to others in the job search. Potential drawbacks for INFPs in the job search include unrealistic expectations for a job, feelings of inadequacy or lack of confidence, and inattention to details of the jobs or of the job search. Under stress, INFPs may become quite critical of others and themselves, and they may hold themselves back because they feel incompetent as they engage in this process. They can benefit from allowing their intuition to give them a new perspective on the possibilities available in the situation. They may also find it helpful to truly ackno


## 4.4 – Build a Vector Search Index

Build a lightweight vector index structure to enable repeated querying efficiently.

- You may reuse `doc_embeddings` directly or create an index structure. Ensure the index can return top-k document indices given a query vector.


In [22]:

# TODO: Initialize a vector index over `doc_embeddings`
# Keep code minimal. The goal is to enable fast top-k retrieval for repeated queries.
import numpy as np
from typing import Sequence
X = doc_embeddings.astype("float32")
X /= (np.linalg.norm(X, axis=1, keepdims=True) + 1e-12)
def topk_search(query: str, k: int = 5):
    q = model.encode([query], convert_to_numpy=True, normalize_embeddings=True).astype("float32")[0]
    scores = X @ q
    k = min(k, scores.size)
    idx = np.argpartition(-scores, k-1)[:k]
    idx = idx[np.argsort(-scores[idx])]
    return idx, scores[idx]   # indices into your ids/texts, and their scores




## 4.5 – RAG (Retrieval-Augmented Generation)

Implement a simple RAG pipeline that:
1) Retrieves the top-k documents for a user query using your vector index.
2) Builds a prompt that includes the query and the retrieved document snippets.
3) Uses a text generation model (your choice) to produce an answer grounded in the retrieved snippets.

- Implement a function `rag_answer(query: str, k: int = 5) -> str`.
- Keep the prompt simple and state clearly that the model should rely on the provided context.


In [None]:

# TODO: Implement a minimal RAG pipeline.
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# --- retrieval over pre-normalized embeddings X ---
def retrieve_topk(query: str, k: int = 5):
    idx, scores = topk_search(query, k)                 # reuse
    hits = [(float(scores[i]), ids[idx[i]]) for i in range(len(idx))]
    docs = [(s, doc_id, documents[doc_id]) for s, doc_id in hits]
    return docs

# --- prompt builder ---
def build_prompt(query: str, docs, char_limit=2000):
    ctx_parts, used = [], 0
    for j, (_, doc_id, text) in enumerate(docs, 1):
        chunk = text[: min(len(text), max(0, char_limit - used))]
        ctx_parts.append(f"[Doc {j} | id={doc_id}]\n{chunk}")
        used += len(chunk)
        if used >= char_limit: break
    context = "\n\n".join(ctx_parts)
    return (
        "You are a helpful assistant. Answer ONLY using the provided context.\n\n"
        f"Question:\n{query}\n\nContext:\n{context}\n\nAnswer:"
    )

# --- generation model (free, small) ---
gen_name = "google/flan-t5-base"
tok = AutoTokenizer.from_pretrained(gen_name)
gen = AutoModelForSeq2SeqLM.from_pretrained(gen_name)
t5 = pipeline("text2text-generation", model=gen, tokenizer=tok, device=-1)

# --- RAG end-to-end ---
def rag_answer(query: str, k: int = 5, max_new_tokens: int = 256):
    docs = retrieve_topk(query, k=k)
    prompt = build_prompt(query, docs)
    out = t5(prompt, max_new_tokens=max_new_tokens, do_sample=False, truncation=True)[0]["generated_text"]
    return out, docs

# Example:
answer, evidence = rag_answer("How do people feel about their jobs?", k=5)
print(answer)


Device set to use cpu


unrealistic expectations for a job, feelings of inadequacy or lack of confidence, and inattention to details of the jobs or of the job search


_*Comment from me:* The quality of the response is seriously limited by the model being so small._

## 4.6 – Evaluation

Use the following queries for your evaluation. For each query:

- Run `dense_search(query, k=5)` to retrieve relevant documents.
- Use `rag_answer(query, k=5)` to generate an answer using the top-5 retrieved documents.

**Queries:**
1. How do people deal with breakups?
2. What do bloggers write about their daily routines?
3. How do people feel about their jobs?


In [27]:
# Do not change this code
queries = [
    "How do people deal with breakups?",
    "What do bloggers write about their daily routines?",
    "How do people feel about their jobs?"
]

In [31]:
# TODO: Run and report your evaluation as described above.

def run_batch_evaluation(queries, k=5):
    for i, query in enumerate(queries, 1):
        print("=" * 100)
        print(f"Q{i}: {query}")
        print("-" * 100)

        top_k = dense_search(query, k=k)
        print(f"Top-{k} retrieved indices:", top_k)
        print("\nTop retrieved snippets:")
        for idx in top_k:
            id = ids[idx]
            snippet = documents[id]
            print(f"[{idx}] {snippet[:200]}...\n")

        print("RAG answer:\n")
        answer = rag_answer(query, k=k)
        print(answer)
        print("\n")

# Run the evaluation
run_batch_evaluation(queries, k=5)

Q1: How do people deal with breakups?
----------------------------------------------------------------------------------------------------
Top-5 retrieved indices: [2575, 4830, 2691, 3674, 4802]

Top retrieved snippets:
[2575] Why do my ex-boyfriends always feel the need to keep in touch? Does this happen to anyone else? Particularly perplexing are the calls and emails from the ones who dumped me...didn't they want out in t...

[4830] i've got what i want... she wont call or contact me anymore... so the games finally over and its time to rebuild, recoup, find happiness alone, and then venture off into the world again as a stronger ...

[2691] Well, it looks as though I'm a single man again. Joe got back from Spain...and rather than me launching into the "we need to talk" speech...he did. Which, is really quite surprising given his continua...

[3674] Find the new.  To no one in particular: Sometimes it's too easy to get stuck in the past, things that were done wrong to you. Shit happen