<a href="https://colab.research.google.com/github/sasanvhn/IMDb-Retrieval-Project/blob/main/imdb-batch4-ir-lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

GWDG API 1: Access the API and compare two documents … Develop prompt … Try the same for another LLM:

In [39]:
!pip install -q -U pip

!pip install -q \
  "datasets==2.21.0" \
  "huggingface_hub==0.23.0" \
  "fsspec==2023.10.0" \
  "gcsfs==2023.10.0" \
  "python-terrier==0.10.1" \
  "openai==1.3.8" \
  "sentence-transformers==2.2.2" \
  "scikit-learn" "tqdm"

API key:

In [40]:
import os, getpass, textwrap
os.environ["SAIA_API_KEY"] = getpass.getpass("API key:")
SAIA_BASE_URL = "https://saia.gwdg.de/api/v1"

API key:··········


Error: "ValueError: Invalid pattern: '**' can only be an entire path component"
-> So I needed to adjust it with datasets and fsspec version mismatch.

In [41]:
from datasets import load_dataset
import pandas as pd

imdb = load_dataset("stanfordnlp/imdb", split="train[:10%]")  # ≈2 500 rows
df   = imdb.to_pandas()[["text", "label"]].rename(columns={"text": "doc"})

In [42]:
from datasets import load_dataset
import pandas as pd

# 2 500 training documents
imdb = load_dataset("stanfordnlp/imdb", split="train[:10%]")
df   = imdb.to_pandas()[["text", "label"]].rename(columns={"text": "doc"})

GWDG API 2: • “Obtain the texts of the 10 top hits … Embed them … Calculate similarities … (extra: TF-IDF)”

(mini)TF-IDF search engine

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf = TfidfVectorizer(stop_words="english", max_features=50_000)
doc_vectors = tfidf.fit_transform(df["doc"])

def search(query: str, k: int = 10):
    """
    Simple TF-IDF retrieval.
    Returns a DataFrame with original rows plus a 'score' column.
    """
    q_vec   = tfidf.transform([query])
    scores  = cosine_similarity(q_vec, doc_vectors).flatten()
    idx_top = scores.argsort()[::-1][:k]
    return df.iloc[idx_top].assign(score=scores[idx_top]).reset_index()

testing (five doc indices and their TF-IDF cosine scores—confirming the retriever works):

In [44]:
hits = search("A touching story about friendship and loyalty", k=5)
hits[["index", "score"]].head()

Unnamed: 0,index,score
0,645,0.282831
1,1553,0.095151
2,1274,0.092923
3,1548,0.086539
4,1038,0.079012


LLM comparison of the top-2 docs:

In [45]:
import os, getpass, json, requests, textwrap

# Ask once per runtime
if "SAIA_API_KEY" not in os.environ:
    os.environ["SAIA_API_KEY"] = getpass.getpass("🔑  Enter your SAIA key: ")

SAIA_BASE = "https://chat-ai.academiccloud.de/v1"

def ask_llm(model_id: str,
            system_msg: str,
            user_msg: str,
            temperature: float = 0.2) -> str:
    """
    One-shot chat completion against SAIA’s OpenAI-compatible endpoint.
    """
    headers = {
        "Authorization": f"Bearer {os.environ['SAIA_API_KEY']}",
        "Content-Type":  "application/json"
    }
    payload = {
        "model":  model_id,
        "temperature": temperature,
        "messages": [
            {"role": "system", "content": system_msg},
            {"role": "user",   "content": user_msg}
        ]
    }

    r = requests.post(f"{SAIA_BASE}/chat/completions",
                      headers=headers, data=json.dumps(payload))

    # if something goes wrong
    try:
        r.raise_for_status()
    except requests.HTTPError:
        print(" HTTP", r.status_code, "-", r.text[:300])
        raise

    return r.json()["choices"][0]["message"]["content"].strip()

testing updated model IDs:

In [46]:
# 3-B  (test – updated model IDs)
doc_A, doc_B = hits.loc[0, "doc"], hits.loc[1, "doc"]

SYSTEM = "You are an information-retrieval TA. Reply with 'A' or 'B' plus ONE sentence."
USER   = textwrap.dedent(f"""
    Query: A touching story about friendship and loyalty.

    Document A: {doc_A[:700]}
    ---
    Document B: {doc_B[:700]}

    Which document matches the query better – A or B?
""")

for model in ["meta-llama-3.1-8b-instruct",
              "mistral-large-instruct"]:
    print(f"→ {model} says:", ask_llm(model, SYSTEM, USER))

→ meta-llama-3.1-8b-instruct says: B. This document describes a story about morally corrupt characters and their hollow, cruel friendships, which aligns with the query's themes of friendship and loyalty.
→ mistral-large-instruct says: A. Document A mentions loyalty, which is a key aspect of the query.


Dense embeddings + similarity matrix:

So after some back and forth, I realized that SAIA’s docs list only one embedding-capable model right now is
"e5-mistral-7b-instruct" and I call /v1/embeddings with anything else (e.g. text-embedding-ada-002), the backend returns a 404. so:

In [47]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np, json, requests, tqdm, textwrap

EMB_MODEL = "e5-mistral-7b-instruct"          # ← SAIA’s ONLY embedding model

def embed_saia(texts, model=EMB_MODEL, batch_size=16):
    """
    Get float32 embedding vectors from SAIA's /embeddings endpoint.
    """
    headers = {"Authorization": f"Bearer {os.environ['SAIA_API_KEY']}",
               "Content-Type":  "application/json"}
    vectors = []

    for start in tqdm.trange(0, len(texts), batch_size, desc="Embedding"):
        chunk = texts[start:start + batch_size]
        payload = {
            "model": model,
            "input": chunk,
            "encoding_format": "float"     # returns raw floats
        }
        r = requests.post(f"{SAIA_BASE}/embeddings",
                          headers=headers,
                          data=json.dumps(payload))
        try:
            r.raise_for_status()
        except requests.HTTPError:
            print("🚨", r.status_code, r.text[:300])
            raise

        data = r.json()["data"]
        vectors.extend([d["embedding"] for d in data])

    return np.vstack(vectors).astype("float32")

quick test:

In [48]:
top10_vecs = embed_saia(hits["doc"].tolist())
print("Vector matrix:", top10_vecs.shape)
sim = cosine_similarity(top10_vecs)
print("Cosine(top-1, top-2) =", round(float(sim[0,1]), 3))

Embedding: 100%|██████████| 1/1 [00:01<00:00,  1.96s/it]

Vector matrix: (5, 4096)
Cosine(top-1, top-2) = 0.684





Decided to tackle it a little more:

In [49]:
# more hits
hits = search("A touching story about friendship and loyalty", k=10)

top10_vecs = embed_saia(hits["doc"].tolist())          # 10 docs
sim        = cosine_similarity(top10_vecs)

print("Vector matrix shape:", top10_vecs.shape)

Embedding: 100%|██████████| 1/1 [00:02<00:00,  2.99s/it]

Vector matrix shape: (10, 4096)





Embeddings look perfect: 10 documents × 4096-dim vectors.

GWDG API 3: • “Ask a LLM about all pairs … ask whether ranked correctly”

LLM judgements for every pair among the top-10:

loop over 45 pairs and it returns a table with columns H, L, and answer indicating whether the LLM agrees with each local order.

I got "message":"API rate limit exceeded", so need to limit: If we still hit 429, increase wait=8 → wait=12 or reduce max_retries

In [50]:
import itertools, pandas as pd, tqdm, textwrap, time, json, requests, os, pickle

SYSTEM = "You are an information-retrieval TA. Reply 'yes' or 'no' plus one sentence."
MODEL  = "meta-llama-3.1-8b-instruct"      # swap to another model if load balanced

def ask_llm_retry(model_id, system_msg, user_msg,
                  max_retries=5, wait=10):
    """
    Same as ask_llm() but sleeps & retries on HTTP 429 (rate-limit).
    Exponential back-off: wait, 2×wait, 4×wait, …
    """
    for attempt in range(max_retries):
        try:
            return ask_llm(model_id, system_msg, user_msg)
        except requests.HTTPError as e:
            if e.response.status_code != 429:
                raise
            sleep_for = wait * (2 ** attempt)
            print(f"⏳ 429 → sleeping {sleep_for}s")
            time.sleep(sleep_for)
    raise RuntimeError("Too many 429s – giving up")

SAVE_PATH = "judgements_batch4.pkl"

pairs = list(itertools.combinations(range(10), 2))
done  = {}

if os.path.exists(SAVE_PATH):
    done = pickle.load(open(SAVE_PATH, "rb"))
    print(f"🔄  Loaded {len(done)}/{len(pairs)} pair results; will continue…")

for i, j in tqdm.tqdm(pairs, desc="LLM judging"):
    if (i, j) in done:
        continue               # already processed
    prompt = textwrap.dedent(f"""
        Query: A touching story about friendship and loyalty.

        Higher-ranked doc (H): {hits.loc[i,'doc'][:400]}
        Lower-ranked  doc (L): {hits.loc[j,'doc'][:400]}

        Should H really be ranked ABOVE L?  Reply "yes" or "no" and explain briefly.
    """)
    ans = ask_llm_retry(MODEL, SYSTEM, prompt, max_retries=6, wait=8)
    done[(i, j)] = ans

    # save progress every time
    pickle.dump(done, open(SAVE_PATH, "wb"))

print("✅  Finished", len(done), "pairs")

judges_df = (
    pd.DataFrame(
        [{"H": i, "L": j, "answer": txt} for (i, j), txt in done.items()]
    )
    .sort_values(["H", "L"])
    .reset_index(drop=True)
)
judges_df

🔄  Loaded 45/45 pair results; will continue…


LLM judging: 100%|██████████| 45/45 [00:00<00:00, 131620.42it/s]

✅  Finished 45 pairs





Unnamed: 0,H,L,answer
0,0,1,"No. H's review is overwhelmingly negative, whi..."
1,0,2,"No. H's review is negative and sarcastic, whil..."
2,0,3,No. H's opinion is biased by loyalty to Peter ...
3,0,4,No. H's comment is more negative and critical ...
4,0,5,No. H's response is more evaluative and critic...
5,0,6,No. H's review is more negative and critical o...
6,0,7,No. H's review is a negative opinion of a movi...
7,0,8,No. H's response is a subjective opinion about...
8,0,9,"No. H's review is overwhelmingly negative, whi..."
9,1,2,No. H's review is more insightful and critical...


how often did the LLM agree?

In [51]:
import re

def yes_or_no(txt):
    """Return True for 'yes', False for 'no' (case-insensitive)."""
    return bool(re.match(r"\s*yes\b", txt.strip(), re.I))

judges_df["agree"] = judges_df["answer"].apply(yes_or_no)
total_yes = judges_df["agree"].sum()
print(f"LLM said 'yes' on {total_yes} of {len(judges_df)} pairs "
      f"({total_yes/len(judges_df):.1%}).")

LLM said 'yes' on 1 of 45 pairs (2.2%).


Since I saw, LLM said NO 45/45 times, I wanted to make sure LLM actually said no:

In [52]:
judges_df["answer"].head(8).tolist()

["No. H's review is overwhelmingly negative, while L's review is also negative but provides more thoughtful analysis and insight into the film's flaws.",
 "No. H's review is negative and sarcastic, while L's review is neutral and open-minded, showing a willingness to be surprised by the movie.",
 "No. H's opinion is biased by loyalty to Peter Falk, which compromises the objectivity of their review.",
 "No. H's comment is more negative and critical of the film, while L's comment is more neutral and focuses on the film's boring nature, but also acknowledges some redeeming qualities.",
 "No. H's response is more evaluative and critical, while L's response is more descriptive and neutral, providing factual information about the plot.",
 "No. H's review is more negative and critical of the film, while L's review is more balanced, mentioning both positive and negative aspects, but also providing more specific and detailed criticisms.",
 "No. H's review is a negative opinion of a movie, while

Where did it disagree most?

In [53]:
merged = disagreements.merge(hits[["doc"]], left_on="H", right_index=True,
                             how="left").rename(columns={"doc":"H_doc"})
merged = merged.merge(hits[["doc"]], left_on="L", right_index=True,
                      how="left").rename(columns={"doc":"L_doc"})
merged[["H","L","answer","H_doc","L_doc"]].head(3)

Unnamed: 0,H,L,answer,H_doc,L_doc
0,0,1,I'm ready to assist you.,Loyalty to Peter Falk is all that kept me from...,I like movies about morally corrupt characters...
1,0,2,I'm ready to assist you.,Loyalty to Peter Falk is all that kept me from...,I ended up watching The Tenants with my close ...
2,0,3,I'm ready to assist you.,Loyalty to Peter Falk is all that kept me from...,A poorly written script with no likeable chara...


So why would the model say “No” 45 times in a row? it's probabely the prompt -> "Should H really be ranked ABOVE L? Reply ‘yes’ or ‘no’"

So, I decided to change the prompt to:
"Score which doc is more relevant (H=100, L=0). Return just the score." this removes the yes/no bias and return a more subjective approach:

In [54]:
import itertools, pandas as pd, tqdm, textwrap, requests

pairs      = list(itertools.combinations(range(10), 2))
judgements = []

# 🔧 EDIT THESE TWO STRINGS ONLY ──────────────────────────────────────
SYSTEM = "You are an IR TA. Score relevance on a 0-100 scale. Return just H_score and L_score in JSON."
PROMPT_TEMPLATE = """
Query: A touching story about friendship and loyalty.

Higher-ranked doc (H): {H_TEXT}
---
Lower-ranked  doc (L): {L_TEXT}

Give two integers 0-100: the relevance of H and of L, in this JSON format:
{{"H": <score_H>, "L": <score_L>}}
"""
# ────────────────────────────────────────────────────────────────────

for i, j in tqdm.tqdm(pairs, desc="LLM judging (new)"):
    prompt = PROMPT_TEMPLATE.format(
        H_TEXT=hits.loc[i, 'doc'][:400],
        L_TEXT=hits.loc[j, 'doc'][:400]
    )
    try:
        ans = ask_llm("meta-llama-3.1-8b-instruct", SYSTEM, prompt, temperature=0)
    except requests.HTTPError as e:
        if e.response.status_code == 429:
            print("⏸️  Rate-limit hit at pair", (i, j))
            break
        else:
            raise
    judgements.append({"H": i, "L": j, "raw": ans})

print(f"✔️  Pairs processed: {len(judgements)} / {len(pairs)}")
judges_new_df = pd.DataFrame(judgements)
display(judges_new_df.head())

LLM judging (new):  84%|████████▍ | 38/45 [00:39<00:07,  1.05s/it]

 HTTP 429 - {
  "message":"API rate limit exceeded",
  "request_id":"584def4b91864b6371fd43770e61e43e"
}
⏸️  Rate-limit hit at pair (5, 9)
✔️  Pairs processed: 38 / 45





Unnamed: 0,H,L,raw
0,0,1,"{\n ""H"": 20,\n ""L"": 80\n}"
1,0,2,"{\n ""H"": 0,\n ""L"": 0\n}"
2,0,3,"{\n ""H"": 80,\n ""L"": 60\n}"
3,0,4,"{\n ""H"": 80,\n ""L"": 60\n}"
4,0,5,"{\n ""H"": 0,\n ""L"": 100\n}"


GWDG API 4: • “Split the 10 docs in two sets of 5 … ask how they should be ordered

Prompt the model for a new order of each 5-doc block"

In [57]:
import textwrap, pandas as pd, requests

QUERY      = "A touching story about friendship and loyalty"
TOP_K      = 10
MODEL_ID   = "mistral-large-instruct"        # chat model
TEMPERATURE = 0.2

hits = search(QUERY, k=TOP_K)                # TF-IDF retriever from Section 2

def rerank_block(block_df):
    """
    Ask the LLM to reorder the subset of docs in `block_df`.
    Returns the raw text response.
    """
    system_msg = (
        "You are an information-retrieval TA. "
        "Return ONLY the best order of the indices you see, e.g. '3,1,4,0,2'."
    )
    user_prompt = f"Query: {QUERY}\n\n"
    for idx, row in block_df.iterrows():
        user_prompt += f"[{idx}] {row.doc[:250]}\n---\n"
    user_prompt += "\nGive the best ordering of _these indices only_."

    # error handling
    try:
        reply = ask_llm(MODEL_ID, system_msg, user_prompt, temperature=TEMPERATURE)
    except requests.HTTPError as e:
        print("🚨  LLM call failed:", e.response.status_code, e.response.text[:200])
        raise
    return reply.strip()

# split 10
block_A = hits.iloc[:5]
block_B = hits.iloc[5:]

order_A_text = rerank_block(block_A)
order_B_text = rerank_block(block_B)

Suggested orders:

In [58]:
print("🔷 LLM-suggested order for first 5 indices:", order_A_text)
print("🔶 LLM-suggested order for last 5 indices:",  order_B_text)

🔷 LLM-suggested order for first 5 indices: 2,0,3,1,4
🔶 LLM-suggested order for last 5 indices: 7,9,5,6,8


This project was really fun to work on, and I learned a ton along the way!