# Ungraded Lab -  Retrieval Metrics
---

In this lab, you will be working on retrieving and analyzing metrics for a RAG system. RAG models are designed to improve the quality of generated responses by retrieving relevant documents from a knowledge base. Your goal is to evaluate the retrieval component by calculating precision and recall metrics, along with context precision and context recall.

In this lab, you will learn:
- How to compute precision and recall metrics
- How to apply these metrics in information retrieval
- How to work with a concrete dataset to test the retrieval capabilities of semantic-based searches

You will be using the `sentence-transformers` library to convert text to embeddings, allowing efficient similarity computations. To compute retrieval metrics, you need a labeled dataset.

---
<h4 style="color:black; font-weight:bold;">USING THE TABLE OF CONTENTS</h4>

JupyterLab provides an easy way for you to navigate through your assignment. It's located under the Table of Contents tab, found in the left panel, as shown in the picture below.

![TOC Location](images/toc.png)


# Table of Contents
- [ 1 - The dataset](#1)
  - [ 1.1 Preprocessing and Vectorizing Data](#1-1)
  - [ 1.2 Basic functions for retrieve](#1-2)
- [ 2 - Retrieving metric](#2)
  - [ 2.1 Precision](#2-1)
  - [ 2.2 Recall](#2-2)
  - [ 2.3 Computing metrics over some queries](#2-3)


## 1 - Introduction
---

Retrieval metrics are fundamental in RAG systems, as they provide a way to measure performance. To effectively gauge performance, you need a labeled dataset—one where the answers to specific queries are known—allowing you to compare these results with those generated by your RAG system. In this lab, you will use a pre-labeled dataset and focus on Precision and Recall metrics.

<div style="text-align: center;">
  <img src="images/precision_recall.png" alt="Description" style="width: 70%;">
</div>

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
import matplotlib.pyplot as plt
import joblib
import os

<a id='1'></a>
### 1.1 The dataset

The [20 Newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) is a classic text dataset with text data on various topics, with labeled categories. Let's use the `sklearn.datasets` module to load this dataset.

In [2]:
from sklearn.datasets import fetch_20newsgroups

# Load the 20 Newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, data_home='./dataset')

# Convert the dataset to a DataFrame for easier handling
df = pd.DataFrame({
    'text': newsgroups_train.data,
    'category': newsgroups_train.target
})

# Display some basic information about the dataset
print(df.head())
print("\nDataset Size:", df.shape)
print("\nNumber of Categories:", len(newsgroups_train.target_names))
print("\nCategories:", newsgroups_train.target_names)

                                                text  category
0  From: lerxst@wam.umd.edu (where's my thing)\nS...         7
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...         4
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...         4
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...         1
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...        14

Dataset Size: (11314, 2)

Number of Categories: 20

Categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [3]:
print(f"TEXT:\n\t{df['text'][0]}\nCATEGORY:\n\t{newsgroups_train.target_names[df['category'][0]]}")

TEXT:
	From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





CATEGORY:
	rec.autos


<a id='1-1'></a>
### 1.1 Preprocessing and Vectorizing Data

In this section, you'll preprocess the text data by cleaning it and then vectorize the text using a pre-trained model from the `sentence-transformers` library. You will use the model `BAAI/bge-base-en-v1.5` for encoding the sentences into vectors. To save time, the dataset has been embedded ahead of time for you, so the model will be used only to vectorize the prompts.

In [4]:
# Load the pre-trained sentence transformer model
model_name =  "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(os.path.join(os.environ['MODEL_PATH'],model_name))

embedding_vectors = joblib.load('embeddings.joblib')

In [5]:
len(embedding_vectors)

11314

<a id='1-2'></a>
### 1.2 Basic functions for retrieval

Now let's implement a basic RAG mechanism by performing a similarity search over our precomputed embeddings. This code uses cosine similarity to find the most relevant documents for a given query. Let's first define our basic functions.


In [6]:
def preprocess_text(text):
    """
    Preprocess the text data by removing leading and trailing whitespace.

    Parameters:
    text (str): The input text to preprocess.

    Returns:
    str: The preprocessed text, with leading and trailing whitespace removed.
    """
    # Example preprocessing: remove leading/trailing whitespace
    text = text.strip()
    return text


def cosine_similarity(v1, array_of_vectors):
    """
    Cosine similarity between a vector and either a single vector (1D) or an array of vectors (2D).
    Returns a float for 1D input, or a list of floats for 2D input.
    Safely handles PyTorch tensors (moves to CPU) and NumPy arrays.
    """
    # Handle torch tensors for v1
    if hasattr(v1, "detach"):  # torch tensor
        v1 = v1.detach().cpu().numpy()
    v1 = np.asarray(v1, dtype=np.float32).ravel()

    # Handle torch tensors for array_of_vectors
    if hasattr(array_of_vectors, "detach"):  # torch tensor
        array_of_vectors = array_of_vectors.detach().cpu().numpy()
    A = np.asarray(array_of_vectors, dtype=np.float32)

    if A.ndim == 1:
        A = A.ravel()
        denom = np.linalg.norm(v1) * np.linalg.norm(A)
        return float(0.0 if denom == 0 else np.dot(v1, A) / denom)

    # 2D case: compute similarities for each row in A
    A = np.atleast_2d(A)
    v1_norm = np.linalg.norm(v1)
    A_norms = np.linalg.norm(A, axis=1)
    denom = v1_norm * A_norms
    with np.errstate(divide='ignore', invalid='ignore'):
        sims = (A @ v1) / np.where(denom == 0, 1.0, denom)
    sims[denom == 0] = 0.0
    return sims.tolist()


def top_k_greatest_indices(lst, k):
    """
    Get the indices of the top k greatest items in a list.

    Parameters:
    lst (list): The list of elements to evaluate.
    k (int): The number of top elements to retrieve by index.

    Returns:
    list: A list of indices corresponding to the top k greatest elements in lst.
    """
    # Enumerate the list to keep track of indices
    indexed_list = list(enumerate(lst))
    # Sort by element values in descending order
    sorted_by_value = sorted(indexed_list, key=lambda x: x[1], reverse=True)
    # Extract the top k indices
    top_k_indices = [index for index, value in sorted_by_value[:k]]
    return top_k_indices

Now let's define the retriever function.

In [7]:
def retrieve_documents(query, embeddings, model, top_k=5):
    """
    Retrieve top-k most similar documents to a query using cosine similarity.
    Assumes:
      - preprocess_text, top_k_greatest_indices, df, and newsgroups_train are defined elsewhere.
      - embeddings is an iterable of document embeddings (NumPy arrays or torch tensors).
      - model.encode supports convert_to_tensor parameter (e.g., sentence-transformers).
    """
    
    query_clean = preprocess_text(query)
    query_embedding = model.encode(query_clean, convert_to_tensor=False).astype(np.float32)

    cosine_scores = []
    for x in embeddings:
        # Ensure each embedding is a NumPy array
        if hasattr(x, "detach"):  # torch tensor
            x = x.detach().cpu().numpy()
        x = np.asarray(x, dtype=np.float32)

        score = cosine_similarity(query_embedding, x)  # returns a float for 1D x
        cosine_scores.append(float(score))

    top_results = top_k_greatest_indices(cosine_scores, k=top_k)

    print(f"Query: {query}")
    for idx in top_results:
        print(f"Document: {df.iloc[idx]['text'][:200]}...")
        print(f"Category: {newsgroups_train.target_names[df.iloc[idx]['category']]}...")
        print("\n\n")

        
# Example query
example_query = "space exploration"
retrieve_documents(example_query, embedding_vectors, model, top_k = 2)

Query: space exploration
Document: From: u1452@penelope.sdsc.edu (Jeff Bytof - SIO)
Subject: End of the Space Age?
Organization: San Diego Supercomputer Center @ UCSD
Lines: 16
Distribution: world
NNTP-Posting-Host: penelope.sdsc.edu

...
Category: sci.space...



Document: From: dennisn@ecs.comm.mot.com (Dennis Newkirk)
Subject: Space class for teachers near Chicago
Organization: Motorola
Distribution: usa
Nntp-Posting-Host: 145.1.146.43
Lines: 59

I am posting this for...
Category: sci.space...





<a id='2'></a>
## 2 - Retrieving metrics

---

Let's explore briefly the most common metrics for retrieval systems: Precision@K and Recall@K.

<a id='2-1'></a>
### 2.1 Precision@K

Precision@K provides an evaluation of the relevancy of the top K retrieved documents. It's calculated as the ratio of relevant documents in the top K results to K (the total number of documents retrieved).

$$\text{Precision@K} = \frac{\text{Number of Relevant Documents in Top K}}{\text{K}}$$

where K is the number of documents retrieved.

In [8]:
def precision_at_k(relevant_count, k):
    """
    Calculate the Precision@K for a retrieval system.

    Precision@K is the ratio of relevant documents in the top K retrieved documents
    to the total number K of documents retrieved.

    Args:
        relevant_count (int): Number of relevant documents in the top K results.
        k (int): Total number of documents retrieved (K).

    Returns:
        float: The Precision@K value, or 0.0 if k is zero.
    
    Raises:
        ValueError: If any input is negative.
    """
    if relevant_count < 0 or k < 0:
        raise ValueError("All input values must be non-negative.")
    
    if k == 0:
        return 0.0

    return relevant_count / k

<a id='2-2'></a>
### 2.2 Recall@K

Recall@K evaluates the retrieval system's ability to find all relevant documents from the dataset within the top K results. It's calculated as the ratio of relevant documents in the top K results to the total number of relevant documents in the entire corpus.

$$\text{Recall@K} = \frac{\text{Number of Relevant Documents in Top K}}{\text{Total Number of Relevant Documents in Corpus}}$$

In [9]:
def recall_at_k(relevant_count, total_relevant):
    """
    Calculate the Recall@K for a retrieval system.

    Recall@K is the ratio of relevant documents in the top K retrieved documents
    to the total number of relevant documents in the entire corpus.

    Args:
        relevant_count (int): Number of relevant documents in the top K results.
        total_relevant (int): Total number of relevant documents in the corpus.

    Returns:
        float: The Recall@K value, or 0.0 if total_relevant is zero.
    
    Raises:
        ValueError: If any input is negative.
    """
    if relevant_count < 0 or total_relevant < 0:
        raise ValueError("All input values must be non-negative.")

    if total_relevant == 0:
        return 0.0

    return relevant_count / total_relevant

<a id='2-3'></a>
### 2.3 Computing metrics over some queries

Now let's compute these metrics on some pre-defined queries.

In [10]:
# Define more complex test queries with their corresponding desired categories
test_queries = [
    {"query": "advancements in space exploration technology", "desired_category": "sci.space"},
    {"query": "real-time rendering techniques in computer graphics", "desired_category": "comp.graphics"},
    {"query": "latest findings in cardiovascular medical research", "desired_category": "sci.med"},
    {"query": "NHL playoffs and team performance statistics", "desired_category": "rec.sport.hockey"},
    {"query": "impacts of cryptography in online security", "desired_category": "sci.crypt"},
    {"query": "the role of electronics in modern computing devices", "desired_category": "sci.electronics"},
    {"query": "motorcycles maintenance tips for enthusiasts", "desired_category": "rec.motorcycles"},
    {"query": "high-performance baseball tactics for championships", "desired_category": "rec.sport.baseball"},
    {"query": "historical influence of politics on society", "desired_category": "talk.politics.misc"},
    {"query": "latest technology trends in the Windows operating system", "desired_category": "comp.os.ms-windows.misc"}
    
]


In [11]:
def compute_metrics(queries, embeddings, model, top_k=5):
    """
    Compute Precision@K and Recall@K for a list of queries against a dataset of document embeddings.
    Assumes:
      - preprocess_text, top_k_greatest_indices, precision_at_k, recall_at_k, df, newsgroups_train are defined elsewhere.
      - embeddings is a list/iterable of document embeddings (NumPy arrays or torch tensors).
      - model.encode supports convert_to_tensor parameter (e.g., sentence-transformers).
    """

    results = []

    # Normalize all embeddings to NumPy once
    np_embeddings = []
    for x in embeddings:
        if hasattr(x, "detach"):  # torch tensor
            x = x.detach().cpu().numpy()
        np_embeddings.append(np.asarray(x, dtype=np.float32).ravel())
    E = np.vstack(np_embeddings)  # shape: (N, D)

    for item in queries:
        query = item["query"]
        desired_category = item["desired_category"]

        # Get NumPy, not torch, to avoid GPU->NumPy conversion errors
        q_clean = preprocess_text(query)
        q_emb = model.encode(q_clean, convert_to_tensor=False)
        q_emb = np.asarray(q_emb, dtype=np.float32).ravel()

        # Compute similarities vectorized
        cosine_scores = cosine_similarity(q_emb, E)  # list of floats length N

        # Top-K indices
        top_results = top_k_greatest_indices(cosine_scores, k=top_k)

        # Retrieved categories
        retrieved_categories = [
            newsgroups_train.target_names[df.iloc[idx]["category"]] for idx in top_results
        ]

        # Metrics
        relevant_in_top_k = sum(1 for cat in retrieved_categories if cat == desired_category)
        total_relevant_in_corpus = sum(
            1 for idx in range(len(df))
            if newsgroups_train.target_names[df.iloc[idx]["category"]] == desired_category
        )

        p = precision_at_k(relevant_in_top_k, top_k)
        r = recall_at_k(relevant_in_top_k, total_relevant_in_corpus)

        results.append({
            "query": query,
            "precision@k": p,
            "recall@k": r,
        })

    return results

In [12]:
# Run the queries and compute metrics with different K values
k_values = [5, 20, 50]

for k in k_values:
    print(f"\n{'='*80}")
    print(f"Results with K={k}:")
    print('='*80)
    results = compute_metrics(test_queries, embedding_vectors, model, top_k=k)
    
    # Display the results
    for result in results:
        print(f"Query: {result['query']}")
        print(f"  Precision@{k}: {result['precision@k']:.2f}, Recall@{k}: {result['recall@k']:.2f}")
        print()


Results with K=5:
Query: advancements in space exploration technology
  Precision@5: 1.00, Recall@5: 0.01

Query: real-time rendering techniques in computer graphics
  Precision@5: 1.00, Recall@5: 0.01

Query: latest findings in cardiovascular medical research
  Precision@5: 1.00, Recall@5: 0.01

Query: NHL playoffs and team performance statistics
  Precision@5: 1.00, Recall@5: 0.01

Query: impacts of cryptography in online security
  Precision@5: 1.00, Recall@5: 0.01

Query: the role of electronics in modern computing devices
  Precision@5: 1.00, Recall@5: 0.01

Query: motorcycles maintenance tips for enthusiasts
  Precision@5: 1.00, Recall@5: 0.01

Query: high-performance baseball tactics for championships
  Precision@5: 1.00, Recall@5: 0.01

Query: historical influence of politics on society
  Precision@5: 0.40, Recall@5: 0.00

Query: latest technology trends in the Windows operating system
  Precision@5: 0.80, Recall@5: 0.01


Results with K=20:
Query: advancements in space explor

**Understanding the Results:**

The results above clearly demonstrate the **precision-recall tradeoff** in retrieval systems as we vary K from 5 to 20 to 50:

**Precision@K Trends (generally decreases as K increases):**

- **At K=5**: Most queries achieve very high precision (0.80-1.00), with 8 out of 10 queries having perfect precision (1.00). This means nearly all retrieved documents are highly relevant.
  
- **At K=20**: Precision starts to decline for some queries:
  - "electronics in computing devices" drops to 0.80 (from 1.00)
  - "Windows operating system" drops to 0.65 (from 0.80)
  - "motorcycles maintenance" drops to 0.95 (from 1.00)
  
- **At K=50**: Precision decreases further as we retrieve more documents:
  - "computer graphics" drops to 0.88 (from 1.00)
  - "electronics in computing devices" drops to 0.66 (from 0.80)
  - "Windows operating system" drops to 0.60 (from 0.65)
  - "historical influence of politics" remains around 0.50-0.52 (the lowest across all K values)

**Recall@K Trends (increases as K increases):**

- **At K=5**: Recall is very low (~0.01 or 1%), meaning only about 1% of all relevant documents are retrieved
  
- **At K=20**: Recall triples to ~0.03 (3%), capturing more relevant documents
  
- **At K=50**: Recall increases to 0.05-0.08 (5-8%), capturing approximately 8 times more relevant documents than K=5

**Key Observations:**

1. **The tradeoff is clear**: As K increases, we retrieve more of the total relevant documents (higher recall), but at the cost of including some irrelevant documents (lower precision).

2. **Some queries are harder than others**: The query "historical influence of politics on society" consistently shows the lowest precision (0.40-0.52), suggesting that this query is semantically ambiguous or the category "talk.politics.misc" is harder to distinguish from related categories.

3. **For RAG systems, K=5 to K=20 is often optimal**: These values provide high precision (most retrieved documents are relevant) while keeping the context size manageable for the LLM. Even though recall is low, the goal is to find the *most relevant* documents, not *all* relevant documents.

4. **Recall remains relatively low even at K=50**: This is expected since each category contains hundreds of documents (500-600), so retrieving 50 documents only captures ~8-10% of the total relevant documents. To achieve high recall, we would need K values in the hundreds, which would severely impact precision and be impractical for RAG applications.

Congratulations on finishing this Ungraded Lab! Keep it up!