# Source Extraction

<img src="https://live.staticflickr.com/65535/54443002259_4a8e1249dd_b.jpg" alt="Embedded Photo" width="500">

*Image generated using ChatGPT.*

## Introduction

Language models are often prone to telling untruths or half-truths, as well as fabricating facts without providing sources. Nowadays, systems are increasingly used that, instead of answering questions directly, first search a database, e.g., a collection of documents, and then generate an answer based on the best-matching documents. Such an answer is more likely to be grounded in reality and can be verified by a human—provided the correct sources have been properly found.

Of course, there can be a huge number of sources, so search methods must be efficient—processing everything "at once" directly with a language model is out of the question! In this task, you will focus on finding the best sources for a given sentence using the method of **embeddings** (vector representations).

Imagine that you are an AI engineer at a company developing a tool for scientific fact verification. Your task is to build a module that can quickly and effectively find reliable scientific publications that confirm or refute specific claims. Thanks to your solution, scientists, journalists, and policymakers will be able to verify information based on solid scientific foundations, which is especially important in the age of disinformation.

## Task

Your task is to develop a system that generates high-quality vector representations (embeddings) for both queries and source documents, enabling precise matching of relevant sources to queries.

Given **queries** (a set of questions; queries for which we seek sources) and a **corpus** (a database of documents/sources; a set of considered documents), you must implement functions that assign queries and sources real-valued vectors of dimension $768$. These vectors will be used to find sources for each query using a provided evaluation function, which selects the $k=10$ nearest neighbors (k-Nearest Neighbors) from the document set.

You may use the provided model based on the GPT2 architecture, which has been specially fine-tuned to produce high-quality embeddings.

While working on your solution, you will be able to test its effectiveness on a validation set, which will allow you to evaluate the quality of the generated embeddings in the context of finding appropriate source documents.

### Data

The available data in this task includes:

- A set of queries, for which appropriate sources must be found
- A corpus of documents, containing scientific publications that can be sources for the queries
- Information on query-to-document matches in the validation set

Your solution will be evaluated on the *SciFact* benchmark. It is used to assess search and fact verification systems in a scientific context. It consists of a set of statements (queries) based on real scientific publications, and the document base (corpus) consists of publications in the natural and medical sciences. For each statement, there is at least one publication that supports or refutes it. We provide code to load the data, so the data is described here for informational purposes only.

**The `corpus.jsonl` file** contains unique identifiers, titles, and abstracts of scientific papers

Example of a single document:
```
{
    "text_id": 13734012,
    "title": "Prevalent abnormal prion protein in human appendixes after bovine spongiform encephalopathy epizootic: large scale survey",
    "text": "OBJECTIVES To carry out a further survey (...) CONCLUSIONS This study corroborates previous studies and suggests a high prevalence of infection with abnormal PrP, indicating vCJD carrier status in the population compared with the 177 vCJD cases to date. These findings have important implications for the management of blood and blood products and for the handling of surgical instruments."
}
```

**The `queries_val.jsonl` file** contains the content of the statements and the identifier of the matching source text. The test set, on which your final solution will be evaluated, **will not contain** matching text identifiers.

Example of a single query:
```
{
    "query": "1 in 5 million in UK have abnormal PrP positivity.",
    "matching_text_id": 13734012
}
```

### Evaluation Criteria
The methods (functions) you implement, `Embedder.encode_queries` and `Embedder.encode_corpus`, will be used to process the queries $q \in Q$ and documents $d \in C$ into vectors. We will interchangeably use $q$ and $d$ to refer to both texts and their embeddings.

Let us assume that query $q\in Q$ corresponds to the gold document $d\in C$.
The evaluation code sorts all documents by distance to $q$, resulting in documents $K_1, K_2, ..., K_n$, such that $K_1$ is the closest. We denote by $I$ the index of the gold document $d$ in this sequence. This means that $I - 1$ is the number of documents whose distance to $q$ is less than the distance from $q$ to $d$.

The distance between vectors is measured using cosine similarity, which for vectors $v, w \in \mathbb{R}^n$ is defined as $\frac{v^Tw}{||v|| \cdot ||w||}$, where $||v||$ is the length of vector $v$.

The result for query $q$ is defined as  

$$\text{nDCG@10}(q) = \begin{cases}
\frac{1}{\log_2(I + 1)} & \text{if $I \leq 10$} \\
0 & \text{otherwise.}
\end{cases}$$

So, the closer the gold document is ranked to the query compared to other documents, the higher the score—if 10 "wrong" documents are closer to the query, the score for that example is 0.

Your final solution will be scored based on the **nDCG@10** metric, calculated as the average value of this metric over all queries $(q \in Q )$.

- If the **nDCG@10** score is **less than 0.2**, you will receive **0 points**.  
- If the score **exceeds 0.5**, you will receive the **maximum score**, which is **100**.  

Scoring for values between these thresholds will be calculated proportionally.

## Constraints

- Your solution will be tested on the Competition Platform without internet access and in a GPU environment.
- The final evaluation of your solution on the Competition Platform must not exceed 10 minutes using a GPU.
- The embedding of each query and document must have a dimension of 768
- Allowed libraries: `torch`, `pandas`, `numpy`, `nltk`, `transformers`

## Submission Files

You must submit only this notebook filled in with your solution (see the `Embedder` class).

## Tips

- GPT2 is a decoder-style language model. Decoder models work such that for a given sequence of tokens (e.g., a prefix of a sentence) $t_1, t_2, \dots, t_n$ they compute a hidden vector $h_{n+1} \in \mathbb{R}^d$, and then transform it with one of their weight matrices into $p_{n+1} \in \mathbb{R}^m$—a probability distribution over the vocabulary tokens.
- There are many documents compared to the available execution time.

## Evaluation

During final evaluation, the flag `FINAL_EVALUATION_MODE` will be set to `True`.

You can earn between 0 and 100 points for this task. The number of points you receive will be calculated on a (secret) test set on the Competition Platform based on the above formula, rounded to the nearest integer. If your solution does not meet the above criteria or does not execute correctly, you will receive 0 points for the task.

# Starter Code

In this section, we initialize the environment by importing the necessary libraries and functions. The prepared tokenizer, data loading, and evaluation code will help you operate on the data and solve the task.

In [1]:
######################### DO NOT MODIFY THIS CELL ##########################

FINAL_EVALUATION_MODE = False  # During the evaluation of your solution, we will set this value to True

In [None]:
######################### DO NOT MODIFY THIS CELL ##########################

import json
import os
from math import log2

import torch
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Tokenizer:
    def __init__(self, tokenizer_path, length=150):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = "right"
        self.length = length

    def __call__(self, batch_text):
        batch_tensor = self.tokenizer(
            batch_text,
            max_length=self.length,
            truncation=True,
            padding=True,
            return_tensors="pt"
        )
        return batch_tensor.to(device)

## Loading Data  
In this part of the task, we will load the training data.

In [None]:
######################### DO NOT MODIFY THIS CELL ##########################

def load_corpus(file):
    corpus = {}
    with open(file, encoding="utf8") as f_in:
        for line in f_in:
            line = json.loads(line)
            corpus[line.get("text_id")] = {
                "text": line.get("text"),
                "title": line.get("title"),
            }
    return corpus

def load_queries(file):
    queries = {}
    matching_texts = {}
    with open(file, encoding="utf8") as f_in:
        for query_num, line in enumerate(f_in):
            line = json.loads(line)

            queries[query_num] = line.get("query")
            matching_texts[query_num] = line.get("matching_text_id")
    return queries, matching_texts

corpus = load_corpus("corpus.jsonl")
queries, matching_texts = load_queries("queries_val.jsonl")

print(f"Loaded {len(corpus)} texts and {len(queries)} queries.")

## Evaluation Criterion Code

Code similar to the one below will be used to evaluate the solution on the test set.

In [None]:
######################### DO NOT MODIFY THIS CELL ##########################

def evaluate_retrieval_ndcg(
    golden_matches: dict[int, int],
    results: dict[int, dict[int, float]],
) -> float:
    """
    Computes the nDCG metric value for the given search results.

    This function calculates the score of your solution based on the results for the top_k best documents 
    according to your embedder.

    :param golden_matches: A dictionary with the gold standard matches, where the key is the query ID 
                           and the value is the ID of the correct document.
    :param results: A dictionary with the search results, where the key is the query ID and the value 
                    is a dictionary of document IDs and their similarity scores to the query.
    :return: The value of the nDCG metric.
    """

    for query_id, v in results.items():
        results[query_id] = {k: v for k, v in sorted(v.items(), key=lambda item: -item[1])}

    ndcg_sum = 0
    for query_id, v in results.items():
        golden_document = golden_matches[query_id]
        for i, document_id in enumerate(v.keys()):
            if golden_document == document_id:
                ndcg_sum += 1 / log2(i + 2)

    ndcg = round(ndcg_sum / len(results), 5)
    return ndcg


def compute_score(ndcg: float) -> float:
    """
    Computes the final score based on the value of the nDCG metric.
    """
    lower_bound = 0.2
    upper_bound = 0.5

    if ndcg <= lower_bound:
        return 0
    elif lower_bound < ndcg < upper_bound:
        return int(round(100 * (ndcg - lower_bound) / (upper_bound - lower_bound)))
    else:
        return 100

### Retrieval  
Below is the code used to select the top_k best documents from the corpus for a given query.

In [None]:
######################### DO NOT MODIFY THIS CELL ##########################

def cos_sim(a: torch.Tensor, b: torch.Tensor):
    """
    Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.
    :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
    """
    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
    return torch.mm(a_norm, b_norm.transpose(0, 1))

def search_topk_texts(
    embedder,
    corpus: dict[str, dict[str, str]],
    queries: dict[str, str],
    top_k: int = 10,
) -> dict[str, dict[str, float]]:
    results = {}

    # Create embeddings for all queries using model.encode_queries()
    # Runs semantic search against the corpus embeddings
    # Returns a ranked list with the corpus ids
    query_ids = list(queries.keys())
    results = {qid: {} for qid in query_ids}
    queries = [queries[qid] for qid in queries]
    query_embeddings = embedder.encode_queries(queries)

    corpus_ids = sorted(
        corpus,
        key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")),
        reverse=True,
    )
    corpus = [corpus[cid] for cid in corpus_ids]

    # Encode chunk of corpus
    corpus_embeddings = embedder.encode_corpus(corpus)

    # Compute similarites using cosine-similarity
    cos_scores = cos_sim(query_embeddings, corpus_embeddings)
    cos_scores[torch.isnan(cos_scores)] = -1

    # Get top-k values
    cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(
        cos_scores,
        min(top_k + 1, len(cos_scores[1])),
        dim=1,
        largest=True,
        sorted=False,
    )
    cos_scores_top_k_values = cos_scores_top_k_values.cpu().tolist()
    cos_scores_top_k_idx = cos_scores_top_k_idx.cpu().tolist()

    for query_itr in range(len(query_embeddings)):
        query_id = query_ids[query_itr]
        for score, corpus_id in zip(cos_scores_top_k_values[query_itr], cos_scores_top_k_idx[query_itr]):
            results[query_id][corpus_ids[corpus_id]] = score

    return results

# Your Solution  
Place your solution in this section. Make changes only here!

In [None]:
class Embedder:
    # Do not change the constructor signature
    def __init__(self):
        # TODO: you can modify this method,
        # but do not change its signature! (i.e., do not change the arguments)
        self.model = AutoModel.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")
        self.tokenizer = Tokenizer("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")

    def encode_queries(self, queries: list[str]):
        """
        Function for encoding queries.
        :param queries: A list of queries to be encoded.
        :return: Query embeddings - a tensor of shape (n, 768), where n = len(queries) is the number of queries.
        """

        # TODO: implement this method – encode the queries
        # Do not change the signature of this method! (i.e., do not change the arguments)
        # Remember, you can use the HuggingFace gpt-2 model...
        # You may use the Tokenizer implemented at the top of the notebook
        # Hint: Evaluation will be faster if the returned tensor is on the GPU.
        ...
        return torch.ones(len(queries), 768).to(device)

    def encode_corpus(self, texts: list[dict]):
        """
        Function for encoding source texts.
        :param texts: A list of texts to be encoded. Each text is represented as a dictionary:
            {
                "title": "..."
                "text": "...",
            }
        :return: Text embeddings - a tensor of shape (m, 768), where m = len(texts) is the number of texts.
        """

        # TODO: implement this method – encode the source texts
        # Do not change the signature of this method! (i.e., do not change the arguments)
        ...
        return torch.ones(len(texts), 768).to(device)

# Evaluation

Running the cell below will allow you to check how many points your solution would score on the validation data.  
Before submitting, make sure the entire notebook runs from start to finish without errors and without requiring any user intervention after selecting "Run All".

In [None]:
######################### DO NOT MODIFY THIS CELL ##########################

if not FINAL_EVALUATION_MODE:
    embedder = Embedder()

    with torch.no_grad():
        results = search_topk_texts(embedder, corpus, queries, top_k=10)

    # Obliczenie nDCG
    ndcg = evaluate_retrieval_ndcg(matching_texts, results)

    # Obliczenie końcowego wyniku na podstawie nDCG
    points = compute_score(ndcg)

    print(f"\nLiczba zapytań: {len(queries)}")
    print(f"Liczba tekstów: {len(corpus)}")
    print(f"nDCG: {ndcg:.3f}")
    print(f"Wynik punktowy: {points}")


During evaluation, the model will be saved as `your_model.pkl` and evaluated on the test set.

In [None]:
######################### DO NOT MODIFY THIS CELL ##########################

if not FINAL_EVALUATION_MODE:
    embedder = Embedder()

    with torch.no_grad():
        results = search_topk_texts(embedder, corpus, queries, top_k=10)

    # Compute nDCG
    ndcg = evaluate_retrieval_ndcg(matching_texts, results)

    # Compute final score based on nDCG
    points = compute_score(ndcg)

    print(f"\nNumber of queries: {len(queries)}")
    print(f"Number of texts: {len(corpus)}")
    print(f"nDCG: {ndcg:.3f}")
    print(f"Score: {points}")