# BM25 benchmark


The BM25 algorithm was first introduced in the paper https://www.researchgate.net/publication/221037764_Okapi_at_TREC-3 where presented a new ranking approach, that improved the existing versions  and the TF-IDF. 

The BM25 model addresses two key limitations of the TF-IDF model, which was widely used at the time. These limitations are:

1. **Term Saturation**: In the TF-IDF model, term frequency (TF) has diminishing returns, meaning that additional occurrences of a term in a document yield diminishing benefits. The BM25 model introduces a saturation function to address this issue.

2. **Document Length Normalization**: The TF-IDF model does not account for variations in document lengths. BM25 incorporates document length normalization, which helps in ranking shorter and longer documents more fairly.


BM25(D, Q) = ∑ (IDF(q) * (f(q, D) * (k1 + 1)) / (f(q, D) + k1 * (1 - b + b * |D| / avgdl)))


# How to evaluate recommender systems

Evaluating recommender systems can involve different methods and metrics depending on the problem. Bellow a summary of the metrics that can be used. 

1. **Accuracy Metrics**: These measure how accurately the recommender system predicts user preferences.
   - **Mean Absolute Error (MAE)** and **Root Mean Square Error (RMSE)**: These are statistical measures that calculate the average magnitude of errors in a set of predictions.
   - **Precision and Recall**: Precision measures the proportion of recommended items that are relevant, while recall measures the proportion of relevant items that are recommended.
   - **Top-N Accuracy**: This measures how often the top N recommendations include items that users actually interact with.

2. **Diversity and Novelty Metrics**: These measure how varied and new the recommendations are.
   - **Diversity**: Assesses how different the recommended items are from each other.
   - **Novelty**: Measures how many new or unknown items are recommended to users.

3. **Coverage Metrics**: These assess the breadth of the recommender system.
   - **Catalog Coverage**: Measures the percentage of items in the catalog that are ever recommended.
   - **User Coverage**: Measures the percentage of users for whom the system can generate recommendations.

4. **Utility-Based Metrics**: These consider the usefulness of recommendations from a user's perspective.
   - **Click-Through Rate (CTR)**: Measures how often users click on the recommended items.
   - **Conversion Rate**: Measures how often recommendations lead to a sale or another desired action.

5. **User Satisfaction Metrics**: Evaluates user experience and satisfaction with the recommendations.
   - **User Surveys**: Direct feedback from users about their satisfaction with the recommendations.
   - **Session Duration**: Longer sessions might indicate greater engagement with the recommendations.

6. **Serendipity and Surprise Metrics**: These assess the unexpectedness of the recommendations.
   - **Serendipity**: Measures how the system can recommend items that a user might not find on their own but ends up liking.
   - **Surprise**: Measures how unexpected the recommendations are to the user.

7. **Fairness and Bias Metrics**: These ensure that the recommendations are fair and unbiased.
   - **Group Fairness**: Ensures that recommendations are equally effective for different user groups.
   - **Item Exposure**: Ensures a fair distribution of exposure among items.

8. **A/B Testing**: This involves comparing two or more versions of the recommender system to see which performs better according to specific metrics.

9. **Online and Offline Evaluation**: 
   - **Offline Evaluation**: Uses historical data to evaluate the recommender system without impacting real users.
   - **Online Evaluation**: Involves real-time testing with actual users, often through A/B testing.

# Existing benchmarks

In order to better compare internal models with existing ones the usage of public benchmarks is a common approach. Two of the most popular are: Beir and mteb. Bellow a summary of the two.

### BEIR (Benchmarking IR)

BEIR, or Benchmarking IR, is a benchmark suite designed for evaluating Information Retrieval (IR) models, particularly in the context of neural retrieval. It's significant because it provides a comprehensive, diverse, and challenging set of datasets for testing the generalizability of these models. Key aspects of BEIR include:

1. **Diverse Domains**: BEIR covers a wide range of domains, such as scientific articles, news articles, and general queries. This diversity ensures that models are tested in various scenarios, reflecting real-world applications.

2. **Heterogeneous Tasks**: The benchmark includes different types of IR tasks like fact-checking, question answering, and citation prediction. This helps in understanding how well models perform across different IR challenges.

3. **Zero-Shot Setting**: BEIR focuses on evaluating models in a zero-shot setting, where models are tested on datasets they were not trained on. This tests the model's ability to generalize to new data.

4. **Comprehensive Evaluation Metrics**: BEIR employs standard IR metrics like Normalized Discounted Cumulative Gain (nDCG), Mean Reciprocal Rank (MRR), and Precision@k. These metrics give a well-rounded assessment of model performance.

5. **Open and Extensible**: Researchers can contribute new datasets to BEIR, making it a growing and evolving benchmark.

### MTEB (Multi-Task Evaluation Benchmark)

MTEB, or the Multi-Task Evaluation Benchmark, is designed for evaluating large language models across a wide array of natural language processing (NLP) tasks. It's crucial for understanding how well these models can handle different types of language-related challenges. Key aspects of MTEB include:

1. **Broad Task Coverage**: MTEB covers a wide range of NLP tasks, such as text classification, question answering, and text generation. This helps in assessing the versatility of language models.

2. **Multi-Task Focus**: Unlike benchmarks that focus on a single task, MTEB evaluates models across multiple tasks simultaneously. This reflects real-world scenarios where models often need to handle various types of language processing tasks.

3. **Benchmark for Large Models**: MTEB is particularly suited for evaluating large-scale language models, like GPT and BERT variants, providing insights into their strengths and weaknesses across tasks.

4. **Quantitative Evaluation**: The benchmark uses a range of metrics specific to each task, providing a quantitative measure of performance.

5. **Comparison Across Models**: MTEB allows for direct comparison between different language models, facilitating an understanding of which models perform best on specific types of tasks.

# Let's experiment with BEIR

In [1]:
# Load ms_marco using the hugging face lib
from datasets import load_dataset
dataset = load_dataset("ms_marco", 'v1.1')

In [2]:
dataset

DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 10047
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 82326
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 9650
    })
})

In [3]:
# organize datasets
import pandas as pd

train_df = pd.DataFrame(dataset["train"])
test_df = pd.DataFrame(dataset["test"])
validation_df = pd.DataFrame(dataset["validation"])


In [4]:
train_df

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers
0,[Results-Based Accountability is a disciplined...,"{'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]...",what is rba,19699,description,[]
1,[Yes],"{'is_selected': [0, 1, 0, 0, 0, 0, 0], 'passag...",was ronald reagan a democrat,19700,description,[]
2,[20-25 minutes],"{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]...",how long do you need for sydney and surroundin...,19701,numeric,[]
3,[$11 to $22 per square foot],"{'is_selected': [0, 0, 0, 0, 0, 0, 0, 0, 1], '...",price to install tile in shower,19702,numeric,[]
4,[Due to symptoms in the body],"{'is_selected': [0, 0, 1, 0, 0, 0, 0, 0], 'pas...",why conversion observed in body,19703,description,[]
...,...,...,...,...,...,...
82321,[The act or action of propagating as a increas...,"{'is_selected': [1, 0, 0], 'passage_text': ['d...",meaning of propagation,102124,description,[]
82322,[Yes],"{'is_selected': [0, 0, 1, 0, 0, 0, 0, 0, 0], '...",do you have to do a phd to be a clinical psych...,102125,description,[]
82323,[Chablis],"{'is_selected': [0, 1, 0, 0, 0, 0], 'passage_t...",what wine goes with oysters,102126,entity,[]
82324,[1 Lithium carbonate 150 mg capsules. Lithium ...,"{'is_selected': [0, 0, 0, 1, 0, 0, 0, 0, 0], '...",what strengths does lithium come in,102127,description,[]


In [5]:
# Create column to store all the passages text
train_df["clean_text"] = train_df.loc[:,"passages"].apply(lambda x: x['passage_text'])

In [6]:
# Create a columns to store the passage indexes, in order to later be able to access the indexes.
val= 0
all_indexes = []
for line in train_df.clean_text:
    indexes = []
    for element in line: 
        val +=1
        indexes.append(val)
    
    all_indexes.append(indexes) 
train_df["indexes"] = all_indexes

In [7]:
train_df.head()

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers,clean_text,indexes
0,[Results-Based Accountability is a disciplined...,"{'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]...",what is rba,19699,description,[],"[Since 2007, the RBA's outstanding reputation ...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]"
1,[Yes],"{'is_selected': [0, 1, 0, 0, 0, 0, 0], 'passag...",was ronald reagan a democrat,19700,description,[],"[In his younger years, Ronald Reagan was a mem...","[11, 12, 13, 14, 15, 16, 17]"
2,[20-25 minutes],"{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]...",how long do you need for sydney and surroundin...,19701,numeric,[],"[Sydney, New South Wales, Australia is located...","[18, 19, 20, 21, 22, 23, 24, 25, 26, 27]"
3,[$11 to $22 per square foot],"{'is_selected': [0, 0, 0, 0, 0, 0, 0, 0, 1], '...",price to install tile in shower,19702,numeric,[],"[In regards to tile installation costs, consum...","[28, 29, 30, 31, 32, 33, 34, 35, 36]"
4,[Due to symptoms in the body],"{'is_selected': [0, 0, 1, 0, 0, 0, 0, 0], 'pas...",why conversion observed in body,19703,description,[],"[Conclusions: In adult body CT, dose to an org...","[37, 38, 39, 40, 41, 42, 43, 44]"


# Let's compute baseline using available lib

In [8]:
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

In [9]:
corpus = []
for line in train_df.clean_text:
    document = []
    for i in line:
        document.append(word_tokenize(i.lower()))
    corpus += document


In [10]:
# Let's build the index 
bm25_title = BM25Okapi(corpus)

In [11]:
train_df["word_tokenize"] = train_df["query"].apply(word_tokenize)

In [12]:
def return_top_n_indices(scores, n=10):
    """ Return top n results given a list of scores """
    # Create a list of (index, value) pairs
    indexed_list = list(enumerate(scores))
    
    # Sort the indexed list by value in descending order
    indexed_list.sort(key=lambda x: x[1], reverse=True)
    
    # Get the indices of the top 10 values
    top_10_indices = [index for index, _ in indexed_list[:n]]
    return top_10_indices

In [13]:
result = []
import tqdm
for row in tqdm.tqdm(train_df["word_tokenize"][:100]):  # run just for 100 queries due to slow predict time
    value = bm25_title.get_scores(row)
    #value = bm25_title.get_top_n(row, corpus, n=10)
    result.append(return_top_n_indices(value))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:09<00:00,  1.29s/it]


In [14]:
# Ensure we have the same length
train_df = train_df[:len(result)]

In [15]:
# Saves result indexes
train_df["retriever"] = result

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["retriever"] = result


In [16]:
train_df

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers,clean_text,indexes,word_tokenize,retriever
0,[Results-Based Accountability is a disciplined...,"{'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]...",what is rba,19699,description,[],"[Since 2007, the RBA's outstanding reputation ...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[what, is, rba]","[333792, 333789, 8, 3, 4, 5, 6, 0, 7, 137248]"
1,[Yes],"{'is_selected': [0, 1, 0, 0, 0, 0, 0], 'passag...",was ronald reagan a democrat,19700,description,[],"[In his younger years, Ronald Reagan was a mem...","[11, 12, 13, 14, 15, 16, 17]","[was, ronald, reagan, a, democrat]","[14, 11, 391341, 624933, 624926, 344774, 24594..."
2,[20-25 minutes],"{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]...",how long do you need for sydney and surroundin...,19701,numeric,[],"[Sydney, New South Wales, Australia is located...","[18, 19, 20, 21, 22, 23, 24, 25, 26, 27]","[how, long, do, you, need, for, sydney, and, s...","[25, 19, 63938, 559080, 63937, 101552, 556007,..."
3,[$11 to $22 per square foot],"{'is_selected': [0, 0, 0, 0, 0, 0, 0, 0, 1], '...",price to install tile in shower,19702,numeric,[],"[In regards to tile installation costs, consum...","[28, 29, 30, 31, 32, 33, 34, 35, 36]","[price, to, install, tile, in, shower]","[341394, 491520, 462958, 596050, 259278, 51876..."
4,[Due to symptoms in the body],"{'is_selected': [0, 0, 1, 0, 0, 0, 0, 0], 'pas...",why conversion observed in body,19703,description,[],"[Conclusions: In adult body CT, dose to an org...","[37, 38, 39, 40, 41, 42, 43, 44]","[why, conversion, observed, in, body]","[479697, 652251, 529330, 18139, 594514, 508156..."
...,...,...,...,...,...,...,...,...,...,...
95,[WatchDog.sys is a vital system file used by t...,"{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0], '...",watchdog.sys what is,19794,description,[],[WatchDog.sys was originally stored in the sys...,"[775, 776, 777, 778, 779, 780, 781, 782, 783]","[watchdog.sys, what, is]","[775, 779, 782, 777, 778, 776, 774, 781, 62692..."
96,"[In computing, .bak is a filename extension co...","{'is_selected': [0, 0, 0, 0, 0, 0, 0, 1], 'pas...",what is a bak file,19795,description,[],[The easiest way to open a BAK file is to doub...,"[784, 785, 786, 787, 788, 789, 790, 791]","[what, is, a, bak, file]","[785, 783, 789, 787, 784, 597850, 554742, 1999..."
97,"[Public, four-year colleges cost $7,000 for in...","{'is_selected': [0, 0, 0, 0, 0, 0, 1, 0, 0], '...",How much will it cost to go to college to beco...,19796,numeric,[],[A: The degree that you need to be a detective...,"[792, 793, 794, 795, 796, 797, 798, 799, 800]","[How, much, will, it, cost, to, go, to, colleg...","[442170, 791, 92803, 577458, 15476, 261638, 79..."
98,[A document used to change one or more minor p...,"{'is_selected': [0, 0, 0, 1, 0, 1, 0, 0, 0], '...",trust amendment term,19797,description,[],[Trust Restatement Law & Legal Definition. A r...,"[801, 802, 803, 804, 805, 806, 807, 808, 809]","[trust, amendment, term]","[805, 807, 803, 802, 537989, 800, 804, 441656,..."


In [17]:
def precisioK(ground_truth, retrieved_items):
    # Calculate the number of common elements between the two lists
    common_elements = set(ground_truth) & set(retrieved_items)
    
    # Compute precision
    precision = len(common_elements) / len(ground_truth)
    return precision

In [18]:
result = train_df.apply(lambda row: precisioK(row['indexes'], row['retriever']), axis=1)


In [19]:
result.mean()

0.30794444444444447

# Run beir benchmark


In [20]:
!pip install beir

Looking in indexes: https://pypi.org/simple, https://__token__:****@git.naspersclassifieds.com/api/v4/projects/35468/packages/pypi/simple

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [21]:
from beir import util, LoggingHandler

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

In [22]:
import pathlib, os
from beir import util

dataset = "msmarco"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(os.getcwd(), "datasets")
data_path = util.download_and_unzip(url, out_dir)
print("Dataset downloaded here: {}".format(data_path))

Dataset downloaded here: /Users/tiago.cabo/Documents/github-repos/moviellens-ai-playground/notebooks/datasets/msmarco


Folder Structure of any BEIR dataset:
- scifact/
    -    corpus.jsonl
    -    queries.jsonl
    -    qrels/
    -    train.tsv
    -    dev.tsv
    -    test.tsv

In [23]:
# **Data Loading**

from beir.datasets.data_loader import GenericDataLoader

data_path = "datasets/msmarco"
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"

2023-12-02 18:36:20 - Loading Corpus...


  0%|          | 0/8841823 [00:00<?, ?it/s]

2023-12-02 18:36:43 - Loaded 8841823 TEST Documents.
2023-12-02 18:36:43 - Doc Example: {'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 'title': ''}
2023-12-02 18:36:43 - Loading Queries...
2023-12-02 18:36:44 - Loaded 43 TEST Queries.
2023-12-02 18:36:44 - Query Example: anthropological definition of environment


In [24]:
corpus

{'0': {'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.',
  'title': ''},
 '1': {'text': 'The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.',
  'title': ''},
 '2': {'text': 'Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.',
  'title': ''},
 '3': {'text': 'The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers speci

In [25]:
qrels

{'19335': {'1017759': 0,
  '1082489': 0,
  '109063': 0,
  '1160863': 0,
  '1160871': 0,
  '1189088': 0,
  '1203500': 0,
  '1231806': 0,
  '1231807': 0,
  '1274615': 0,
  '1274620': 0,
  '1324075': 0,
  '1509459': 0,
  '1555317': 0,
  '1568085': 0,
  '161603': 0,
  '1705525': 0,
  '1720387': 0,
  '1720388': 0,
  '1720389': 1,
  '1720393': 0,
  '1720395': 1,
  '1722': 0,
  '1725697': 0,
  '1726': 0,
  '1729': 2,
  '1730': 0,
  '1796642': 0,
  '1796647': 0,
  '1825416': 0,
  '1825418': 0,
  '1837110': 0,
  '1871222': 0,
  '1908804': 0,
  '1956669': 0,
  '1958100': 0,
  '1958102': 0,
  '1958103': 0,
  '1959553': 0,
  '2004186': 0,
  '2046505': 1,
  '2071723': 0,
  '2130187': 0,
  '2186129': 0,
  '2304004': 0,
  '2304005': 0,
  '2324839': 0,
  '2325143': 0,
  '2382766': 0,
  '2394677': 0,
  '256744': 0,
  '256746': 0,
  '256750': 0,
  '2594897': 0,
  '2604487': 0,
  '2725017': 0,
  '2874503': 0,
  '2943092': 0,
  '2978577': 0,
  '3045565': 1,
  '3045567': 1,
  '3137952': 0,
  '3175481': 3,


In [None]:
import tqdm
# Process corpus for BM25Okapi
tokenized_corpus = [doc['text'].split(" ") for doc in tqdm.tqdm(corpus.values())]
bm25 = BM25Okapi(tokenized_corpus)


 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎              | 7840689/8841823 [09:36<00:07, 133966.04it/s]

In [None]:
# Retrieval with BM25
results = {}
for query_id, query in queries.items():
    scores = bm25.get_scores(query['text'].split(" "))
    results[query_id] = {doc_id: score for doc_id, score in zip(corpus.keys(), scores)}


In [None]:
results

In [None]:
from beir.retrieval.evaluation import EvaluateRetrieval

# Evaluate
evaluator = EvaluateRetrieval()
ndcg, _map, recall, precision = evaluator.evaluate(qrels, results)