# **BEIR: A Heterogenous benchmark for Zero-shot Evaluation of Information Retrieval models**

This notebook contains an simple and easy examples to evaluate retrieval models from our new benchmark.

## Introduction
The BEIR benchmark contains 9 diverse retrieval tasks including 17 diverse datasets. We evaluate 9 state-of-the-art retriever models all in a zero-shot evaluation setup. Today, in this colab notebook, we first will show how to download and load the 14 open-sourced datasets with just three lines of code. Afterward, we would load some state-of-the-art dense retrievers (bi-encoders) such as SBERT, ANCE, DPR models and use them for retrieval and evaluate them in a zero-shot setup.

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Developed by Nandan Thakur, Researcher @ UKP Lab, TU Darmstadt

(https://nthakur.xyz) (nandant@gmail.com)

In [1]:
!nvidia-smi

Tue May 28 19:28:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.56                 Driver Version: 532.10       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4050 L...    On | 00000000:01:00.0  On |                  N/A |
| N/A   47C    P8                2W /  N/A|    499MiB /  6141MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
# Install the beir PyPI package
# !pip install beir

In [2]:
import beir

ModuleNotFoundError: No module named 'beir'

In [1]:
!pip -V

pip 23.2.1 from /home/ab/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pip (python 3.11)


In [2]:
from beir import util, LoggingHandler

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

  from tqdm.autonotebook import tqdm


# **BEIR Datasets**

BEIR contains 17 diverse datasets overall. You can view all the datasets (14 downloadable) with the link below:

[``https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/``](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/)

Please refer GitHub page to evaluate on other datasets (3 of them).


We include the following datasets in BEIR:

| Dataset   | Website| BEIR-Name | Domain     | Relevancy| Queries  | Documents | Avg. Docs/Q | Download |
| -------- | -----| ---------| ----------- | ---------| ---------| --------- | ------| ------------|
| MSMARCO    | [``Homepage``](https://microsoft.github.io/msmarco/)| ``msmarco`` | Misc.       |  Binary  |  6,980   |  8.84M     |    1.1 | Yes |  
| TREC-COVID |  [``Homepage``](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| Bio-Medical |  3-level|50|  171K| 493.5 | Yes |
| NFCorpus   | [``Homepage``](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus``  | Bio-Medical |  3-level |  323     |  3.6K     |  38.2 | Yes |
| BioASQ     | [``Homepage``](http://bioasq.org) | ``bioasq``| Bio-Medical |  Binary  |   500    |  14.91M    |  8.05 | No |
| NQ         | [``Homepage``](https://ai.google.com/research/NaturalQuestions) | ``nq``| Wikipedia   |  Binary  |  3,452   |  2.68M  |  1.2 | Yes |
| HotpotQA   | [``Homepage``](https://hotpotqa.github.io) | ``hotpotqa``| Wikipedia   |  Binary  |  7,405   |  5.23M  |  2.0 | Yes |
| FiQA-2018  | [``Homepage``](https://sites.google.com/view/fiqa/) | ``fiqa``    | Finance     |  Binary  |  648     |  57K    |  2.6 | Yes |
| Signal-1M (RT) | [``Homepage``](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | Twitter     |  3-level  |   97   |  2.86M  |  19.6 | No |
| TREC-NEWS  | [``Homepage``](https://trec.nist.gov/data/news2019.html) | ``trec-news``    | News     |  5-level  |   57    |  595K    |  19.6 | No |
| ArguAna    | [``Homepage``](http://argumentation.bplaced.net/arguana/data) | ``arguana`` | Misc.       |  Binary  |  1,406     |  8.67K    |  1.0 | Yes |
| Touche-2020| [``Homepage``](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| Misc.       |  6-level  |  49     |  382K    |  49.2 |  Yes |
| CQADupstack| [``Homepage``](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| StackEx.      |  Binary  |  13,145 |  457K  |  1.4 |  Yes |
| Quora| [``Homepage``](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| Quora  | Binary  |  10,000     |  523K    |  1.6 |  Yes |
| DBPedia | [``Homepage``](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| Wikipedia |  3-level  |  400    |  4.63M    |  38.2 |  Yes |
| SCIDOCS| [``Homepage``](https://allenai.org/data/scidocs) | ``scidocs``| Scientific |  Binary  |  1,000     |  25K    |  4.9 |  Yes |
| FEVER| [``Homepage``](http://fever.ai) | ``fever``| Wikipedia     |  Binary  |  6,666     |  5.42M    |  1.2|  Yes |
| Climate-FEVER| [``Homepage``](http://climatefever.ai) | ``climate-fever``| Wikipedia |  Binary  |  1,535     |  5.42M |  3.0 |  Yes |
| SciFact| [``Homepage``](https://github.com/allenai/scifact) | ``scifact``| Scientific |  Binary  |  300     |  5K    |  1.1 |  Yes |


For Simplicity, we will show example with the one of the smallest datasets - ``SciFact`` for our example.

You can evaluate any dataset you wish by looking at the table above.

In [3]:
import pathlib, os
from beir import util

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(os.getcwd(), "datasets")
data_path = util.download_and_unzip(url, out_dir)
print("Dataset downloaded here: {}".format(data_path))

2024-05-28 19:46:46 - Downloading scifact.zip ...


/mnt/c/D_drive/UCSD/Quarters/Q3/DSC253-Adv_txt_mining/Project/slm4search/test/datasets/scifact.zip: 100%|██████████| 2.69M/2.69M [00:01<00:00, 2.26MiB/s]


2024-05-28 19:46:49 - Unzipping scifact.zip ...
Dataset downloaded here: /mnt/c/D_drive/UCSD/Quarters/Q3/DSC253-Adv_txt_mining/Project/slm4search/test/datasets/scifact


# **Folder Structure of any BEIR dataset**

* scifact/
    * corpus.jsonl
    * queries.jsonl
    * qrels/
        * train.tsv
        * dev.tsv
        * test.tsv

In [4]:
!ls datasets/scifact/

corpus.jsonl  qrels  queries.jsonl


# **Data Loading**

In [5]:
from beir.datasets.data_loader import GenericDataLoader

data_path = "datasets/scifact"
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"

2024-05-28 19:46:53 - Loading Corpus...


100%|██████████| 5183/5183 [00:00<00:00, 13189.24it/s]


2024-05-28 19:46:54 - Loaded 5183 TEST Documents.
2024-05-28 19:46:54 - Doc Example: {'text': 'Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 vers

# **Dense Retrieval using Exact Search**

## **Sentence-BERT**
We use the [``distilbert-base-msmarco-v3``](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) SBERT model in this example.

In [6]:
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

#### Dense Retrieval using SBERT (Sentence-BERT) ####
#### Provide any pretrained sentence-transformers model
#### The model was fine-tuned using cosine-similarity.
#### Complete list - https://www.sbert.net/docs/pretrained_models.html

model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="cos_sim")

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

2024-05-28 19:46:59 - PyTorch version 2.3.0 available.
2024-05-28 19:46:59 - JAX version 0.4.23 available.
2024-05-28 19:47:00 - Loading faiss with AVX2 support.
2024-05-28 19:47:00 - Successfully loaded faiss with AVX2 support.
2024-05-28 19:47:04 - Use pytorch device_name: cuda
2024-05-28 19:47:04 - Load pretrained SentenceTransformer: msmarco-distilbert-base-v3




2024-05-28 19:48:26 - Encoding Queries...


Batches: 100%|██████████| 3/3 [00:22<00:00,  7.52s/it]


2024-05-28 19:48:49 - Sorting Corpus by document length (Longest first)...
2024-05-28 19:48:49 - Scoring Function: Cosine Similarity (cos_sim)
2024-05-28 19:48:49 - Encoding Batch 1/1...


Batches: 100%|██████████| 41/41 [00:47<00:00,  1.16s/it]


In [7]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...

logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

2024-05-28 19:50:41 - Retriever evaluation for k in: [1, 3, 5, 10, 100, 1000]
2024-05-28 19:50:41 - For evaluation, we ignore identical query and document ids (default), please explicitly set ``ignore_identical_ids=False`` to ignore this.
2024-05-28 19:50:41 - 

2024-05-28 19:50:41 - NDCG@1: 0.4233
2024-05-28 19:50:41 - NDCG@3: 0.4842
2024-05-28 19:50:41 - NDCG@5: 0.5104
2024-05-28 19:50:41 - NDCG@10: 0.5379
2024-05-28 19:50:41 - NDCG@100: 0.5759
2024-05-28 19:50:41 - NDCG@1000: 0.5913
2024-05-28 19:50:41 - 

2024-05-28 19:50:41 - MAP@1: 0.3994
2024-05-28 19:50:41 - MAP@3: 0.4593
2024-05-28 19:50:41 - MAP@5: 0.4768
2024-05-28 19:50:41 - MAP@10: 0.4889
2024-05-28 19:50:41 - MAP@100: 0.4974
2024-05-28 19:50:41 - MAP@1000: 0.4980
2024-05-28 19:50:41 - 

2024-05-28 19:50:41 - Recall@1: 0.3994
2024-05-28 19:50:41 - Recall@3: 0.5256
2024-05-28 19:50:41 - Recall@5: 0.5887
2024-05-28 19:50:41 - Recall@10: 0.6723
2024-05-28 19:50:41 - Recall@100: 0.8460
2024-05-28 19:50:41 - Recall@1000: 0.9683

In [8]:
import random

#### Print top-k documents retrieved ####
top_k = 10

query_id, ranking_scores = random.choice(list(results.items()))
scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)
logging.info("Query : %s\n" % queries[query_id])

for rank in range(top_k):
    doc_id = scores_sorted[rank][0]
    # Format: Rank x: ID [Title] Body
    logging.info("Rank %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))

2024-05-28 19:50:47 - Query : FoxO3a activation in neuronal death is mediated by reactive oxygen species (ROS).

2024-05-28 19:50:47 - Rank 1: 4418070 [Novel Foxo1-dependent transcriptional programs control Treg cell function] - Regulatory T (Treg) cells, characterized by expression of the transcription factor forkhead box P3 (Foxp3), maintain immune homeostasis by suppressing self-destructive immune responses. Foxp3 operates as a late-acting differentiation factor controlling Treg cell homeostasis and function, whereas the early Treg-cell-lineage commitment is regulated by the Akt kinase and the forkhead box O (Foxo) family of transcription factors. However, whether Foxo proteins act beyond the Treg-cell-commitment stage to control Treg cell homeostasis and function remains largely unexplored. Here we show that Foxo1 is a pivotal regulator of Treg cell function. Treg cells express high amounts of Foxo1 and display reduced T-cell-receptor-induced Akt activation, Foxo1 phosphorylation a

## **ANCE**

We use the [``msmarco-roberta-base-ance-fristp``](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) ANCE model which was fine-tuned on MSMARCO dataset for 600K steps.

In [9]:
# #### Dense Retrieval using ANCE ####
# # https://www.sbert.net/docs/pretrained-models/msmarco-v3.html
# # MSMARCO Dev Passage Retrieval ANCE(FirstP) 600K model from ANCE.
# # The ANCE model was fine-tuned using dot-product (dot) function.

# model = DRES(models.SentenceBERT("msmarco-roberta-base-ance-fristp"))
# retriever = EvaluateRetrieval(model, score_function="dot")

# #### Retrieve dense results (format of results is identical to qrels)
# results = retriever.retrieve(corpus, queries)

In [10]:
# #### Evaluate your retrieval using NDCG@k, MAP@K ...

# logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
# ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

# **Lexical Retrieval using BM25 (Elasticsearch)**

## 1. Download and setup the Elasticsearch instance
Reference: https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/elasticsearch.ipynb

For demo purposes, the open-source version of the elasticsearch package is used.

In [11]:
# %%bash

# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
# tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
# sudo chown -R daemon:daemon elasticsearch-7.9.2/
# shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512

Run the instance as a daemon process


In [19]:
# %%bash --bg

# sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

SyntaxError: invalid syntax (2349466733.py, line 3)

In [None]:
# import time

# # Sleep for few seconds to let the instance start.
# time.sleep(20)

Once the instance has been started, grep for ``elasticsearch`` in the processes list to confirm the availability.

In [None]:
%%bash

ps -ef | grep elasticsearch

In [None]:
# 127.0.0.1:9300

In [36]:
%%bash

curl -sX GET "localhost:9200/"


{
  "name" : "ab",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "JS2mz8c3RdOol8iLxCkFHA",
  "version" : {


    "number" : "7.9.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


In [37]:
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

#### Provide parameters for elastic-search
hostname = "localhost"
index_name = "scifact"
initialize = True # True, will delete existing index with same name and reindex all documents

model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
retriever = EvaluateRetrieval(model)

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

  0%|          | 0/5183 [09:58<?, ?docs/s]

2024-05-28 20:22:29 - Activating Elasticsearch....
2024-05-28 20:22:29 - Elastic Search Credentials: {'hostname': 'localhost', 'index_name': 'scifact', 'keys': {'title': 'title', 'body': 'txt'}, 'timeout': 100, 'retry_on_timeout': True, 'maxsize': 24, 'number_of_shards': 'default', 'language': 'english'}
2024-05-28 20:22:29 - Deleting previous Elasticsearch-Index named - scifact


  0%|          | 0/5183 [10:00<?, ?docs/s]

2024-05-28 20:22:31 - Creating fresh Elasticsearch-Index named - scifact


  0%|          | 0/5183 [00:00<?, ?docs/s]
que: 100%|██████████| 3/3 [00:02<00:00,  1.07it/s]


In [38]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

  0%|          | 0/5183 [10:16<?, ?docs/s]

2024-05-28 20:22:47 - For evaluation, we ignore identical query and document ids (default), please explicitly set ``ignore_identical_ids=False`` to ignore this.
2024-05-28 20:22:47 - 

2024-05-28 20:22:47 - NDCG@1: 0.5767
2024-05-28 20:22:47 - NDCG@3: 0.6366
2024-05-28 20:22:47 - NDCG@5: 0.6652
2024-05-28 20:22:47 - NDCG@10: 0.6906
2024-05-28 20:22:47 - NDCG@100: 0.7134
2024-05-28 20:22:47 - NDCG@1000: 0.7212
2024-05-28 20:22:47 - 

2024-05-28 20:22:47 - MAP@1: 0.5559
2024-05-28 20:22:47 - MAP@3: 0.6143
2024-05-28 20:22:47 - MAP@5: 0.6312
2024-05-28 20:22:47 - MAP@10: 0.6437
2024-05-28 20:22:47 - MAP@100: 0.6492
2024-05-28 20:22:47 - MAP@1000: 0.6495
2024-05-28 20:22:47 - 

2024-05-28 20:22:47 - Recall@1: 0.5559
2024-05-28 20:22:47 - Recall@3: 0.6793
2024-05-28 20:22:47 - Recall@5: 0.7479
2024-05-28 20:22:47 - Recall@10: 0.8198
2024-05-28 20:22:47 - Recall@100: 0.9192
2024-05-28 20:22:47 - Recall@1000: 0.9800
2024-05-28 20:22:47 - 

2024-05-28 20:22:47 - P@1: 0.5767
2024-05-28 20:22:47

: 

# **Reranking BM25 using Cross-Encoder**

In this example, we rerank the top-20 documents retrieved from BM25, using ([cross-encoder/ms-marco-electra-base](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html)) SBERT cross-encoder model

In [None]:
from beir.reranking.models import CrossEncoder
from beir.reranking import Rerank

#### Reranking using Cross-Encoder models (list: )
cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-electra-base')
reranker = Rerank(cross_encoder_model, batch_size=128)

# Rerank top-100 results using the reranker provided
rerank_results = reranker.rerank(corpus, queries, results, top_k=20)

2021-04-20 16:14:39 - Use pytorch device: cuda
2021-04-20 16:14:39 - Starting To Rerank Top-20....


HBox(children=(FloatProgress(value=0.0, description='Batches', max=47.0, style=ProgressStyle(description_width…




In [None]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...
ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, rerank_results, retriever.k_values)

2021-04-20 16:19:08 - 

2021-04-20 16:19:08 - NDCG@1: 0.5733
2021-04-20 16:19:08 - NDCG@3: 0.6314
2021-04-20 16:19:08 - NDCG@5: 0.6520
2021-04-20 16:19:08 - NDCG@10: 0.6720
2021-04-20 16:19:08 - NDCG@100: 0.6780
2021-04-20 16:19:08 - NDCG@1000: 0.6780
2021-04-20 16:19:08 - 

2021-04-20 16:19:08 - MAP@1: 0.5451
2021-04-20 16:19:08 - MAP@3: 0.6074
2021-04-20 16:19:08 - MAP@5: 0.6216
2021-04-20 16:19:08 - MAP@10: 0.6307
2021-04-20 16:19:08 - MAP@100: 0.6324
2021-04-20 16:19:08 - MAP@1000: 0.6324
2021-04-20 16:19:08 - 

2021-04-20 16:19:08 - Recall@1: 0.5451
2021-04-20 16:19:08 - Recall@3: 0.6758
2021-04-20 16:19:08 - Recall@5: 0.7260
2021-04-20 16:19:08 - Recall@10: 0.7844
2021-04-20 16:19:08 - Recall@100: 0.8078
2021-04-20 16:19:08 - Recall@1000: 0.8078
2021-04-20 16:19:08 - 

2021-04-20 16:19:08 - P@1: 0.5733
2021-04-20 16:19:08 - P@3: 0.2444
2021-04-20 16:19:08 - P@5: 0.1613
2021-04-20 16:19:08 - P@10: 0.0880
2021-04-20 16:19:08 - P@100: 0.0090
2021-04-20 16:19:08 - P@1000: 0.0009
