<a href="https://colab.research.google.com/github/shuvanyu/Document-Retrieval-and-Ranking/blob/main/bm25_nfcorpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prerequisites

In [1]:
!pip install ir_datasets > /dev/null
!pip install beir > /dev/null
!pip install tensorflow_text > /dev/null

# Testing beir

In [2]:
from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#### Download scifact.zip dataset and unzip the dataset
dataset = "nfcorpus"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join('drive', 'MyDrive',  "nlp_datashare")

  from tqdm.autonotebook import tqdm


In [3]:

data_path = util.download_and_unzip(url, out_dir)
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"


drive/MyDrive/nlp_datashare/nfcorpus.zip:   0%|          | 0.00/2.34M [00:00<?, ?iB/s]

  0%|          | 0/3633 [00:00<?, ?it/s]

In [4]:
%%bash

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512 

elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK


In [5]:
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

In [6]:
import time

# Sleep for few seconds to let the instance start.
time.sleep(20)

In [7]:

%%bash

ps -ef | grep elasticsearch

root        1117    1115  0 00:39 ?        00:00:00 sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
daemon      1118    1117 99 00:39 ?        00:00:20 /content/elasticsearch-7.9.2/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-14238863348155212484 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:fileco

In [8]:

%%bash

curl -sX GET "localhost:9200/"

{
  "name" : "e02b8af3c1e2",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "cf72HulCTN6_fT6heBgxGg",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


In [9]:

from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

#### Provide parameters for elastic-search
hostname = "localhost" 
index_name = "scifact" 
initialize = True # True, will delete existing index with same name and reindex all documents

model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
retriever = EvaluateRetrieval(model)

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

  0%|          | 0/3633 [00:00<?, ?docs/s]
que: 100%|██████████| 3/3 [00:12<00:00,  4.09s/it]


In [10]:
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

In [11]:
ndcg

{'NDCG@1': 0.44968,
 'NDCG@3': 0.40253,
 'NDCG@5': 0.37705,
 'NDCG@10': 0.34281,
 'NDCG@100': 0.28939,
 'NDCG@1000': 0.32065}

In [12]:
_map

{'MAP@1': 0.05936,
 'MAP@3': 0.10053,
 'MAP@5': 0.11329,
 'MAP@10': 0.12969,
 'MAP@100': 0.15417,
 'MAP@1000': 0.16002}

In [13]:
recall

{'Recall@1': 0.05936,
 'Recall@3': 0.11319,
 'Recall@5': 0.13313,
 'Recall@10': 0.16603,
 'Recall@100': 0.26023,
 'Recall@1000': 0.38996}

In [14]:
precision

{'P@1': 0.46753,
 'P@3': 0.37771,
 'P@5': 0.32273,
 'P@10': 0.24708,
 'P@100': 0.06828,
 'P@1000': 0.01177}

In [15]:
import random
random.seed(250)

#### Print top-k documents retrieved ####
top_k = 10

query_id, ranking_scores = random.choice(list(results.items()))


In [16]:

query_id

'PLAIN-2311'

In [17]:
len(ranking_scores)

3

In [18]:
scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)
print("Query : %s\n" % queries[query_id])

for rank in range(min(top_k, len(ranking_scores))):
    doc_id = scores_sorted[rank][0]
    # Format: Rank x: ID [Title] Body
    print("Rank %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))

Query : veal

Rank 1: MED-4801 [Methicillin-resistant Staphylococcus aureus (MRSA) in food production animals.] - Until recently, reports on methicillin-resistant Staphylococcus aureus (MRSA) in food production animals were mainly limited to occasional detections in dairy cattle mastitis. However, since 2005 a MRSA clone, CC398, has been reported colonizing pigs, veal calves and broiler chickens and infecting dairy cows. Many aspects of its prevalence in pigs remain unclear. In other livestock, colonizing capacity and reservoir status still require elucidation. MRSA CC398 has also been detected in meat, but, as for other MRSA, the risk this poses is somewhat unclear. Currently, the most worrying aspect of MRSA CC398 appears to be its capacity to spread to humans. This might complicate MRSA control measures in human healthcare, urging research into risk factors and transmission routes. Although infections with MRSA CC398 are much less reported than carriage, more investigation into its 