<a href="https://colab.research.google.com/github/shuvanyu/Document-Retrieval-and-Ranking/blob/main/condenser_trec_covid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prerequisites

In [1]:
!pip install sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install beir


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!pip install tensorflow_text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Testing beir

In [4]:
from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#### Download scifact.zip dataset and unzip the dataset
dataset = "trec-covid"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join('drive', 'MyDrive',  "nlp_datashare")

  from tqdm.autonotebook import tqdm


In [5]:

data_path = util.download_and_unzip(url, out_dir)
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"


  0%|          | 0/171332 [00:00<?, ?it/s]

In [6]:
sorted(list(corpus.keys()))[3]

'000bb2uc'

# Co-condensor

In [7]:
len(corpus), len(queries)

(171332, 50)

In [8]:
type(corpus), type(queries)

(dict, dict)

In [9]:
model_path = "sentence-transformers/msmarco-bert-co-condensor"
model = models.SentenceBERT(model_path=model_path, device = 'cuda')
normalize = True


In [10]:
model_dres = DRES(model, batch_size=16)
retriever = EvaluateRetrieval(model_dres, score_function="dot") # or "cos_sim" for cosine similarity
results = retriever.retrieve(corpus, queries)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Batches:   0%|          | 0/1334 [00:00<?, ?it/s]

In [11]:
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

In [12]:
ndcg

{'NDCG@1': 0.79,
 'NDCG@3': 0.7852,
 'NDCG@5': 0.7721,
 'NDCG@10': 0.72653,
 'NDCG@100': 0.51905,
 'NDCG@1000': 0.46158}

In [13]:
_map

{'MAP@1': 0.00219,
 'MAP@3': 0.00654,
 'MAP@5': 0.01027,
 'MAP@10': 0.01775,
 'MAP@100': 0.09038,
 'MAP@1000': 0.21767}

In [14]:
recall

{'Recall@1': 0.00219,
 'Recall@3': 0.0069,
 'Recall@5': 0.01097,
 'Recall@10': 0.01956,
 'Recall@100': 0.1247,
 'Recall@1000': 0.43223}

In [15]:
precision

{'P@1': 0.82,
 'P@3': 0.84,
 'P@5': 0.824,
 'P@10': 0.762,
 'P@100': 0.529,
 'P@1000': 0.20654}

In [16]:
import random
random.seed(250)

#### Print top-k documents retrieved ####
top_k = 10

query_id, ranking_scores = random.choice(list(results.items()))


In [17]:

query_id

'25'

In [18]:
len(ranking_scores)

4004

In [19]:
scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)
print("Query : %s\n" % queries[query_id])

for rank in range(min(top_k, len(ranking_scores))):
    doc_id = scores_sorted[rank][0]
    # Format: Rank x: ID [Title] Body
    print("Rank %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))

Query : which biomarkers predict the severe clinical course of 2019-nCOV infection?

Rank 1: 0deyspy2 [[Analysis of clinical features of 29 patients with 2019 novel coronavirus pneumonia].] - Objective: To analyze the clinical characteristics of 2019 novel coronavirus (2019-nCoV) pneumonia and to investigate the correlation between serum inflammatory cytokines and severity of the disease. Methods: 29 patients with 2019-ncov admitted to the isolation ward of Tongji hospital affiliated to Tongji medical college of Huazhong University of Science and Technology in January 2020 were selected as the study subjects. Clinical data were collected and the general information, clinical symptoms, blood test and CT imaging characteristics were analyzed. According to the relevant diagnostic criteria, the patients were divided into three groups: mild (15 cases), severe (9 cases) and critical (5 cases). The expression levels of inflammatory cytokines and other markers in the serum of each group were d

# Random Query

In [None]:

import torch

In [None]:
if normalize:
    corpus_embs = model.encode_corpus(reduced_corpus, batch_size=128, convert_to_tensor=True, normalize_embeddings=True)
else:
    corpus_embs = model.encode_corpus(reduced_corpus, batch_size=128, convert_to_tensor=True)

In [None]:
query = ''

In [None]:
start = datetime.datetime.now()
if normalize:
    query_emb = model.encode_queries([query], batch_size=1, convert_to_tensor=True, normalize_embeddings=True, show_progress_bar=False)
else:
    query_emb = model.encode_queries([query], batch_size=1, convert_to_tensor=True, show_progress_bar=False)

#### Dot product for normalized embeddings is equal to cosine similarity
sim_scores = utils.dot_score(query_emb, corpus_embs)
sim_scores_top_k_values, sim_scores_top_k_idx = torch.topk(sim_scores, 10, dim=1, largest=True, sorted=True)
end = datetime.datetime.now()

#### Measuring time taken in ms (milliseconds)
time_taken = (end - start)
time_taken = time_taken.total_seconds() * 1000
time_taken_all[query_id] = time_taken
print("{}: {} {:.2f}ms".format(query_id, query, time_taken))