<a href="https://colab.research.google.com/github/vinay-jose/rag-nbs/blob/main/BM25_%2B_Reranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -Uq lancedb ir_datasets tantivy

In [2]:
import ir_datasets
import pandas as pd
import lancedb
from lancedb.rerankers import ColbertReranker

In [3]:
dataset = ir_datasets.load('cord19/trec-covid') # dataset for Information Retrieval

## Looking at data

In [4]:
pd.DataFrame(dataset.docs_iter()).head()  # check documents

[INFO] [starting] building docstore
[INFO] If you have a local copy of https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/80d664e496b8b7e50a39c6f6bb92e0ef
[INFO] [starting] https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv
docs_iter:   0%|                                    | 0/192509 [00:00<?, ?doc/s]
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.0%| 0.00/269M [00:00<?, ?B/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.0%| 90.1k/269M [00:00<05:39, 793kB/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.3%| 705k/269M [00:00<01:33, 2.89MB/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 1.4%| 3.81M/269M [00:00<00:24, 11.0MB/s][A
https://ai2-semantics

Unnamed: 0,doc_id,title,doi,date,abstract
0,ug7v899j,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,2001-07-04,OBJECTIVE: This retrospective chart review des...
1,02tnwd4m,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,2000-08-15,Inflammatory diseases of the respiratory tract...
2,ejv2xln0,Surfactant protein-D and pulmonary host defense,10.1186/rr19,2000-08-25,Surfactant protein-D (SP-D) participates in th...
3,2b73a28n,Role of endothelin-1 in lung disease,10.1186/rr44,2001-02-22,Endothelin-1 (ET-1) is a 21 amino acid peptide...
4,9785vg6d,Gene expression in epithelial cells in respons...,10.1186/rr61,2001-05-11,Respiratory syncytial virus (RSV) and pneumoni...


In [5]:
pd.DataFrame(dataset.queries_iter()).head() # check queries

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [00:00] [18.7kB] [29.6MB/s]


Unnamed: 0,query_id,title,description,narrative
0,1,coronavirus origin,what is the origin of COVID-19,seeking range of information about the SARS-Co...
1,2,coronavirus response to weather changes,how does the coronavirus respond to changes in...,seeking range of information about the SARS-Co...
2,3,coronavirus immunity,will SARS-CoV2 infected people develop immunit...,seeking studies of immunity developed due to i...
3,4,how do people die from the coronavirus,what causes death from Covid-19?,Studies looking at mechanisms of death from Co...
4,5,animal models of COVID-19,what drugs have been active against SARS-CoV o...,Papers that describe the results of testing d...


In [6]:
pd.DataFrame(dataset.qrels_iter()).head() # check qrels

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [00:00] [1.14MB] [6.94MB/s]


Unnamed: 0,query_id,doc_id,relevance,iteration
0,1,005b2j4b,2,4.5
1,1,00fmeepz,1,4.0
2,1,010vptx3,2,0.5
3,1,0194oljo,1,2.5
4,1,021q9884,1,4.0


In [7]:
dataset.qrels_defs() # check qrels definitions

{2: 'Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.',
 1: 'Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.',
 0: 'Not Relevant: everything else.'}

## Setup LanceDB to save FTS (Full Text Search) Index

In [8]:
uri = "data/sample-lancedb"
db = lancedb.connect(uri) # connect to database

In [9]:
table = db.create_table("table_from_df", data=pd.DataFrame(dataset.docs_iter())) # create table from dataframe

In [10]:
table.head() # check table

pyarrow.Table
doc_id: string
title: string
doi: string
date: string
abstract: string
----
doc_id: [["ug7v899j","02tnwd4m","ejv2xln0","2b73a28n","9785vg6d"]]
title: [["Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia","Nitric oxide: a pro-inflammatory mediator in lung disease?","Surfactant protein-D and pulmonary host defense","Role of endothelin-1 in lung disease","Gene expression in epithelial cells in response to pneumovirus infection"]]
doi: [["10.1186/1471-2334-1-6","10.1186/rr14","10.1186/rr19","10.1186/rr44","10.1186/rr61"]]
date: [["2001-07-04","2000-08-15","2000-08-25","2001-02-22","2001-05-11"]]
abstract: [["OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory sp

In [11]:
table.create_fts_index("abstract", replace=True) # create fts index

In [12]:
q = pd.DataFrame(dataset.queries_iter()).description[1] # query
q, type(q)

('how does the coronavirus respond to changes in the weather', str)

In [13]:
reranker = ColbertReranker(column="abstract") # create reranker
results = table.search(q, query_type="fts").limit(10).rerank(reranker=reranker).to_list() # search and rerank

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [14]:
len(results), results[0] # check results

(10,
 {'doc_id': '9svrz0vj',
  'title': 'Susceptible supply limits the role of climate in the early SARS-CoV-2 pandemic',
  'doi': '10.1126/science.abc2535',
  'date': '2020-07-10',
  'abstract': 'Preliminary evidence suggests that climate may modulate the transmission of SARS-CoV-2. Yet it remains unclear whether seasonal and geographic variations in climate can substantially alter the pandemic trajectory, given high susceptibility is a core driver. Here, we use a climate-dependent epidemic model to simulate the SARS-CoV-2 pandemic probing different scenarios based on known coronavirus biology. We find that while variations in weather may be important for endemic infections, during the pandemic stage of an emerging pathogen the climate drives only modest changes to pandemic size. A preliminary analysis of non-pharmaceutical control measures indicates that they may moderate the pandemic-climate interaction via susceptible depletion. Our findings suggest, without effective control measu