
# Scientific Claim Source Retrieval

## Description

Given an implicit reference to a scientific paper, i.e., a social media post (tweet) that mentions a research publication without a URL, this method enables to retrieve the mentioned paper from a pool of candidate papers. It was initially developed to leverage [CORD19](https://github.com/allenai/cord19), a corpus of academic papers about COVID-19 and related coronavirus research, however, it can be used with any corpus of publications with enough metadata. 

The method takes an input claim or sentence from the user, computes its similarity with the publication titles and abstracts in the corpus, and returns a ranked list of matching publications. The similarity between the input claim and the publications is calculated using  [BM25](https://en.wikipedia.org/wiki/Okapi_BM25).

## Use Cases
1. To find which publication is possible mentioned in a claim/statement. 
2. To find topically similar publications to a claim/statement.

## Input Data

The [input data](data/query_posts.tsv) consists of social media posts having the following fields:

- post_id : unique post ID in the collection
- tweet_text : text of the post (tweet)


Example Input:

- post_id: 12345678901
- tweet_text: Published in the journal Antiviral Research, the study from Monash University showed that a single dose of Ivermectin could stop the coronavirus growing in cell culture, effectively eradicating all genetic material of the virus within two days.

## Output Data
This [output](output/df_query_train_with_bm25_topk.tsv) aims to show an example publication matching for the given input.

- post_id : unique post ID in the collection
- tweet_text : text of the post (tweet)
- cord_uid: identifier of the matching publication
- bm25_topk: top-k matching publications based on BM25 similarity score
- in_topx: Float value indicating the rank of the matching publication in the top-k list

- bm25_topk: ['htlvpvz5', 'h7hj64q5', 'rwgqkow3', 'dbgtslc8', 'am11yqbf']
- in_topx: 1.0
Example Output:

- post_id: 12345678901
- tweet_text: Published in the journal Antiviral Research, the study from Monash University showed that a single dose of Ivermectin could stop the coronavirus growing in cell culture, effectively eradicating all genetic material of the virus within two days.
- cord_uid: htlvpvz5 (Effectiveness of Covid-19 Vaccines against the B.1.617.2 (Delta) Variant)
- bm25_topk: ['htlvpvz5', 'h7hj64q5', 'rwgqkow3', 'dbgtslc8', 'am11yqbf']
- in_topx: 1.0


## Hardware Requirements
The method runs on a small virtual machine provided by cloud computing company (2 x86 CPU core, 4 GB RAM, 40GB HDD). 

## Environment Setup

The method is implemented in Python and requires the following libraries which can be installed via `pip`:


In [34]:
!pip install -r requirements.txt





## How to Use

Please follow the instructions in the [notebook](notebooks/getting_started_claim_source.ipynb).

## Technical Details
Published in the journal Antiviral Research, the study from Monash University showed that a single dose of Ivermectin could stop the coronavirus growing in cell culture -- effectively eradicating all genetic material of the virus within two days. 

Peer-reviewed in the New England Journal of Medicine regarding Delta (B.1.617.2):  
- Pfizer is ~90% effective  
- AstraZeneca is ~70% effective.  
This falls in line with vaccine efficacy of other variants. Yes, the vaccines ARE indeed effective against Delta.
    
## Contact Details

For questions or feedback, contact Yavuz Selim Kartal via [YavuzSelim.Kartal@gesis.org](mailto:YavuzSelim.Kartal@gesis.org).


# 1) Importing data

In [18]:
import numpy as np
import pandas as pd

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

You should first download the file then upload it on the Google Colab session with the following steps.


In [4]:
# 1) Download the collection set from the Gitlab repository
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = 'data/collection_data.pkl' #MODIFY PATH

In [5]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [6]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [7]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

In [9]:
# 1) Download the query tweets from the Gitlab repository
PATH_QUERY_TRAIN_DATA = 'data/query_posts.tsv' #MODIFY PATH

In [10]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')

In [11]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [12]:
df_query_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12853 entries, 0 to 12852
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     12853 non-null  int64 
 1   tweet_text  12853 non-null  object
 2   cord_uid    12853 non-null  object
dtypes: int64(1), object(2)
memory usage: 301.4+ KB


# 2) Running the baseline
The following code runs a BM25 baseline.


In [35]:
from rank_bm25 import BM25Okapi

# Create the BM25 corpus
corpus = df_collection[:][['title', 'abstract']].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]['cord_uid'].tolist()
tokenized_corpus = [doc.split(' ') for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [15]:
def get_top_cord_uids(query):
    """
    Given a query, return the top 5 CORD UID's based on BM25 scores.
    Args:
        query (str): The search query.
    Returns:
        list: A list of top 5 CORD UID's.
    """ 
    text2bm25top = {}
    if query in text2bm25top.keys():
        return text2bm25top[query]
    else:
        tokenized_query = query.split(' ')
        doc_scores = bm25.get_scores(tokenized_query)
        indices = np.argsort(-doc_scores)[:5]
        bm25_topk = [cord_uids[x] for x in indices]

        text2bm25top[query] = bm25_topk
        return bm25_topk


In [20]:
# Retrieve topk candidates using the BM25 model
# This will take a while, so be patient.
df_query_train['bm25_topk'] = df_query_train['tweet_text'].apply(lambda x: get_top_cord_uids(x))

# 3) Inspect the results
The following shows you to view top publication predictions, by their cord_ids, for each of the tweet

In [24]:
df_query_train[['tweet_text', 'bm25_topk']]

Unnamed: 0,tweet_text,bm25_topk
0,Oral care in rehabilitation medicine: oral vul...,"[htlvpvz5, h7hj64q5, rwgqkow3, dbgtslc8, am11y..."
1,this study isn't receiving sufficient attentio...,"[2cwvga0k, 33znyrn8, st3fyb64, 74dw6emg, b68c8..."
2,"thanks, xi jinping. a reminder that this study...","[8hkxbxz9, w98847ai, jtwb17u8, iy1enazk, pg0l9..."
3,Taiwan - a population of 23 million has had ju...,"[32ua8wb6, zxe95qy9, 30pl5tx3, l4y7v729, b97ac..."
4,Obtaining a diagnosis of autism in lower incom...,"[tiqksd69, k7smwz6w, b0dzhsrh, yc7cvbii, iiafu..."
...,...,...
12848,"""evidence on covid-19 reveals a growing body o...","[l9lni5d3, qfrenkb6, wnlse824, jgq968f6, l69kw..."
12849,Outdoor lighting has detrimental impacts on lo...,"[s2bpha8l, 8a3fp7ym, eie9cozf, 5nt99kyu, o3pvs..."
12850,"26/ and influenza virus (and other pathogens, ...","[2d2y5gmg, ps5crd29, ogx40z8c, 9rxv6fy9, lavcs..."
12851,does it?'sars-cov-2-naïve vaccinees had a 13.0...,"[t4y1ylb3, 7a543f7v, b5eve7re, 02p4et0u, u89jd..."


# 4) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the query set using the Mean Reciprocal Rank score (MRR@5).

In [25]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        #performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance


In [31]:
results_train = get_performance_mrr(df_query_train, 'cord_uid', 'bm25_topk')
# Printed MRR@k results in the following format: {k: MRR@k}
print(f"Results on the train set: {results_train}")

Results on the train set: {1: np.float64(0.5079747918773827), 5: np.float64(0.5508999196037242), 10: np.float64(0.5508999196037242)}


In [33]:
# check if output directory exists, if not create it
import os
if not os.path.exists('output'):
    os.makedirs('output')
    
# Add the MRR@k results to the df_query_train DataFrame
# Save the modified df_query_train to an output folder
OUTPUT_PATH = 'output/df_query_train_with_bm25_topk.tsv'
df_query_train.to_csv(OUTPUT_PATH, sep='\t', index=False)

# Inference of new/changed characteristics in df_query_train:
# - A new column 'bm25_topk' has been added, which contains the top-k candidate cord_uids retrieved using the BM25 model for each tweet_text.
# - A new column 'in_topx' has been added, which indicates the reciprocal rank of the correct cord_uid in the top-k predictions or 0 if not present.