## Table of Content
 - [1. Introduction](#Introduction)
 - [2. Overview of the SciReader](#Overview-of-the-SciReader)
  * [Word and Named Entity Embedding](#Word-and-Named-Entity-Embedding)
  * [Semantic Keyword Search](#Semantic-Keyword-Search)
  * [Semantic Sentence Search](#Semantic-Sentence-Search)
 - [3. Use Case: Learn About a COVID-19 Research Task](#Use-Case:-Learn-About-a-COVID-19-Research-Task)
 - [4. Conclusion and Discussion](#Conclusion-and-Discussion)


## Introduction

Amid the COVID-19 pandemic, staying informed with the most relevant, up-to-date research and clinical outcomes is critical to life saving and crisis management. However, navigating through an already vast volume of literature that continues to grow in a head-spinnig speed is extremely difficult, preventing a timely information sharing and decision making. 
Aiming to trackle this challenge, in this work, we employ a semantic text mining approach to efficiently allocate and digest the most relevant literature given a query research task. 

## Overview of the SciReader
SciReader achieves semantic text mining via a word embedding approach. We utilize the [sciSpacy](#https://allenai.github.io/scispacy/) NLP model, which is trained on biomedical literature corpus, to perform tokenization and named entity recognition, as well as the embedding vectorization of the words & entities. Quantifying the similarity between the embedding vectors of the named entities enpowers a [semantic keyword search](#Semantic-Keyword-Search). The [semantic search at the sentence level](#Semantic-Sentence-Search) is based on calculating the word mover's distance between groups of tokens using the [wmd](#https://github.com/src-d/wmd-relax) package. 

#### To start:
1. Install required packages

In [None]:
!pip install scispacy  --quiet
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz  --quiet
!pip install wmd  --quiet

2. Install [SciReader](https://github.com/yubailibra/scireader)

In [None]:
!pip install scireader==0.0.4 --quiet 

3. Prepare the environment, and load in prerequisites and scireader

In [None]:
import spacy
import scispacy
import glob
from scireader import *

root_path = '/kaggle/input/CORD-19-research-challenge/'
jsonfiles = glob.glob(f'{root_path}/**/pdf_json/*.json', recursive=True)
print('Load '+str(len(jsonfiles))+" papers for this study.")

nlp = en_core_sci_lg.load()
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
nlp.add_pipe(WMD.SpacySimilarityHook(nlp), last=True)

### Word and Named Entity Embedding

A PaperBank object is constructed that is the workhorse of analyzing the literature corpus. It organizes, cleans the text and extracts tokens and named entities with their embedding vectors.

In [None]:
bank=PaperBank(nlp)
bank.read(jsonfiles)
bank.parse('abstract')
print('Done building the PaperBank object')

### Semantic Keyword Search

Below we show case how the semantic keyword search works. In this example, besides syntactically search the keyword 'spike protein', we are able to fetch additional synonyms, such as 'spike gene', 'spike s domain', 'spike subunit' etc. The query also returns the papers that contain the keyword and/or synonyms that will be used in the subsequent analysis.

In [None]:
bank.query('spike protein',similarity=0.8,verbose=True)

### Semantic Sentence Search

SciReader is able to semantically compare sentences. In the validation example shown below, the first sentence is much closer in meaning to the second sentence, than to the third one. Consistently, our model calculates a much smaller similarity score between the first two sentences than that between the first and the third.

In [None]:
sentence1='underlying disease may be a risk factor for the ICU patients'
sentence2='hypertension strongly predictive severe disease admission'
sentence3='these epitopes may potentially offer protection against this novel virus'

similarity12=sentSimilarity(bank,sentence1,sentence2)
similarity13=sentSimilarity(bank,sentence1,sentence3)

print('The similarity score between the sentence 1 and sentence 2 is: '+str(similarity12))
print('The similarity score between the sentence 1 and sentence 3 is: '+str(similarity13))

## Use Case: Learn About a COVID-19 Research Task

Now we apply SciReader to locate the most relevant literatures for one of the Tasks in the challenge: Effectiveness of drugs being developed and tried to treat COVID-19 patients.

**First**, we use the semantic keyword search to narrow down to a collection of papers that are related to both the COVID-19 disease and the candidate therapeutics. We obtained the keywords from the invaluable medical dictionary that [@savannareid](https://www.kaggle.com/savannareid) has put together. 

In [None]:
kw_covid19=['severe acute respiratory syndrome coronavirus.2','cov.19','covid 19','2019 corona virus','corona virus 2019','sars cov 2','2019 cov','cov.2','coronavirus.2','wuhan cov','wuhan corona virus','pandemic corona virus','corona virus pandemic']

kw_therapy=['naproxen','arbidol hydrochloride','oseltamivir','angiotensin converting inhibitor','ace2 inhibitor','ace.2 inhibitor','arbidol','asc09','ritonavir','atazanavir','aviptadil','oseltamivir','azithromycin','baricitinib','bevacizumab','bromhexine hydrochloride','camostat mesilate','carrimycin','cd24fc','chloroquine diphosphate','chloroquine','hydroxychloroquine','chloroquine phosphate','colchicine','alfa.*interferon','interferon alfa','lopinavir','ritonavir','darunavir','cobicistat','das181','dexamethasone','eculizumab','escin','favipiravir','tocilizumab','fingolimod','hydrocortisone','ceftriaxone','moxifloxacin','levofloxacin','piperacillin tazobactam','piperacillin','tazobactam','ceftaroline','amoxicillin','clavulanate', 'amoxicillin clavulanate','macrolide','oseltamivir','interferon Î²1a','interferon beta','anakinra','ganovo','danoprevir','huaier granule','hyperbaric oxygen','losartan','meplazumab','methylprednisolone','acetylcysteine','nitric oxide','pd.1 antibody','thymosin','thalidomide','plaquenil','pul.042','pul042', 'rhACE2','recombinant angiotensin converting enzyme','recombinant human interferon alpha','thymosin alpha','recombinant human interferon','remdesivir','roactemra','kevzara','sargramostim','sarilumab','sildenafil citrate','tetrandrine','tocilizumab','clarithromycin','minocyclin','randomize control trial', 'odds ratio', 'observation case series', 'randomized trial', 'case.control', 'interrupt time.series', 'hazard ratio', 'odds ratio', 'treatment effect','rate adverse event', "reduction disease symptom","reduction symptom"]

hits_covid19=scanPapersByKW(bank,kw_covid19,similarity_outcome=0.9)

hits_therapy=scanPapersByKW(bank,kw_therapy,similarity_outcome=0.99)

candidates_therapy=list(set(hits_covid19).intersection(set(hits_therapy)))

print('Found '+str(len(hits_covid19))+' papers by semantic search of keywords related to COVID-19')
print('Found '+str(len(kw_therapy))+' papers by semantic search of keywords related to candidate therapeutics')
print('Jointly we found '+str(len(candidates_therapy))+' papers that are related to both COVID-19 and therapeutics')


**Next**, we apply the semantic sentence search on the candidate papers to select the most relevant ones.

In [None]:
inquiry='evaluation of covid-19 treatment'
answer=scanPapersBySent(bank,candidates_therapy,[inquiry],distance=3)

print('Top 20 papers (id) and their WMD scores for the inquiry (\''+inquiry+'\'):')
answer[inquiry][0:20]

By examing the abstract of the top hit papers, we are convinced that the hits indeed provide critical, sought-after information to address the inquiry.  

In [None]:
from scireader.utils import cleanText
for hit in answer[inquiry][0:20]:
    print('paper id='+hit[0],'score='+str(hit[1]))
    print(cleanText(bank.text.loc[hit[0],'abstract']))
    print('\n')

## Conclusion and Discussion

In this work, we have identified the most informative literature related to a research topic of interest via a word embedding based semantic search. For demonstraton purpose, we only investigated the specific task about therapeutics of COVID-19. Nevertheless, the same approach is readily applicable to other tasks in this challenge, which we are interested to explore in the future. In addition, we aim to further mine the retrieved information, such as content summary, credibility evaluation, data scraping/analysis etc.



## Acknowledgement

contributors: 

kaggle ID: [ytisserant](https://www.kaggle.com/ytisserant),email: ytisserant@gmail.com 

kaggle ID: [yunchenyang](https://www.kaggle.com/yunchenyang), email: yunchenyang@hotmail.com