## Task
Create summary tables that address therapeutics, interventions, and clinical studies related to COVID-19. Specifically, the submission should focus on what the literature reports about:
* What is the best method to combat the hypercoagulable state seen in COVID-19?
* What is the efficacy of novel therapeutics being tested currently?

We describe our efforts in answering these two questions of this task. For each question, we create a csv file as our submission.

## Approach Summary
Our phase-1 submission for this task described CoNetz, a tool for text and network mining for COVID-19 research. The tool allows easy exploration and visualization of the COVID-19 corpus derived network for generation of leads. We now describe our phase-2 submission for this task, which uses a set of new techniques to specifically address the above two questions.

Our fully automated phase-2 pipeline broadly consists of the following steps:
1. **Corpus preprocessing (corpus augmentation, named entity annotation and cleansing)**:  
  The CORD-19 corpus was augmented with additional COVID-19 related MEDLINE abstracts. Foreign (non-English) language articles were then removed. A lexicon (dictionary)-based NER was performed to annotate named entities in the articles. In particular, entities belonging to "PHENOTYPE and SYMPTOM" terms, "CHEMICAL, DRUG and INTERVENTION" terms, "HYPERCOAGULATION" terms and COVID-19 DISEASE terms were tagged. 
2. **Identification of relevant articles**:  
 Selection of relevant COVID-19 articles that are related to clinical intervention studies in the context of either COVID-19 therapeutics or handling hypercoagulable state.
3. **Information extraction/prediction for populating the target fields**:  
 Applying automated information extraction and prediction on the final corpus based on various DL and ML models including the BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) model and SVM models for populating the target fields in the csv output. 
4. **Post processing**:  
 Process the csv intermediate output from the previous step for handling duplicates, missing fields, noisy entries etc., and produce the final csv output.

## Approach Details

In the following, we describe the above steps in detail. Implementation details including organization of supporting data/code and statistics of intermediate outputs are also discussed in the end (Additional Details Section). Most of the computational pipeline is identical for answering both the questions. Where ever applicable, we discuss the custom modifications that are applied to answer any of the specific questions. The pipeline is run on the latest CORD-19 corpus (uploaded on 9th June 2020). 

### 1. **Corpus preprocessing (corpus augmentation, named entity annotation and cleansing)**: 
* Using a JSON parser, the provided CORD-19 corpus was converted to a TSV file that consists of relevant fields.
* The provided CORD-19 corpus is an excellent source of COVID-19 related articles. However, in order to improve the coverage, we also included additional COVID-19 related MEDLINE abstracts that are not present in the provided CORD-19 corpus. In order to do this, we searched for COVID-19 or any of its synonyms in the MEDLINE corpus and the compared their PubMed IDs with the PubMed IDs present in the CORD-19 corpus. 
* Foreign language (non-English) articles were removed from this augmented corpus. The "FastText" model based prediction was used for filtering out non-English articles. Please see the Additional Details section for code/data. The final filtered set of articles are saved in cord19_arts.tsv file.
* We performed Named Entity Recognition (NER) on the filtered corpus to tag the entities of interest. Specifically, we tagged entities belonging to "PHENOTYPE and SYMPTOM" terms, "CHEMICAL, DRUG and INTERVENTION" terms, "HYPERCOAGULATION" terms and COVID-19 DISEASE terms. These entities are crucial for accurate extraction of relevant information.
* For performing annotations, we used the TPX NER tool [1] which is an in-house lexicon (dictionary)-based NER module. The customized dictionaries consisted of a comprehensive set of terms belonging to the four entity categories. The NER module performs approximate string matching for tagging. It can also handle local abbreviations.
* The input to the NER module is cord19_arts.tsv. The NER output consists of a) article level annotation capturing the tagged entities (annotations.inp) and b) corpus level aggregate data that captures pairwise co-occurrence and pairwise Pearson correlation between the tagged entities in the input corpus (pc_associations_master.txt). The output files are made available in the data folder. 
* The corpus preprocessing phase comprising of the above steps is not included in our Jupyter notebook. However, all the relevant code and the intermediate outputs are made available in the data/code folder (Details discussed in the Additional Details Section), with the exception of the NER module. The TPX NER code is not included in our submission code folder due to licensing constraints. In order to support future execution of our pipeline on newer articles, we would be providing a facility for web based remote execution of the NER module which takes a cord19_arts.tsv file as input and produces the two corresponding output files that feed into our next step.

### 2. **Identification of relevant articles**: 
* In this step, the output files from the previous step are used to select the relevant articles for the subsequent information extraction steps. We apply multiple filters for relevant article identification.
* The first filter uses entity annotations of the articles. An article qualifies this filter if the entity annotations in its title/abstract contain
    * A COVID-19 (or its synonym) term 
    * at least one "CHEMICAL, DRUG and INTERVENTION" term
    * at least one "HYPERCOAGULATION" term (only in the case of addressing the first question on hypercoagulable state).
* The second filter is applied on the articles that qualify the first filter in order to identify articles that are related to clinical intervention studies. This is because, for both the questions in this task, the information extraction has to be performed from articles that are related to clinical intervention studies. The second filter applies a combination of pattern matching and  SVM based classification. An article that was classified as positive by either the pattern matching or the SVM classifier are included for further processing.
    * In pattern matching, articles whose title/abstract has an occurrence of any of ["patients" or "volunteers" or "participants" or "cases" or "COVID-19 case" or "cohort" or [0-9]+ year old] are tagged as positive.
    * A one-class SVM classifier (outlier detection) was trained on a positive training set of PubMed articles whose publication type metadata is any one of [Adaptive Clinical Trial, Case Reports, Clinical Conference, Clinical Study, Clinical Trial, Clinical Trial, Phase I, Clinical Trial, Phase II, Clinical Trial, Phase III, Clinical Trial, Phase IV, Clinical Trial Protocol, Controlled Clinical Trial, Pragmatic Clinical Trial and Randomized Controlled Trial]. A negative training set could further improve the quality by training a two-class SVM classifier. However, we have used a one-class classifier in this submission.
* The output is a tsv file "novel_th_ab.tsv" ("hgs_ab.tsv" for the hypercoagulation question).  These files will be created in the Jupyter working folder.

### 3. **Information extraction/prediction for populating the target fields**:
* In this step, we apply information extraction and prediction on each article present in the output of the previous step. For this, we apply various DL and ML models including the BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) model [2] and SVM models.
* The target fields in the output are: Date | Study | Study Link | Journal | Study Type | Therapeutic method(s) utilized/assessed | Sample size| Severity of Symptoms | General Outcome/Conclusion Excerpt | Primary Endpoint(s) of Study | Clinical Improvement (Y/N)
* The entity annotations of the articles are also used in conjunction with ML and DL models for populating the different fields.
* **"Date", "Study", "Study Link" and "Journal" fields**: These are directly populated from the CORD-19 article metadata.
* **"Study Type" field**:
    * We use an SVM based classifier to populate this field. We train a multi-class SVM classifier using training data constructed from PubMed. For each class, we create approximately 2000 training articles by using related PubMed searches and related PubMed metadata. The training corpus, training code and the final trained SVM model are available in the data folder. 
* **"Therapeutic method(s) utilized/assessed" field**:
    * We use pre-trained BioBERT model (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) trained on Biomedical corpus to extract this field.
    * In particular, we use the Question/Answer (Q&A) model of BioBERT. We frame multiple questions around Therapeutics such as "What were the therapeutics used for treatment of covid-19?" and scan the BioBERT returned text snippets from the article title/abstract. We then identify the snippet having the maximum prediction score by BioBERT.
    * Subsequently, we then use the article annotation data and search for the presence of tagged therapeutics terms in the identified text snippet.
    * If multiple tagged therapeutics are detected then we create separate rows in the output csv for each of these and extract the remaining fields for them separately.
    * To address the question on handling hypercoagulation state, we use a set of modified questions such as "What were the therapeutics used for treatment of Y ?", where "Y" is the tagged HYPERCOAGULATION TERM present in the article abstract.
* **"Sample size" field**:
    * We use pre-trained BioBERT Q&A model along with multiple related questions followed by pattern matching to extract sample size.
* **"Severity of Symptoms" field**:
    * Here again, we use pre-trained BioBERT Q&A model along with multiple related questions followed by pattern matching to populate this field.
* **"General Outcome/Conclusion Excerpt" field**: 
    * We use the article conclusion section in the case of full text articles to populate this field. For articles without full text, conclusions present in the article abstract are extracted using pattern matching. In the absence of a defined conclusion in the article abstract, we populate this field with the text snippet containing the tagged therapeutic term and COVID-19 mention.
* **"Primary Endpoint(s) of Study" field**:
    * Presence of primary end point related patterns (such as "primary composite endpoints") are searched in the full text articles. After identifying the relevant text portions, BioBERT based Q&A model along with related questions are used to identify high confidence text snippets that are then used to populate this field. For articles without full text, BioBERT Q&A model is run on the whole title/abstract section. 
* **"Clinical Improvement (Y/N)" field**:
    * We use a combination of standard questions and a set of adaptive questions together with BioBERT to extract relevant text snippets with high prediction scores. For adaptive questions, we use the tagged therapeutics and frame tailored questions such as "How effective was Y?" where "Y" stands for a tagged therapeutic entity. The extracted text snippet is then fed to Vader sentiment analyzer to classify it as either Y or N. 
* In summary, to populate the above fields, we heavily use BioBERT Q&A model to identify high confidence text snippets by firing multiple related questions. To improve the precision further, we use the tagged entities in the article to both process these snippets and to construct adaptive queries for BioBERT that are tailored to the specific article. For some of the fields, we use additional trained SVM models that take these snippets as input and predict the field values. We believe that using a combination of BioBERT Q&A framework together with tailored questions, tagged annotations and additional SVM classifiers is a promising approach in achieving both good precision as well as recall.
* The output of this step is novel_th_ab_wbert.tsv file (hgs_ab_wbert.tsv for the hypercoagulation question) containing all the target fields and this forms the input to the next step.
* 

### 4. **Post processing**: 

* We perform a set of post processing steps to clean the csv file generated by the previous step.
* Add missing Journals, if found in MEDLINE. All the available journal names in the corpus were mapped to their PubMed journal name abbreviation. Some were a direct match, while there were many which needed to be processed into their full names using rules created based on the patterns observed.
* Add missing DOI, if found in MEDLINE.
* Add missing/partial publication date, if found in MEDLINE.
* Date: to be transformed to M/D/YYYY format. If the date is shown as just "2020", then it is left as-is.
* Remove the word(s) "Abstract" or "Background", "Full-length title", "News at a glance", "Letter to the Editor" or "commentary" if the title begins with these in the Study field.


### **The final output files**: 
* novel_th.csv file for the novel therapeutics question
* hgs.csv for the hypercoagulation question

## Additional Details
* The latest CORD-19 input corpus consisted of 1,38,794 articles.
* A total of 2553 non-English articles were removed in the preprocessing step.
* A total of 2420 filtered articles resulted after performing step 2 (identifying relevant articles).
* The /kaggle/input/data folder contains:
    * The BioBERT Q&A SQuAD model 
    * Combined corpus of CORD-19 articles and filtered MEDLINE
    * Corpus annotations and Pearson correlation values
    * SVM models for study type classification
    * one-class SVM model for clinical study article classification
* The /kaggle/input/code/python folder contains:
    * CORD-19 JSON parser
    * FastText foreign language classifier.
    * BioBERT Q&A code
    * SVM classifier (training and prediction) for study type classification in the utils sub folder
    * one-class SVM classifier (training and prediction) for clinical study article classification in the utils sub folder
* The /kaggle/input/code/java folder contains:
    * java source code for novel therapuetics pipeline, BioBERT input/output processing and for the post processing step.

## References
1. Joseph T, Saipradeep VG, Raghavan GS, Srinivasan R, Rao A, Kotte S, Sivadasan N. TPX: Biomedical literature search made easy. Bioinformation 8(12): 578-80 (2012).
2. Lee, J; Yoon, W; Kim, S; Kim, D; Kim, S; Ho So, C; Kang, J. Bioinformatics, Volume 36, Issue 4, 15 February 2020, Pages 1234–1240.

In [None]:
!ls -lrt /kaggle/input
!java -version
!javac -d /kaggle/working/  /kaggle/input/code/java/CORD19/src/DTBean.java
!javac -d /kaggle/working/  /kaggle/input/code/java/CORD19/src/Segment.java
!javac -d /kaggle/working/  /kaggle/input/code/java/CORD19/src/Text.java
!javac -d /kaggle/working/  /kaggle/input/code/java/CORD19/src/Word2Num.java
!javac -d /kaggle/working/ -cp /kaggle/working/ /kaggle/input/code/java/CORD19/src/SentenceSplitter.java
!javac -d /kaggle/working/ -cp /kaggle/working/ /kaggle/input/code/java/CORD19/src/NovelTherapeuticsPipeline.java
!javac -d /kaggle/working/ -cp /kaggle/working/ /kaggle/input/code/java/CORD19/src/BERTInputPreprocess.java
!javac -d /kaggle/working/ -cp /kaggle/working/ /kaggle/input/code/java/CORD19/src/BERTPostProcessing.java
!javac -d /kaggle/working/ -cp /kaggle/working/:/kaggle/input/code/java/CORD19/lib/* /kaggle/input/code/java/CORD19/src/BERTOutputProcessor.java

!java -cp /kaggle/working/ NovelTherapeuticsPipeline
!java -cp /kaggle/working/ BERTInputPreprocess
!ls -lrt /kaggle/input


### Code for running BioBERT for novel therapeutics

!pip install tensorflow-gpu==1.14.0
!pip install bert-tensorflow

! python /kaggle/input/code/python/run_factoid.py      --do_train=False      --do_predict=True      --vocab_file=/kaggle/input/data/data/BERT-pubmed-1000000-SQuAD/vocab.txt      --bert_config_file=/kaggle/input/data/data/BERT-pubmed-1000000-SQuAD/bert_config.json      --init_checkpoint=/kaggle/input/data/data/BERT-pubmed-1000000-SQuAD/model.ckpt-14599      --max_seq_length=384      --train_batch_size=12      --learning_rate=5e-6      --doc_stride=128      --num_train_epochs=5.0      --do_lower_case=False      --predict_file=/kaggle/working/novel_th_ab_bert.json      --output_dir=/kaggle/working/

In [None]:
!java -cp /kaggle/working/:/kaggle/input/code/java/CORD19/lib/* BERTOutputProcessor
!java -cp /kaggle/working/:/kaggle/input/code/java/CORD19/lib/* BERTPostProcessing

In [None]:
import nltk
import re
nltk.downloader.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd

regex=r"\b(low|reduce|stop)(.*)(infection|fatal|mortal|risk|cytokine storm|concentration|death|adverse)+"
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

def checkNegFP(text):
    m=re.search(regex,text)
    if(m!=None):
        return True
    else:
        return False



In [None]:
input_file = '/kaggle/working/novel_th_ab_wbert_processed.tsv'
tdf = pd.read_csv(input_file,sep='\t',converters={"Clinical Improvement (Y/N)":str})
for i, r in tdf.iterrows():
    r[1]=r[1].title() 
    score_dict = SentimentIntensityAnalyzer().polarity_scores(r[9]);
    if(score_dict['neg']>score_dict['pos']):
        if checkNegFP(r[9]):
            r[9]='Y'
        else:
            r[9]='N'
    elif(score_dict['neg']<score_dict['pos']):
        r[9]='Y'
    else:
        r[9]='-'
tdf.to_csv("/kaggle/working/novel_th.csv")

### Final output summary table for novel therapeutics

In [None]:
tdf

In [None]:
!java -cp /kaggle/working/ NovelTherapeuticsPipeline q1
!java -cp /kaggle/working/ BERTInputPreprocess q1

### Code for running BioBERT for hypercoagulable state

! python /kaggle/input/code/python/run_factoid.py      --do_train=False      --do_predict=True      --vocab_file=/kaggle/input/data/data/BERT-pubmed-1000000-SQuAD/vocab.txt      --bert_config_file=/kaggle/input/data/data/BERT-pubmed-1000000-SQuAD/bert_config.json      --init_checkpoint=/kaggle/input/data/data/BERT-pubmed-1000000-SQuAD/model.ckpt-14599      --max_seq_length=384      --train_batch_size=12      --learning_rate=5e-6      --doc_stride=128      --num_train_epochs=5.0      --do_lower_case=False      --predict_file=/kaggle/working/hgs_ab_bert.json      --output_dir=/kaggle/working/

In [None]:
!java -cp /kaggle/working/:/kaggle/input/code/java/CORD19/lib/* BERTOutputProcessor q1
!java -cp /kaggle/working/:/kaggle/input/code/java/CORD19/lib/* BERTPostProcessing q1

In [None]:
input_file = '/kaggle/working/hgs_ab_wbert_processed.tsv'
df = pd.read_csv(input_file,sep='\t',converters={"Clinical Improvement (Y/N)":str})
for i, r in df.iterrows():
    r[1]=r[1].title() 
    score_dict = SentimentIntensityAnalyzer().polarity_scores(r[9]);
    if(score_dict['neg']>score_dict['pos']):
        if checkNegFP(r[9]):
            r[9]='Y'
        else:
            r[9]='N'
    elif(score_dict['neg']<score_dict['pos']):
        r[9]='Y'
    else:
        r[9]='-'
df.to_csv("/kaggle/working/hgs.csv")

### Final output summary table for hypercoagulable state

In [None]:
df