# Highlight searching results for COVID-19 


# Introduction:
Abstract is one of the most important sections in any publication. It includes summary for the findings and results in a paper. It is the first thing that researchers view in order to decide whether to go deeper and read the whole publication or skip to another one. The abstract section includes a comprehensive outline of published paper contents, the intended purposes, the publication importance. Hence, we will exploit it to highlight relevant answers for user queries. 
The solution objective is to ease the search for relevant topics asked about COVID-19. This is done by highlighting related answers from abstract publication extractions. In the following section we will describe the solution flow. 


# Methodology:
**Data Preparation**

Data set used for searching is provided by Allen Institute for AI. Anserini team have provided already indexing for the data set covering title and abstract.
Reference: https://github.com/castorini/anserini/blob/master/docs/experiments-covid.md

Building a customized stop words by compiling all the paper abstracts and computing the term frequency. By displaying the first 150 most frequent terms, we selected manually terms that not necessarily defined as a keyword. For example, a word like “patients” or  “disease” does not add much information as we know beforehand that the dataset is covering medical domain. 

Building a customized synonyms file for words that we want to expand. This is done by compiling all the queries and sub-quires published on Kaggle competition and computing the term frequency. By discarding the traditional English stop words like “and”, “the”..etc, we selected manually terms that were interesting and added more synonym for it. For example, a word like “animal”, “monkey”, “mice” and “mouse” will be probably used as a reference for clinical experiments on animals, and therefore we clustered them together as synonyms to be used in query expansions. 


**The solution works as depicted in figure 1:**
Step 1: A user ask a query in natural language, for example: “what are available vaccine for Covid 19”
Step 2: A keywords extraction module will process the query. The module aims to expand the query by synonyms, normalize the text, remove stop words. 
Step 3: Now the keywords are used to search in the indexed data set, we retrieve the top ten hits sorted in descending order of the search score. 
Step 4:  Using the abstract and user query we use Bert question and answer model to highlight the answer in the abstract
![Solution Flow Diagram](../input/diagram/flow2.jpg)
![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F687442%2F1be903a421119ac5add0beebff9846c1%2Fflow2.jpg?generation=1587076320936373&alt=media) 
# Discussion 
**Solution Pros:**
•	Simple and straight forward solution with good results.  
•	Exploiting Anserini, a toolkit that is built on top of core Lucene libraries. Making it easy to retrieve related documents. [1][2]
•	Exploiting BERT, a pre-trained model for question and answering task [3]. Bert helped in boosting our result. It facilitated the focus on the search outcome by highlighting the answer of user query within the abstract of retrieved papers. Making it very simply and clear for the user to reach the desired information. 

**Solution Cons:**
•	Bert is unable to extract the answer from publication abstract that are more than 512 tokens. When this case occurs, we are highlighting interesting keywords instead using the extracted keywords.
•	We need to explore other models that are domain specific like scibert, biobert, Bio_ClinicalBERT


# Acknowledgments:
I would like to thank Anserini team for providing demo notebooks and indexed datasets from the Allen Institute for AI [github](https://github.com/castorini/anserini/blob/master/docs/experiments-covid.md) from the Allen Institute for AI.

I would like to thank Chris McCormick for his Bert demos, articles and his notebook Question Answering with a Fine-Tuned BERT
[here](https://colab.research.google.com/drive/1uSlWtJdZmLrI3FCNIlUHFxwAJiSu2J0-#scrollTo=W-1zl5XdYInf): 


# References
In this notebook, we'll perform data mining using Covid-19 publications title + abstract.The solution objective is to ease the search for relevant topics asked about COVID-19. This is done by highlighting related answers from abstract publication extractions. 

[1] Yang, Peilin, Hui Fang, and Jimmy Lin. "Anserini: Enabling the use of Lucene for information retrieval research." Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017.
[2] https://github.com/castorini/anserini
[3] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).



First, install Python dependencies

In [None]:
import os.path
from pathlib import Path


In [None]:
import subprocess
version = subprocess.check_output(['java', '-version'], stderr=subprocess.STDOUT)
print(version)

In [None]:
%%capture
!pip install pyserini==0.8.1.0
!pip install transformers
!pip install nltk
import json

Perform the imports and downloads for prerequisites

In [None]:
if(not('11.0.2' in str(version))):
    print('jdk upgrade required')
    !curl -O https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz

    !mv openjdk-11.0.2_linux-x64_bin.tar.gz /usr/lib/jvm/; cd /usr/lib/jvm/; tar -zxvf openjdk-11.0.2_linux-x64_bin.tar.gz
    !update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk-11.0.2/bin/java 1
    !update-alternatives --set java /usr/lib/jvm/jdk-11.0.2/bin/java
else:
    print('jdk level is Ok ')

In [None]:
import json
import os
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.system("ls /usr/lib/jvm")
os.environ["JAVA_HOME"] = "/usr/lib/jvm/jdk-11.0.2"
!ls '/usr/lib/jvm'

In [None]:
from IPython.core.display import display, HTML
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
nltk.download('stopwords')
import pandas as pd
import numpy as np
import string
import torch
import numpy
from tqdm import tqdm
#%tensorflow_version 1.x
!pip install tensorflow==1.15.2
import tensorflow
print(tensorflow.__version__)

Download the pre-built index and download synonyms file. The synonym file is a preliminary version that was built manually to help in expanding the search query.

In [None]:
%%capture

!wget https://www.dropbox.com/s/j55t617yhvmegy8/lucene-index-covid-2020-04-10.tar.gz
!tar xvfz lucene-index-covid-2020-04-10.tar.gz
!wget https://www.dropbox.com/s/szakwmvco88hp3m/synonyms.csv?dl=0
!mv synonyms.csv?dl=0 synonyms.csv

Sanity check of index size (should be 1.3G):

In [None]:
!du -h lucene-index-covid-2020-04-10

Load BERT from HuggingFace Transformers

In [None]:
from transformers import *
#let us try different BERT models, so far BERT model had better performance

#dtokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
#dmodel = AutoModelForQuestionAnswering.from_pretrained('allenai/scibert_scivocab_cased')
#dtokenizer = AutoTokenizer.from_pretrained('monologg/biobert_v1.0_pubmed_pmc', do_lower_case=False)
#dmodel = AutoModelForQuestionAnswering.from_pretrained('monologg/biobert_v1.0_pubmed_pmc')
#dtokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
#dmodel = AutoModelForQuestionAnswering.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

dtokenizer= BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
dmodel=BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
scoredic={}

			

Helper function getsynonym extract synonyms from synonyms.csv

In [None]:
# retrun synonyms for term 
def getsynonym(term):
    

  df = pd.read_csv('../input/synonymscsv/synonyms.csv')

  mylist=[]
  for col in df.columns:
    for index, rows in df.iterrows():
      if rows[col]==term:
        df2=rows[:]
        df2=df2.dropna()
        mylist = df2.values.tolist()
        break;
  return mylist

Helper function expandquery is used to expand the query

In [None]:
#return a string composed of the query and adding more synonynms.
def expandquery(query):
  
  searchquery=""
  querylist =query.split(" ")
  listofwords=[]
  for term in querylist:
    synonymlist = getsynonym(term)
    if not synonymlist == []:
      listofwords=listofwords+synonymlist
    else:
      searchquery=searchquery+" "+term
  myset = set(listofwords)
  mylist =list(myset)
  searchquery2=" ".join(str(item) for item in mylist)
  searchquery = searchquery+" "+searchquery2
  
  return searchquery

Helper function unicodedata is used to normalize the text.

In [None]:
import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize("NFKD", text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)

removeCovidStopwords function is used to remove stop words. 
Other than the default stop words extracted from wordnet, We have collected some stop words that are specific to COVID-19 data set. The customized stop words were selected by computing the term frequency of all papers abstract and manually selecting some of the words that are repeated in almost most of the papers.


In [None]:
#return a string composed of the query after removing stop words
def removeCovidStopwords(query):
  stop_wordsCovid =set(['what','how',"which","where","virus","viral","viruses","infection","disease","patients","study",",","?"])
  stop_words=set(stopwords.words("english"))
  searchquery=""
  word_tokens = word_tokenize(query)
  print(type(stop_wordsCovid))
  filtered_sentence = [w for w in word_tokens if ((not w in stop_words)and(not w in stop_wordsCovid))]
  searchquery=" ".join(str(item) for item in filtered_sentence)
  return searchquery

Function extractquerysearch is used to extract the keywords that we can use to fire search query. The result search query is only used for information retrieval and not with Bert model. In other words, we will use the original query as is with BERT.

In [None]:
# return keywords to be used with pyserini
def extractquerysearch(query):
  searchquery=""
  searchquery = normalize_caseless(query)
  searchquery = removeCovidStopwords(searchquery)
  searchquery=expandquery(searchquery)

  return searchquery

Clean some extra text in retrieved paper abstract for a better results display. 

In [None]:
# Clean some extra text in paper abstract for a better presentation of results
def cleantext(paragraph):
  if paragraph.startswith('abstract')or paragraph.startswith('Abstract')or paragraph.startswith('ABSTRACT'):
    paragraph =paragraph[8:]
  
  return paragraph

Using the user original query let us extract more keywords.

In [None]:
query='What is known about covid-19 transmission, incubation, and environmental stability?'
searchquery=extractquerysearch(query)
print("keywords extracted are:",searchquery)

Using the keywords exracted (i.e. searchquery ) Let us use pyserini to search for related publications. We will display the top 10 documents and their score.

In [None]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index-covid-2020-04-10/')
hits = searcher.search(searchquery)

display(HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:12px"><b>Query</b>: '+query+'</div>'))


# Prints the first 10 hits
for i in range(0, 10):
  score=hits[i].score
  scoredic.update({hits[i].lucene_document.get("title") :score })
  display(HTML('<div style="font-family: Times New Roman; font-size: 18px; padding-bottom:10px">' + 
               F'{i+1} {hits[i].docid} ({hits[i].score:1.2f}) -- ' +
               F'{hits[i].lucene_document.get("authors")} et al. --' + 
               F'<a href="https://doi.org/{hits[i].lucene_document.get("doi")}">{hits[i].lucene_document.get("doi")}</a>.'+
               '<br>' +'<b> Paper Title: </b> '+
               F'{hits[i].lucene_document.get("title")}. '
               
               + '</div>'))

Visualize the scores of relevance for each paper retrieved.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.rcdefaults()
fig, ax = plt.subplots()

titles = list(scoredic.keys())
y_pos = np.arange(len(titles))
scores = list(scoredic.values())
error = np.random.rand(len(titles))

ax.barh(y_pos, scores, xerr=error, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(titles)
ax.invert_yaxis()  
ax.set_xlabel('Scores')
ax.set_title(query)

plt.show()


Using BERT, we will use answer_question function that will extract the answer using the query and absract. In case the answer is not found then we will highlight the keywords instead.

In [None]:
def answer_question(question, answer_text,dtokenizer,dmodel):
    
    answer = "No highlight detected"
    if not question or not answer_text:
      print("Empty question or Empty abstract")
      return answer
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = dtokenizer.encode(question, answer_text,max_length=512)
    # Report how long the input sequence is.
    #print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(dtokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1
    
    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a
    
    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example question through the model.
    
    start_scores, end_scores = dmodel(torch.tensor([input_ids]), # The tokens representing our input text.
                                    token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = dtokenizer.convert_ids_to_tokens(input_ids)
    
    # Start with the first token.
    answer = tokens[answer_start]
    #if bert didn't get the tokens right, then the function retrun and highlight the keywords instead
    if answer==dtokenizer.cls_token:
      answer = "No highlight detected"
      return answer
    # if the first token is [sep] then skip and move forward  
    if answer==dtokenizer.sep_token:
      answer=""

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            if tokens[i-1]=='(' or tokens[i-1]  == '-':
              answer += tokens[i]
            elif tokens[i] == ')' or tokens[i]  == '-':
              answer += tokens[i]
            else:
              answer += ' ' + tokens[i]

    if answer==dtokenizer.sep_token:
      answer='No highlight detected'
    return answer

function highlightanswer is used to highlight a string (str) in a paragraph.

In [None]:
def highlightanswer(str,paragraph):
  str_start=""
  str_end=""
  flag='none'
  paragraph=normalize_caseless(paragraph)
  str=normalize_caseless(str)
  try:
    indx = paragraph.index(str)
  except:
    return str_start, str, str_end,flag

  if indx==-1:
    return str_start, str, str_end,flag
  str_start=paragraph[0:indx]
  str_end=paragraph[indx+len(str):]
  flag='done'
  return str_start, str, str_end, flag

highlight_keywords function will high light the keywords found in the abstract.

In [None]:
def highlight_keywords(answer_text):

  abstractwords= word_tokenize(answer_text)
  searchquery_tokenized=word_tokenize(searchquery)
  abstractpara=""

  for wrd in abstractwords:
    if wrd in searchquery_tokenized:
      abstractpara = abstractpara+" "+"<font color='red'>"+wrd+"</font>"
    else:
      abstractpara = abstractpara+" "+wrd
  
  return abstractpara

display_marker_result function will loop over the search hits sorted by documents score. It will highlight the answer and any kewords that would be interested to the user.

In [None]:
def display_marker_result():
  display(HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:12px; background:#e3e3e3"><b>Query</b>: '+query+'</div>'))
  # Prints the first 10 hits
  for i in range(0, 10):
    abstract=cleantext(hits[i].lucene_document.get("abstract"))
    answer =answer_question(query,abstract,dtokenizer,dmodel)
    strstart, highlighted, strend, myflag= highlightanswer(answer,abstract)
    if answer=='No highlight detected':
      display(HTML('<div style="font-family: Times New Roman; font-size: 18px; padding-bottom:10px">' + '<b>'+
               F'{i+1}'+') Score: </b>'+ F'{hits[i].score:1.2f}' +'-- <b>Authors: </b>'+
               F'{hits[i].lucene_document.get("authors")} et al. ' +'-- <b>DOI: </b>'+
               F'<a href="https://doi.org/{hits[i].lucene_document.get("doi")}">{hits[i].lucene_document.get("doi")}</a>.'+
               '<br> <b>Paper Title: </b>'+ F'{hits[i].lucene_document.get("title")}. ' +
               '<br> <b>Abstract: </b><br>'+
               F'{highlight_keywords(abstract)}'
               +'<font color="red">'+
               '<br><br><b>High Lights: </b> highlighting detected keywords </font><br>'+
                '</div> --------------------------------------------------------------------------------------------------------------------------------------' ))
    elif myflag=='none':
      display(HTML('<div style="font-family: Times New Roman; font-size: 18px; padding-bottom:10px">' + '<b>'+
               F'{i+1}'+') Score: </b>'+ F'{hits[i].score:1.2f}' +'-- <b>Authors: </b>'+
               F'{hits[i].lucene_document.get("authors")} et al. ' +'-- <b>DOI: </b>'+
               F'<a href="https://doi.org/{hits[i].lucene_document.get("doi")}">{hits[i].lucene_document.get("doi")}</a>.'+
               '<br> <b>Paper Title: </b>'+ F'{hits[i].lucene_document.get("title")}. ' +
               '<br> <b>Abstract: </b><br>'+
               F'{abstract}'
               +'<font color="red">'+
               '<br><br><b>High Lights: </b>'+F'{highlighted} '+'</font><br>'+
                '</div> --------------------------------------------------------------------------------------------------------------------------------------' ))
    else:
        display(HTML('<div style="font-family: Times New Roman; font-size: 18px; padding-bottom:10px">' + '<b>'+
               F'{i+1}'+') Score: </b>'+ F'{hits[i].score:1.2f}' +'-- <b>Authors: </b>'+
               F'{hits[i].lucene_document.get("authors")} et al. ' +'-- <b>DOI: </b>'+
               F'<a href="https://doi.org/{hits[i].lucene_document.get("doi")}">{hits[i].lucene_document.get("doi")}</a>.'+
               '<br> <b>Paper Title: </b>'+ F'{hits[i].lucene_document.get("title")}. ' +
               '<br> <b>Abstract: </b><br>'+
               F'{strstart} ' +'<font color="red">'+F'{highlighted} '+'</font>'+F'{strend}'
              
               +'<font color="red">'+
               '<br><br><b>High Lights: </b>'+F'{highlighted} '+'</font><br>'+
               '</div> ---------------------------------------------------------------------------------------------------------------------------------------' ))


Let us display the result.

In [None]:
display_marker_result()

Let us perform a new search now and see results

In [None]:
query ='what are the effectiveness of drugs being developed and tried to treat COVID-19 patients?'
searchquery=extractquerysearch(query)
print("keywords extracted is: ",searchquery)
hits = searcher.search(searchquery)
display_marker_result()

Again, Let us perform a new search and see results

In [None]:
query="What do we know about COVID-19 risk factors?"
searchquery=extractquerysearch(query)
print("keywords extracted is: ",searchquery)
hits = searcher.search(searchquery)
display_marker_result()