# Covid-19 Data mining tool


**Overview**

We present how to answer the growing questions in the COVID-19 to enable researchers around the world to benefit from this tool to easily get relevant answers to their questions along with the relevant papers from the CORD-19 Open Research Dataset of Covid19 literature given by Allen Institute for AI. 

Our goal is to produce a smart literature review where we use question-answering to find the relevant answers in the document set as well as the evidence for each answer. We've used the COVID-19 Open Research Dataset Challenge,a pretrained BioBERT model fine-tuned on SQuAD 2.0 as QnA model.We believe a good literature review should include automated machine learning and domain technology. Our final approach is expected to give auto summarisation of the relevant papers and to enable speech recognition for more ease.

**Methodology**

Dataset used: CORD-19 (Kaggle Competition data)
* Filter a set of CORD-19 documents in the list associated with COVID-19 (keyword) and save this meta information in covid19.csv.
* Create a retrieval system using the TF-IDF similarity measure to identify top N documents from the corpus based on similarity score with the question being asked.
* At last, using the QnA model BioBERT model fine tuned using SQUAD 2.0 dataset to get relevant answer within each document and highlight the answer in each candidate document.

**Q&A Model Background**

BERT (Devlin et al. 2018) is a verbal model of learning that has been learned from a multimodal imaging model of the guirectional Transformer (Vaswani et al. 2017). After naming Corpora for a common natural language, BERT can be easily configured with many lower NLP functions, achieving maximum performance using very small amounts of data. Recent work has shown significant improvements in various functions using similar BERT or Transformer models, and BERT variations have been modified in certain domains by pretending to be for a specific company - e.g. BioBERT (Lee et al. 2019) promoted BERT pretending to be a professional book company from PubMed.
The general function of the NLP standard is to answer the answers to the questions: given the question and the related role, the model reads the output quoted in the paragraph that best answers the question. The standard measurement data for this function is SQuAD. Notably, in the latest model (SQuAD 2.0, Rajpurkar et al. 2018), approximately â…“ of questions in the data cannot be intentionally inaccessible, so that top models should learn to answer when given insufficient evidence. (We found that fine-tuning this data improved the quality of our responses compared to SQuAD 1.1.)

**Implementation**

Given the biomedical content of the CORD-19 corpus, and the sparse, uncertain nature of question answering with this dataset (i.e. most documents do not contain good answers to most questions), our Q&A system is thus powered by a pretrained BioBERT model fine-tuned for extractive question answering using SQuAD 2.0.

* Firstly, extract meta information of all articles related to covid 19 search keywords(Covid19, SARS COV2 etc.) from the CORD-19 open research dataset in a csv file.
* Based on TF-IDF similarity score of question with the document's title, abstract and text. we select top N candidate articles based on the similarity score.
* We then used the fine tuned(BioBERT's Transfer learning on SQUAD2.0) model to get the answer for that question from these top N relevant articles.


We use the Huggingface Transformers library for fine tuning with our Q&A model. As shown in this [link](https://github.com/huggingface/transformers/tree/master/examples/question-answering), how to fine tune BERT model on SQUAD dataset. But we've used BiOBERT model instead of BERT model as BioBERT is found to be better than The model checkpoint is included with the submission and can be loaded below. To reproduce this checkpoint, use the run_squad.py script included in the Huggingface Transformers examples with the following command (takes ~8 hours on a GTX 1080):
```
python run_squad.py \
  --model_type bert \
  --model_name_or_path monologg/biobert_v1.1_pubmed \
  --do_train \
  --do_eval \
  --train_file SQUAD_DIR/train-v2.0.json \
  --predict_file SQUAD_DIR/dev-v2.0.json \
  --per_gpu_train_batch_size 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 4 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/biobert_squad2/ \
  --version_2_with_negative
 ```
To improve accuracy and relevance of the results, we pre-filter the corpus using a list of keywords related to COVID-19, as shown in this kernel. At query time, the top candidate documents are automatically retrieved based on TF-IDF cosine similarity with the query, with the option to manually filter using only keywords if desired. By default, we extract answers from the abstract, discussion, and conclusion sections. Answers are ranked based on the model's start score of the answer span. 


**Discussion**

A literature review, while taking time, is probably the most important step in understanding a particular research topic or question. With this epidemic, the number of books produced was enormous, and dumping this information is almost impossible.
By using NLP methods, we are able to automatically filter them with a lot of noise and precisely place relevant information. Using a streamlined process, our Q&A model can easily be updated to include new publications as they are published, giving researchers easy access to the most innovative results. In addition, the questions we present here are made with the needs of clinical users in mind, and are broad enough to recur in future disease attacks.
In the future, we would like to improve our program by adding summary skills to integrate results across topics and extract additional context for each article, including credibility or evidence-based evidence. These are open issues that challenge the NLP. The courses presented should be highly reviewed by experts, and consensus should be agreed upon between health care workers and the public.

**Acknowledgements**
**
This notebook is inspired and reproduced from the Google Health medical records research team (https://github.com/Google-Health/records-research) and by other Kagglers.


**Future Work**

1. Integrating this search tool with Speech recognition APIs to ease the speed with which researchers can interact with this feature.
1. A Whatsapp chatbot can be designed to provide it's access to everyone right from their Mobile phones.




In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import sys
import time
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import textwrap
import re
import attr
import abc
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import HTML
from os import listdir
from os.path import isfile, join

import warnings  
warnings.filterwarnings('ignore')
MAX_ARTICLES = 1000
base_dir = '/kaggle/input'
data_dir = base_dir + '/covid-19-articles'
data_path = data_dir + '/covid19.csv'
model_path = base_dir + '/biobert-qa/biobert_squad2_cased'
df=pd.read_csv(data_path)
class ResearchQA(object):
    def __init__(self, data_path, model_path):
        print('Loading data from', data_path)
        self.df = pd.read_csv(data_path)
        print('Initializing model from', model_path)
        self.model = TFAutoModelForQuestionAnswering.from_pretrained(model_path, from_pt=True)
        tf.saved_model.save(self.model, '/kaggle/output')
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.retrievers = {}
        self.build_retrievers()
        self.main_question_dict = dict()
        
    
    def build_retrievers(self):
        df = self.df
        abstracts = df[df.abstract.notna()].abstract
        self.retrievers['abstract'] = TFIDFRetrieval(abstracts)
        body_text = df[df.body_text.notna()].body_text
        self.retrievers['body_text'] = TFIDFRetrieval(body_text)

    def retrieve_candidates(self, section_path, question, top_n):
        candidates = self.retrievers[section_path[0]].retrieve(question, top_n)
        return self.df.loc[candidates.index]
          
        
    def get_answers(self, question, section='abstract', keyword=None, max_articles=1000, batch_size=4):
        df = self.df
        answers = []
        section_path = section.split('/')

        if keyword:
            candidates = df[df[section_path[0]].str.contains(keyword, na=False, case=False)]
        else:
            candidates = self.retrieve_candidates(section_path, question, top_n=max_articles) #get top N candidate articles based on similarity score
        if max_articles:
            candidates = candidates.head(max_articles)

        text_list = []
        indices = []
        for idx, row in candidates.iterrows():
            if section_path[0] == 'body_text':
                text = self.get_body_section(row.body_text, section_path[1])
            else:
                text = row[section]
            if text and isinstance(text, str):
                text_list.append(text)
                indices.append(idx)

        num_batches = len(text_list) // batch_size
        all_answers = []
        for i in range(num_batches):
            batch = text_list[i * batch_size:(i+1) * batch_size]
            answers = self.get_answers_from_text_list(question, batch)
            all_answers.extend(answers)

        last_batch = text_list[batch_size * num_batches:]
        if last_batch:
            all_answers.extend(self.get_answers_from_text_list(question, last_batch))

        columns = ['doi', 'authors', 'journal', 'publish_time', 'title', 'cohort_size']
        processed_answers = []
        for i, a in enumerate(all_answers):
            if a:
                row = candidates.loc[indices[i]]
                new_row = [a.text, a.start_score, a.end_score, a.input_text]
                new_row.extend(row[columns].values)
                processed_answers.append(new_row)
        answer_df = pd.DataFrame(processed_answers, columns=(['answer', 'start_score',
                                                 'end_score', 'context'] + columns))
        return answer_df.sort_values(['start_score', 'end_score'], ascending=False)

    def get_body_section(self, body_text, section_name):
      sections = body_text.split('<SECTION>\n')
      for section in sections:
        lines = section.split('\n')
        if len(lines) > 1:
          if section_name.lower() in lines[0].lower():
            return section

    def get_answers_from_text_list(self, question, text_list, max_tokens=512):
      tokenizer = self.tokenizer
      model = self.model
      inputs = tokenizer.batch_encode_plus(
          [(question, text) for text in text_list], add_special_tokens=True, return_tensors='tf',
          max_length=max_tokens, truncation_strategy='only_second', pad_to_max_length=True)
      input_ids = inputs['input_ids'].numpy()
      answer_start_scores, answer_end_scores = model(inputs)
      answer_start = tf.argmax(
          answer_start_scores, axis=1
      ).numpy()  # Get the most likely beginning of each answer with the argmax of the score
      answer_end = (
          tf.argmax(answer_end_scores, axis=1) + 1
      ).numpy()  # Get the most likely end of each answer with the argmax of the score

      answers = []
      for i, text in enumerate(text_list):
        input_text = tokenizer.decode(input_ids[i, :], clean_up_tokenization_spaces=True)
        input_text = input_text.split('[SEP] ', 2)[1]
        answer = tokenizer.decode(
            input_ids[i, answer_start[i]:answer_end[i]], clean_up_tokenization_spaces=True)
        score_start = answer_start_scores.numpy()[i][answer_start[i]]
        score_end = answer_end_scores.numpy()[i][answer_end[i]-1]
        if answer and not '[CLS]' in answer:
          answers.append(Answer(answer, score_start, score_end, input_text))
        else:
          answers.append(None)
      return answers
    

class Retrieval(abc.ABC):
  """Base class for retrieval methods."""

  def __init__(self, docs, keys=None):
    """
    Args:
      docs: a pd.Series of strings. The text to retrieve.
      keys: a pd.Series. Keys (e.g. ID, title) associated with each document.
    """
    self._docs = docs.copy()
    if keys is not None:
      self._docs.index = keys
    self._model = None
    self._doc_vecs = None

  def _top_documents(self, q_vec, top_n=10):
    similarity = cosine_similarity(self._doc_vecs, q_vec)
    rankings = np.argsort(np.squeeze(similarity))[::-1]
    ranked_indices = self._docs.index[rankings]
    return self._docs[ranked_indices][:top_n]

  @abc.abstractmethod
  def retrieve(self, query, top_n=10):
    pass

class TFIDFRetrieval(Retrieval):
  """Retrieve documents based on cosine similarity of TF-IDF vectors with query."""

  def __init__(self, docs, keys=None):
    """
    Args:
      docs: a list or pd.Series of strings. The text to retrieve.
      keys: a list or pd.Series. Keys (e.g. ID, title) associated with each document.
    """
    super(TFIDFRetrieval, self).__init__(docs, keys)
    self._model = TfidfVectorizer()
    self._doc_vecs = self._model.fit_transform(docs)

  def retrieve(self, query, top_n=10):
    q_vec = self._model.transform([query])
    return self._top_documents(q_vec, top_n)

@attr.s
class Answer(object):
    text = attr.ib()
    start_score = attr.ib()
    end_score = attr.ib()
    input_text = attr.ib()
    
style = '''
<style>
.hilight {
  background-color:#cceeff;
}
a {
  color: #000 !important;
  text-decoration: underline;
}
.question {
  font-size: 20px;
  font-style: italic;
  margin: 10px 0;
}
.info {
  padding: 10px 0;
}
table.dataframe {
  max-height: 450px;
  text-align: left;
}
.meta {
  margin-top: 10px;
}
.journal {
  color: green;
}
.footer {
  position: absolute;
  bottom: 20px;
  left: 20px;
}
</style>
'''

def format_context(row):
  text = row.context
  answer = row.answer
  highlight_start = text.find(answer)

  def find_context_start(text):
    idx = len(text) - 1
    while idx >= 2:
      if text[idx].isupper() and re.match(r'\W ', text[idx - 2:idx]):
        return idx
      idx -= 1
    return 0 
  context_start = find_context_start(text[:highlight_start])
  highlight_end = highlight_start + len(answer)
  context_html = (text[context_start:highlight_start] + '<span class=hilight>' + 
                  text[highlight_start:highlight_end] + '</span>' + 
                  text[highlight_end:highlight_end + 1 + text[highlight_end:].find('. ')])
  context_html += f'<br><br>score: {row.start_score:.2f}'
  return context_html


def format_author(authors):
  if not authors or not isinstance(authors, str):
    return 'Unknown Authors'
  name = authors.split(';')[0]
  name = name.split(',')[0]
  return name + ' et al'

def format_info(row):
  meta = []
  authors = format_author(row.authors) 
  if authors:
    meta.append(authors)
  meta.append(row.publish_time)
  meta = ', '.join(meta)
 
  html = f'''\
  <a class="title" target=_blank href="http://doi.org/{row.doi}">{row.title}</a>\
  <div class="meta">{meta}</div>\
  '''

  journal = row.journal
  if journal and isinstance(journal, str):
    html += f'<div class="journal">{journal}</div>'

  return html

def render_results(main_question, answers):
  id = main_question[:20].replace(' ', '_')
  html = f'<h1 id="{id}" style="font-size:20px;">{main_question}</h1>'
  for q, a in answers.items():
    # TODO: skipping empty answers. Maybe we should show
    # top retrieved docs.
    if a.empty:
      continue
    # clean up question
    if '?' in q:
        q = q.split('?')[0] + '?'
    html += f'<div class=question>{q}</div>' + format_answers(a)
  display(HTML(style + html))

def format_answers(a):
    a = a.sort_values('start_score', ascending=False)
    a.drop_duplicates('doi', inplace=True)
    out = []
    for i, row in a.iterrows():
        if row.start_score < 0:
            continue
        info = format_info(row)
        context = format_context(row)

        cohort = ''
        if not np.isnan(row.cohort_size):
            cohort = int(row.cohort_size)
        temp=df[df['doi']==row.doi]
        text = temp['body_text']
        summ=summarizer(str(text), max_length=1000,   min_length=30)
        out.append([context, info,summ])
    out = pd.DataFrame(out, columns=['answer', 'article','summ'])
    return out.to_html(escape=False, index=False)

def render_answers(a):
    display(HTML(style + format_answers(a)))

In [None]:
from transformers import pipeline
summarizer = pipeline('summarization')

In [None]:
model1 = TFAutoModelForQuestionAnswering.from_pretrained(model_path, from_pt=True)
tf.saved_model.save(model1, '/kaggle/working/')

In [None]:
qa = ResearchQA(data_path, model_path)

In [None]:
answers = qa.get_answers('What drugs are effective?',max_articles=5)
render_answers(answers)

In [None]:
answers = qa.get_answers('What kind of cytokines play a major role in host response?',max_articles=5)
render_answers(answers)