# Some results from finetuned BioBERT used with Question Answering

_Mark Griffiths and Mehmed Sariyildiz, AbbVie RAIDERS_

Here we report some results from using a BioBERT-based question answering system to extract:

    1) DRUG TARGETS for the SARS-CoV-2 virus
    2) DRUGS Used to treat COVID-19 disease
    3) RISK FACTORS for COVID-19 disease (outcomes) 

We developed an app which we call ASK COVID ES (Ensemble Sources) to offer a simple interface to pose questions to the 
fine-tuned BioBERT model for it to infer against COVID-19 relevant texts.

This is based on  the idea of using an ensemble of sources for the use case and then presenting the model 
with the best results i.e. corpus creation on the fly based on multiple sources using the question posed 
as the starting point for corpus creation. 
The model is presented with top 500 pertinent articles from the COVID-19 corpus plus top 100 PubMed 
articles and top 50 results from google, now you can basically ask it any question with the caveat 
of potential IP leak (since google and PubMed are also being interrogated).

The main idea here being if you pose a question that canâ€™t be addressed well by the COVID-19 corpus
the other sources google and PubMed could provide missing context.

Response time is at most 1-2 minutes per question posed.





<img src="https://i.imgur.com/2nGiwkq.png" width="800">


## Question posed to model: "what are drug targets for coronavirus?"


model saved results to the file 388472.txt

In [None]:
import pandas as pd
from IPython.display import display, HTML

input_path = '../input/covid-drug-target-qa-inputs/COVID Drug Target QA Inputs/'

covid = pd.read_csv(input_path + '388472.txt',encoding='utf-8',sep ='\t',header = None)

covid[[4]] = covid[[4]].apply(pd.to_numeric)

#############################
# pick a Score Threshold
##############################

covid2=covid[covid[4] > 0.76]


targets= covid2.rename(columns={0: "link", 1: "title ", 2: "Abstract",3:"Target", 4:"Score"})

file_path='./first_pass_drug_targets.xlsx'

targets.to_excel(file_path,encoding='UTF-8',index=False,float_format='%g') 

display(HTML(targets.to_html(render_links=True)))

## Question posed to model: "what drugs are used to treat COVID-19?"


model saved results to the file 669533.txt

In [None]:
covid = pd.read_csv(input_path + '669533.txt',encoding='utf-8',sep ='\t',header = None)

covid[[4]] = covid[[4]].apply(pd.to_numeric)

##############################
# pick a Score Threshold
##############################

covid2=covid[covid[4] > 0.8]


drugs= covid2.rename(columns={0: "link", 1: "title ", 2: "Abstract",3:"Drug", 4:"Score"})

file_path='./first_pass_drugs_used_to_treat_COVID19.xlsx'

drugs.to_excel(file_path,encoding='UTF-8',index=False,float_format='%g')

display(HTML(drugs.to_html(render_links=True)))

## Question posed to model: "what are risks for COVID-19?"

model saved results to the 645033.txt

Risk is a very broad concept.

The results are improved if they are run through NER such as TERMite for aggregation and counts.
Naturally some information if lost by that approach, however, you do gain on aggregation of for example indications.
Data can be gainfully browsed using a variety of tools both freeware and commercial, we found for example Spotfire very helpful.

It proved possible to convert the strongest signals into a graph of interactions with COVID-19 in for example neo4j.
A simple sorting by key word such as age, or alphabetically reveals clusters of similar concepts.

In [None]:
covid = pd.read_csv(input_path + '645033.txt',encoding='utf-8',sep ='\t',header = None)

covid[[4]] = covid[[4]].apply(pd.to_numeric)

##############################
# pick a Score Threshold
##############################

covid2=covid[covid[4] > 0.1]


risks= covid2.rename(columns={0: "link", 1: "title ", 2: "Abstract",3:"Risk Factor(s)", 4:"Score"})

file_path='./first_pass_risks_factors_for_COVID19.xlsx'

risks.to_excel(file_path,encoding='UTF-8',index=False,float_format='%g')  

display(HTML(risks.to_html(render_links=True)))

## Question posed to model: "what drugs are used to treat COVID-19?"

It proved possible to convert the strongest signals into a graph of interactions with COVID-19 in for example neo4j.
Here Drugs used to treat COVID-19 as extracted from the results stream with the commercial product TERMite.

<img src="https://i.imgur.com/QRFOzDB.png" width="800">