# *Questions on risks of COVID-19 and answers from Research text dataset *

**We will use universal sentence encoder to encode text from COVID-19 dataset and use to answer queries. Approach is to encode sentences from COVID-19 dataset into a list. Then match the question with embedding to find the top matching sencences as answers.**
![COVID-19](https://upload.wikimedia.org/wikipedia/commons/0/09/Covid-19-4855688_640.png)

Based on Google's Universal Sentence Encoder: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf

Dataset available at: https://pages.semanticscholar.org/coronavirus-research

By Dattaraj J Rao (Persistent Systems) - https://www.linkedin.com/in/dattarajrao

(Image courtesy: WikiMedia - Čeština: Grafický odkaz na stránku koronavirus.mzcr.cz)

## We will use the Sentence Transformers library
Approach is to encode sentences from COVID-19 dataset into a list. Then match the question with embedding to find the top matching sencences as answers.

In [None]:
!pip install -U sentence-transformers > silent.txt

In [None]:
import os
import json
import warnings
warnings.simplefilter('ignore')

JSON_PATH = '/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/'
NLP_MODEL = 'bert-base-nli-mean-tokens'

#json_files = [pos_json for pos_json in os.listdir(JSON_PATH) if pos_json.endswith('.json')]
# take all json files available

json_files = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if filename.endswith('.json'):
            json_files.append(os.path.join(dirname, filename))

corpus = []

# loop through the files
for jfile in json_files[::]:
    # for each file open it and read as json
    with open(os.path.join(JSON_PATH, jfile)) as json_file:
        covid_json = json.load(json_file)
        # read abstract
        for item in covid_json['abstract']:
            corpus.append(item['text'])
        # read body text
        #for item in covid_json['body_text']:
        #    corpus.append(item['text'])
            
print("Corpus size = %d"%(len(corpus)))

from sentence_transformers import SentenceTransformer
import scipy

embedder = SentenceTransformer(NLP_MODEL)
corpus_embeddings = embedder.encode(corpus)

## Lets define the ask_question method
It takes the question string as input and prints out list of answers.

In [None]:
from IPython.display import display, Markdown, Latex

# inputs text query and results top 5 matching answers
def ask_question(query):
    queries = [query]
    query_embeddings = embedder.encode(queries)

    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    closest_n = 5
    for query, query_embedding in zip(queries, query_embeddings):
        distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])
        display(Markdown('## Question -> %s'%query))
        display(Markdown('**Top 5 answers compiled below by running AI algorithm on research text.**<hr>')) 
        
        # get the closest answers
        count = 0
        for idx, distance in results[0:closest_n]:
            display(Markdown('- ### ' + corpus[idx].strip() + " (Score: %.4f)" % (1-distance)))
        display(Markdown('<hr>'))

In [None]:
ask_question('Does smoking or pre-existing pulmonary disease increase risk of COVID-19?')

In [None]:
ask_question('Are neonates and pregnant women ar greater risk of COVID-19?')

In [None]:
ask_question('Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups')

In [None]:
ask_question('Are there socio-economic and behavioral factors that help understand economic impact of the virus COVID-19 and whether there were differences?')

In [None]:
ask_question('What is the severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups?')

In [None]:
ask_question('Does rise in pollution increase risk of COVID-19?')

In [None]:
ask_question('Are there public health mitigation measures that could be effective for control of COVID-19?')

In [None]:
ask_question('What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?')