# Dataset Creation
### This is a pilot model. Due to computational limitations, we have trained the model on a subset of the entire dataset. Details of the subset used is mentioned below.
### This model is a two step experiment.
Initially our knowledge graph model was trained only on **12 selective documents**. These documents were fetched based on BERT results for the first two tasks on the entire datset. The list ot titles of those documents are:
1. What are the risks of COVID-19 infection in pregnant women?
2. What Should Gastroenterologists and Patients Know About COVID-19?
3. Are high-performing health systems resilient against the COVID-19 epidemic?
4. Dengue Virus Glycosylation: What Do We Know?
5. Can Bats Serve as Reservoirs for Arboviruses?
6. What can we predict about viral evolution and emergence?
7. When are pathogen genome sequences informative of transmission events?
8. Does genetic diversity limit disease spread in natural host populations?
9. What is the time-scale of hantavirus evolution?
10. The Impact of Host-Based Early Warning on Disease Outbreaks.
11. Does atopy affect the course of viral pneumonia?
12. Does reduced MHC diversity decrease viability of vertebrate populations?

After this as a second step, we trained our model on **1000+ documents from the biorxiv_medrxiv dataset**. We have not used the entire dataset of 45,000 titles in our experiment due to computational limitations. 

The list of queries for which we derived results from our model are:
1. What is known about transmission, incubation,and environmental stability?
2. What do we know about COVID-19 risk factors? 
3. What do we know about virus genetics, origin, and evolution?
4. What do we know about vaccines and therapeutics?
5. What has been published about medical care?
6. What do we know about non-pharmaceutical interventions?
7. What do we know about diagnostics and surveillance?
8. What has been published about ethical and social science considerations?
9. What has been published about information sharing and inter-sectoral collaboration?
10. Are there geographic variations in the mortality rate of COVID-19?

**The first two results use the initial approach we took with 12 documents, and the later ones include the results using the latest work with the biorxiv_medrxiv dataset.**

We additionally took two more questions and tried extracting results only for BERT that is explained in Section 2 (BERT Training and Document Extraction)

In the first step to extract the documents we ran BERT embeddings on all the 45k titles. Please find results for few of the TASK, this was just to check effectivity of BERT.

![](https://imgur.com/IN1NVZr.png)

# Abstract:

A large amount data is available, and it is difficult for researchers to go through each and every paper every time a query comes up. This method aims at making this task easier for the researchers and scientists. Every time a query comes up, the prescribed model will return a list of related papers with the paragraphs where the relevant answers to those queries can be found. This model uses the combination of **Knowledge graphs with MeSH ontology and Bert embeddings**. The graph enabled search helps the user search based on very specific medical terminology. It can also be extended to give inference on how different documents are related to each other (based on author, body, title, etc.)


# High Level Flow- Diagram:
![](https://imgur.com/q1KgnNG.png)

# Problem Statement:
1.	Given a ***natural language query*** return relevant documents from the corpus having **semantically** similar meaning.
2.	Given a ***natural language query*** return relevant paragraphs having related **key phrases** to the query.


# Methodology:

### 1 a.	Entity extraction from JSON documents:

Entities are extracted from the title, author details, abstract, body, and bibliographical references, for every JSON document. These entities are then used to make triples for knowledge graph creation. 5 kinds of relationships are included in the triples. The entities are mapped to their respective paper ids with these relations:


    * Has_author: Direct extraction of author’s first and last name.
    * Has_title: The nouns and pronouns are extracted from the title using Spacy and also we have added DBPedia and Scispacy Models.
    * Has_abstract: The nouns and pronouns are extracted from the abstract. Further,the pre-defined function of entity extraction from spacy is used and the union of these two are used along with DBpedia and Scispacy Model
    * Has_body: Key phrases are extracted using topic rank algorithm.For texts where there weren’t enough candidates for keyphrase extraction, entities are extracted just the same way as that for abstract along with DBPedia and Scipacy Model.
    * Has_reference: The entities for this relation are extracted the same way as that for abstract along with DBPedia and Scispacy Model.
    

All the entities which are extracted, are further filtered using top common words corpus to remove the frequently occurring words and keep only rare words. This makes our model more medical domain specific. 

![](https://imgur.com/KnkBpxv.png)


### 1 b.	Enhanced Entity Extraction using MeSH Ontology:**

For every entity in the triple, it’s top 5 corresponding similar entities are extracted from the **MeSH ontology**. These entities are then also included in the triples, with their corresponding relationships. This is done to increase the reach and power of our **knowledge graph**, such that it covers not only the terms present in the documents, but also related entities which are not present but can be helpful. 


![](https://i.imgur.com/kFB6JLu.png)

In [None]:
filelist=[]

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        filelist.append(os.path.join(dirname, filename))

import json
import pke
import numpy as np
import time
import pandas as pd
import spacy
import re
import string
import en_core_web_sm
import requests
import en_ner_bionlp13cg_md
import scispacy
import time

nlpp = en_ner_bionlp13cg_md.load()
max_dbp_size = 70000

def splitting(text):
    words = text.split()
    subs = []
    n = 50
    for i in range(0, len(words), n):
        subs.append(" ".join(words[i:i+n]))
    return subs


def get_spacy_entities(document):
    """
    This function displays word entities

    Parameters:
         model(module): A pretrained model from spaCy(https://spacy.io/models) or ScispaCy(https://allenai.github.io/scispacy/)
         document(str): Document to be processed

    Returns: Image rendering and list of named/unnamed word entities and entity labels
     """

    doc = nlpp(document)
    entity = [X.text for X in doc.ents]
    return entity


class APIError(Exception):
    def __init__(self, status):
        self.status = status
    def __str__(self):
        return "APIError: status={}".format(self.status)


def annotate_text(text):
    # base_url = "http://api.dbpedia-spotlight.org/en/annotate"
    base_url = "Add your DBPedia Server URL"
    params = {"text": """{}""".format(text),
              "confidence": 0.35}
    headers = {'accept': 'application/json'}
    res = requests.get(base_url, params=params, headers=headers)
    if res.status_code != 200:
        raise APIError(res.status_code)
    data = json.loads(res.text)
    return data


def dbp(text):
    data = annotate_text(text)
    entities=[]
    if ("Resources" in data):
        resources = data["Resources"]
        entities = list()
        for each in resources:
            entity = each["@surfaceForm"]
            if entity not in entities:
                entities.append(entity)
    return entities


triples=[]
nlp = en_core_web_sm.load()
w="words.txt"
file = open(w,"r")
f = file.readlines()
word = []
for w in f:
    word.append(w.replace("\n", ""))


for item1 in filelist:
        with open(item1,"r") as object:
                data = object.read()
        obj = json.loads(data)
        paperid = obj['paper_id']
        title = obj['metadata']['title']
        authors = obj['metadata']['authors']
        abstract = obj['abstract']
        bib = obj['bib_entries']
        body = obj['body_text']
        triples1=[]
        titledb = dbp(title)
        for item in titledb:
            l=[item, paperid, 'has_title', 'DBPedia']
            triples.append(l)
        titlesp = get_spacy_entities(title)
        for item in titlesp:
            l=[item, paperid, 'has_title', 'SciSpacy']
            triples.append(l)
        title = nlp(title)
        for token in title:
                if((token.pos_ == "PROPN") or (token.pos_ == "NOUN")):
                        if(len(token.text) > 2):
                                triples1.append(token.text)
        triples1 = list(set(triples1) - set(word))
        for item in triples1:
                l=[item, paperid, 'has_title']
                triples.append(l)
        for item in authors:
                name = item['first'] + " " +  item['last']
                l = [name, paperid, 'has_author']
                triples.append(l)
        abss=""
        for item in abstract:
                abss = abss + item['text'] + " "
                abss = abss + a1 + " "
                if len(a1) >= max_dbp_size:
                    x_list = splitting(a1)
                else:
                    x_list = [a1]
                for x_small in x_list:
                    abssd = dbp(x_small)
                    for item in abssd:
                        l=[item, paperid, 'has_abstract', 'DBPedia']
                        triples.append(l)
        abspacy = get_spacy_entities(abss)
        for item in abspacy:
            l=[item, paperid, 'has_abstract', 'SciSpacy']
            triples.append(l)
        abstract = re.sub(r" \d+", "", abss)
        abstract = re.sub(r"[^A-Za-z0-9 -]+", "",abstract)
        abstract = nlp(abstract)
        l1=[]
        abs1=abstract.ents
        for token in abstract:
                if((token.pos_ == "PROPN") or (token.pos_ == "NOUN")):
                        l1.append(token.text)
        l2=[]
        for token in abs1:
                l2.append(token.text)
        l3 = list(set(l1) | set(l2))
        l3 = list(set(l3) - set(word))
        for item in l3:
                if(len(item) > 2):
                        l=[item, paperid, 'has_abstract']
                        triples.append(l)
        btext=""
        bod=[]
        for item in body:
                btext =item['text']
                if len(btext) >= max_dbp_size:
                    x_list = splitting(btext)
                else:
                    x_list = [btext]
                for x_small in x_list:
                    btextdb = dbp(x_small)
                    for item in btextdb:
                        l=[item, paperid, 'has_body', 'DBPedia']
                        triples.append(l)
                btextsp = get_spacy_entities(btext)
                for item in btextsp:
                    l=[item, paperid, 'has_body', 'SciSpacy']
                    triples.append(l)
                keyphrases=[]
                try:
                        extractor = pke.unsupervised.TopicRank()
                        extractor.load_document(input=btext, language='en')
                        extractor.candidate_selection()
                        extractor.candidate_weighting()
                        keyphrases = extractor.get_n_best(n=5)
                except ValueError:
                        keyphrases=[]
                        btext = re.sub(r" \d+", "", btext)
                        btext = re.sub(r"[^A-Za-z0-9 -]+", "",btext)
                        btext=nlp(btext)
                        l1=[]
                        b1 = btext.ents
                        for token in btext:
                                if((token.pos_ == "PROPN") or (token.pos_ == "NOUN")):
                                        l1.append(token.text)
                        l2=[]
                        for token in b1:
                                l2.append(token.text)
                        l3 = list(set(l1) | set(l2))
                        l3 = list(set(l3) - set(word))
                        for item2 in l3:
                                x=[item2]
                                keyphrases.append(x)
                for item in keyphrases:
                        print(item[0])
                        bod.append(item[0])
        bod = list(set(bod) - set(word))
        for item in bod:
                l=[item, paperid, 'has_body']
                triples.append(l)
        s=""
        for item in bib.keys():
                sa = bib[item]['title']
                s = s + bib[item]['title'] + " "
                if len(sa) >= max_dbp_size:
                    x_list = splitting(sa)
                else:
                    x_list = [sa]
                for x_small in x_list:
                    sdb = dbp(x_small)
                    for item in sdb:
                        l=[item, paperid, 'has_reference', 'DBPedia']
                        triples.append(l)
        sspacy = get_spacy_entities(s)
        for item in sspacy:
            l=[item, paperid, 'has_reference', 'SciSpacy']
            triples.append(l)
        s = re.sub(r" \d+", "", s)
        s = re.sub("[^A-Za-z0-9 -]+", "",s)
        s = nlp(s)
        ref = s.ents
        l1=[]
        l2=[]
        for token in s:
                if((token.pos_ == "PROPN") or (token.pos_ == "NOUN")):
                        l1.append(token.text)
        for token in ref:
                l2.append(token.text)
        l3 = list(set(l1) | set(l2))
        l3 = list(set(l3) - set(word))
        for item in l3:
                if(len(item) > 2):
                        l=[item, paperid, 'has_reference']
                        triples.append(l)


seen = set()
tripless=[]

count = 0
for item in triples:
        t = tuple(item)
        if t not in seen:
                tripless.append(item)
                seen.add(t)
triples=[]
for item in tripless:
        if (item[0].lower() in word):
                tripless.remove(item)
print(len(tripless))
for item in tripless:
        triples.append(item)
        entity = item[0]
        req = requests.get('https://id.nlm.nih.gov/mesh/lookup/term?label={query}&match=contains&limit=5'.format(query=entity.strip()),headers={"content-type": "application/json"}, verify=False)
        count=count+1
        print(count)
        if(count == 400):
                count = 0
                time.sleep(70)
        req1 = eval(req.text)
        for x in req1:
                rel = x['label']
                l = [rel, item[1], item[2]]
                triples.append(l)

triples = np.array(triples)
dy = pd.DataFrame(triples)
dy.to_csv("sampleresult1.csv", index=False)


### 2.  BERT Training and Document Extraction:
    1. Extraction of titles from all JSON documents.
    2. Getting the list of embeddings for all the titles using BERT.
    3. Finding the embedding for the input query using BERT
    4. Using Cosine similarity, to find the list of similar embeddings to that of input query. This generate the list of titles which are     similar to the input query. 
    
![](https://i.imgur.com/s2XOxqO.png)

**P.S:** We tried  **BERT only** document extraction after training on all the document titles and also extracted results for some other question going ahead we will add Mesh Enabled knowledge graph for these questions also.

Please find the results below for only BERT Results:-

![](https://imgur.com/wK4nW8A.png)

In [None]:
import json
import logging
from configparser import ConfigParser
from elasticsearch import Elasticsearch
from bert_serving.client import BertClient
import pandas as pd

BERT_SERVICE_HOST = config.get("DOCUMENT_SEARCH", "BERT_SERVICE_HOST")


bert_client = BertClient(ip=BERT_SERVICE_HOST, output_fmt='list',check_length=False)
df = pd.read_csv("data_covid.csv")
df['title'] = df['title'].astype(str)
df['vector'] =""
for rows,index in df.iterrows():
    title = index['title']
    print(title)
    #break
    search_phrase_vector = bert_client.encode([title])[0]
    #print(search_phrase_vector)
    df.set_value(rows,'vector',search_phrase_vector)
    #break
    #df['vector'][ind] = search_phrase_vector
#print(df)
df.to_csv('bert_embeddings.csv', sep='\t')



### 3 and 4.	Result Extraction using BERT and Knowledge Graph:
    1. Extraction of entities from the input query using noun, pronouns and pre-defined function for entity extraction from Spacy. 
    2. Performing fuzzy match of the entities on the triples (used to create the Knowledge graph from step 2) using fuzzy match to get the list of documents with their hits. 
    3. Top 3 document titles (as the dataset was limited to 12 documents, we took only the top 3) with maximum hits are taken from the **knowledge graph**. From the previous step, top 3 documents titles are also taken from the **BERT training**, having cosine similarity score greater than 90, for that particular query. This gives us a list of selected documents which are related to the input query based on both knowledge graphs and BERT.

![](https://i.imgur.com/ydeYwlp.png)

In [None]:
#Import packages
import json
import logging
from configparser import ConfigParser
from elasticsearch import Elasticsearch
from bert_serving.client import BertClient
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from ast import literal_eval
from sklearn.metrics.pairwise import cosine_similarity

#SEARCH_SIZE = config.getint("DOCUMENT_SEARCH", "SEARCH_SIZE")
BERT_SERVICE_HOST = config.get("DOCUMENT_SEARCH", "BERT_SERVICE_HOST")

#Appending the embedding with its document name
da = pd.read_csv("bert_embeddings.csv",sep='\t')
l1=[]
l2=[]
name=[]
document_number =[]
#print(da)
da.vector = da.vector.apply(literal_eval)
for index, row in da.iterrows():
        l2=[]
        name.append(row['title'])
        document_number.append(row['Document #'])
        list = row['vector']
        #print(list)
        for i in list:
                #print(i)
                l2.append(float(i))
        l1.append(l2)


bert_client = BertClient(ip=BERT_SERVICE_HOST, output_fmt='list')
#Generating embeddings for the input query
xx="What is known about transmission, incubation, and environmental stability?"
check = bert_client.encode([xx])[0]
check = np.array(check)
l1 = np.array(l1)
check = check.reshape(-1,1024)

#Applying cosine similarity on the embeddings for input query and documents to get the relevant documents in order
value = cosine_similarity(l1, check)
listy=[]
leng = len(name)
for i in range(0, leng):
        listx=[document_number[i],name[i], value[i][0]]
        listy.append(listx)

p = pd.DataFrame(listy)
p.columns = ['document_number','name','score']
dy = p.sort_values(by=['score'], ascending=False)
print(dy.head(10))
dy.to_csv("results1.csv", index=False)

In [None]:
import pandas as pd
import numpy as np
import spacy
import en_core_web_sm
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pke
import en_ner_bionlp13cg_md
import scispacy
import requests

def get_spacy_entities(document):
    """
    This function displays word entities

    Parameters:
         model(module): A pretrained model from spaCy(https://spacy.io/models) or ScispaCy(https://allenai.github.io/scispacy/)
         document(str): Document to be processed

    Returns: Image rendering and list of named/unnamed word entities and entity labels
     """

    doc = nlpp(document)
    entity = [X.text for X in doc.ents]
    return entity


class APIError(Exception):
    def __init__(self, status):
        self.status = status
    def __str__(self):
        return "APIError: status={}".format(self.status)


def annotate_text(text):
    # base_url = "http://api.dbpedia-spotlight.org/en/annotate"
    base_url = "Enter the URL of your DBPedia server"
    params = {"text": """{}""".format(text),
              "confidence": 0.35}
    headers = {'accept': 'application/json'}
    res = requests.get(base_url, params=params, headers=headers)
    if res.status_code != 200:
        raise APIError(res.status_code)
    data = json.loads(res.text)
    return data


def dbp(text):
    data = annotate_text(text)
    entities=[]
    if ("Resources" in data):
        resources = data["Resources"]
        entities = list()
        for each in resources:
            entity = each["@surfaceForm"]
            if entity not in entities:
                entities.append(entity)
    return entities


dy  =pd.read_csv("sampleresult1.csv")
dy = dy.iloc[1:]
dy.columns=["Source", "Target", "Relation"]
file = open("words.txt","r")
f = file.readlines()
word = []
for w in f:
        word.append(w.replace("\n", ""))

choices=[]
for index,row in dy.iterrows():
        choices.append(row[1])
choices = list(set(choices))
counter={}
for item in choices:
        counter[item]=0

nlp = en_core_web_sm.load()

query="What do we know about COVID-19 risk factors?"
k=[]
k = k + dbp(query)
k = k + get_spacy_entities(query)
keyphrases=[]
try:
        extractor = pke.unsupervised.TopicRank()
        extractor.load_document(input=query, language='en')
        extractor.candidate_selection()
        extractor.candidate_weighting()
        keyphrases = extractor.get_n_best(n=5)
except:
        keyphrases=[]
        query = re.sub(r" \d+", "", query)
        query = re.sub(r"[^A-Za-z0-9 -]+", "",query)
        query  =nlp(query)
        l1=[]
        abs1=query.ents
        for token in query:
                if((token.pos_ == "PROPN") or (token.pos_ == "NOUN")):
                        l1.append(token.text)
        l2=[]
        for token in abs1:
                l2.append(token.text)
        l3 = list(set(l1) | set(l2))
        l3 = list(set(l3) - set(word))
        for item2 in l3:
                x=[item2]
                keyphrases.append(x)
l3=[]
for item in keyphrases:
        l3.append(item[0])
l3 = l3 + k
entities=[]
for item in l3:
        for index,row in dy.iterrows():
                if(fuzz.WRatio(item,row[0]) >=60):
                        counter[row[1]] = counter[row[1]]+1
print(counter)
counter =  sorted(counter.items(), key=lambda item: item[1], reverse=True)
print(counter)
docid = []
for k in counter:
        if(k[1] >=5):
                docid.append(k[0])
docid = np.array(docid)
df = pd.DataFrame(docid)
df.to_csv("Docid_list.csv", index=False)

In [None]:
#Import packages
import numpy as np
import pandas as pd
import check_similar
import matching

da = pd.read_csv("results1.csv")
db = pd.read_csv("Docid_list.csv")

#Retrieving Documents from Knowledge Graph
l=[]
c=0
check=[]
for index, row in db.iterrows():
        if(c<3):
                z = [row[0], "knowledge graph"]
                l.append(z)
                check.append(row[0])
                c=c+1
        else:
                break

#Retrieving Documents from Bert and ensure those are not retrieve from Knowledge Graph to avoid duplication
c=0
print (check)
for index, row in da.iterrows():
        print("BERT:",row["document_number"])
        if((row["score"] >= 0.9) and (c<3)):
                if(row["document_number"] in check):
                        print("Repetation: ",row["document_number"])
                        c=c+1
                        continue
                        #print("Repetation: ",row["document_number"])
                else:
                        z = [row["document_number"], "bert"]
                        l.append(z)
                        c=c+1
        else:
                break

#Storing list into a .csv file
ff=[]
print("Final BERT Results :", l)
dc = pd.read_csv("docmapping.csv",header=None)
li = dc.values.tolist()
print(li)
for item in l:
        for item1 in li:
                #print("Final_Result: ",item[0],item1[0])
                if (item[0] == item1[0]):
                        x=[item[0], item1[1], item[1]]
                        ff.append(x)
                        break
#print("Final Results: ", ff)

df = pd.DataFrame(ff)
df.columns=["Document_ID","Document_Name","Extraction type"]
df.to_csv("final_docid.csv", index=False)

### 5.	Paragraph Extraction for related entities in extracted documents

    1. The key phrases of the input query and the titles of the documents are matched against the corresponding paragraphs from the documents using fuzzy logic to get the paragraphs which is relevant to the search query. 
    2. Based on term frequency voting mechanism, top 3 paragraphs for every selected title(document) is selected. 


![](https://i.imgur.com/Yvn1KRs.png)

In [None]:
#Inport Packages
import json
from os import listdir
from os.path import isfile, join
import json
import pke
import numpy as np
import pandas as pd
import spacy
import re
import string
import en_core_web_sm
from ast import literal_eval
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import operator
import bertkgmatch


#print("Start of Code")
file = open("words.txt","r")
f = file.readlines()
word = []
for w in f:
    word.append(w.replace("\n", ""))


paragraphresult = pd.read_csv('paragraphentityextracted.csv', converters={"Entity": literal_eval})
nlp = en_core_web_sm.load()


#Generation of entities for the input query
ques ="What do we know about virus genetics, origin, and evolution?:"
print(ques, type(ques))
quesentity = []
btext = ques
keyphrases=[]
try:
        extractor = pke.unsupervised.TopicRank()
        extractor.load_document(input=btext, language='en')
        extractor.candidate_selection()
        extractor.candidate_weighting()
        keyphrases = extractor.get_n_best(n=5)
except ValueError:
        keyphrases=[]
        btext = re.sub(r" \d+", "", btext)
        btext = re.sub(r"[^A-Za-z0-9 -]+", "",btext)
        btext=nlp(btext)
        l1=[]
        b1 = btext.ents
        for token in btext:
                if((token.pos_ == "PROPN") or (token.pos_ == "NOUN")):
                        l1.append(token.text)
                l2=[]
                for token in b1:
                        l2.append(token.text)
                l3 = list(set(l1) | set(l2))
                l3 = list(set(l3) - set(word))
                for item2 in l3:
                        x=[item2]
                        keyphrases.append(x)

for item in keyphrases:
        quesentity.append(item[0])
quesentity = list(set(quesentity) - set(word))
print("quesentity: ",quesentity)

#Dictionay to maintain the number of hits a paragraph recieved by an entity
#Voting mechanism to generate the paragraphs with maximum hits
count_dict = {}
def inc(keyy):
    if keyy in count_dict:
        curr = count_dict[keyy]
        count_dict[keyy] = curr + 1
    else:
        count_dict[keyy] = 1

entity_paragraph_list = []


mesh_result = pd.read_csv('final_docid.csv')
mesh_doc_list = mesh_result["Document_ID"]
mesh_doc_list = mesh_doc_list.values.tolist()
mesh_entity = []

#Generating entities for document title recieved from BERT and knowledge graph
for i in range(len(mesh_result)):
        s = mesh_result.iloc[i]['Document_Name']
        bod = []
        btext = s
        keyphrases=[]
        try:
                extractor = pke.unsupervised.TopicRank()
                extractor.load_document(input=btext, language='en')
                extractor.candidate_selection()
                extractor.candidate_weighting()
                keyphrases = extractor.get_n_best(n=5)
        except ValueError:
                keyphrases=[]
                btext = re.sub(r" \d+", "", btext)
                btext = re.sub(r"[^A-Za-z0-9 -]+", "",btext)
                btext=nlp(btext)
                l1=[]
                b1 = btext.ents
                for token in btext:
                        if((token.pos_ == "PROPN") or (token.pos_ == "NOUN")):
                                l1.append(token.text)
                l2=[]
                for token in b1:
                        l2.append(token.text)
                l3 = list(set(l1) | set(l2))
                l3 = list(set(l3) - set(word))
                for item2 in l3:
                        x=[item2]
                        keyphrases.append(x)

        for item in keyphrases:
                bod.append(item[0])
        bod = list(set(bod) - set(word))
        mesh_entity.append(bod)

mesh_result['entity'] = mesh_entity

mesh_paragraph_list = []
entity_hit_dict = {}
entity_hit_list = []
trial = []

#Selection of paragraphs which are relevant to the query
for i in range(len(mesh_doc_list)):
        count_dict.clear()
        entity_hit_dict.clear()
        entity_hit_list = []

        curr_doc_id = mesh_result.iloc[i]["Document_ID"]
        for j in range(len(paragraphresult)):

                if(paragraphresult.iloc[j]["Document_ID"] == curr_doc_id):
                        for k in range(len(paragraphresult.iloc[j]["Entity"])):
                                for l in quesentity:
                                        if(fuzz.WRatio(paragraphresult.iloc[j]["Entity"][k],l) >=60):
                                                inc(paragraphresult.iloc[j]["Paragraph_Number"])
                                                entity_hit_dict[paragraphresult.iloc[j]["Paragraph_Number"]] = [k,l]
                                                entity_hit_list.append([ paragraphresult.iloc[j]["Paragraph_Number"], l])

                                for m in mesh_entity[i]:
                                        if(fuzz.WRatio(paragraphresult.iloc[j]["Entity"][k],m) >=60):
                                                inc(paragraphresult.iloc[j]["Paragraph_Number"])
                                                entity_hit_dict[paragraphresult.iloc[j]["Paragraph_Number"]] = [k,m]
                                                entity_hit_list.append([ paragraphresult.iloc[j]["Paragraph_Number"], m])
        sorted_count_dict = dict(sorted(count_dict.items(), key = operator.itemgetter(1), reverse = True))
        temp = []
        count = 0
        entity_temp = []
        for key in sorted_count_dict:
                for para in range(len(paragraphresult)):
                        if(paragraphresult.iloc[para]["Document_ID"] == curr_doc_id and paragraphresult.iloc[para]["Paragraph_Number"] == key):

                                for x in entity_hit_list:
                                    if(x[0] == key):
                                        entity_temp.append(x[1])
                                entity_temp = list(dict.fromkeys(entity_temp))
                                temp.append([mesh_result.iloc[i]["Document_Name"], paragraphresult.iloc[para]["Paragraph"], key, entity_temp])
                                count += 1
                        if(count ==3):
                                break
                if(count ==3):
                        break
        trial.append(mesh_result.iloc[i]["Document_Name"])
        mesh_paragraph_list.append(temp)


#Display of result
print("\n\n\nMESH OUTPUT\n\n")
for i in range(len(mesh_paragraph_list)):
       try:
            print("\n\nTitle: ", mesh_paragraph_list[i][0][0], "\n")
            print("Entity extraction method: ",mesh_result.iloc[i]["Extraction type"])
            print()
            for j in range(len(mesh_paragraph_list[i])):
                test =  mesh_paragraph_list[i][j][3]
                test = list(dict.fromkeys(test))
                print("Entities Hit: ", test)
                print("Paragraph Number: ", mesh_paragraph_list[i][j][2], ":", mesh_paragraph_list[i][j][1])
                print()

        except:
               print("\n\nTitle: ", trial[i], "\n")

# RESULTS:

The results were good on the given dataset. In some cases, both **Knowledge graph as well as BERT** were giving the same results, which essentially means that the title of the document is representative of the entire document. But in some cases, we were having one or two extra results from either of the two methods which were not overlapping. Hence, using a combination of both **BERT and Knowledge Graph** covers the aspects of **semantic search** as well as some **keyword-specific search**.


We even took an example of a very specific medical term **CYP6F1** protein which was not present in any of the documents taken. But because of the power of MeSH ontology we figured out that this protein was related to **culex** (entity in one of the documents). Thus, we were able to retrieve that document from our search model. We have just shown a sample output below for one of the search queries.

**The First two results shown here uses the initial approach we took with 12 documents, and the later ones include the results using the latest work with the biorxiv_medrxiv dataset.**

# Query 1: What do we know about COVID-19 risk factors?

### **Title 1:** What Should Gastroenterologists and Patients Know About COVID-19? 

**Extracted Method:** Knowledge Graph/Bert

**Paragraph Number 1:**  COVID-19 is a respiratory illness caused by a novel coronavirus that was first identified in Wuhan, the capital city of China's Hubei Province, in December 2019. 1 Initially referred to as the 2019 novel coronavirus (2019-nCoV), COVID-19 is caused by **severe acute respiratory syndrome** coronavirus 2 (SARS-CoV-2). 2 It was identified by researchers at the Wuhan Institute of Virology through metagenomic analysis of a bronchoalveolar lavage sample from a patient in the initial cluster of pneumonia cases in that city. 3 Coronaviruses are a large family of RNA viruses that are known to cause illnesses ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and **Severe Acute Respiratory Syndrome** (SARS). The SARS-CoV-2 virus shares 79.5% of the genetic sequence of SARS and has 96.2% homology to bat coronavirus. 4 The intermediate animal vector between bats and humans for SARS-CoV-2 is currently unknown but has been linked epidemiologically to the Huanan Seafood Wholesale Market. 5 Although initially a zoonotic virus, SARS-CoV-2 is now spread human-to-human with higher infectivity than MERS and SARS but a lower fatality rate. 3 The clinical presentation of COVID-19 can range from mild non-specific respiratory symptoms to severe organ dysfunction such as acute respiratory distress syndrome (ARDS) that can lead to death. 1,6,7 Most cases of **COVID-19** appear to be mild with the most common symptoms being fever (83-98%), cough (46-82%), myalgia/fatigue (11-44%), and shortness of breath (31%). 7 **Risk factors** for more severe illness requiring hospitalization appear to be older age and having underlying chronic medical conditions such as diabetes, lung disease and cardiovascular disease. 7 Early reports suggest that for more severe cases the median time from first symptom onset to the development of shortness of breath and/or need for hospitalization ranged from 5 to 8 days. 6-8 Among hospitalized **COVID-19** patients, it is reported that 5 to 26.1% have required admission to the intensive care unit. 6, 8 The reported fatality rate for hospitalized COVID-19 patients has ranged from 1.4 to 15%. [6] [7] [8] The incubation period for SARS-CoV-2 appears to average 5.2 days but may range from 2 to 14 days and potential asymptomatic infection has been reported. 7, 9 Of note for **gastroenterologists, patients** may complain of gastrointestinal symptoms such as nausea or diarrhea. 7 In the prior SARS coronavirus outbreak, diarrhea was reported in up to 25% of **patients**. 11 Interestingly, the cell entry receptor ACE2 appears to mediate entry of SARS-CoV-2 (similar to SARS) and has been demonstrated to be highly expressed in small intestinal enterocytes. 11 ACE2 is important in controlling intestinal inflammation and its disruption may lead to diarrhea. 11 The reported frequency of diarrhea among COVID-19 patients has varied from 2 to 33% and was one of the prominent symptoms reported by the first case in the United States. 11,12 SARS-CoV-2 has been detected in the stool of **COVID-19** patients. 12, 13 So while **COVID-19** appears to primarily spread through respiratory droplets and secretions, the **gastrointestinal** tract may be another potential route of infection, highlighting importance of personal protective equipment during endoscopy. Further, some of the more common laboratory findings described in COVID-19 patients include liver function test abnormalities. In addition to leukopenia (reported in 9-25% of cases) or leukocytosis (24-30%), elevated alanine aminotransferase and aspartate aminotransferase have been seen in up to 37% of cases. [7] [8] [9] More recent descriptions of patients in China also noted around 10% of patients were also noted to have elevated total bilirubin levels. 8 Gastroenterologists should be aware of these potential gastrointestinal manifestations of COVID-19.

**Paragraph Number 2:**  necessary precautions to take. **COVID-19** has been of particular interest to our patients on immunosuppressive agents (immunomodulators or biologics) such as those with inflammatory bowel disease (IBD). **COVID-19** has now been reported throughout the world, with more reported cases on a daily basis. We therefore aim to provide a brief overview of COVID-19 for the **gastroenterology** community based on currently available information to help assist with addressing our **patients**' questions and concerns

**Paragraph Number 3:**  •	Measures to control the spread of **COVID-19** are similar to the general advice for preventing any respiratory viral illness. At the national and international level, travel restrictions have been implemented for regions with the highest **COVID-19** incidence currently, but these recommendations and polices are likely to rapidly change and warrant close monitoring. **Patients** with potential symptoms of COVID-19 with recent travel to areas of higher incidence (China, South Korea, Japan, Iran, and Italy based on the most recent CDC data) should be asked to wear a standard surgical mask as soon as they are identified and be placed in a private room with the door closed, ideally an airborne isolation room (negative-pressure room). 3, 7 It is important to note that travel history criteria for testing for COVID-19 will change as community-based spread is emerging within the United States and patients living in or with recent travel to areas where COVID-19 has been confirmed should be considered for testing as well. Any concern for possible **COVID-19** should immediately prompt notification of institutional infection prevention and control as well as local or state health departments. Healthcare personnel entering the room should use standard precautions, contact precautions (gown and gloves), airborne precautions (with **N95 respirator**), and eye protection (goggles or a face shield). Testing has been performed at the Centers for Disease Control and Prevention (CDC) but is now more widely available in state and city laboratories and is becoming available commercially. It is important that more common respiratory illnesses (influenza, etc.) are also ruled out. What should we tell our patients based on our current knowledge? First, this is a rapidly evolving area with new information emerging on a daily basis. Therefore periodically checking on recommendations from leading national and international health organizations, such as the CDC or the World Health Organization (WHO), is the most important way for both patients and physicians to stay informed with accurate information. Second, it is important to realize that the majority of cases (80+%) have been mild, the fatality rate for COVID-19 is lower than prior coronavirus outbreaks, and the proportion of severe/fatal cases may be an overestimate as milder or asymptomatic cases are likely under-reported. 15 Third, there are currently no specific recommendations for people on immunosuppression, such as IBD patients. Prior IBD research has found that viral infections are more likely among patients on immunomodulators (such as 6-mercaptopurine and azathioprine) than those on biologics but it is unclear if this can be extended to COVID-19. 16 There are no data currently about the impact of immunosuppressive agents, although one of the largest case series from China did note that 2 patients with immunodeficiency (not further specified) had non-severe disease. 8 17 Last, the best measures to decrease the **risk** of contracting SARS-CoV-2 are the same as standard practices against any viral illness. These include good hand hygiene with alcohol-based hand sanitizers or soap and water, covering your mouth and nose with a tissue or your sleeve (not your hands) when coughing or sneezing, limiting touching your face, avoiding close contact with anyone with influenza-like and/or upper respiratory symptoms, and staying home if you are sick.


### **Title 2**: What are the risks of COVID-19 infection in pregnant women?

**Extracted Method:** BERT

**Paragraph Number 1:**  Stephanie colleagues did not find any evidence of the presence of SARSCoV2 viral particles in the products of conception or in neonates, in accordance with the findings of a previous study on SARSCoV1 done by Wong and colleagues. 9 Two neonatal cases of **COVID19** infection have been confirmed so far, 10 with one case confirmed at 17 days after birth and having a close contact history with two confirmed cases (the baby's mother and maternity matron) and the other case confirmed at 36 h after birth and for whom the possibility of close contact history cannot be excluded. However, no reliable evidence is as yet available to support the possibility of vertical transmission of **COVID19** infection from the mother to the baby. Previous studies have shown that SARS during pregnancy is associated with a high incidence of adverse maternal and neonatal complications, such as spontaneous miscarriage, preterm delivery, intrauterine growth restriction, application of endotracheal intu bation, admission to the intensive care unit, renal failure, and disseminated intravascular coagulopathy. 9, 11 However, **pregnant women** with COVID19 infection in the present study had fewer adverse maternal and neonatal complications and outcomes than would be anticipated for those with SARSCoV1 infection. Although a small number of cases was analysed and the findings should be interpreted with caution, the findings are mostly consistent with the clinical analysis done by Zhu and colleagues 12 of ten neonates born to mothers with COVID19 pneumonia. The clinical characteristics reported in pregnant women with confirmed COVID19 infection are similar to those reported for nonpregnant adults with confirmed COVID19 infection in the general population and are indicative of a relatively optimistic clinical course and outcomes for COVID19 infection compared with SARSCoV1 infection. 13, 14 Nonetheless, because of the small number of cases analysed and the short duration of the study period, more followup studies should be done to further evaluate the safety and health of pregnant women and newborn babies who develop COVID19 infection. As discussed in the study, pregnant women are susceptible to respiratory pathogens and to development of severe pneumonia, which possibly makes them more susceptible to COVID19 infection than the general population, especially if they have chronic diseases or maternal complications. Therefore, **pregnant women** and newborn babies should be considered key **atrisk** populations in strategies focusing on prevention and management of COVID19 infection. Based on evidence from the latest studies and expert recommendations, as well as previous experiences from the prevention and control of SARS, the National Health Commission of China launched a new notice on Feb 8, 2020, 15 which proposed strengthening health counselling, screening, and followups for pregnant women, reinforcing visit time and procedures in obstetric clinics and units with specialised infection control preparations and protective clothing, and emphasised that neonates of **pregnant women** with suspected or confirmed **COVID19** infection should be isolated in a designated unit for at least 14 days after birth and should not be breastfed, to avoid close contact with the mother while she has suspected or confirmed COVID19 infection.

**Paragraph Number 2:**  Since December, 2019, the outbreak of the 2019 **novel coronavirus** disease (COVID19) infection has become a major epidemic threat in China. As of Feb 11, 2020 , the cumulative number of confirmed cases in mainland China has reached 38 800, with 4740 (12·2%) cured cases and 1113 (2·9%) deaths; additionally, there have been 16 067 suspected cases so far. 1 All 31 provinces in mainland China have now adopted the firstlevel response to major public health emergencies. The National Health Commission of China has published a series of guidelines on the prevention, diagnosis, and treatment of **COVID19** pneumonia, based on growing evidence of the pathogens responsible for COVID19 infection, as well as the epidemiological characteristics, clinical features, and the most effective treatments. [2] [3] [4] The central government and some provincial govern ments have provided food and medical supplies and dispatched expert groups and medical teams to manage and control the outbreak response in the hardesthit areas (Wuhan and neighbouring cities in Hubei province).

**Paragraph Number 3:**  As the COVID19 outbreak unfolds, prevention and control of COVID19 infection among **pregnant women** and the **potential risk** of vertical transmission have become a major concern. More evidence is needed to develop effective preventive and clinical strategies. The latest research by Huijun Chen and colleagues 5 reported in The Lancet provides some insight into the clinical characteristics, **pregnancy outcomes**, and vertical transmission potential of COVID19 infection in **pregnant women**. Although the study analysed only a small number of cases (nine women with confirmed COVID19 pneumonia), under such emergent circum stances these findings are valuable for preventive and clinical practice in China and elsewhere. Although neonatal nasopharyngeal swab samples have been collected in some hospitals across China, this study also collected and tested amniotic fluid, cord blood, and breastmilk samples for the presence of severe acute respiratory syndrome coronavirus 2 (SARSCoV2), thus allowing a more detailed assessment of the vertical transmission potential of COVID19 infection.



### **Title 3**: Can Bats Serve as Reservoirs for Arboviruses?

**Extracted Method:** Knowledge Graph

**Paragraph Number 1:**  DENV nucleic acid and anti-DENV antibodies have been detected in Mexican bats on the Gulf and Pacific coast, and nucleic acid has been detected in the liver and/or sera of wild-caught bats in French Guiana [62, 63] . Anti-DENV antibodies have been detected in multiple bat species in Uganda [29] . However, a survey in Central and Southern Mexico analyzing 240 individuals representing 19 bat species by RT-PCR resulted in no detection of viral nucleic acid [64] . A 2017 study by Vicente-Santos and colleagues examined 12 bat species from Costa Rica and found a cumulative seroprevalence of 21.2% (51/241) by PRNT and a prevalence of 8.8% (28/318) in organs tested by RT-PCR [65] . No infectious virus was isolated and viral loads were considered too low for the bats to function as amplifying hosts. Rather, Vicente-Santos and colleagues surmised a spillover event from humans to bats, with bats functioning as a dead-end host [65] . The serum of Jamaican fruit bats (Artibeus jamaicensis) and Great fruit-eating bats (A. literatus) from Grenada (n = 50) were also tested for antibodies against DENV 1, 2, 3, and 4, and none were seropositive [66] . While field evidence supports the exposure of bats to DENV in multiple geographic areas, experimental infections conducted to date are consistent in that bats are not likely to support DENV replication and circulation to levels high enough to infect blood-feeding mosquitoes.

**Paragraph Number 2:**  Bats and the viruses they harbor have been of interest to the scientific community due to the unique association with some high consequence human pathogens in the absence of overt pathology. Virologic and serologic reports in the literature demonstrate the exposure of bats worldwide to arboviruses (arthropod-borne viruses) of medical and veterinary importance [1] . However, the epidemiological significance of these observations is unclear as to whether or not bats are contributing to the circulation of **arboviruses**

**Paragraph Number 3:**  Historically, a zoonotic virus reservoir has been considered a vertebrate species which develops a persistent infection in the absence of pathology or loss of function, while maintaining the ability to shed the virus (e.g., urine, feces, saliva) [2] [3] [4] . Haydon et al. extended this definition of a reservoir to include epidemiologically-connected populations or **environments** in which the pathogen can be permanently maintained and from which infection is transmitted to the defined target population. The significance of the relative pathogenicity of the infectious agent to the purported reservoir host has been debated [5] . In the case of bats as a reservoir species, rigorous field and experimental evidence now exist to solidify the role of the Egyptian rousette bat (Rousettus aegyptiacus) as the reservoir for Marburg virus [6] [7] [8] . Considering arboviruses, additional criteria must be met in order to consider a particular vertebrate species a reservoir. Reviewed by Kuno et al., these criteria include the periodic isolation of the infectious agent from the vertebrate species in the absence of seasonal vector activity, and the coincidence of **transmission** with vector activity [9] . Further, the vertebrate reservoir must also develop viremia sufficient to allow the hematophagous arthropod to acquire an infectious bloodmeal [10] in order for vector-borne **transmission** to occur. Bats have long been suspected as reservoirs for arboviruses [11] , but experimental data that would support a role of bats as reservoir hosts for certain arboviruses remain difficult to collect. Here we synthesize what information is currently known regarding the exposure history and permissiveness of bats to arbovirus infections, and identify knowledge gaps regarding their designation as arbovirus reservoirs.

### **Title 4:** Are high-performing health systems resilient against the COVID-19 epidemic?

**Extracted Method:** Knowledge Graph

**Paragraph Number 1:**  Are high-performing health systems resilient against the COVID-19 epidemic?

**Paragraph Number 2:**  As of March 5, 2020, there has been sustained local transmission of coronavirus disease 2019 in Hong Kong, Singapore, and Japan. 1 Containment strategies seem to have prevented smaller transmission chains from amplifying into widespread community transmission. The health systems in these locations have generally been able to adapt, 2,3 but their resilience could be affected if the COVID-19 epidemic continues for many more months and increasing numbers of people require services. We outline some of the core dimensions of these resilient health systems 4 and their responses to the COVID-19 epidemic.

**Paragraph Number 3:**  Fifth, in all locations, critical care treatment and medicines have been available for patients with COVID-19, but adequate supplies of personal protective equipment in hospitals and face masks in the community are a key concern. In Japan and Hong Kong, hospital supplies are running low but have not yet impacted clinical management. In all locations, pressure on critical care treatment is likely if there is a sustained increase in cases of COVID-19.


# Query 2: What is known about transmission, incubation, and environmental stability?

### **Title 1:**  Can Bats Serve as Reservoirs for Arboviruses?  

**Extracted Method:** Knowledge Graph/BERT

**Paragraph Number 1:**  Kading and Schountz (2016) reviewed instances in the literature where mosquito blood meals have been identified as originating from bats [95] . Information on primary mosquito vectors feeding on bats is very limited. Tiawsirisup et al. 2012 collected mosquitoes from five genera inside a bat cave in Thailand to investigate sylvatic circulation of JBEV. While these collections included arbovirus vectors Cx. quinquefasciatus and Cx. tritaenhiorhynchus, the only blood-fed mosquitoes collected from the cave were Cx. quinquefasciatus, at least 20 of which had fed on Leschnault's rousette (Rousettus leschenaulti) bats [176] . **Culiseta morsitans Theobald mosquitoes (vector of Eastern equine encephalitis virus)** were found to have fed on Eastern pipistrelle bats (Pipistrellus subflavus), but these blood meals comprised only 1% of the total blood meals identified from this mosquito species [177] . No information was found on any blood meals from bats being detected in Aedes (Stegomyia) species, but it is unclear how much investigation has been done in this area. Sixteen of 20 field-collected Ae. funereus mosquitoes (vector of RRV) had fed on Pteropus alecto bats [149] . In Africa, mosquito species in which bat blood meals have been identified and are known to be associated with a number of medically-important arboviruses include: Coquillettidia (Cq.) fuscopennata (Theobald) (YFV, Sindbis, chikungunya viruses), Culex (Cx.) perfuscus Edwards (WNV, Oropouche, Sindbis, Wesselsbron, Usutu, Babanki viruses), Cx. (Cx.) neavei Theobald (WNV, Babanki, Spondweni, Sindbis, Koutango viruses), and Cx. (Cx.) decens group (WNV, chikungunya, Babanki viruses [148, 178] . While mosquitoes in the subgenus Cx. (Cx.) are recognized as primary vectors of WNV, Sindbis, Babanki, and Usutu viruses, only for Babanki virus has additional field data been collected so far that support a potential role for bats in **virus circulation** (discussed above) [29] .

**Paragraph Number 2:**  To truly elucidate the role of bats as **reservoirs** for arboviruses, field surveillance studies documenting natural infection and **transmission** dynamics among vector and vertebrate species must be supplemented with experimental infections to characterize viremia profiles and infectiousness to vectors, virus-induced pathology, and immune kinetics following infection. With bats, these tasks are not trivial, and carry significant challenges in both the field and the lab. These challenges are evidenced by the few arboviruses for which there are substantial field and laboratory data involving the infection of bats. While many studies have presented serologic data indicative of past exposure of free-ranging **bats** to arboviruses of medical and veterinary importance, these studies should be followed up with laboratory assessments of reservoir host competence to shed light on the true epidemiological significance of the field data. Further, the detection of viral nucleic acid in free-ranging bats does not necessarily implicate the species as an arboviral reservoir. Rather, recovery and isolation of live virus at biologically relevant titers, and demonstrating the persistence of the pathogen in nature among connected populations of a potential reservoir species is more definitive. Unfortunately, few established bat colonies exist for use in vivo viral pathogenesis studies, limiting the achievability of these studies.

**Paragraph Number 3:**  Bats and the viruses they harbor have been of interest to the scientific community due to the unique association with some high consequence human pathogens in the absence of overt pathology. Virologic and serologic reports in the literature demonstrate the exposure of bats worldwide to arboviruses (arthropod-borne viruses) of medical and veterinary importance [1] . However, the epidemiological significance of these observations is unclear as to whether or not bats are contributing to the circulation of arboviruses.

### **Title 2:** When are pathogen genome sequences informative of transmission events? viruses?

**Extracted Method:** Knowledge Graph/BERT

**Paragraph Number 1:**  The impact of limited genetic diversity on the reconstruction of disease outbreaks remains to be investigated. While this impact undoubtedly varies across different methods, the intrinsic informativeness of genetic data with respect to the underlying transmission tree can be evaluated. The genetic diversity accumulating along transmission chains depends on various genomic and epidemiological factors. To quantify this diversity, we introduce the concept of **'transmission divergence'**, which we define as the number of mutations accumulating between pathogen WGS sampled from transmission pairs. Transmission divergence can be estimated empirically from a transmission tree by determining the number of mutations separating pathogen samples of known **transmission pairs**. However, accurately reconstructed transmission trees with corresponding genetic sequence data are generally not available for most pathogens. We present here a simulation based approach for estimating the transmission divergence of different pathogens using parameters available in the literature, namely the length of the pathogen genome (L), its overall mutation rate (M), its generation time distribution (W) (i.e. the distribution of delays between primary and secondary infections [24] ) and its basic reproduction number R 0 (i.e. the average number of secondary infections caused by an index case in a fully susceptible population [25]). Specifically, we model transmission trees alongside sequence evolution and extract the number of mutations separating individual transmission pairs. Intuitively, greater transmission divergence should enable better reconstruction of these transmission trees, although the nature of this relationship remains to be described.

**Paragraph Number 2:**  Understanding transmission dynamics in the early stages of an infectious disease outbreak is essential for informing effective control policy. Valuable insights can be gained by the reconstruction of the transmission tree, which describes the history of infectious events at the resolution of individual cases [1] [2] [3] [4] . Recent years have seen significant progress in the development of statistical and computational tools for inferring such trees [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] , with a major emphasis placed on the analysis of whole genome sequence (WGS) data, now routinely collected in many outbreak scenarios [16] .

**Paragraph Number 3:**  Two approaches to the inference of transmission trees from WGS have emerged. One begins with an underlying transmission model, attaching to this a model of sequence evolution that relates observed genetic relationships between pathogens to unobserved epidemiological relationships (i.e. transmission pairs) between infected individuals. A simple implementation involves ruling out direct transmission events between individuals separated by more than a fixed threshold of substitutions [17] [18] [19] . More sophisticated methods have specified models of sequence evolution as components in a joint likelihood, formalising expected genetic relationships in a probabilistic manner [5] [6] [7] [8] [9] 20 ]. The other approach considers outbreak reconstruction from a phylogenetic perspective, inferring unobserved historical relationships between pathogen samples to capture more complex evolutionary dynamics. WGS data is used to reconstruct phylogenetic trees which are either treated as data upon which transmission histories are overlaid [10] [11] [12] , or jointly inferred alongside the transmission tree itself [13] [14] [15] . Given the unprecedented level of detail of WGS data and the epidemiological insights it has provided in real-life scenarios [21-23], genetic analysis is clearly an indispensable tool for outbreak reconstruction.

### **Title 3:** Dengue Virus Glycosylation: What Do We Know?

**Extracted Method:** Knowledge Graph

**Paragraph Number 1:**  Mannose binding lectin (MBL) in the lectin pathway triggers antibody-independent activation of complement (Thielens et al., 2002) . The proposed MBL-mediated virus elimination mechanisms include (1) direct virus neutralization, (2) C3/C4 deposition on virus surface and (3) interference of host cell lectin receptor binding . MBL differentiates self-from non-self-antigens based on a sugar density-dependent recognition mechanism (Dam and Brewer, 2010) , and the micro pattern of the oligosaccharides structure in addition to the spatial geometry of the macro sugar pattern (Takahashi and Ezekowitz, 2005) . It was proposed that the additional N67-glycan in DENV (which is absent in other flaviviruses) could promote a more efficient recognition and binding by MBL . This hypothesis is supported by improved MBL binding and in vivo virus clearance of a genetically engineered WNV with additional N67-glycosylation site . MBL is reactive to high mannose oligosaccharides and thus can efficiently recognize insect cell-derived DENV with high mannose glycans present on its E proteins. Hence, the change of N-glycan profile of E protein after one round of replication in mammalian host cells may provide an opportunity to the virus to escape from effective MBL recognition (Fuchs et al., 2011) . However, mammalian cell-derived DENV was found to be effectively inhibited and neutralized by mouse MBL . A separate study instead reported preferential binding of recombinant human MBL to insect cell-derived DENV2, whereas virions produced in monocyte-derived DC were not neutralized by human MBL . It therefore remains unclear whether MBL-mediated virus clearance is optimally engaged during DENV infection in humans.

**Paragraph Number 2:**  The role of glycosylation and glycan structures in DENV virulence has yet to be reported with evidence of attenuated phenotypes in symptomatic mouse models. Given the impact of glycosylation in virus entry and virus fitness in mammalian cells, it is highly likely that deglycosylated DENV mutants will display reduced virulence in vivo. The ability of Celgosivir treatment, a bicyclic iminosugar that inhibits glycosylation through negatively binding to ER [alpha]-glucosidase II, to protect mice from a lethal DENV challenge indirectly demonstrates the importance of glycosylation in DENV virulence (Perry et al., 2013; Sayce et al., 2016; Warfield et al., 2016) .

**Paragraph Number 3:**  DENV belongs to the family Flaviviridae of which the members are well known as human pathogens, including WNV, Zika virus, Yellow fever virus, Tick-borne encephalitis virus, JEV, and Hepatitis C virus (HVC). They are enveloped viruses with positive sense, single stranded RNA and many of them are arthropod-borne viruses. Among all flaviviruses, DENV has the highest impact on global disease burden. The virus particle is about 50 nm in size and the RNA genome (∼10.7 kb) is encapsulated by a protein shell which consists of three structural proteins, namely capsid (C), envelope (E), and (pre)membrane protein (prM/M) (Kuhn et al., 2002) .

### **Title 4:** What can we predict about viral evolution and emergence? 

**Extracted Method:** BERT

**Paragraph Number 1:**  There is a general, and justifiable, nervousness about predictions in evolutionary biology. Mutations can be fixed in populations by genetic drift or natural selection. Prediction in the case of genetic drift is necessarily hindered by a large sampling error. Although natural selection is a deterministic process, which ought to engender it with some inherent predictability, this will only be possible if the fitness of each relevant mutation is known in each relevant environment. For example, early attempts to predict the evolution of influenza virus as a guide for vaccine strain selection using the differential strength of positive selection among viral lineages gained little traction [9] . Although predictions are more robust when selection is extremely strong -it is trivial to predict that resistance (and often the causative mutations) will arise to commonly used antivirals -predictions of the evolution of more complex multi-factorial traits such as host adaptation and emergence are a very different matter [10, 11] . Even for traits where single mutations might confer a major fitness advantage, such as antiviral resistance, genome-scale interactions, including epistasis [ Fortunately, there are some evolutionary generalities that enable very broad-scale predictions about viral emergence, although none possess meaningful precision. One of the best established is that viruses are more likely to jump between closely related species [15, 16••] . Although informative, numerous exceptions arise because the probability of exposure does not match phylogenetic relatedness. For example, although humans are more closely related to other primates than to rodents, we probably have greater exposure to the latter. It is equally well known that RNA viruses jump species boundaries more often than DNA viruses and that this likely reflects their differing rates of evolutionary change [8, 17] . RNA viruses have mutation rates of between 0.1 -1.0 mutations per genome replication, several logs higher than those of double-strand DNA viruses, and which will generate abundant genetic variation when coupled with huge population sizes [17] . Although there is some variation in evolutionary rate among RNA viruses, this does not appear to be related to the propensity to jump hosts. Interestingly, single-stranded DNA viruses exhibit rates of evolutionary change similar to those of RNA viruses [17] and may frequently cross species boundaries [18] . Similarly, although rates of recombination (or reassortment) vary greatly among viruses, these appear to be of little predictive power: as a case in point, paramxyoviruses, such as measles and the emerging henipaviruses, experience extremely low rates of recombination (if they recombine at all) yet frequently jump species [19] .

**Paragraph Number 2:**  The challenge of predicting viral evolution and emergence is reflected in the on-going debate over whether highly pathogenic avian H5N1 influenza virus will emerge in humans.

**Paragraph Number 3:**  Finally, more attention should be devoted to revealing the common evolutionary and epidemiological patterns exhibited by those viruses that have successfully jumped species boundaries. For example, a comprehensive survey of the phylodynamic patterns [56] exhibited by currently circulating viruses will do much to help us understand how a new virus will evolve and spread once it has emerged. Specifically, it should be possible to compile a cross virus data base of the parameters that correlate most with successful emergence, such as how rapidly each virus evolves, its mode of transmission, its major host or vector species, its cell receptors of choice, key aspects of phenotype such as virulence and antigenicity, its population growth rate, its phylogeography, and whether it has jumped species boundaries in the past. Although such data will not enable us to predict future emergence with any certainty, it may allow broad-scale conclusions as to which groups of viruses are most likely to emerge in humans, which animal species in which geographical locations need to be surveyed most intensively, and how evolution will proceed following a host jump. and particularly a host-dependent effect. Genetic factors such as epistasis will greatly complicate any attempt to predict future viral evolution and emergence. 13. Tsetsarkin KA, Chen R, Leal G, Forrester N, Higgs S, Huang J, Weaver SC. Chikungunya virus emergence is constrained in Asia by lineage-specific adaptive landscapes. Proc Natl Acad Sci USA. 2011; 108:7872-7877. [PubMed: 21518887] Another important example of how epistatic interactions among mutations influence the probability of emergence. 14. Bloom JD, Gong LI, Baltimore D. Permissive secondary mutations enable the evolution of influenza oseltamivir resistance. Science. 2010; 328:1272-1275. [PubMed: 20522774] Although there is an understandable focus on the emergence of mutations of phenotypic importance, such as those that confer antiviral resistance, this paper shows that the evolution of major fitness enhancing mutations may be associated with those that occur elsewhere in the viral genome. Surveying these 'permissive' mutations is therefore a useful contributor to studies of disease emergence.

# Query 3: What do we know about vaccines and therapeutics?

### **Title:**  Financing Vaccines for Global Health Security

**Extraction method:**  Knowledge Graph

**Paragraph Number 1 :** In spite of these substantial difficulties-or perhaps because of them-new global initiatives have drawn attention to the need for new approaches to encourage the development of **vaccines** against EIDs (16, 17) . International collaborations like CEPI have drawn extensive public, private, NGO, and academic attention to the perils of global epidemic unpreparedness (18) .

**Paragraph Number 2 :** **Vaccines** only sell for a fraction of their economic value, in some cases for only a few dollars. They provide myriad benefits, like enabling would-be patients to live longer, healthier lives (33, 34) , and bearing yet-undervalued gains in productivity and positive externalities to society at large (35) (36) (37) . Although the low price of **vaccines** is meant to benefit individuals and regions with lower incomes, in the long run, it has had the opposite effect, causing them to be medically underserved due to a lack of **vaccine** investment. Pharmaceutical companies and investors are directing their resources to projects in which the estimated return on investment is more predictable and lucrative. **Vaccine** prices are currently set far below the prices of drugs that treat other serious conditions, such as cancer, despite the enormous societal value of **vaccines** in general, and those to ensure **global health security** in particular.

**Paragraph Number 3 :** This crisis-driven expanded interest in **vaccines** to address epidemic threats is encouraging, but there is still much work to be done. There needs to be a viable, sustainable business model that will align the financial incentives of stakeholders to encourage the necessary investment in **vaccine** development (19, 20) . While governments and international agencies have striven to create incentives to attract additional private sector investment in **vaccine** development, these efforts have so far failed in attracting sufficient capital to enhance preparedness against the world's most deadly emerging pathogens (21) .

### **Title:**  The Essential Facts of Wuhan Novel Coronavirus Outbreak in China and Epitope-based Vaccine Designing against COVID-19

**Extraction method: ** Knowledge Graph

**Paragraph Number 1 :** Reverse vaccinology refers to the process of developing **vaccines** where the novel antigens of a virus or microorganism or organism are detected by analyzing the genomic and genetic information of that particular virus or organism. In reverse vaccinology, the tools of bioinformatics are used for identifying and analyzing these novel antigens. These tools are used to dissect the genome and genetic makeup of a pathogen for developing a potential vaccine. Reverse vaccinology approach of vaccine development also allows the scientists to easily understand the antigenic segments of a virus or pathogen that should be given more emphasis during the vaccine development. This method is a quick, cheap, efficient, easy and cost-effective way to design vaccine. Reverse vaccinology has successfully been used for developing vaccines to fight against many viruses i.e., the Zika virus, Chikungunya virus etc. [

**Paragraph Number 2 :** The current experiment was conducted to develop potential **vaccines** against the Wuhan novel coronavirus (strain SARS-CoV-2) (Wuhan seafood market pneumonia virus) exploiting the strategies of reverse vaccinology (Figure 01 ).

**Paragraph Number 3 :** The 3D structures of the selected **epitopes** were generated using PEP-FOLD3 (http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD3/) online tool. Only the best selected **epitopes** from previous steps (the epitopes that followed the selection criteria of high antigenicity, non-allergenicity and non-toxicity in the previous steps, were considered best) were taken for 3D structure prediction [119] - [121] .

### **Title:**  Guidelines for preventing respiratory illness in older adults aged 60 years and above living in long-term care A rapid review of clinical practice guidelines

**Extraction method:**  Knowledge Graph

**Paragraph Number 1 :** 1. What are the infection prevention and control practices/measures for preventing or reducing respiratory viruses (including coronavirus and influenza) in **older adults** aged 60 years and above living in long-**term care**? 2. How do infection prevention and control practices differ for adults aged 60 years and above living in long-**term care** with respiratory illness and severe comorbidities or frailty differ than those without such severe comorbidities/frailty? 3. How do infection prevention and control practices differ for adults aged 60 years and above living in long-**term care** with respiratory illness from low-and middle-income economy countries (LMIC) differ than those living in high-income economy countries and do differences exist across different cultural contexts?

**Paragraph Number 2 :** The rapid review conduct was guided by the Cochrane Handbook for Systematic Reviews of Interventions 1 along with the **Rapid Review** Guide for Health Policy and Systems Research 2 . The team used an integrated knowledge translation approach, with consultation from the knowledge users from the WHO at the following stages: question development, interpretation of results, and draft report. We used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Statement 3 to guide the reporting of our **rapid review** results; a PRISMA extension for rapid reviews is currently under development. This **rapid review** of guidelines was completed in conjunction with a rapid overview of systematic reviews published in a separate report titled: Preventing respiratory illness in **older adults** aged 60 years and above living in long-**term care**: A rapid overview of reviews".

**Paragraph Number 3 :** There are several limitations to the review methods employed here, single screening and abstraction for example, however they were selected to thoughtfully tailor our methods according to our knowledge user needs and the urgent nature of the request to provide timely results. There is also a chance that our literature search missed guidance documents from various state and provincial authorities. However, we were unable to perform an exhaustive grey literature search of websites, due to the timelines imposed on this review.

### **Title:**  What does simple power law kinetics tell about our response to coronavirus pandemic?

**Extraction method:**  BERT

**Paragraph Number 1 :** To the best of my knowledge of the scientific understanding of coronavirus at the point of writing this article,

**Paragraph Number 2 :** • The effect of weather on the spread of pandemic has not been established.

**Paragraph Number 3 :** • Travel restrictions between countries are very useful in the initial phases of the pandemic but have little effect in the later phases.


### **Title:**  How lethal is the novel coronavirus, and how many undetected cases there are? The importance of being tested

**Extraction method:**  BERT

**Paragraph Number 1 :** The main result of this work is that, due to the reduced number of performed tests, the vast majority of the coronavirus infections went undetected in most countries. These undetected infections had the potential to spread freely in the population, giving raise to the rapid exponential phase of the epidemics before the lockout measures.

**Paragraph Number 2 :** The pandemic spread of the **novel coronavirus** SARS-CoV2 is causing thousands of fatalities, creating a tremendous threat to global health [1] . In this situation, the society is strongly concerned by the lethality and the true extension of the pandemics, see for instance [2, 3] . In the media, but also in some non-specialist scientific circles, it is frequent to find that lethality is estimated as the cumulative number of deaths divided by the cumulative number of confirmed cases, data that are easily accessible to everyone in the internet. This quantity changes rapidly and systematically both in time and in space, generating doubts and concerns regarding its interpretation. Of course, epidemiologists can estimate the lethality rate in a less biased way with additional data on the dates in which the infections of the persons that die or recover were detected, and better sampling of the population. However, these data are not easily accessible for all countries. Therefore, here I set up to obtain an estimate of the lethality rate and the detection rate only based on the data reported for all countries in the John Hopkins University database [4] . This extrapolation of the data shows that, when the time course of the disease is controlled for, the lethality estimated for all countries for which reliable data are available depends only on the intensity of the performed tests, i.e. the number of tests divided by the number of positive case. Extrapolating to infinite number of tests, I estimate a lethality rate of 0.012 ± 0.012, which is very noisy but consistent with the estimate of 0.01 that is frequently mentioned in the media. Inverting the relationship between apparent death rate and number of tests, it is possible to estimate that in all countries except South Korea and perhaps Germany, at least at the beginning of the spread, the vast majority of positive cases went undetected, with more than 90 percent undetected cases in some countries such as Italy. These cases that went out of the radar and were not isolated are likely to have contributed strongly to the rapid spread of the virus. Finally, I propose to adopt the ratio between the cumulative number of recovered and the cumulative number of deceased persons, as a potential indicator that can anticipate whether the spread of the epidemics is halting.

**Paragraph Number 3 :** This result is consistent with recent reports from China [8, 9] , and with the high rates of asymptomatic but positive cases observed in Vo Euganeo (Italy) [10], in the cruise ship Diamond Princess [11] and among children [12] (see also [13] ). The only countries that were able to keep the fraction of undetected infections low appear to be South Korea and Germany, which explains their better ability to control the 7 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

### **Title:**  When will the Covid-19 epidemic fade out?

**Extraction method:**  BERT

**Paragraph Number 1 :** Humanity has always been afflicted by the onset of epidemics. Owing to the absence of vaccines, the slow connections between people and isolation between infectious and susceptible were the only remedies to their devastating effects. Over the last two decades, there have been three major epidemics due to human-transmissible viruses, namely Avian, Ebola and Sars, but fortunately the advanced ability of the scientific world has been able to contain their effects.

**Paragraph Number 2 :** A dangerous impact of infectious diseases on populations can arise from emergence and spread of novel pathogens in a population, or a sudden change in the epidemiology of an existing pathogen. Today, due to the absence of a vaccine and to a highly globalized society, the Covid-19 epidemic is frightening the world, raising a series of important questions. Among these, the most common among people is: when will the epidemic die down? During spreading, this is a difficult question to answer: in addition to understand the early transmission dynamics of the infection, control measures should also be accounted for, which may significantly affect the trend of infection.


# Query 4: What has been published about ethical and social science considerations?

### **Title:**  Rapid community-driven development of a SARS-CoV-2 tissue simulator

**Extraction method:**  Knowledge Graph

**Paragraph Number 1 :** (1) Open source and GitHub: All simulation source code is shared as open source on GitHub, with welldefined, versioned and documented releases, and Zenodo-generated DOIs and archives. (2) Interactive cloud-hosted models: Every prototype version is rapidly transformed into a cloud-hosted, interactive model to permit faster scientific communication across communities, particularly with virologists and others who have essential insights but ordinarily would not directly run the simulation models. (3) Social media and virtual feedback: We enlist **community participation** (feedback, modeling contributions, software contributions, and data contributions) through social media, virtual seminars, web-based forms, and a dedicated Slack workspace. We are particularly encouraging feedback and data contributions by domain experts in virology, epidemiology, and mathematical biology (with a focus (4) Frequent preprint updates: Each model iteration is accompanied by a cloud-hosted, interactive app (see #2) and an updated preprint on bioRxiv. (5) Integration of feedback: All **community feedback** is evaluated to plan the next set of model refinements and recorded in an updated bioRxiv preprint.

**Paragraph Number 2 :** • We will gather community consensus and pool efforts into a "standardized" model that captures key **SARS**-CoV-2 dynamics. We will supply this model to the community for use in parallel studies by multiple labs.

**Paragraph Number 3 :** This **rapid prototyping** effort brings together specialists from a broad variety of domains: virology and infectious diseases, mathematical biology, computer science, high performance computing, data science, and other disciplines. Therefore, it is critical that all members of the project have access to a clear description of underlying biology. In this section we outline key aspects of viral replication and host response in functional terms needed for development of agent-based, multi-scale and multi-physics models.

### **Title:**  The need to connect: Acute social isolation causes neural craving responses similar to hunger

**Extraction method:**  Knowledge Graph

**Paragraph Number 1 :** In sum, these results suggest that across all participants, SN/VTA shows an increased response to social cues after objective **social isolation**, in a pattern that is similar to the response to food cues when hungry; the magnitude of this response was variable across participants, and larger in those who reported more **social craving after the acute isolation period**. We predicted that individual variability in response to objective isolation might reflect pre-existing differences in participants' social network size and/or chronic loneliness. Consistent with this prediction, participants with higher levels of chronic loneliness during the initial screening reported less **craving for social contact** after 10 hours of isolation in response to the social cues in the cue-induced craving task (Social_Craving_CIC: r(38)=-0.36; p=0.020; two-tailed) and somewhat less craving on the online questionnaire (Social_Craving_Q: r(38)=-0.30; p=0.059; two-tailed). People with higher pre-existing chronic loneliness also showed a muted response in SN/VTA to social cues after acute isolation (r(38)=-0.35; p=0.027; two-tailed). Individual differences in social network size did not predict either self-reported or neural responses to acute social isolation (all p-values >= 0.18). We explored whether pre-existing loneliness was also associated with different responses to food cues after fasting and find that while loneliness did not affect self-reported food craving (p-values p>0.307), higher loneliness was associated with lower post-fasting SN/VTA responses to food cues (r(38)=-0.32; p=0.042; two-tailed).

**Paragraph Number 2 :** The first challenge for this research was to induce a subjective state of **social craving** using an experimentally induced period of objective isolation. Despite the fact that isolation lasted only ten hours, and the participants knew exactly when it would end, participants reported more loneliness and social craving at the end of the day than they did at the beginning. For our participants, who are all highly socially connected, a day of social isolation was a very large deviation from typical rates of social interaction. Although being alone is not inherently aversive for people (i.e., when chosen intentionally, solitude can be restful and rejuvenating (80, 81) ), our isolation paradigm was subjectively aversive: participants felt more uncomfortable and less happy at the end of the day. In the current paradigm, isolation was likely made more salient by the externally mandated (not personally chosen) restrictions on participants' behaviour, including bans on email, social media, fiction, films, music, and other forms of virtual and fictional connectedness(82) (though participants did have access to games, puzzles, and reading material). The cessation of virtual social interaction was a substantial change: typical young adults currently use social media for social interactions more than two hours per day (83) . In all, this experiment demonstrates that **acute objective social isolation** can induce subjective social craving in human participants. This paradigm could be useful for future research investigating, for example, how . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.25.006643 doi: bioRxiv preprint objective isolation is experienced by different human groups (e.g., adolescents, older adults) and how this experience is modulated by social media usage.

**Paragraph Number 3 :** How are people affected by a period of forced social isolation **Chronic social isolation and loneliness** are associated with lower physical(1-5) and mental (5) (6) (7) (8) (9) health, but little is known about the consequences of **acute mandatory isolation**. Positive social interactions in and of themselves may be basic human needs, analogous to other basic needs like food consumption or sleep (10) (11) (12) . If so, the absence of positive social interaction may create a want, or "craving", that motivates behavior to repair what is lacking (10) . Cues associated with positive social interaction (e.g., smiling faces) activate neural reward systems (13, for review) . However, research on the neural representation of unmet human social needs is lacking (14) .

### **Title:**  Fibrinolytic niche is requested for alveolar type 2 cell-mediated alveologenesis and injury repair

**Extraction method:**  Knowledge Graph

**Paragraph Number 1 :** The epithelial lining of regeneratively quiescent lungs is composed of alveolar type 2 (AT2) progenitor and differentiated alveolar type 1 (AT1) cells. To replace aged AT1 cells, AT2 cells undergo self-renewal to maintain alveolar epithelial homeostasis1. The regenerative potential of AT2 cells could be activated for timely recovery from lung epithelial injury2-4, including lobectomy and infections5 , 6. A marked suppression in fibrinolytic activity in local respiratory illnesses (e.g., inhaled smoke and aspirated gastric juice) and pulmonary complications of systemic diseases (e.g., sepsis) has been reported clinically and in animal models7 , 8. Migration and differentiation of mesenchymal stem cells (MSCs) in inflamed tissues are regulated by dynamic fibrinolytic niche9-12. The proteolysis of extracellular matrix substrates by urokinase plasminogen activator (uPA) could be involved in the benefit of the fibrinolytic niche to the regeneration of skeletal muscles and fractured cartilage13-17. For example, uPA and plasmin, two critical components of the fibrinolytic niche, cleave epithelial sodium channels (ENaC)18 , 19. In addition, functionally multifaceted uPA regulates alveolar epithelial function20 and possesses an A6 motif with a high affinity to CD44 receptors21 , 22. CD44 + AT2 cells show a higher proliferative capacity in fibrotic lungs23. The fibrinolytic niche in alveolar epithelial homeostasis and regeneration mediated by AT2 cells, however, has not been studied systematically. Our results have tested the potential novel contribution of uPA-PAI1-A6-CD44-ENaC cascade to the fibrinolytic niche in regulating the fate of AT2 cells.

**Paragraph Number 2 :** In summary, our study uncovers a novel role of the fibrinolytic niche in alveolarization via the uPA/A6/CD44 and the uPA/ENaC signaling pathways. This study provides new evidence for the possible prognostic relevance of the fibrinolytic niche for ARDS patients. Targeting the abnormality of the fibrinolytic niche could be a promising pharmaceutic strategy to accelerate reparative processes of injured alveolar epithelium in clinical disorders such as ARDS.

**Paragraph Number 3 :** Plau gene regulates polarization and bioelectric features of AT2 monolayers. The divergent bioelectric features in polarized AT2 monolayers, including transepithelial resistance (RT) and short-circuit currents (ISC) were measured. Plau -/monolayers showed a lower ISC level compared to wt monolayer and diminished significantly by replacing Na + -free bath solution to inhibit Na + ion transport (Fig. 3A-B) , consistent with our previous studies in the airway epithelial cells27. Moreover, Plau -/-AT2 cells showed a reduced amiloridesensitive ISC level compared to wt cells (Fig. 3C-D) . However, the amiloride affinity was not altered significantly ( Fig. 3D) . In addition, a greater RTE value on day 5 was measured in wt monolayers compared to Plau -/monolayer cultures (Fig. 3E) . Thus, the ENaC activity seems to be regulated by the Plau gene.

### **Title:**  What does simple power law kinetics tell about our response to coronavirus pandemic?

**Extraction method:**  BERT

**Paragraph Number 1 :** To the best of my knowledge of the scientific understanding of coronavirus at the point of writing this article,

**Paragraph Number 2 :** • The effect of weather on the spread of pandemic has not been established.

**Paragraph Number 3 :** • Travel restrictions between countries are very useful in the initial phases of the pandemic but have little effect in the later phases.

### **Title:**  China's fight against COVID-19: What we have done and what we should do next?

**Extraction method:**  BERT

**Paragraph Number 1 :** To the best of our knowledge, the main strategy of "Four early": "early identification, early report, early isolation and early treatment" was put forward and concentrating patients, medical experts, and medical resources into special centers to improve the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

**Paragraph Number 2 :** . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

**Paragraph Number 3 :** . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

### **Title:**  Reconciling early-outbreak estimates of the basic reproductive number and its uncertainty: framework and applications to the novel coronavirus (SARS-CoV-2) outbreak

**Extraction method:**  BERT

**Paragraph Number 1 :** Since December 2019, a novel coronavirus (SARS-CoV-2) has been spreading in China and other parts of the world (World Health Organization, 2020d). Although the virus is believed to have originated from animal reservoirs (Centers for Disease Control and Prevention, 2020), the ability of SARS-CoV-2 ability to directly transmit between humans has posed a greater threat for its spread World Health Organization, 2020c) . As of February 27, 2020, the World Health Organization (WHO) has confirmed 82,294 cases of the coronavirus disease , including 3,664 confirmed cases in 46 different countries, outside China (World Health Organization, 2020a).

**Paragraph Number 2 :** As the disease continues to spread, many researchers have already published their analyses of the outbreak as pre-prints (e.g., Bedford et al. (2020) ; ; Liu et al. (2020) ; Majumder and Mandl (2020) ; Read et al. (2020a) ; ) and in peer-reviewed journals (e.g., ; Riou and Althaus (2020b) ; Wu et al. (2020) ; ), focusing in particular on estimates of the basic reproductive number R 0 (i.e., the average number of secondary cases generated by a primary case in a fully susceptible population (Anderson and May, 1991; Diekmann et al., 1990) ). Estimates of the basic reproductive number are of interest during an outbreak because they provide information about the level of intervention required to interrupt transmission (Anderson and May, 1991) , and about the potential final size of the outbreak (Anderson and May, 1991; Ma and Earn, 2006) . We commend these researchers for their timely contribution and those who made the data publicly available. However, it can be difficult to compare a disparate set of estimates of R 0 from different research groups (as well as the associated degrees of uncertainty) when the estimation methods and their underlying assumptions vary widely.

**Paragraph Number 3 :** 4 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

# Query 5: What has been published about medical care?

### **Title:**  Guidelines for preventing respiratory illness in older adults aged 60 years and above living in long-term care A rapid review of clinical practice guidelines

**Extraction method:**  Knowledge graph

**Paragraph Number 1 :** Standard precautions refer to the work practices required to achieve a basic level of infection prevention and control. They apply to all residents, regardless of suspected or confirmed infection status, in all health care and long-term residential care settings. and assessed as being suitable roommates.

**Paragraph Number 2 :**  Victoria State Government, 2018; Country: Australia, Sponsor: Victoria State Government Scope: guidelines apply to all residential care facilities (RCFs) in Victoria. This refers to any public or private aged care, disability services or other congruent accommodation setting in Victoria where residents are provided with personal care or health care by facility staff. Prevention and preparedness Facilities must ensure they are prepared for respiratory outbreaks prior to the start of the influenza season (March / April).

**Paragraph Number 3 :** The Infection Prevention & Control of the World Health Organization (WHO) Health Emergency Programme presented a query on preventing and managing COVID-19 in older adults aged 60 years and above living in long-term care facilities including privately paid for and publicly paid for settings with a 5-business day timeline. According to the WHO, "long-term care covers those activities undertaken by others to ensure that people with, or at risk of, a significant ongoing loss of intrinsic capacity can maintain a level of functional ability consistent with their basic rights, fundamental freedoms and human dignity" (https://www.who.int/ageing/long-term-care/WHO-LTC-series-subsaharan-africa.pdf?ua=1). Examples of long-term care include nursing homes, charitable homes, municipal homes, long-term care hospitals, long-term care facilities, skilled nursing facilities, convalescent homes, and assisted living facilities (https://www.canada.ca/en/health-canada/services/home-continuing-care/long-term-facilitiesbased-care.html).

### **Title:**  IMPACT OF VIRAL EPIDEMIC OUTBREAKS ON MENTAL HEALTH OF HEALTHCARE WORKERS: A RAPID SYSTEMATIC REVIEW

**Extraction method:**  Knowledge Graph

**Paragraph Number 1 :** The aim of this rapid systematic literature review is twofold: i) to examine the impact of health emergencies caused by a viral pandemic or epidemic on HCWs mental health; and ii) to assess the effectiveness of interventions to reduce such impact.

**Paragraph Number 2 :** Chen 2005 (27) Taiwan SARS Wards or emergency units Nurses N=128 Depression (SCR), anxiety (SCR), PTSD (IES), intrusion and avoidance (IES), somatization, interpersonal sensitivity, hostility, psychoticism (SCR)

**Paragraph Number 3 :** In this timely systematic rapid review we synthesized evidence from 61 studies examining the impact on mental health of providing frontline healthcare during infectious disease outbreaks.

### **Title:**  Transgenic mice expressing tunable levels of DUX4 develop characteristic facioscapulohumeral muscular dystrophy-like pathophysiology ranging in severity Title: FSHD-like transgenic mouse models of varying severity

**Extraction method:**  Knowledge Graph

**Paragraph Number 1 :** DUX4-FL protein was not detectable in heart or skeletal muscles from the non-recombined In order to quantitate the changes in gene expression for each severity model, qRT-PCR was used to measure overall DUX4-fl mRNA levels ( Figure 3A ). However, we have previously shown that this assay is a poor measure of DUX4-fl transgene expression using FLExDUX4 mouse models [41] , and DUX4-fl mRNA is even difficult to detect in muscle biopsies from FSHD affected subjects [16] . Since DUX4-FL functions as a transcriptional activator in both human and mouse cells [30, 68] , expression of DUX4-FL direct target genes has proven to be a more accurate indicator of DUX4-FL expression levels in both species [24, 31, 41] . Therefore, . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/471094 doi: bioRxiv preprint in addition to DUX4-fl mRNA, the mRNA levels of two mouse homologs of DUX4-FL direct target genes, Wfdc3 and Trim36 [41, 68] , were also assayed ( Figure 3B and C). Detectable DUX4-fl mRNA levels were extremely low in gastrocnemius muscles from all models ( Figure   3A ), consistent with previous studies [41] . Interestingly, there were no significant changes detected in DUX4-fl mRNA levels between the mild, moderate, and severe models 9 days after TMX treatments, a timepoint with prominent differences in DUX4-FL protein expression ( Figure   2 ). In contrast, both DUX4-FL target genes assayed showed significant induction in all bitransgenic animals compared with the FLExD/+ mice, indicating the presence of DUX4-FL protein. Wdfc3 and Trim36 mRNA levels are each significantly increased in muscles from the moderate and severe models compared with the mild model, and Trim36 mRNA levels are significantly increased in the severe model compared to the moderate model ( Figure 3B and C).

**Paragraph Number 2 :** Overall, these dose-dependent DUX4-fl FSHD-like phenotypic mouse models strongly support the DUX4 misexpression model for mediating FSHD pathogenesis [15, 18] , and provide a useful and highly flexible tool for performing FSHD preclinical testing of therapeutic approaches targeting DUX4-fl mRNA and protein. Importantly for future analyses, we have shown sex-specific differences, anatomical muscle-specific differences, and model-specific differences that must be taken into account when using these FSHD-like mice. Within a single mouse, one can assess differentially affected muscles. Studying both sexes from a cross provides more fine-tuning of effects as well, with females being slightly but significantly more affected than the males. This provides even greater flexibility and utility for the model as a tool for studying FSHD and testing potential therapeutic approaches.

**Paragraph Number 3 :** Paragraph Number:  25 : FSHD is caused by mosaic expression of DUX4-fl mRNA and its encoded protein from the normally silent DUX4 gene in a small fraction of differentiated adult skeletal myocytes [8, 19] . Previously we showed that mosaic expression of DUX4-fl in skeletal muscle of FLExDUX4 transgenic mice (or FLExD) can produce a very severe myopathy with FSHD-like pathology [41] . However, preclinical testing for different candidate FSHD therapeutics targeting DUX4-fl mRNA and protein expression in these mice will likely require different criteria, such as degree and progression of pathophysiology, dependent upon the approach. To address this issue, we generated and characterized a highly reproducible series of phenotypic FSHD-like transgenic . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/471094 doi: bioRxiv preprint mouse models varying in severity and pathogenic progression based on differing levels of mosaic expression of the pathogenic DUX4-fl mRNA isoform of human DUX4 in adult murine skeletal muscle. We identified the tamoxifen (TMX) inducible and skeletal muscle-specific Cre expressing transgenic mice, ACTA1-MerCreMer (or ACTA1-MCM) [48] as a strong candidate for the generation of the desired phenotypes. To test if these could be used to generate mosaic expression in skeletal muscles and to optimize TMX dosing, the ACTA1-MCM mice were crossed with R26 NZG Cre reporter mice [49] that produce readily detectable nuclear ßgalactosidase (nLacZ) expression in all cells where Cre is functional in the nucleus ( Figure S1 ).

### **Title:**  Commentary Title: COVID-19 and maternal mental health: Are we getting the balance right?

**Extraction method:**  BERT

**Paragraph Number 1 :** Even though there is plenty of scientific evidences and experience-based knowledge in several fields from the previous two coronaviruses, the research on MERS-CoV and SARS-CoV and pregnancy/childbirth is still limited, especially in terms of maternal mental health. There is a critically important gap in our knowledge about how pandemics affect mothers and their babies, and how pregnant women, mothers and their families can be better supported. In February 2020, several reports were published in The Lancet stating that mental health care should be included in the national public health emergency systems and that further understanding was needed to better respond to future unexpected disease outbreaks [26] [27] [28] [29] [30] . Prenatal and postnatal mental health should be prioritised due to its pervasive short and long-term impacts on maternal, familial and fetus, infant and child biopsychosocial development [31] .

**Paragraph Number 2 :** Being pregnant and/or having a baby is, ideally, an event that is associated with joy, delight, and fulfilment, following a safe and positive pregnancy, birth, and early parenthood. However, some women and their partners can experience a range of negative emotions during this period, including anxiety and depression. Globally, the extent and adverse impacts of maternal mental health problems are increasingly recognised. As the World Health Organisation (WHO) states "virtually all women can develop mental disorders during pregnancy and in the first year after delivery". Conditions such as extreme stress, emergency and conflict situations and natural disasters can increase risks for specific mental health disorders [1] . Maternal and parental mental health problems are associated with longer term risks for the mother and partner, and for their children [2] [3] [4] [5] . This raises the question: how are we safeguarding the short-and longer-term mental health of pregnant women and their partners in the age of coronavirus? Apart from the overall population level pandemic-related stress, there is still a limited formal evidence-base about the nature and clinical consequences of the various versions of coronavirus (COVID-19 or SARS-CoV-2 or HCoV-19) for a pregnant woman. There is even less information about the mental health impacts consequent to self-isolation, living in a household with an affected person, limited access to goods/services and to routine or emergency health and social care. For women (and especially for those living in poverty, in poor or cramped housing), these issues may be exacerbated if they are expected to be the main carers for elderly or infirm relatives, or for young children, while living with multiple family members in confined spaces.

**Paragraph Number 3 :**  What is currently known about COVID- 19 [15] . The key findings from these studies have been captured in a number of rapid COVID-19 guidelines produced by professional and global authorities. In summary, they are as follows:

### **Title:**  What does simple power law kinetics tell about our response to coronavirus pandemic?

**Extraction method:**  BERT

**Paragraph Number 1 :**  (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.03.20051797 doi: medRxiv preprint citizens to avoid physical contacts, (2) limiting social gathering, (3) closing academic institutions, (4) strict lockdown with only essential services open. Finally, many countries have adopted random testing in populations to identify affected people.

**Paragraph Number 2 :** To the best of my knowledge of the scientific understanding of coronavirus at the point of writing this article

**Paragraph Number 3 :**  • The effect of weather on the spread of pandemic has not been established.


### **Title:** The early scientific literature response to the novel Coronavirus outbreak: who published what? CORRESPONDING AUTHOR

**Extraction method:**  BERT

**Paragraph Number 1 :** Recent events highlight how emerging and re-emerging pathogens are actually becoming global challenges for public health. [1] Coronaviruses are enveloped RNA viruses which are broadly distributed either in humans or in a vast majority of other mammals and birds. These viruses have the possibility to cause respiratory, enteric, hepatic, and neurologic diseases. [2, 3] Coronaviruses are highly prevalent in many species. They have a large genetic diversity. Given their RNA, they are susceptible to frequent recombination and mutations of their genomes. In context in which there are increasing human-animal interactions, novel coronaviruses are very likely to emerge periodically. If the newly created cross-species pathogens acquires the ability to infect humans or to be transmitted human to human it can lead to occasional spillover events and epidemics. [4, 5, 6] In the past years two other strains of Coronavirus -severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV) -have emerged as potential public health worldwide threats. [5] SARS-CoV was the causal agent of the severe acute respiratory syndrome outbreaks in 2002 and 2003 in Guangdong Province, China; [7, 8, 9] MERS-CoV was the pathogen responsible for severe respiratory disease outbreaks in 2012 in the Middle East. [10] On 31 December 2019, the World Health Organization (WHO) China Country Office was informed of cases of pneumonia of unknown etiology detected in Wuhan City, located in the Hubei Province, China, associated with exposures in a seafood and wet wholesale market in the same city. A new type of coronavirus was isolated on 7 January 2020. [11] On 30 January 2020 WHO declared the outbreak to be a Public Health Emergency of International Concern (PHEIC). [12] This novel coronavirus suddenly turned out to be a global health concern for a disease, called Coronavirus disease 2019 (COVID-19), [13] which was characterized as a pandemic by WHO on 11 March 2020. [14] Starting from the second-half of January 2020, scientific literature has been particularly focused on the description of this new viral outbreak; the main topics addressed were the epidemiological, clinical and virological aspects as well as the possible public health choices necessary to contain the spread of the disease. Nevertheless, several aspects are still unclear and have not been thoroughly explored, leaving grey areas in our knowledge of the disease and of the outbreak. The aim of this paper is to perform a bibliometric analysis on the first papers published in the early stages of the SARS-CoV-2 outbreak, in order to give a glimpse to the researchers of "who published what" at the very beginning of this Public Health Emergency of International Concern.

**Paragraph Number 2 :**  The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03. 25.20043315 doi: medRxiv preprint author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

**Paragraph Number 3 :** The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03. 25.20043315 doi: medRxiv preprint

# Query 6: What do we know about virus genetics, origin, and evolution?

### **Title:**  When Darkness Becomes a Ray of Light in the Dark Times: Understanding the COVID-19 via the Comparative Analysis of the Dark Proteomes of SARS-CoV-2, Human SARS and Bat SARS-Like Coronaviruses

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.13.990598 doi: bioRxiv preprint Non-structural protein 2 (Nsp2). This protein functions by disrupting the host survival pathway via interaction with the host proteins Prohibitin-1 and Prohibitin-2 [128] . Reverse genetic deletion in the coding sequence of Nsp2 of the SARS virus attenuated little viral growth and replication and allowed the recovery of mutant virulent viruses. This indicates the dispensable nature of the Nsp2 protein for SARS viruses [129] . The sequence identity of the Nsp2 protein from SARS-CoV-2 with Nsp2s of Human SARS CoV and Bat CoV amounts to 68.34% and 68.97%, respectively (Supplementary Figure  S2A) . We have estimated the mean PPIDs of Nsp2s of SARS-CoV-2, Human SARS CoV, and Bat CoV to be 5.17%, 2.04%, and 2.03% respectively (see Table 3 ). The per-residues predisposition for the intrinsic disorder of Nsp2s from SARS-CoV-2, Human SARS CoV, and Bat CoV are depicted in Figures 20A, 20B, and 20C . According to this analysis, the following regions in Nsp2 proteins are predicted to be disordered, residues 570-595 (SARS-CoV-2), residues 110-115 (Human SARS), and residues 112-116 (Bat CoV). As listed in Table 2 , and Supplementary Tables 7 and 8, Human SARS CoV does not contain MoRF while SARS-CoV-2 and Bat CoV have an N-terminally located MoRF region predicted by MoRFchibi_web.

**Paragraph Number 2:**  The emergence of new viruses and associated deaths around the globe represent one of the major concerns of modern times. Despite its pandemic nature, there is very little information available in the public domain regarding the structures and functions of SARS-CoV-2 proteins. Based on its similarity with Human SARS CoV and Bat CoV, the published reports have suggested the functions of SARS-CoV-2 proteins. In this study, we utilized information available on SARS-CoV-2 genome and translated proteome from GenBank, and carried out a comprehensive computational analysis of the prevalence of the intrinsic disorder in SARS-CoV-2 proteins. Additionally, a comparison was also made with proteins from close relatives of SARS-CoV-2 from the same group of beta coronaviruses, Human SARS CoV and Bat CoV. Our analysis revealed that in these three CoVs, the N proteins are highly disordered, possessing the PPID values of more than 60%. These viruses also have several moderately disordered proteins, such as Nsp8, Orf6, and Orf9b. Although other proteins have shown lower disorder content, almost all of them contain at least some IDPRs, and all CoV proteins author/funder. All rights reserved. No reuse allowed without permission.

**Paragraph Number 3:**  Supplementary Figures S1. Multiple sequence alignment of structural proteins of all three studied coronaviruses are generated using Clustal Omega. The aligned images are created using Esprit 3.0. Figure S1A . MSA of SARS-CoV-2, Human SARS, and Bat CoV spike glycoproteins. Figure S1B . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nucleoproteins.

### **Title:**  Mutation of Ebola virus VP35 Ser129 uncouples interferon antagonist and replication functions

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  promoter. HEK 293T cells were co-transfected with pGL-IFN-β-luc, pRL-TK, and various 444 dilutions of a vector control, VP35-WT, or VP35-S129A. Twenty-four h post-transfection, cells 445

**Paragraph Number 2:**  profile of S129A suggests that this mutation introduces some local conformational change 424

**Paragraph Number 3:**  We further investigated the effect of S129A on VP35 oligomerization domain conformation by 431 computational modeling. Starting from the available WT crystal structure, a Ser-to-Ala 432 substitution was made and the oligomerization domain was subjected to energy minimization. 433

### **Title:**  What does simple power law kinetics tell about our response to coronavirus pandemic?

**Entity extraction method:**  BERT

**Paragraph Number 1:**  To the best of my knowledge of the scientific understanding of coronavirus at the point of writing this article,

**Paragraph Number 2:**  • The effect of weather on the spread of pandemic has not been established.

**Paragraph Number 3:**  • Travel restrictions between countries are very useful in the initial phases of the pandemic but have little effect in the later phases.

### **Title:**  How lethal is the novel coronavirus, and how many undetected cases there are? The importance of being tested

**Entity extraction method:**  BERT

**Paragraph Number 1:**  The main result of this work is that, due to the reduced number of performed tests, the vast majority of the coronavirus infections went undetected in most countries. These undetected infections had the potential to spread freely in the population, giving raise to the rapid exponential phase of the epidemics before the lockout measures.

**Paragraph Number 2:**  The pandemic spread of the novel coronavirus SARS-CoV2 is causing thousands of fatalities, creating a tremendous threat to global health [1] . In this situation, the society is strongly concerned by the lethality and the true extension of the pandemics, see for instance [2, 3] . In the media, but also in some non-specialist scientific circles, it is frequent to find that lethality is estimated as the cumulative number of deaths divided by the cumulative number of confirmed cases, data that are easily accessible to everyone in the internet. This quantity changes rapidly and systematically both in time and in space, generating doubts and concerns regarding its interpretation. Of course, epidemiologists can estimate the lethality rate in a less biased way with additional data on the dates in which the infections of the persons that die or recover were detected, and better sampling of the population. However, these data are not easily accessible for all countries. Therefore, here I set up to obtain an estimate of the lethality rate and the detection rate only based on the data reported for all countries in the John Hopkins University database [4] . This extrapolation of the data shows that, when the time course of the disease is controlled for, the lethality estimated for all countries for which reliable data are available depends only on the intensity of the performed tests, i.e. the number of tests divided by the number of positive case. Extrapolating to infinite number of tests, I estimate a lethality rate of 0.012 ± 0.012, which is very noisy but consistent with the estimate of 0.01 that is frequently mentioned in the media. Inverting the relationship between apparent death rate and number of tests, it is possible to estimate that in all countries except South Korea and perhaps Germany, at least at the beginning of the spread, the vast majority of positive cases went undetected, with more than 90 percent undetected cases in some countries such as Italy. These cases that went out of the radar and were not isolated are likely to have contributed strongly to the rapid spread of the virus. Finally, I propose to adopt the ratio between the cumulative number of recovered and the cumulative number of deceased persons, as a potential indicator that can anticipate whether the spread of the epidemics is halting.

**Paragraph Number 3:**  This result is consistent with recent reports from China [8, 9] , and with the high rates of asymptomatic but positive cases observed in Vo Euganeo (Italy) [10], in the cruise ship Diamond Princess [11] and among children [12] (see also [13] ). The only countries that were able to keep the fraction of undetected infections low appear to be South Korea and Germany, which explains their better ability to control the 7 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

### **Title:**  How does the outbreak of 2019-nCoV spread in mainland China? A retrospective analysis of the dynamic transmission routes 

**Entity extraction method:**  BERT

**Paragraph Number 1:**  Although the exact origin is still debatable, the current shock, namely 2019-nCoV, has taken place in Wuhan, the capital city of Hubei province in mainland China. As the fourth large-scale outbreak of coronaviruses, 2019-nCoV is spreading quickly to all provinces in China and has recently become a world-wide epidemic. As a significant complement to existing research, this study employs a tvSVAR model and retrospectively investigates and 13 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

**Paragraph Number 2:**  Coronaviruses are single-stranded, enveloped and positive-sense RNA viruses, which are spherical in shape and have petal-like spines [1] . Firstly discovered and identified in 1965 [2] , coronaviruses have not caused large-scale outbreaks until the 2003 SARS epidemic in China, followed by 2012 MERS in Saudi Arabia and 2015 MERS in South Korea [3] . Although the exact origin remains debatable [4] , the fourth outbreak has taken place in Hubei province of China in December 2019 and rapidly spread out nationally [5] [6] [7] [8] [9] [10] . On January 10, 2020, the WHO officially named this new coronavirus as the 2019 novel coronavirus, or 2019-nCoV, and released a comprehensive interim guidance on dealing with this new virus for all countries [11] . As of February 19, there are 75,101 confirmed cases (including 2,121 death report) in China, among which over 80% are from Hubei and over 50% are from Wuhan, the capital city of Hubei [12] .

**Paragraph Number 3:**  During this anti-epidemic war, statistic and mathematical modeling plays a non-negligible role. Among the emerging large volume of studies, the classical susceptible exposed infectious recovered (SEIR) model wth its various extensions is the most popular method [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] . SEIR family models are effective in exploring the epidemic characteristics of the outbreak, forecasting the inflection point and ending time, and deciding the measures to curb the spreading. Despite this, they are less appropriate in identifying transmission routes of the 2019-nCoV outbreak, which is also not thoroughly investigated in existing literature.

# Query 7: What do we know about non-pharmaceutical interventions?
**Here the Knowledge Graph did not give any results**

### **Title:**  What does simple power law kinetics tell about our response to coronavirus pandemic?

**Entity extraction method:**  BERT

**Paragraph Number 1:**  To the best of my knowledge of the scientific understanding of coronavirus at the point of writing this article,

**Paragraph Number 2:**  • The effect of weather on the spread of pandemic has not been established.

**Paragraph Number 3:**  • Travel restrictions between countries are very useful in the initial phases of the pandemic but have little effect in the later phases.

### **Title:**  If long-term suppression is not possible, how do we minimize mortality for COVID-19 and other emerging infectious disease outbreaks?

**Entity extraction method:**  BERT

**Paragraph Number 1:**  Arguably, even more important than minimizing cases is minimizing deaths. For COVID-19, evidence so far suggests that mortality among children is low, while the elderly are at a much higher risk. This suggests that to reduce overall mortality, interventions should . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

**Paragraph Number 2:**  This illustrates the importance of factoring in mortality among specific groups when deciding on the type of interventions to implement. Of course, it is worth emphasizing that there is still large uncertainty about the combined impact of our available interventions on the spread of COVID-19. The speed with which COVID-19 initially overwhelmed health systems in many countries, as well as the ability of several countries, to drive cases down successfully, suggests that the current best approach is to implement all feasible interventions against COVID-19 to suppress the disease. As the situation is monitored, it will become clear if we can accomplish elimination. If this is unlikely, and we are in a situation where interventions cannot be sustained long enough to prevent a resurgent epidemic, some interventions should be relaxed. Mitigation should then focus on allowing 'controlled spread' toward herd immunity in such a way that the most vulnerable populations are protected most, and the health care system remains functional. This might mean re-opening schools before re-starting other activities mainly frequented by adults or the elderly.

### **Title:**  How lethal is the novel coronavirus, and how many undetected cases there are? The importance of being tested

**Entity extraction method:**  BERT

**Paragraph Number 1:**  The main result of this work is that, due to the reduced number of performed tests, the vast majority of the coronavirus infections went undetected in most countries. These undetected infections had the potential to spread freely in the population, giving raise to the rapid exponential phase of the epidemics before the lockout measures.

**Paragraph Number 2:**  The pandemic spread of the novel coronavirus SARS-CoV2 is causing thousands of fatalities, creating a tremendous threat to global health [1] . In this situation, the society is strongly concerned by the lethality and the true extension of the pandemics, see for instance [2, 3] . In the media, but also in some non-specialist scientific circles, it is frequent to find that lethality is estimated as the cumulative number of deaths divided by the cumulative number of confirmed cases, data that are easily accessible to everyone in the internet. This quantity changes rapidly and systematically both in time and in space, generating doubts and concerns regarding its interpretation. Of course, epidemiologists can estimate the lethality rate in a less biased way with additional data on the dates in which the infections of the persons that die or recover were detected, and better sampling of the population. However, these data are not easily accessible for all countries. Therefore, here I set up to obtain an estimate of the lethality rate and the detection rate only based on the data reported for all countries in the John Hopkins University database [4] . This extrapolation of the data shows that, when the time course of the disease is controlled for, the lethality estimated for all countries for which reliable data are available depends only on the intensity of the performed tests, i.e. the number of tests divided by the number of positive case. Extrapolating to infinite number of tests, I estimate a lethality rate of 0.012 ± 0.012, which is very noisy but consistent with the estimate of 0.01 that is frequently mentioned in the media. Inverting the relationship between apparent death rate and number of tests, it is possible to estimate that in all countries except South Korea and perhaps Germany, at least at the beginning of the spread, the vast majority of positive cases went undetected, with more than 90 percent undetected cases in some countries such as Italy. These cases that went out of the radar and were not isolated are likely to have contributed strongly to the rapid spread of the virus. Finally, I propose to adopt the ratio between the cumulative number of recovered and the cumulative number of deceased persons, as a potential indicator that can anticipate whether the spread of the epidemics is halting.

**Paragraph Number 3:**  This result is consistent with recent reports from China [8, 9] , and with the high rates of asymptomatic but positive cases observed in Vo Euganeo (Italy) [10], in the cruise ship Diamond Princess [11] and among children [12] (see also [13] ). The only countries that were able to keep the fraction of undetected infections low appear to be South Korea and Germany, which explains their better ability to control the 7 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

# Query 8: What do we know about diagnostics and surveillance?

### **Title:**  Alpha-ketoamides as broad-spectrum inhibitors of coronavirus and enterovirus replication Structure-based design, synthesis, and activity assessment

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  A number of capped dipeptidyl a-ketoamides have been described as inhibitors of the norovirus 3C-like protease. 41 These were optimized with respect to their P1' substituent, whereas P2 was isobutyl in most cases and occasionally benzyl. The former displayed IC50 values one order of magnitude lower than the latter, indicating that the S2 pocket of the norovirus 3CL protease is fairly small. Although we did not include the norovirus 3CL pro in our study, expanding the target range of our inhibitors to norovirus is probably a realistic undertaking. While our study was underway, Zeng et al. 42 published a series of a-ketoamides as inhibitors of the EV-A71 3C pro . These authors mainly studied the structure-activity relationships of the P1' residue and found small alkyl substituents to be superior to larger ones. Interestingly, they also reported that a six-membered d-lactam in the P1 position led to 2 -3 times higher activities, compared to the fivemembered g-lactam. At the same time, Kim et al. 43 described a series of five a-ketoamides with P1' = cyclopropyl that showed submicromolar activity against EV-D68 and two HRV strains.

**Paragraph Number 2:**  The most prominent a-ketoamide drugs are probably telaprivir and boceprivir, peptidomimetic inhibitors of the HCV NS3/4A protease, 39, 40 which have helped revolutionize the treatment of chronic HCV infections. For viral cysteine proteases, a-ketoamides have only occasionally been described as inhibitors and few systematic studies have been carried out.

**Paragraph Number 3:**  Occasionally, individual a-ketoamides have been reported in the literature as inhibitors of both the enterovirus 3C protease and the coronavirus main protease. A single capped dipeptidyl a-ketoamide, Cbz-Leu-GlnLactam-CO-CO-NH-iPr, was described that inhibited the recombinant transmissible gastroenteritis virus (TGEV) and SARS-CoV M pro s as well as human rhinovirus and poliovirus 3C pro s in the one-digit micromolar range. 44 Coded GC-375, this compound showed poor activity in cell culture against EV-A71 though (EC50 = 15.2 µM), probably because P2 was isobutyl. As we have shown here, an isobutyl side-chain in the P2 position of the inhibitors is too small to completely fill the S2 pocket of the EV-A71 3C pro and the CVB3 3C pro .

### **Title:**  Automated collection of pathogen-specific diagnostic data for real-time syndromic epidemiological studies

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  It is not always possible to accurately diagnose the causative agents of most infectious diseases from symptoms alone due to overlapping clinical presentation. Thus, to achieve maximal utility, 90 infectious disease surveillance systems should move beyond syndrome-based reporting and be pathogen-specific and comprehensive, reporting on as many of the common pathogens for a particular syndrome as possible. Sensitive and specific automated molecular diagnostic systems that detect up to four different pathogens in a single sample have been available from in vitro diagnostic (IVD) manufacturers for some time [31] [32] [33] . However adoption of IVD platforms with 95 broad multiplexing capability has become widespread only in the last few years. Commercially available systems that can detect most of the known etiological agents for respiratory, gastrointestinal and other multi-pathogen syndromes [34] [35] [36] include the BioFire (Salt Lake City, UT) FilmArray ® System [37] ; the GenMark (Carlsbad, CA) eSensor XT-8 ® [38] and ePlex ® [39] ; and the Luminex (Austin, TX) xTAG ® [40] , nxTag ® [41] and Verigene ® systems [42] . 100 Multi-analyte diagnostic tests provide the raw data needed for real-time pathogen-specific surveillance but there remain a number of obstacles to sharing these results (reviewed in [43] ).

**Paragraph Number 2:**  The obstacles largely center on information privacy and network security. A real-time surveillance system using diagnostic test results requires safeguards for protected health 105 information (PHI). Medical records and devices have become attractive targets for cyber attackers in recent years [44] , which has made hospitals and clinics reluctant to connect their Local Area Networks (LANs) to the Internet. Releasing patient test results requires the removal of PHI or authorization from the patient. Studies have shown that de-identification of patient data is not as simple as removing all specific identifiers because, in the age of big data, under the right 110 circumstances it is possible to re-associate patients and their data using publicly available information [45] [46] [47] [48] .

**Paragraph Number 3:**  Because the Expert Determination study established that no PHI will be disclosed by the Trend data export algorithm, Data Use Agreements (DUA), rather than Business Associates Agreements (BAA, see Methods for the difference between a DUA and a BAA) were executed 165 with each of the collaborating institutions (listed in Table 3 ). The DUAs define for the clinical laboratory how BioFire will manage and make use of the Trend data. The Trend client software residing on the FilmArray computer queries the FilmArray test result local database and exports the results to an Amazon Web Services (AWS) database (see Methods). The Trend client software performs de-identification on the FilmArray computer prior to export as detailed in 170 Table 2 . Health care providers are granted access to their institutions' Trend data by the Laboratory Director. Since web access to view the data is restricted to the local site, deidentification of geographic indicators is not required. However, in implementation of the public Trend website, which presents the national syndromic test surveillance (Figure 1 ), we have further aggregated the data with respect to geographic origin and obfuscated the date of the test 8 (see Methods). Since only de-identified data are exported from the clinical institutions, no PHI is sent to or stored on the cloud server.

### **Title:**  Capturing diverse microbial sequence with comprehensive and scalable probe design

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  To enhance capture of diverse targets, we instead need rigorous methods, implemented in publicly available software, that can be systematically applied to create and rapidly update optimally designed probe sets. These methods ought to comprehensively cover known sequence diversity, ideally with theoretical guarantees, especially given the exceptional variability of viral genomes. Moreover, as the diversity of known taxa expands and novel species continue to be identified 26, 27 , probe sets designed by such methods must also be dynamic and scalable to keep pace with these changes. These methods should be applicable to any taxa, including all microbes. Several existing approaches to probe design for non-microbial targets [28] [29] [30] strive to meet some of these goals but are not designed to be applied against the extensive diversity seen within and across microbial taxa.

**Paragraph Number 2:**  Despite the potential of capture, there are challenges and practical considerations that are present with the use of any probe set. Notably, as capture requires additional cycles of amplification, computational analyses should properly account for duplicate reads due to amplification; the inclusion of unique molecular identifiers 54, 55 could improve determination of unique fragments. Also, quantifying the sensitivity and specificity of capture with comprehensive probe sets is challenging -as it is for metagenomic sequencing more broadly -because doing so would necessitate obtaining viral genomes for the hundreds of targeted species, and false positives are likely to be due to components of sequencing and classification that are unrelated to capture (e.g., contamination in sample processing or read misclassifications). For sequencing some ultra low input samples, targeted amplicon approaches may be faster and more sensitive 38 , but genome size, sequence heterogeneity, and the need for prior knowledge of the target species can limit the feasibility and sensitivity of these approaches 1, 56, 57 . Similarly, for molecular diagnostics of particular pathogens, many commonly used assays such as qRT-PCR and rapid antigen tests are likely to be faster and less expensive than metagenomic sequencing. Capture does increase the preparation cost and time per-sample compared to unbiased metagenomic sequencing, but this is offset by reduced sequencing costs through increased sample pooling and/or lower-depth sequencing 1 (Supplementary Table 9 ).

**Paragraph Number 3:**  Consider a large set of input sequences that encompass a diverse set of taxa (e.g., hundreds of viral species). We could run CATCH, as described above, on a single choice of parameters θ d such that the number of probes in s(d, θ d ) is feasible for synthesis. However, this can lead to a poor representation of taxa in the diverse probe set; it can become dominated by probes covering taxa that have more genetic diversity (e.g., HIV-1). Furthermore, it can force probes to be designed with relaxed assumptions about hybridization across all taxa. To alleviate these issues, we allow different choices of parameters governing hybridization for different subsets of input sequences, so that some can have probes designed with more relaxed assumptions than others.

### **Title:**  What does simple power law kinetics tell about our response to coronavirus pandemic?

**Entity extraction method:**  BERT

**Paragraph Number 1:**  (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.03.20051797 doi: medRxiv preprint citizens to avoid physical contacts, (2) limiting social gathering, (3) closing academic institutions, (4) strict lockdown with only essential services open. Finally, many countries have adopted random testing in populations to identify affected people.

**Paragraph Number 2:**  To the best of my knowledge of the scientific understanding of coronavirus at the point of writing this article,

**Paragraph Number 3:**  • The effect of weather on the spread of pandemic has not been established.

### **Title:**  How lethal is the novel coronavirus, and how many undetected cases there are? The importance of being tested

**Entity extraction method:**  BERT

**Paragraph Number 1:**  The main result of this work is that, due to the reduced number of performed tests, the vast majority of the coronavirus infections went undetected in most countries. These undetected infections had the potential to spread freely in the population, giving raise to the rapid exponential phase of the epidemics before the lockout measures.

**Paragraph Number 2:**  The pandemic spread of the novel coronavirus SARS-CoV2 is causing thousands of fatalities, creating a tremendous threat to global health [1] . In this situation, the society is strongly concerned by the lethality and the true extension of the pandemics, see for instance [2, 3] . In the media, but also in some non-specialist scientific circles, it is frequent to find that lethality is estimated as the cumulative number of deaths divided by the cumulative number of confirmed cases, data that are easily accessible to everyone in the internet. This quantity changes rapidly and systematically both in time and in space, generating doubts and concerns regarding its interpretation. Of course, epidemiologists can estimate the lethality rate in a less biased way with additional data on the dates in which the infections of the persons that die or recover were detected, and better sampling of the population. However, these data are not easily accessible for all countries. Therefore, here I set up to obtain an estimate of the lethality rate and the detection rate only based on the data reported for all countries in the John Hopkins University database [4] . This extrapolation of the data shows that, when the time course of the disease is controlled for, the lethality estimated for all countries for which reliable data are available depends only on the intensity of the performed tests, i.e. the number of tests divided by the number of positive case. Extrapolating to infinite number of tests, I estimate a lethality rate of 0.012 ± 0.012, which is very noisy but consistent with the estimate of 0.01 that is frequently mentioned in the media. Inverting the relationship between apparent death rate and number of tests, it is possible to estimate that in all countries except South Korea and perhaps Germany, at least at the beginning of the spread, the vast majority of positive cases went undetected, with more than 90 percent undetected cases in some countries such as Italy. These cases that went out of the radar and were not isolated are likely to have contributed strongly to the rapid spread of the virus. Finally, I propose to adopt the ratio between the cumulative number of recovered and the cumulative number of deceased persons, as a potential indicator that can anticipate whether the spread of the epidemics is halting.

**Paragraph Number 3:**  This result is consistent with recent reports from China [8, 9] , and with the high rates of asymptomatic but positive cases observed in Vo Euganeo (Italy) [10], in the cruise ship Diamond Princess [11] and among children [12] (see also [13] ). The only countries that were able to keep the fraction of undetected infections low appear to be South Korea and Germany, which explains their better ability to control the 7 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

### **Title:**  When will the Covid-19 epidemic fade out?

**Entity extraction method:**  BERT

**Paragraph Number 1:**  Humanity has always been afflicted by the onset of epidemics. Owing to the absence of vaccines, the slow connections between people and isolation between infectious and susceptible were the only remedies to their devastating effects. Over the last two decades, there have been three major epidemics due to human-transmissible viruses, namely Avian, Ebola and Sars, but fortunately the advanced ability of the scientific world has been able to contain their effects.

**Paragraph Number 2:**  A dangerous impact of infectious diseases on populations can arise from emergence and spread of novel pathogens in a population, or a sudden change in the epidemiology of an existing pathogen. Today, due to the absence of a vaccine and to a highly globalized society, the Covid-19 epidemic is frightening the world, raising a series of important questions. Among these, the most common among people is: when will the epidemic die down? During spreading, this is a difficult question to answer: in addition to understand the early transmission dynamics of the infection, control measures should also be accounted for, which may significantly affect the trend of infection.

**Paragraph Number 3:**  Using this model and the available statistical data, we attempt predictions on the Covid-19 epidemic trend.

# Query 9: What has been published about information sharing and inter-sectoral collaboration?

### **Title:**  Evolved sequence features within the intrinsically disordered tail influence FtsZ assembly and bacterial cell division

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  Previous studies showed that deletion of the CTL leads to aberrant FtsZ assembly that compromises Z-ring formation and cell division (Buske and Levin, 2013) . Replacing the wildtype (WT) CTL with a scrambled sequence variant of the linker preserves Z-ring formation and cell division. In contrast, replacing the CTL with a rigid alpha-helical domain from human betacatenin yielded diffuse FtsZ puncta that compromise Z-ring formation and cell division (Buske and Levin, 2013) . This work underscored the importance of the disordered CTL in FtsZ assembly and Z-ring formation. It also raised several questions regarding the contribution of the intrinsically disordered CTL to the assortment of functions coordinated by FtsZ. Here, we pursue answers to some of these questions by leveraging insights that have emerged through systematic studies of sequence-encoded conformational preferences of intrinsically disordered proteins / regions (Das et al., 2015) .

**Paragraph Number 2:**  The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/301622 doi: bioRxiv preprint lowered sequence complexity. Sequence complexity is also lowered by using a simplified amino acid composition as we have done with the CTT sequences that are based on a reduced amino acid library. In eukaryotic systems, ubiquitinated substrates engage productively with the proteasome if and only if they have low sequence complexity tags at their termini (Kraut et al., 2012; Schrader et al., 2011) . Accordingly, we quantified the sequence complexities of each of the designed CTTs based on the WT and reduced amino acid alphabets. Here, we use the Lempel-Ziv (LZ) measure of complexity (Lempel and Ziv, 1976 ) that has been used in previous analysis of IDPs / IDRs Romero et al., 2000) . Interestingly, we find that the variants that have low cellular levels in B. subtilis (Figure 3B & 6B) also have the lowest LZ sequence complexity in their CTT sequences ( Figure 6E) . The sequence complexity of CTT sequences is lowered either by increasing or decreasing κ within the CTT, for the WT amino acid composition or by using a reduced alphabet for the amino acid composition.

**Paragraph Number 3:**  Calorimetric measurements suggest that the formation of FtsZ assemblies requires the crossing of a critical concentration threshold (Caplan and Erickson, 2003; Chen et al., 2005) . This apparent cooperativity is at odds with the observation that FtsZ forms stable singlestranded filaments, which is the defining characteristic of isodesmic assembly (Oosawa and Kasai, 1962) . Numerous models have been developed to describe how FtsZ assembles cooperatively. These models suggest that FtsZ subunits might undergo conformational changes to facilitate the population of a high affinity state that favors polymerization (Miraldi et al., 2008) .

### **Title:**  A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug- Repurposing Abbreviations: HC-PPIs: High confidence protein-protein interactions PPIs: protein-protein interaction AP-MS: affinity purification-mass spectrometry

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  Mycophenolic acid Sanglifehrin A

**Paragraph Number 2:**  There are several mechanistically interesting, and potentially disease-relevant drug-target interactions revealed in the chemoinformatic network (Fig. 5a) . Among them, the well-known chemical probe, Bafilomycin A1, is a potent inhibitor of the V1-ATPase, subunits of which interact with Nsp6 and M. Bafilomycin's inhibition of this cotransporter acts to prevent the acidification of the lysosome, inhibiting autophagy and endosome trafficking pathways, which may impact the viral life-cycle. Similarly, drugs exist to target several well-known epigenetic regulators prominent among the human interactors, including HDAC2, BRD2 and BRD4, which interact with viral proteins nsp5 and E, respectively (Figs. 3 and 5a). The approved drug Valproic acid (an anticonvulsant) and the pre-clinical candidate Apicidin inhibit HDAC2 with affinities of 62 μM and 120 nM, respectively. Clinical compounds ABBV-744 and CPI-0610 act on BRD2/4, with an affinity of 2 nM or 39 nM, respectively --several preclinical compounds also target bromodomain-containing proteins (Table 1a,b). As a final example, we were intrigued to observe that the SARS-CoV-2 Nsp6 protein interacts with the Sigma receptor, which is thought to regulate ER stress response 71 . Similarly, the Sigma2 receptor interacted with the vial protein orf9. Both Sigma1 and Sigma2 are promiscuous receptors that interact with many non-polar, cationic drugs. We prioritized several of these drugs based on potency or potential disease relevance, including the antipsychotic Haloperidol, which binds in the low nM range to both receptors 72 , and Chloroquine, which is currently in clinical trials for COVID-19 and has mid-nM activity vs the Sigma1 receptor, and low μM activity against the Sigma2 receptor. Because many patients are already treated with drugs that have off-target impact on Sigma receptors, associating clinical outcomes accompanying treatment with these drugs may merit investigation, a point to which we return. Finally, in addition to the druggable host factors, a few of which we have highlighted here, the SARS-CoV-2-human interactome reveals many traditionally "undruggable" targets. Among these, for instance, are components of the centriole such as CEP250, which interacts with the viral Nsp13. Intriguingly, a very recent patent disclosure revealed a natural product, WDB002, that directly and specifically targets CEP250. As a natural product, WDB002 would likely be harder to source than the molecules on which we have focused on here, but may well merit investigation. Similarly, other "undruggable" targets may be revealed to have compounds that could usefully perturb the viral-human interaction network, and act as leads to therapeutics.

**Paragraph Number 3:**  Polymerase (RdRP) inhibitor remdesivir [9] [10] [11] , and recent data suggests a new nucleotide analog may be effective against SARS-CoV-2 infection in laboratory animals 12 . Clinical trials on several vaccine candidates are also underway 13 , as well as trials of repurposed host-directed compounds inhibiting the human protease TMPRSS2 14 . We believe there is great potential in systematically exploring the host dependencies of the SARS-CoV-2 virus to identify other host proteins already targeted with existing drugs. Therapies targeting the host-virus interface, where mutational resistance is arguably less likely, could potentially present durable, broad-spectrum treatment modalities 15 . Unfortunately, our minimal knowledge of the molecular details of SARS-CoV-2 infection precludes a comprehensive evaluation of small molecule candidates for host-directed therapies. We sought to address this knowledge gap by systematically mapping the interaction landscape between SARS-CoV-2 proteins and human proteins.

### **Title:**  Global profiling of SARS-CoV-2 specific IgG/ IgM responses of convalescents using a proteome microarray

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  The SARS-CoV-2 proteome microarray enables not only the global profiling of virus specific antibody responses but also providing semi-quantitative information. By adopting the dual color strategy of microarray, we can measure IgG and IgM simultaneously. For the convalescent COVID-19 patients tested in this study that with a median of 22 days after onset, we found that the overall IgG response is significantly higher than that of IgM, indicating for these patients the SARS-CoV-2 specific IgG responses are dominant at the convalescent phase, although IgM level might reach the peak at a similar time point with that of IgG, according to some studies of SARS-

**Paragraph Number 2:**  The detailed layout of the SARS-CoV-2 proteome microarray was indicated as well ( Fig. 2A) . High antibody responses were usually observed for COVID-19 patients while not in control sera (Fig.   2B ). Since the Fc tag could be recognized by fluorescence labeled anti-human IgG antibody, the ACE2-Fc generated high signals for all the tests, which could serve as control for the anti-human IgG antibody, though the initial reason to include ACE2 on the microarray is for applications other than serum profiling. To test the experimental reproducibility of the serum profiling using the microarray, two COVID-19 convalescent sera were random selected. Three independent analysis for each of these two sera were repeated on the microarray. Pearson correlation coefficients between two repeats were 0.988 and 0.981 for IgG and IgM, respectively, and the overall fluorescence intensity ranges of the repeated experiments were fairly close, demonstrating high reproducibility of the microarray based serum profiling both for IgG and IgM (Fig. 2 C-E) .


### **Title:**  Defining high-value information for COVID-19 decision-making COVID-19 Statistics, Policy modeling and Epidemiology Collective (C-SPEC) 1

**Entity extraction method:**  BERT

**Paragraph Number 1:**  In this study we demonstrate a model-based approach to determining priorities for acquiring new information based on the potential value of this information in reducing critical uncertainties relevant to COVID-19 decision-making over short-and medium-term. We fit an age-structured compartmental model to reported cases in the San Francisco Bay Area, forecast a range of possible epidemic trajectories consistent with available information, and then demonstrate how additional information about the impact of NPIs or the fraction of all infections that are confirmed positive would impact case predictions.

**Paragraph Number 2:**  We assumed that 33% of ascertained cases required hospitalization, 12 and back-calculated the proportion of all cases requiring hospitalizations for a given set of parameter values (Supplemental Information). We assumed that average duration of hospitalization for COVID-19 was 10 days. 6 We estimated a hospital capacity constraint using AHA estimates and assumed that 50% of beds could be available for COVID -19 patients. 12,13 To assess the potential decision value of new information, we focused on the projected time until the hospital supply constraint was exceeded by COVID-19 demand as a proxy for major decision-relevant outcomes. First, we calculated the uncertainty around this projected time when the model was only calibrated to the confirmed case counts through March 15. Second, we evaluated the reduction in this uncertainty that would result from obtaining more precise information about 1) the effectiveness of NPIs or 2) the fraction of all cases ascertained. For illustration, we considered narrowing uncertainty around NPI effectiveness to either a 10-30%

**Paragraph Number 3:**  We developed an age-structured dynamic compartmental model of COVID-19 transmission, similar to previously published models. 10 Model compartments stratify the population into susceptible, exposed, symptomatic infected, asymptomatic infected, and recovered individuals. We derived initial ranges for model parameters including length of incubation and infectious periods, and the basic reproductive number (R0), by drawing from prior COVID-19 modeling studies (Table 1) , and from prior studies on age-specific contact patterns (sources in Supplemental Information). Initial parameter ranges were specified to envelop at least the high and low point estimates used in other modeling studies, and to encompass uncertainty ranges from those studies when provided (sources in Supplemental Information).

### **Title:**  Lessons learnt from 288 COVID-19 international cases: importations over time, effect of interventions, underdetection of imported cases

**Entity extraction method:**  BERT

**Paragraph Number 1:**  Twenty-six countries worldwide have declared cases of the novel coronavirus, COVID-19, as of February 20, 2020 1 . Only China so far registered a widespread epidemic 2 , and authorities have implemented massive intervention measures to curtail it 3 . Outside China, affected countries are facing importations of cases and clusters of local transmission 1, 4, 5 Border controls have been reinforced in many countries, and active surveillance has been intensified to rapidly detect and isolate importations, trace contacts and isolate suspect cases 6,7 .

**Paragraph Number 2:**  The effectiveness of such measures, however, critically depends on COVID-19 epidemiology and natural history 8, 9 , as well as the volume of importations 6 . The presence of an incubation period, during which infected individuals carry on their usual activities (including travel), is a major challenge for screening controls at airports 8 . Moreover, mild non-specific symptoms and transmission before the onset of clinical symptoms 2,10 may compromise infection control measures for importations and onward transmissions 9 . There is concern that imported cases may have gone undetected and contribute unknowingly to the global spread of the disease 11-15 .

**Paragraph Number 3:**  Here we systematically collected and analyzed data on 288 COVID-19 confirmed cases outside China. We analyzed importations that were successfully isolated and those leading to onward transmission, characterizing their case timeline. We developed a statistical model to nowcast trends in importations and quantify the proportion of undetected imported cases.

### **Title:**  Global trends in air travel: implications for connectivity and resilience to infectious disease threats

**Entity extraction method:**  BERT

**Paragraph Number 1:**  The global population recently surpassed 7.6 billion people and is expected to reach 9.8 billion by 2050. 1 In tandem, transportation networks have expanded and evolved to satisfy the growing social and economic demands for global connectivity. 2 Today, commercial air travel is the conduit for approximately 3.5 billion trips annually, of which over 40% are international. 3 We live in an increasingly connected world, with more people traveling further distances than prior generations and decreased times required to travel these increased distances. 4 While this interconnectedness is a defining feature of globalization and has produced tremendous benefits for humankind, it has also facilitated the geographic dispersion of infectious diseases. 5 Although travel has always been associated with the introduction of pathogens to new environments and populations, the speed with which these new introductions occurs has been enhanced as population mobility increases. In the 1800s, the slow march of the second cholera pandemic could be observed to follow trade and military campaign routes out of India to Central Asia, the Middle East, Europe, and eventually North America, over the course of years. 5 Today, by contrast, infectious diseases can traverse the globe in less than a day. [6] [7] [8] With improvements in the availability of and access to various modes of transportation over the past several decades and the resultant decreases in time required to reach destinations, an outbreak in an isolated location can pose an international threat. 5 It is generally assumed that increased numbers of international travelers will increase global vulnerability to infectious diseases, by enhancing the potential for geographic spread. However, increased travel volume alone does not capture another . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

**Paragraph Number 2:**  is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.29.20046904 doi: medRxiv preprint important feature of travel trends -connectivity between countries with differential capacities to detect and respond to infectious disease threats. Increased travel between two countries with strong health care and public health systems will likely have very different implications for global health security than increased travel between countries with less developed infrastructure or countries with disparities in their capacities to respond to public health threats. For instance, an increase in travelers to more vulnerable countries may increase the likelihood of exportation of cases to other countries, thereby increasing the rapidity of global transmission. However, if the countries to which the disease is exported have suitable health capacity, the risk of establishment and ongoing transmission may be less of a concern, 9 albeit nonnegligible. 10 Increased connectivity also increases the likelihood of introduction of pathogens via infected tourists to less resilient countries, where disease establishment is a risk. 11 To better understand the impact of globalization on our ability to mitigate infectious disease threats, we sought to describe trends in air passenger numbers over a 10-year period (2007-2016) and investigate if connectivity between countries with different levels of resilience has changed over this time period.

**Paragraph Number 3:**  We examined trends in FSI score, trends in worldwide air travel, and the association between a state's FSI score and air travel. Travel between countries included in the FSI rankings represented 95% (1.28 billion out of 1.35 billion) of all international trips in 2016. We excluded South Sudan from these analyses, as FSI data were only available from 2012 onwards. South Sudan is classified as an Alert country and travel to and from this country accounted for less than 0.03% of annual passenger flows. Analyses were conducted using either inbound or outbound passenger numbers, but are presented for outbound passenger numbers only, as the inbound and outbound passenger volumes did not differ significantly (paired T-test, p=0.62).

# Query 10: Are there geographic variations in the mortality rate of COVID-19?

### Title:  When Darkness Becomes a Ray of Light in the Dark Times: Understanding the COVID-19 via the Comparative Analysis of the Dark Proteomes of SARS-CoV-2, Human SARS and Bat SARS-Like Coronaviruses


**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  Supplementary Figures S1. Multiple sequence alignment of structural proteins of all three studied coronaviruses are generated using Clustal Omega. The aligned images are created using Esprit 3.0. Figure S1A . MSA of SARS-CoV-2, Human SARS, and Bat CoV spike glycoproteins. Figure S1B . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nucleoproteins.

**Paragraph Number 2:**  Supplementary Figure S2 . Multiple sequence alignment of non-structural proteins of all three studied coronaviruses are generated using Clustal Omega. The aligned images are created using Esprit 3.0. Figure S2A . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp2 proteins. Figure S2B . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp3 proteins. Figure S2C . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp4 proteins. Figure S2D . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp5 proteins. Figure S2E . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp12 proteins. Figure S2F . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp13 proteins. Figure S2G . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp14 proteins. Figure S2H . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp15 proteins. Figure S2I . MSA of SARS-CoV-2, Human SARS, and Bat CoV Nsp16 proteins.

**Paragraph Number 3:**  The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.13.990598 doi: bioRxiv preprint On performing MSA, results of which are shown in Figure 7D , we found that ORF3a protein from SARS-COV-2 is slightly evolutionary closer to the ORF3a of Bat CoV (73.36%) than to the ORF3a of Human SARS CoV (72.99%). Graphs in Figures 7A, 7B , and 7C depict the propensity for disorder in ORF3a proteins of novel SARS-CoV-2, Human SARS CoV, and Bat CoV (SARS-like), respectively. Mean PPIDs in these ORF3a proteins are 9.1% (SARS-CoV-2), 8.8% (Human SARS), and 6.2% (Bat CoV (SARS-like)). ORF3a of SARS CoV-2 shows protein-binding regions at its N-terminus (by MoRFchibi_web (residues 1-6), MoRFPred (residues 7-12), and DISOPRED3 (residues [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] ) and at C-terminus (by MoRFchibi_web (residues 261-268) and MoRFPred (residues 259-263)) ( Table 2) . Similarly, ORF3a of Human SARS and Bat CoV also shows MoRFs at N-and C-terminus with the help of MoRFchibi_web and MoRFPred ( Supplementary Tables 7 and 8) . These protein-binding regions in ORF3a may have role in its co-localization with E, M, and S viral proteins. Apart from MoRFs, it also displays several nucleotide-binding residues in all three viruses (see Supplementary Tables 9, 10, and 11). In fact, this represents maximum number of RNA and DNA binding residues as compared with all other accessory proteins. These results indicate that the IDPs/IDPRs of this protein could be utilized in molecular recognition (protein-protein, protein-RNA, and protein-DNA interaction).

### Title:  Title:  A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug- Repurposing Abbreviations: HC-PPIs: High confidence protein-protein interactions PPIs: protein-protein interaction AP-MS: affinity purification-mass spectrometry

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  Mycophenolic acid Sanglifehrin A

**Paragraph Number 2:**  Systematic validation using genetic-based approaches 79,80 will be key to determine the functional relevance of these interactions and if the human proteins are being used by the virus or are fighting off infection, information that would inform future pharmacological studies. It is important to note that pharmacological intervention with the agents we identified in this study could be either detrimental or beneficial for infection. For instance, the HDAC2 inhibitors may compound the potential action of the Nsp5 protease to hydrolyze this human protein. Future work will involve generation of protein-protein interaction maps in different human cell types, as well as bat cells, and the study of related coronaviruses including SARS-CoV, MERS-CoV and the less virulent OC43 5 , data that will allow for valuable cross-species and viral evolution studies. Targeted biochemical and structural studies will also be crucial for a deeper understanding of the viral-host complexes, which will inform more targeted drug design.

**Paragraph Number 3:**  Sequence analysis of SARS-CoV-2 isolates suggests that the 30kb genome encodes as many as 14 open reading frames (Orfs). The 5' Orf1a / Orf1ab encodes polyproteins, which are auto-proteolytically processed into 16 non-structural proteins (Nsp1-16) which form the replicase / transcriptase complex (RTC). The RTC consists of multiple enzymes, including the papain-like protease (Nsp3), the main protease (Nsp5), the Nsp7-Nsp8 primase complex, the primary RNA-dependent RNA polymerase (Nsp12), a helicase/triphosphatase (Nsp13), an exoribonuclease (Nsp14), an endonuclease (Nsp15), and N7-and 2'O-methyltransferases (Nsp10/Nsp16) 1, 16, 17 . At the 3' end of the viral genome, as many as 13 Orfs are expressed from nine predicted sub-genomic RNAs. These include four structural proteins: Spike (S), Envelope (E), Membrane (M) and Nucleocapsid (N) 17 , and nine putative accessory factors (Fig. 1a) 1, 16 . In genetic composition, the SARS-CoV-2 genome is very similar to SARS-CoV: each has an Orf1ab encoding 16 predicted Nsps and each has the four typical coronavirus structural proteins. However, they differ in their complement of 3' open reading frames: SARS-CoV-2 possesses an Orf3b and Orf10 with limited detectable protein homology to SARS-CoV 16 , and its Orf8 is intact while SARS-CoV encodes Orf8a and Orf8b (Fig. 1b) 1, 16, 18 .

### Title:  Title:  Alpha-ketoamides as broad-spectrum inhibitors of coronavirus and enterovirus replication Structure-based design, synthesis, and activity assessment

**Entity extraction method:**  Knowledge Graph

**Paragraph Number 1:**  A number of capped dipeptidyl a-ketoamides have been described as inhibitors of the norovirus 3C-like protease. 41 These were optimized with respect to their P1' substituent, whereas P2 was isobutyl in most cases and occasionally benzyl. The former displayed IC50 values one order of magnitude lower than the latter, indicating that the S2 pocket of the norovirus 3CL protease is fairly small. Although we did not include the norovirus 3CL pro in our study, expanding the target range of our inhibitors to norovirus is probably a realistic undertaking. While our study was underway, Zeng et al. 42 published a series of a-ketoamides as inhibitors of the EV-A71 3C pro . These authors mainly studied the structure-activity relationships of the P1' residue and found small alkyl substituents to be superior to larger ones. Interestingly, they also reported that a six-membered d-lactam in the P1 position led to 2 -3 times higher activities, compared to the fivemembered g-lactam. At the same time, Kim et al. 43 described a series of five a-ketoamides with P1' = cyclopropyl that showed submicromolar activity against EV-D68 and two HRV strains.

**Paragraph Number 2:**  Seventeen years have passed since the outbreak of severe acute respiratory syndrome (SARS) in 2003, but there is yet no approved treatment for infections with the SARS coronavirus (SARS-CoV). 1 One of the reasons is that despite the devastating consequences of SARS for the affected patients, the development of an antiviral drug against this virus would not be commercially viable in view of the fact that the virus has been rapidly contained and did not reappear since 2004. As a result, we were empty-handed when the Middle-East respiratory syndrome coronavirus (MERS-CoV), a close relative of SARS-CoV, emerged in 2012. 2 MERS is characterized by severe respiratory disease, quite similar to SARS, but in addition frequently causes renal failure 3 . Although the number of registered MERS cases is low (2494 as of November 30, 2019; www.who.int), the threat MERS-CoV poses to global public health may be even more serious than that presented by SARS-CoV. This is related to the high casefatality rate (about 35%, compared to 10% for SARS), and to the fact that MERS cases are still accumulating seven years after the discovery of the virus, whereas the SARS outbreak was essentially contained within 6 months. The potential for human-to-human transmission of MERS-CoV has been impressively demonstrated by the 2015 outbreak in South Korea, where 186 cases could be traced back to a single infected traveller returning from the Middle East. 4 SARS-like coronaviruses are still circulating in bats in China, [5] [6] [7] [8] from where they may spill over into the human population; this is probably what caused the current outbreak of atypical pneumonia in Wuhan, which is linked to a seafood and animal market. The RNA genome (Gen-Bank accession code: MN908947.2; http://virological.org/t/initialgenome-release-of-novel-coronavirus/319, last accessed on January 11, 2020) of the new betacoronavirus features around 82% identity to that of SARS-CoV.

**Paragraph Number 3:**  In spite of the considerable threat posed by SARS-CoV and related viruses, as well as by MERS-CoV, it is obvious that the number of cases so far does not warrant the commercial development of an antiviral drug targeting MERS-and SARS-CoV even if a projected steady growth of the number of MERS cases is taken into account. A possible solution to the problem could be the development of broadspectrum antiviral drugs that are directed against the major viral protease, a target that is shared by all coronavirus genera as well as, in a related form, by members of the large genus Enterovirus in the picornavirus family. Among the members of the genus Alphacoronavirus are the human coronaviruses (HCoV) NL63 (ref. 9) and 229E 10 that usually cause only mild respiratory symptoms in otherwise healthy individuals, but are much more widespread than SARS-CoV or MERS-CoV. Therapeutic intervention against alphacoronaviruses is indicated in cases of accompanying disease such as cystic fibrosis 11 or leukemia, 12 or certain other underlying medical conditions. 13 The enteroviruses include pathogens such as EV-D68, the causative agent of the 2014 outbreak of the "summer flu" in the US, 14 EV-A71 and Coxsackievirus A16 (CVA16), the etiological agents of Hand, Foot, and Mouth Disease (HFMD), 15 Coxsackievirus B3 (CVB3), which can cause myocardic inflammation, 16 and human rhinoviruses (HRV), notoriously known to lead to the common cold but also capable of causing exacerbations of asthma and COPD. 17 Infection with some of these viruses can lead to serious outcome; thus, EV-D68 can cause polio-like disease, 18 and EV-A71 infection can proceed to aseptic meningitis, encephalitis, pulmonary edema, viral myocarditis, and acute flaccid paralysis. 15, [19] [20] Enteroviruses cause clinical disease much more frequently than coronaviruses, so that an antiviral drug targeting both virus families should be commercially viable.

### Title:  The distress of Iranian adults during the Covid-19 pandemic -More distressed than the Chinese and with different predictors The distress of Iranian adults during the Covid-19 pandemic -More distressed than the Chinese and with different predictors Summary Background

**Entity extraction method:**  BERT

**Paragraph Number 1:**  Covid-19 disrupts lives and work and causes psychological distress to the general public [1] [2] [3] [4] . The Covid-19 outbreak first triggered public panic and mental health distress in China 2 . Early research has found working adults in more affected areas in China had worse health conditions, more distress and lower life satisfactions 3 . Researchers at Shanghai Mental Health Center developed a Covid-19 Peritraumatic Distress Index (CPDI) to assess the distress level specific to Covid-19 5 . Such research on mental health during the Covid-19 pandemic is critical to identify and screen people psychiatrically based on their distress levels to prioritize assistance 2, 6 . The identification of those who are more likely to suffer mentally enables more targeted assistance from caregivers and policymakers, especially given the limited resources relative to the scale of the pandemic 1,3 .

**Paragraph Number 2:**  One of the countries most affected by Covid-19 is Iran 7 . When we started our survey on March 28, 2020, Iran had one of the highest national counts of Covid-19 confirmed cases (35,408) and deaths (2717), and a mortality rate of 7.6%, as reported by the Iranian government. However the figures reported by BBC Persian for Iran were much higher. The Covid-19 outbreak in Iran has been compounded by the ongoing decade-long US-led sanctions on Iran. A Lancet correspondence noted that "all aspects of prevention, diagnosis, and treatment are directly and indirectly hampered, and the country (Iran) is falling short in combating the crisis. Lack of medical, pharmaceutical, and laboratory equipment such as protective gowns and necessary medication has been scaling up the burden of the epidemic and the number of casualties" 7 . We provide the first empirical evidence on Iranian adults' level of distress and identify several predictors of their distress under the Covid-19 pandemic.

**Paragraph Number 3:**  In sum, this paper provides the first empirical evidence of the level of distress of Iranian adults during the Covid-19 pandemic. The results suggest adults in Iran are experiencing more distress than adults in China, with level of distress predicted by different factors, suggesting future research needs to examine mental health and the predictors in individual countries to effectively identify and screen those who are more susceptible to mental health issues during the Covid-19 pandemic.

### Title:  Preliminary evidence that higher temperatures are associated with lower incidence of COVID-19, for cases reported globally up to 29th February 2020

**Entity extraction method:**  BERT

**Paragraph Number 1:**  Daily gridded temperature data at 0.5-degree spatial resolution were obtained from the Climate Prediction Centre (NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, https://www.esrl.noaa.gov/psd/, accessed March 4 th 2020). The average temperature at the ADM1 centroid was calculated by taking the average of the maximum and minimum temperatures over the observation period at the centroid coordinates, using packages {ncdf4} [9] and {rgdal} [10] . All the analyses were implemented in the R environment [11] .

**Paragraph Number 2:**  An open-source line list of confirmed COVID-19 cases was downloaded on March 2 nd 2020 [8] . The line list included data on confirmed cases up to February 29 th 2020 for all countries, including China. Cases were aggregated to the first-level administrative division (ADM1) in which they occurred, as defined by the Global Administrative Areas dataset (https://gadm.org/, accessed March 4 th 2020). This corresponds to the first-level administrative unit within each country, usually described as a state or province. The reported coordinates of the case (variably a point location, city centroid, or different subnational administrative levels) were used to determine the ADM1 in which the case occurred. For each ADM1, an observation period was defined as period from the date of onset of symptoms of the first reported case for that ADM1 to February 29 th 2020. When information regarding onset of symptoms of the first reported case was missing, the confirmation date of the first reported case was used instead.

**Paragraph Number 3:**  Many LMICs had not detected a COVID-19 case as of 29 th February 2020 and therefore were not included in this analysis. Caution is warranted in extrapolating the association between local COVID-19 case counts and temperature to LMICs in tropical regions. COVID-19 outbreaks in LMICs, even if at somewhat lower incidence due to higher temperatures, are still likely to have a substantial impact on health services that are already significantly resource constrained.

### Title:  Estimates of the severity of COVID-19 disease

**Entity extraction method:**  BERT

**Paragraph Number 1:**  The copyright holder for this preprint . https://doi.org/10.1101/2020.03.09.20033357 doi: medRxiv preprint estimates of the probability of requiring hospitalisation assume that only severe cases clinically require hospitalisation. This is clearly different from the pattern of hospitalisation that occurred in China, where hospitalisation was also used to ensure case isolation. Mortality can also be expected to vary with the underlying health of specific populations, given that the risks associated with COVID-19 will be heavily influenced by the presence of underlying co-morbidities.

**Paragraph Number 2:**  The copyright holder for this preprint . https://doi.org/10.1101/2020.03.09.20033357 doi: medRxiv preprint

**Paragraph Number 3:**  The copyright holder for this preprint . https://doi.org/10.1101/2020.03.09.20033357 doi: medRxiv preprint Table 3 : Estimates of the proportion of all infections that would be hospitalised obtained from a subset of cases reported in mainland China 22 . We assume that within a UK-context, those that are defined as "severe" would be hospitalised. The rates are adjusted for under-ascertainment and corrected for demography.