***
# CoNetz
***
## HIGHLIGHTS

This submission used the power of network based analysis for rapid identification of current and potential new non-pharmaceutical interventions (NPIs). Text and network mining techniques are employed to build a comprehensive network of relations between a vast variety of biological entities. The tools allow easy exploration and visualization of the network for generation of leads. The submission utilizes both the CORD-19 corpus as well as relevant MEDLINE abstracts.

Some of the immediate insights derived in the context of COVID 19:

* While school closures was seen as an important non-phermaceutical intervention (NPI), many articles did point out different social difficulties that would need to be factored in when planning such NPIs.
* The fact that airport screening would not be sufficient to detect COVID-19 positive persons is also brought out.

Some use cases to explore leads are illustrated. We encourage the community to leverage the power of this network and its easy to use Python interface, for this task and beyond.

## Introduction
Since the outbreak of the COVID-19 pandemic, there has been a massive pursuit by the research community to find drugs to treat this disease as well as discover vaccines against the disease. A large number of research papers have been published to this end, peer-reviewed as well as those posted in preprint repositories such as bioRxiv (www.bioRxiv.org) and medRxiv (www.medRxiv.org). In addition, a large number of peer-reviewed papers on earlier coronavirus-related diseases such as SARS and MERS are also available.

The COVID-19 Open Research Dataset (CORD-19 corpus) consists of abstracts and full-text articles on COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community via this Kaggle challenge to apply recent advances in natural language processing (NLP) and other related techniques to generate insights in support of the ongoing fight against this infectious disease.

At a specific level, it means we have to help uncover **"*unknown known*"** entities such as drugs and vaccines that are maybe unknown to the larger set of researchers but mentioned in specific scientific article(s) part of the CORD-19 dataset. Our goal is to help the medical research community uncover these "unknown known" entities through a combination of text-mining and network analyses.

## Approach
### Association Network Creation
We had earlier built a framework for NLP called TPX, a web-based text-mining tool that supports real-time entity assisted search and navigation of the MEDLINE repository whilst continuing to use PubMed as the underlying search engine (1). TPX is a modular and versatile biomedical text-mining framework. For instance, we recently built PRIORI-T (2), a pipeline for phenotype-driven rare disease gene prioritization, by re-purposing specific modules of TPX. The modules include:
1. Dictionary Curation module
2. Annotator for entity annotations
3. MEDLINE Processor
4. Network Creation module, to build a network of the correlations extracted by the Correlation Extraction module

We re-purposed TPX for the COVID-19 Open Research Dataset Challenge (CORD-19) as follows:
1.  We took the provided CORD19 dataset corpus (Corpus Date: 2020-04-10 consisting of 51k full-text articles). We used the Full-text article Processor module of TPX to process this corpus.
2.  The Annotator module of TPX performed the annotation based on the following dictionaries: HUMAN_GENE, GENE_SARS, GENE_MERS, GENE_COVID, PHENOTYPE, CHEMICALS, DRUGS, DISEASE, SYMPTOM, GOPROC, GOFUNC, GOLOC, CELLTYPE, TISSUE, ANATOMY, ORGANISM, COUNTRIES, ETHICS TERMS (general terms related to human ethics), NON-PHARMA INTERVENTION (terms related to non-pharmaceutical intervention), SURVEILLANCE TERMS (general terms related to disease surveillance), VACCINE TERMS, VIROLOGY TERMS (general terms used in virology studies) and EARTH SCIENCE TERMS.
3.  We then used the Correlation Extraction module to extract out correlations amongst these entity types and
4.  These correlations extracted by the Correlation Extraction module are then used by the Network Creation module to build a network called TCS\_COVID\_NETWORK (cord19\_pc\_assocs\_v1.tsv). This network can is queried to obtain information and pointers from the corpus.

Thus, the TCS\_COVID\_NETWORK serves as a knowledge base that could help the COVID-19 research community obtain pointers in this regard through a combination of text-mining and network analyses. It is available from Kaggle for use by anyone to possibly try and solve some of text mining related questions that are posed in this Kaggle challenge

### PyVis Visualization
We then used PyVis, a Python-based library for constructing and visualizing an intuitive and interactive exploration of TCS\_COVID\_NETWORK. The Jupyter notebook provides details on this. For instructive purpose, we have included a set of use cases for exploring the network using PyVis and Networkx library. These cases are by no means exhaustive. However, the PyVis and Networkx functionalities can be easily used to provide and/or build richer exploration features.

## Non-Pharmaceutical Interventions (NPIs)
We now describe how this network can be used to answer some of the questions posed in the **non-pharmaceutical interventions (NPIs) task**. The task focuses on *"What do we know about the effectiveness of non-pharmaceutical interventions (NPIs)? Specifically, we want to know what the literature reports about rapid assessment of the likely efficacy of school closures, travel bans, bans on mass gatherings of various sizes, and other social distancing approaches."*.

### Our initial analysis showed that provided corpus lacked a few articles that talk about NPIs. Hence for this task, we augment the provided corpus with MEDLINE articles. 

Towards this, we included 30.7 mn articles from the MEDLINE corpus till April 05, 2020 and CORD-19 the articles (w/titles and abstracts) that were missing in MEDLINE corpus. We ran NER on these articles and filtered those articles that had atleast one of the entities corresponding to 
> 1. Diseases - "Coronavirus infections", "SARS", "MERS", "COVID-19" or 
> 2. Taxonomies - ("Severe acute respiratory syndrome-related coronavirus", "Middle East respiratory syndrome-related coronavirus", "Severe acute respiratory syndrome coronavirus 2"). 

We computed pair-wise associations for various biomedical entities from this filtered corpus of 14,139 articles. The article corpus (cord19_medline_arts.tsv) and associations network (cord19_medline_pc_assocs_v1.tsv) were included as part of the dataset. We will be updating the association network after running on the full-text articles in the subsequent versions. 

Install needed packages

In [None]:
!pip install pyvis

Import all the needed python libraries

In [None]:
from pyvis.network import Network
import networkx as nx
import pandas as pd
from IPython.display import IFrame

Define global variables

In [None]:
DATA_DIR='/kaggle/input/cord19/submissions/'
NETWORK_FILE = DATA_DIR+'cord19_medline_assocs_v2.3.tsv'
ENTITY_NAME_FILE = DATA_DIR+'dicts/entity_name.tsv'
ENT_METADATA_FILE = DATA_DIR+'dicts/entity_metadata.csv'
ent_name=None
ent_id_name=None
ent_name_id={}
enttype_map=None
ent_cmap=None
ent_srcmap=None
connected_nodes=False
notebook_mode=True
net_options = {
  "nodes": {
    "scaling": {
      "min": 46
    }
  },
  "edges": {
    "color": {
      "inherit": True
    },
    "shadow": {
      "enabled": True
    },
    "smooth": True
  },
  "interaction": {
    "hover": True,
    "navigationButtons": True
  },
  "physics": {
    "enabled": True,
    "forceAtlas2Based": {
      "gravitationalConstant": -150,
      "springLength": 100
    },
    "minVelocity": 0.05,
    "timestep":0.1,
    "solver": "forceAtlas2Based"
  }
}

Helper Functions

In [None]:
def getEntityMaps():
    ent_meta_map = pd.read_csv(ENT_METADATA_FILE, sep=',')
    enttype_map = ent_meta_map[['entid','enttype','entsource']]
    
    ent_name_df = pd.read_csv(ENTITY_NAME_FILE, sep='\t', converters={'TypeId':str})
    ent_name_df.TypeId=ent_name_df.TypeId.str.upper()
    ent_name_df.Synonym=ent_name_df.Synonym.str.upper()
    ent_name_df.DictId = ent_name_df.DictId.map(enttype_map.set_index('entid')['enttype'])
    
    ent_name = ent_name_df.set_index('Synonym').to_dict()
    ent_id_name = ent_name_df.set_index(['TypeId','DictId']).to_dict()
    
    ent_color = ent_meta_map[['enttype','entcolor']]
    ent_cmap = ent_color.set_index('enttype').to_dict(orient='index')
    
    ent_source = ent_meta_map[['enttype','entsource']]
    ent_srcmap = ent_source.set_index('enttype').to_dict(orient='index')
    return enttype_map, ent_name, ent_id_name, ent_cmap, ent_srcmap

def getNetwork():
    nw = pd.read_csv(NETWORK_FILE,sep='\t',converters={'src_ent':str, 'target_ent':str})
    nw.src_type = nw.src_type.map(enttype_map.set_index('entid')['enttype'])
    nw.target_type = nw.target_type.map(enttype_map.set_index('entid')['enttype'])
    nw.src_ent=nw.src_ent.str.upper()
    nw.target_ent=nw.target_ent.str.upper()
    return nw

def buildQueryCriteria(src_ents, source_ent_types=None, target_ents=None, target_ent_types=None, 
                       queryByEntityName=True, topk=50, topkByType=None, connected_nodes=False, indirect_links=None):
    # Normalize it upper-case
    src_ents = [i.upper() for i in src_ents]
    
    criteria = {'src_ents' : src_ents,
                'src_ent_types': source_ent_types,
                'target_ents': target_ents,
                'target_ent_types': target_ent_types,
                'query_entname' : queryByEntityName,
                'topk' : topk,
                'topkByType' : topkByType,
                'connected_nodes' : connected_nodes,
                'indirect_links' : indirect_links
                }
    return criteria

def queryByEntityTypes(nw, src_ent_types, target_ent_types):
    #Fetch the network
    qnw=None
    if(src_ent_types is not None):
        qnw = nw[nw.src_type.isin(src_ent_types)]
    if(target_ent_types is not None):
        qnw = nw[nw.target_type.isin(target_ent_types)]
    if(qnw is None):
        qnw=nw
    return qnw

def queryByEntityID(nw, src_ents, target_ents=None):
    #Fetch the network
    qnw = nw[nw.src_ent.isin(src_ents)]
    if(target_ents is not None):
        target_ents = [i.upper() for i in target_ents]
        qnw = nw[nw.target_ent.isin(target_ents)]
    return qnw

def getEntityIds(ents):
    print(' Querying by Entity Name ..')
    #Get the entity triplet for the entities
    typeids = [ent_name['TypeId'][i] for i in ents]
    dictids = [ent_name['DictId'][i] for i in ents]
    #Update Entity Name to Entity/Node ID for reference
    for i in range(len(ents)):
        ent_name_id[ents[i]]=getNodeID(typeids[i], dictids[i])
    return typeids, dictids


def queryByEntityName(nw, src_ents, target_ents=None):
    typeids, dictids = getEntityIds(src_ents)
    qnw = nw[nw.src_ent.isin(typeids) & (nw.src_type.isin(dictids))]
    if(target_ents is not None):
        target_ents = [i.upper() for i in target_ents]
        typeids, dictids = getEntityIds(target_ents)
        qnw = nw[nw.src_ent.isin(typeids) & (nw.src_type.isin(dictids))]
    return qnw

def queryTopk(nw, topk, topkByType):
    qnw=None
    if(topkByType!=None):
        qnw = nw.groupby(['src_ent','target_type']).head(topkByType)
    else:
        qnw = nw.groupby(['src_ent','target_type']).head(topk)
    return qnw

def queryNetwork(nw, criteria):
    # Use the criteria to query the network by entity name
    qnw=None
    
    if(criteria['query_entname']==True):
        qnw = queryByEntityName(nw, criteria['src_ents'], target_ents=criteria['target_ents'])
    else:
        qnw = queryByEntityID(nw, criteria['src_ents'], target_ents=criteria['target_ents'])
    
    # Query by entity types
    qnw = queryByEntityTypes(qnw, criteria['src_ent_types'], criteria['target_ent_types'])
    
    # Display only Top-k entites
    qnw = queryTopk(qnw, criteria['topk'], criteria['topkByType'])
    
    return qnw

def getEntityNames(src, target):
    src_name = ent_id_name['Synonym'][src]
    target_name = ent_id_name['Synonym'][target]
    return src_name, target_name

def getNodeID(typeid, dictid):
    return typeid+'-'+dictid[:2]

def buildNodeAttributes(e):
    # Build Node attributes - node_id, node_label, node_title, node_color 
    src_label, target_label = getEntityNames((e[0],e[1]), (e[2],e[3]))
    
    # Build src node
    src_id = getNodeID(e[0], e[1])
    src_title="<b>"+src_label+"</b><br><i>"+e[1]+"<br>"+e[0]+"</i><br>"+ent_srcmap[e[1]]['entsource']
    src_color=ent_cmap[e[1]]['entcolor']
    
    # Build target node
    target_id = getNodeID(e[2], e[3])
    target_title="<b>"+target_label+"</b><br><i>"+e[3]+"<br>"+e[2]+"</i><br>"+ent_srcmap[e[3]]['entsource']
    target_color=ent_cmap[e[3]]['entcolor']
    
    return (src_id, src_label, src_title, src_color), (target_id, target_label, target_title, target_color)

def edgeAttributes(ent1, ent2, edge_props):
    #Build edge attributes
    edge_prop_arr = edge_props.split(sep=',')
    num_arts = int(edge_prop_arr[0])-3
    edge_title = '<b>'+ent1+' --- '+ent2+'</b><br>Article Evidence(s) :<br>'
    art_type=''
    for i in range(3, len(edge_prop_arr)):
        art=edge_prop_arr[i].replace("[","")
        art=art.replace("]","")
        if("FT_" in art):
            art=art.replace("FT_","")
            art_type='CORD_UID :'
        else:
            art_type='PUBMED_ID :'
        edge_title+=art_type+'<i>'+art+'</i><br>'
    if(num_arts>5):
        edge_title+='and <i><b>'+str(num_arts)+'</b> more articles ...</i>'
    return edge_title

def buildGraph(G, filters=False):
    #Define Network layout
    net = Network(height="750px", width="100%", bgcolor="white", font_color="black", notebook=notebook_mode)
    net.options=net_options
    
    #Convert networkx G to pyvis network
    edges = G.edges(data=True)
    nodes = G.nodes(data=True)
    if len(edges) > 0:
        for e in edges:
            snode_attr=nodes[e[0]]
            tnode_attr=nodes[e[1]]            
            net.add_node(e[0], snode_attr['label'], title=snode_attr['title'], color=snode_attr['color'])
            net.add_node(e[1], tnode_attr['label'], title=tnode_attr['title'], color=tnode_attr['color'])
            net.add_edge(e[0], e[1], value=e[2]['value'], title=e[2]['title'])
    return net    

def applyGraphFilters(G, criteria):
    
    fnodes={}
    # Filter1 - Connected nodes
    if(criteria['connected_nodes']):    
        bic = nx.biconnected_components(G)
        for i in bic:
            if(len(i)>2):
                fnodes=i.union(fnodes)
    
        # Get the sub-graph after applying the filter(s)
        G=G.subgraph(fnodes)
    
    # Filter2 - 'indirect_links'
    il_dicts = criteria['indirect_links']
    if(il_dicts is not None):    
        snode = il_dicts['source_node'] if ('source_node' in il_dicts) else criteria['src_ents'][0]
        snode = ent_name_id[snode.upper()]
        #Depth=Hops+1
        depth=(il_dicts['hops']+1) if('hops' in il_dicts) else 2
        
        if('target_nodes' in il_dicts):
            tnodes = il_dicts['target_nodes']
        elif(criteria['target_ents'] is not None):
            tnodes = criteria['target_ents']
        else:
            tnodes=criteria['src_ents']        
        tnodes = [ ent_name_id[i.upper()] for i in  tnodes]
    
        # Traverse k-hops from source to target nodes.            
        paths_between_generator = nx.all_simple_paths(G, source=snode, target=tnodes, cutoff=depth)
        #indirect_paths = [tuple(e) for e in paths_between_generator]
        
        indirect_paths=[]
        i=0
        for k, path in enumerate(paths_between_generator):
            #if(len(path)==depth+1):
            ce=[]
            #print(path)
            for j, e in enumerate(path):
                if j+1 <= len(path)-1:
                    ce.append((path[j], path[j+1]))
            indirect_paths.extend(ce)
        G=G.edge_subgraph(indirect_paths)
    return G

def run(criteria):
    # Load the entire network
    nw_df = getNetwork()

    # Query the network with the defined search criteria
    qnw = queryNetwork(nw_df, criteria)
    print(' Number of Associations in the Final Network -->'+str(len(qnw)))

    # Build association network using the query result
    sources = qnw['src_ent']
    source_types=qnw['src_type']
    targets = qnw['target_ent']
    target_types=qnw['target_type']
    weights = qnw['score']
    stats = qnw['debug']
    edge_data = zip(sources, source_types, targets, target_types, weights, stats)

    G=nx.Graph()
    for e in edge_data:
        snode, tnode = buildNodeAttributes(e)
        G.add_node(snode[0], label=snode[1], title=snode[2], color=snode[3])
        G.add_node(tnode[0], label=tnode[1], title=tnode[2], color=tnode[3])
        G.add_edge(snode[0], tnode[0], value=e[4], title=edgeAttributes(snode[1],tnode[1], e[5]))

    applyFilter = (criteria['connected_nodes'] or criteria['indirect_links'])
    if(applyFilter):
        G=applyGraphFilters(G, criteria)

    net = buildGraph(G, applyFilter)
    return net

Define your Queries - Query by Keyword (text term)

In [None]:
# Prepare Entity Maps
enttype_map, ent_name, ent_id_name, ent_cmap, ent_srcmap = getEntityMaps()
#Display Entity Types
enttype_map

Top-15 Non-pharma interventions for Covid19

In [None]:
QueryTerms=['covid-19']
criteria = buildQueryCriteria(QueryTerms, topk=15, target_ent_types=['NON-PHARMA INTERVENTION'])
criteria
net = run(criteria)
net.show("cord19_npi.html")

Non-pharma interventions by countries for Covid19

In [None]:
QueryTerms=['covid-19']
criteria = buildQueryCriteria(QueryTerms,target_ent_types=['NON-PHARMA INTERVENTION','COUNTRIES'], topkByType=15)
criteria
net = run(criteria)
net.show("cord19_npi_countries.html")

#### Connecting Non-pharma interventions

In [None]:
QueryTerms=['Entry screening', 'Exit screening', 'school closure', 'contact tracing', 'isolation', 'movement restriction', 'personal protective measures']
criteria = buildQueryCriteria(QueryTerms, topkByType=15, connected_nodes=True, target_ent_types=['NON-PHARMA INTERVENTION'])
criteria
net = run(criteria)
net.show("cord19_connected_npis.html")

Connecting Non-pharma interventions by countries

In [None]:
QueryTerms=['Entry screening', 'Exit screening', 'school closure', 'contact tracing', 'isolation', 'movement restriction', 'personal protective measures']
criteria = buildQueryCriteria(QueryTerms, topkByType=15, connected_nodes=True, target_ent_types=['NON-PHARMA INTERVENTION', 'COUNTRIES'])
criteria
net = run(criteria)
net.show("cord19_connected_npis_countries.html")

#### Covid-19 and school closure

In [None]:
QueryTerms=['Entry screening', 'Exit screening', 'school closure', 'contact tracing', 'isolation', 'covid-19']
criteria = buildQueryCriteria(QueryTerms, topkByType=15, target_ent_types=['NON-PHARMA INTERVENTION', 'COUNTRIES'],
                              indirect_links={'source_node':'covid-19', 'target_nodes':['school closure'], 'hops':2})
criteria
net = run(criteria)
net.show("cord19_school_closure.html")

## Key Articles and Sentences
### School closure
Many of the articles that talk about school closures seem against this NPI because of the social impact. For instance, 
* PMID:32213332 says *"However, quarantine and workplace distancing should be prioritised over school closure because at this early stage, symptomatic children have higher withdrawal rates from school than do symptomatic adults from work."*
* PMID:32222161 full-text says *"School closures during the 2014â€“16 Ebola epidemic increased dropouts, child labour, violence against children, teen pregnancies, and persisting socioeconomic and gender disparities. Access to distance learning through digital technologies is highly unequal, and subsidised meal programmes, vaccination clinics, and school nurses are essential to child health care, especially for marginalised communities"*
* A contrarian article is PMID:32242349 that says *"The simulation results show that the government could reduce at least 200 cases"*, making a case for school closure in South Korea.

### Entry/Travel Restrictions vs Airport screening NPIs
* PMID:32093043 says *"We found that in countries with low connectivity to China but with relatively high R loc, the most beneficial control measure to reduce the risk of outbreaks is a further reduction in their importation number either by entry screening or travel restrictions"*
* Moreover, PMID:32046816 suggests *"Airport screening is unlikely to detect a sufficient proportion of 2019-nCoV infected travellers to avoid entry of infected travellers."*

### Isolation of child patients
* PMID:32243729 says *"The continuous positive real-time reverse transcription- polymerase chain reaction assay for SARS-CoV-2 in the child's throat swab sample indicated the isolation period for suspected child cases should be longer than 14 days."*

### Countries included the cruise ship Diamond Princess
* PMID:32190785 looked at the *"Transmission potential of the novel coronavirus (COVID-19) onboard the diamond Princess Cruises Ship, 2020."* The authors state *"Our findings suggest that Rt decreased substantially compared to values during the early phase after the Japanese government emented an enhanced quarantine control."
* PMID:32183930 looks at "Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship, Yokohama, Japan, 2020." They mention that lack of "Most infections occurred before the quarantine start."

Of course these are very limited data points but based on these it seems:
1. **Entry/travel restrictions should be prioritized over airport screening**
2. **School closures may not be so effective, and also come with adverse social costs**
3. **Isolation period for suspected child cases should be > 14 days**
4. **Enforced quarantine of all passengers onboard the Diamond Princess might have led to increased cases**

Like with many of the other NPIs, it is difficult to give a one NPI fits all scenarios.

## References
1. Joseph T, Saipradeep VG, Raghavan GS, Srinivasan R, Rao A, Kotte S, Sivadasan N. TPX: Biomedical literature search made easy. Bioinformation 8(12): 578-80 (2012).
2. Rao A, Joseph T, Saipradeep VG, Kotte S, Sivadasan N, Srinivasan R. PRIORI-T: a tool for rare disease gene prioritization using MEDLINE. PLOS One (In Press).