# Deeper Insights COVID-19 Knowledge Graph

https://deeperinsights.com/

Edward Brown, Marcia Oliveira, Eduardo Piairo, Brett Drury

## Overview

Our submission to the challenge involves a Knowledge Graph that encodes biomedical concepts and the relations between them.

We should mention immediately that, since no one on our team has a formal medical background, most of our efforts have focused, for our Round 1 submission, on creation of a tool to allow medical researchers to answer the sorts of questions in the Tasks, rather than answering them all directly ourselves. 

The Graph represents a backend that, in the next few weeks, we will open up via a web interface at [covid.deeperinsights.com](http://covid.deeperinsights.com/), with a much simplified search method (e.g. closer to a traditional search box that our code-based queries here), Topic Graph visualisations, Graph navigation, document quality metrics, and so on. This fully-developed tool will be for open use and knowledge discovery, but we also intend to use it as the basis of our Round 2 submission.

Even so we do provide here sets of queries and outputs (the results of performing “low-level” queries against out Graph directly) for a subset of the Round 1 questions, which we believe show very promising results.

Since the code behind the Graph creation process, its deployment, etc., is much too involved for a Notebook context, we will instead outline our methodology descriptively. Query code however is provided later in this Notebook.

## Document Processing

We'll say a little here about the creation of our Graph with respect to its two phases: First, document processing, consisting of a custom spaCy/SciSpacy pipeline - and second a bulk insert into our Amazon Neptune instance.

### SciSpacy and UMLS

Initial parsing of the Kaggle Challenge source data was done using [SciSpacy](http://https://allenai.github.io/scispacy/) (a Python package containing spaCy models for processing biomedical, scientific or clinical text). We also used the SciSpacy UMLS Named Entity Linker, to link the entities found by its NER to a Knowledge Base consisting of UMLS ([Unified Medical Language System](https://www.nlm.nih.gov/research/umls)) Concepts.

Besides SciSpacy, we customised our pipeline to enrich documents with some additional information.

### Custom COVID-19 Entity Recognition

Since COVID-19 is a new disease, and lacks an entry in the UMLS, it isn't linked by the SciSpacy Linker. Hence we created an additional NER dedicated to recognising mentions of COVID-19. This was added to the pipeline *prior to* the SciSpacy NER, to prevent the latter incorrectly identifying COVID-19 an existing UMLS concept. We then linked these to a Concept ID from the [Supplementary Concept Record for Coronavirus Disease 2019](http://www.nlm.nih.gov/pubs/techbull/jf20/brief/jf20_mesh_novel_coronavirus_disease.html) newly issued by MeSH, allowing for them to be queried much as any other UMLS Concept. The result of the Document Processing phase is a large [Neptune CSV format](https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html) file, ready for bulk insert in Amazon Neptune. 

## Amazon Neptune and Gremlin

Once loaded, the dataset is initially represented in the Graph as entities-within-sentences-within-sections, roughly as follows:

- Document  - 
  - Title
    - Sentence<sub>1</sub>
      - Entity<sub>1</sub>
      - ...
      - Entity<sub>n</sub>
    - ...
    - Sentence<sub>n</sub>
  - Abstract
    - (As Title)
  - Full Text
    - (As Title) 
  

Since our Graph instance is under Neptune, it uses the Gremlin query language. Hence all code examples below are written in Gremlin.

### The UMLS Semantic Network

Once hosted, we further enriched our Graph with the structure (in addition to the Concepts, that is) behind the [Unified Medical Language System](https://www.nlm.nih.gov/research/umls/index.html) (UMLS).

Each Entity found during the NLP phase is linked, via an `instance_of edge` to a) one `EntityType` and b) one or more  `SemanticType` Vertices, meaning every `Entity` is given its place within the structure of the [UMLS Taxonomy](https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html). From there, we linked the `SemanticType` Vertex with any corresponding relationships mentioned in the [UMLS Semantic Network](https://semanticnetwork.nlm.nih.gov/). This allows us to query our documents according to that network too. For example, to find `Entities` in our the dataset that, say, are the *types of things* (Bodily Organs, for example) that produce a Body Substance, we can issue this query:

```python
g.V().out('entity').where(out('instance_of').hasLabel('SemanticType').out('produces').has('name','Body Substance')).
```

To find ones (e.g. "child", "Adults") that represent Age Groups:

```python
g.V().out('entity').where(out('instance_of').hasLabel('SemanticType').in('isa').has('name','Age Group')).
```

## Pros and Cons

### Pros

- Our solution is extremely powerful, allowing queries to exploit the rich structure of the UMLS network.
- Queries can output arbitrary numbers of columns and normalised data types, and can return raw UMLS codes instead of text. Hence our approach can turn free-text into structured data directly.
  
### Cons

- The power mentioned in the Pros section has the downside of requiring a user familiarise themselves with the Gremlin query language, and the structure of the UMLS Semantic Network. This makes query development much slower and more iterative than more user-friendly search-box solutions. 
- We are currently using an early version of the Kaggle dataset with 13K documents only.
- While not a Con of the approach itself, our Graph currently lacks co-reference resolution, which likely harms recall significantly. Take the sentence "**COVID-19** is a disease. It is caused by the virus SARS-CoV-2". Since "It" is not recognised as a UMLS Concept, the valuable information in the the second sentence would be missed by queries. So we plan to add co-reference resolution, along with the abovementioned improvements, for our Round 2 submission, and copy any Concept information from "COVID-19" to the "It".
  

The remainder of the Notebook contains our queries and outputs for selected Task questions.

In [None]:
# Utilities to display results.

from IPython.core.display import display, HTML
from pathlib import Path
import pandas as pd
import ast
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 100)

def format_cell(value):
    
    if type(value) is list:
        children=[format_cell(child) for child in value]
        value='; '.join(map(str,value))
    if type(value) is dict:
        children=[f'{format_cell(k)}={format_cell(v)}' for k,v in value.items()]
        value='; '.join(children)
                  
        
    return str(value)
    

def show_results(filename):
    path=Path(f'/kaggle/input/queryoutputs/{filename}.txt')
    line_dicts=[ast.literal_eval(line.strip()) for line in path.read_text(encoding='utf-8').splitlines()]
    df=pd.DataFrame(line_dicts)
    for column in df.columns:
        df[column]=df[column].apply(format_cell)
    return df
    


## Prevalence of asymptomatic shedding and transmission (e.g., particularly children).

Query finds the UMLS concepts pertaining to Viral Shedding and Asymptomatic Infection in the same context as those for COVID-19. It also optionally finds the concept for the Age Group for Children, and outputs a Boolean `true` if the Sentence refers to children.

```python
%%gremlin

g.V().hasLabel('Document'). # Find Document Vertices
    as('DocID').
    out('section').out('sentence').as('Sentence').out('entity').has('name','COVID-19').select('Sentence'). # Find Sections with Sentences that mention COVID-19.
    out('entity').
    has(
        'name', 
        within( # In the same context, find relevant concepts.            
            'viral release from host cell',
            'Viral Shedding',
            'Asymptomatic',
            'Asymptomatic Infections',
            'Asymptomatic Diseases'
        )       
    ).as('Concept')   
    .coalesce( # Optionally also find concepts pertaining to Children, and cast them to Boolean.
        select('Sentence').out('entity').has('name',within('Child')).constant(true),
        constant(false)
        ).as('In Children').
    limit(75).
    order().by(decr). # Order by the Child cases first.
    select('Concept','In Children','Sentence','DocID'). # Output the specified Vertices to columns.        
    by(values('name')).
    by().
    by(values('text')).
    by(values('sid')).
    dedup()
```

The output below shows the `Concept` found (i.e. whether the text is refering to Asymptomatic Infections or Viral Shedding) along with the Boolean `In Children` column, which is ordered to show cases invovling children first. The query also includes the source Sentence text and the source Document ID.

In [None]:
show_results('asymp_child')

## Seasonality of transmission

This query enumerates the UMLS concepts pertaining to seasons and seasonality (the Semantic Types pertaining to periods were too broad to be useful here) and extracts them when in the same context as COVID-19.

```python
%%gremlin

g.V().hasLabel('Document').
    as('DocID').
    out('section').out('sentence').
    as('Sentence').
    out('entity').has('name',
            within( # Concepts for seasonality.
                'summer',
                'winter',
                'spring (season)',
                'Autumn',
                'Holidays',
                'Seasons',
                'Seasonal course',
                'seasonality',
                'Seasonal Variation'
            )
    ).as('Concept').select('Sentence').
    out('entity').has('name','COVID-19').as('COVID').
    dedup().    
    select('Concept',"Sentence","DocID").        
    by(values('name')).
    by(values("text")).    
    by(values("sid")).    
    limit(15).
    dedup()
```

In [None]:
show_results('season')

## Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).


Happily the UMLS contained a broad Semantic Type `Body Substance`, which saved us the gruesome task of enumerating the individual concepts. There were however numerous of concepts pertaining to Transmission that needed including.

```python
%%gremlin

g.V().hasLabel('Document').
    as('DocID').
    out('section').out('sentence').
    as('Sentence').out('entity').has('name','COVID-19').select('Sentence').
    out('entity').where(out('instance_of').hasLabel('SemanticType').has('name','Body Substance')). # Semantic Type for Body Substance
    not(has('name','Matrix substance')).as('Substance'). # Exluded this individual concept as it led to false positives
    select('Sentence').out('entity').
    has(
        'name', 
        within( # Multiple concepts seemed to apply.
            'transmission process',
            'disease transmission qualifier',
            'Droplet Transmission',
            'viral transmission',
            'disease transmission',
            'Horizontal Transmission',
            'Transmitted by',
            'Mode of transmission',
            'Coefficient [transmission coef]',
            'Pathway (interactions)',
            'Direct Transmission',
            'Protein Dynamics',
            'Transmitter Device Component',
            'Contact Transmission',
            'Fecal-oral transmission',
            'Vector-borne transmission',
            'Sexual transmission',
            'Vertical Disease Transmission',
            'Vector-transmitted infectious disease',
            'Animal to human transmission'
        )
    ).
    as('Transmission').
    select('Sentence').
    select('Substance','Transmission','Sentence','DocID').        
    by(values('name')).
    by(values('name')).
    by(values('text')).
    by(values('sid')).
    limit(10)
```

Outputs contain two Columns, `Substance` showing which Body Substance was invovled, and `Transmission` showing the Transmission type.

In [None]:
show_results('persist_body')

## Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).

```python
%%gremlin

g.V().hasLabel('Document').
    as('DocID').
    out('section').out('sentence').
    as('Sentence').out('entity').has('name','COVID-19').
    select('Sentence').out('entity').
    has(
        'name', 
        within(
                'Stainless Steel',
                'Wood material',
                'Plastics',
                'plastic bags',
                'Box',
                'Metals',
                'Metalic',
                'Paper Dosing Unit',
                'Acrylic dental material'
        )        
    ).        
    as('Concept').
    select('Concept','Sentence','DocID').        
    by(valueMap('name')).
    by(values('text')).
    by(values('sid')).
    limit(1)
```

This query did not do well. This is likely because, being a medical Ontology, UMLS has poor coverage for material concepts. Adding a materials-specific NER to our document processing pipeline would likely improve results.

In [None]:
show_results('materials')

## Disease models, including animal models for infection, disease and transmission

The UMLS also contains a broad Semantic Type, `Experimental Model of Disease`, so we can simply query for that.

```python
%%gremlin

g.V().hasLabel('Document').
    as('DocID').
    out('section').out('sentence').
    as('Sentence').out('entity').has('name','COVID-19').
    select('Sentence').out('entity').
    where(out('instance_of').hasLabel('SemanticType').has('name','Experimental Model of Disease')). # Semantic Type for Disease Models
    as('Concept').
    select('Concept','Sentence','DocID').        
    by(values('name')).
    by(values('text')).
    by(values('sid')).
    limit(3).
    dedup()
```

Output only shows a couple of results, however, which are speculative.

In [None]:
show_results('models')

## Tools and studies to monitor phenotypic change and potential adaptation of the virus

```python
%%gremlin

g.V().hasLabel('Document').
    as('DocID').
    out('section').out('sentence').
    as('Sentence').out('entity').has('name','COVID-19').
    select('Sentence').out('entity').
    has(
        'name', 
        within( # Concepts pertaining to Phenotypic change/adaptation.
            'Acclimatization',
            'Phenotyping (qualifier value)',
            'Adaptation',
            'Phenotypic variability',
            'Mutation'
        )        
    ).        
    as('Concept').
    select('Concept','Sentence','DocID').        
    by(values('name')).
    by(values('text')).
    by(values('sid')).
    limit(25)
```

In [None]:
show_results('pheno')

## Immune response and immunity

Simple query to find the two concepts mentioned in the Task.

```python
%%gremlin

g.V().hasLabel('Document').
    as('DocID').
    out('section').out('sentence').
    as('Sentence').out('entity').has('name','COVID-19').
    select('Sentence').out('entity').
    has(
        'name', 
        within(
            'Immune response',
            'Immunity'
        )        
    ).        
    as('Concept').
    select('Concept','Sentence','DocID').        
    by(values('name')).
    by(values('text')).
    by(values('sid')).
    limit(25)
```

In [None]:
show_results('immune')

## Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings

```python
%%gremlin

g.V().hasLabel('Document').
    as('DocID').
    out('section').out('sentence').
    as('Sentence').
    out('entity').has('name',
            within( # Concepts pertaining to types of PPE
                'Protective gloves',       
                'Masks',                   
                'Orthodontic Facemask',    
                'Gown',                    
                'Scrubs',
                'Face shield' ,
                'Pulmonary Hygiene',
                'Personal protective equipment'
            )
    ).as('Concept').select('Sentence').
    out('entity').has('name','COVID-19').as('COVID').    
    coalesce(
        select('Sentence').out('entity').
        has(
            'name',
            within( # Extract concepts for Medical Staff, and cast to Boolean.
                'Health Personnel',
                'Medical Staff'
            )
        ).constant(true),
        constant(false)
        ).as('Medical Staff').  
    select('Concept','Medical Staff',"Sentence","DocID").        
    by(values('name')).
    by().
    by(values("text")).    
    by(values("sid")).
    limit(35).
    dedup()    
```

Output shows the type of PPE in question, plus a Boolean `Medical Staff` column, denoting if the text refers to Medical Staff or not (i.e. the population).

In [None]:
show_results('ppe')