In [1]:
# Check that we're using the 
# right python from the venv
import sys
print(sys.executable)

/home/ubuntu/code/kr2-graph/notebook_profile/env/bin/python


In [2]:
import os
from dotenv import load_dotenv

# neo4j is the official python driver
from neo4j import GraphDatabase

# py2neo is a community driver which is supposed 
# to have nice features for notebooks
from py2neo import Graph

# pull env vars for auth 
load_dotenv()

NEO4J_AUTH = (os.getenv("NEO4J_USER"), os.getenv("NEO4J_PASS"))
NEO4J_URI = os.getenv("NEO4J_URI")

# Official driver
NEO4J_DRIVER = GraphDatabase.driver(NEO4J_URI, auth=NEO4J_AUTH)
SESSION = NEO4J_DRIVER.session()

# Community driver
GRAPH = Graph(NEO4J_URI, auth=NEO4J_AUTH)


# Testing database connection

Just grabbing the WHO Regions, since it's an extremely small query which will return a limited set of nodes

In [3]:
subgraph = GRAPH.run('MATCH (n:Region) return n')
print(subgraph)

 n                                                           
-------------------------------------------------------------
 (_435:Region {name: 'Region of the Americas (PAHO)'})       
 (_436:Region {name: 'Eastern Mediterranean Region (EMRO)'}) 
 (_437:Region {name: 'Western Pacific Region (WPRO)'})       



Running the same query with the official Neo4j python driver:

In [4]:
subgraph = SESSION.run('MATCH (n:Region) RETURN n')
for node in iter(subgraph):
  print(node.data())

{'n': {'name': 'Region of the Americas (PAHO)'}}
{'n': {'name': 'Eastern Mediterranean Region (EMRO)'}}
{'n': {'name': 'Western Pacific Region (WPRO)'}}
{'n': {'name': 'European Region (EURO)'}}
{'n': {'name': 'African Region (AFRO)'}}
{'n': {'name': 'South-East Asia Region (SEARO)'}}


The Official driver certainly does better here; instead of summarizing the output too much it gives us the disconnected sub-graphs of all Regions. This is as expected because each Region node shouldn't be connected to the others. 

# Simple Taxonomic Paths

## Stepping through one edge

Super simple query: find the H1N1 Serotype, and find the label and name of its parent.

Here we're using the starting node, edge direction, and edge label (CONTAINS) so we should be able to access the direct parent node, according to the taxonomy, without specifying anything about it.

In [5]:
subgraph = SESSION.run('MATCH (H1N1:Serotype {name:"H10N1 subtype"})<-[:CONTAINS]-(parent) return H1N1, parent')

for node in subgraph.single().items():
  print(node, '\n')

('H1N1', <Node id=3574 labels=frozenset({'Serotype'}) properties={'name': 'H10N1 subtype'}>) 

('parent', <Node id=3573 labels=frozenset({'Species'}) properties={'name': 'Influenza A virus'}>) 



In [6]:
subgraph = GRAPH.run('MATCH (H1N1:Serotype {name:"H10N1 subtype"})<-[:CONTAINS]-(parent) return H1N1, parent')
print(subgraph)

 H1N1                                     | parent                                      
------------------------------------------|---------------------------------------------
 (_3574:Serotype {name: 'H10N1 subtype'}) | (_3573:Species {name: 'Influenza A virus'}) 



Again, the community driver has an easier output to look at in the notebook but the official driver offers better clarity into exactly what kind of response we're getting. 

Specifically, we see that this query is returning the two nodes we expect, and that the parent is a node of "Species" type named "Influenza A virus," which is as expected.

## Path analysis along taxonomy

This query uses a shortest path analysis to try to identify the family which contains H1N1. The query specifies only the start point (H1N1), the ending node type ":Family," and the edge direction, because following edges backwards from any serotype to a ":Family" node should return exactly one parent, the family which contains that serotype, without needing to know anything about the edge labels or node labels in between.

In [7]:
subgraph = SESSION.run(
  'MATCH (H1N1:Serotype {name:"H10N1 subtype"}), '
  'path = ((H1N1)<-[*]-(:Family)) '
  'RETURN path'
)

result = subgraph.single()

# get path from query result
path = result['path']

print('Path summary: ')
print(path, '\n')

print(f'Path length: {len(path)}\n')

print('Path Relationships (like triples):')
for step in iter(path):
  print(step, '\n')

print(f'Family name: {path.end_node["name"]} ')



Path summary: 
<Path start=<Node id=3574 labels=frozenset({'Serotype'}) properties={'name': 'H10N1 subtype'}> end=<Node id=3571 labels=frozenset({'Family'}) properties={'name': 'Orthomyxoviridae'}> size=3> 

Path length: 3

Path Relationships (like triples):
<Relationship id=4068 nodes=(<Node id=3573 labels=frozenset({'Species'}) properties={'name': 'Influenza A virus'}>, <Node id=3574 labels=frozenset({'Serotype'}) properties={'name': 'H10N1 subtype'}>) type='CONTAINS' properties={}> 

<Relationship id=4066 nodes=(<Node id=3572 labels=frozenset({'Genus'}) properties={'name': 'Alphainfluenzavirus'}>, <Node id=3573 labels=frozenset({'Species'}) properties={'name': 'Influenza A virus'}>) type='CONTAINS' properties={}> 

<Relationship id=4065 nodes=(<Node id=3571 labels=frozenset({'Family'}) properties={'name': 'Orthomyxoviridae'}>, <Node id=3572 labels=frozenset({'Genus'}) properties={'name': 'Alphainfluenzavirus'}>) type='CONTAINS' properties={}> 

Family name: Orthomyxoviridae 


This result shows us that the relationships are correct and are as expected. The query returned a three-step path to the family "Orthomyxoviridae," which can be represented as relationship triples as shown above. 

We can also run this query through the community driver, which returns a Cypher representation of the path:

In [8]:
subgraph = GRAPH.run(
  'MATCH (H1N1:Serotype {name:"H10N1 subtype"}), '
  'path = ((H1N1)<-[*]-(:Family)) '
  'RETURN path'
)

print(subgraph)

 path                                                                                                                         
------------------------------------------------------------------------------------------------------------------------------
 (H10N1 subtype)<-[:CONTAINS {}]-(Influenza A virus)<-[:CONTAINS {}]-(Alphainfluenzavirus)<-[:CONTAINS {}]-(Orthomyxoviridae) 



# Diseases in the database

In prior work on this project, a set of diseases had been identified by Talus for prototyping; these have been ingested into the graph database. 

In addition, a few syndromic categories were prototyped, which were also ingested and linked. 

These are developed for the purposes of analyzing the approach overall, and are not intended to represent the full set of diseases and categories used in the proof of concept deliverable.

## Diseases



In [9]:
subgraph = SESSION.run('MATCH (n:Disease) RETURN n LIMIT 10')
for node in iter(subgraph):
  print(node.data())

{'n': {'name': 'Enterovirus'}}
{'n': {'name': 'Gnathostoma'}}
{'n': {'name': 'Rickettsia'}}
{'n': {'name': 'Gnathostomiasis'}}
{'n': {'name': 'Carbapenem-resistant enterobacteriaceae (CRE)'}}
{'n': {'name': 'Respiratory Illness'}}
{'n': {'name': 'Vibrio, noncholera'}}
{'n': {'name': 'Scrub Typhus'}}
{'n': {'name': 'Nipah/Hendra Virus'}}
{'n': {'name': 'Enterobacter cloacae'}}


## Syndromic Categories

In [10]:
subgraph = SESSION.run('MATCH (n:SyndromicCategory) RETURN n LIMIT 10')
for node in iter(subgraph):
  print(node.data())

{'n': {'name': 'Hospital Acquired Infections'}}
{'n': {'name': 'Hemorrhagic'}}
{'n': {'name': 'Gastrointestinal'}}
{'n': {'name': 'Vectorborne'}}
{'n': {'name': 'Fever/Febrile'}}
{'n': {'name': 'Respiratory'}}


## Syndromic Relationships

A query for a specific disease, searching all "CONTAINS" edges bidirectionally to find other Disease nodes, should use the syndromic category to determine syndromically related diseases: 

In [13]:
subgraph = SESSION.run(
  'MATCH (n:Disease {name: "Enterovirus"}), '
  '(n)-[:CONTAINS*1..2]-(related:Disease) '
  ' RETURN related'
)

for node in iter(subgraph):
  print(node.data())

{'related': {'name': 'Relapsing Fever'}}
{'related': {'name': 'Fever'}}
{'related': {'name': 'Toxoplasmosis'}}
{'related': {'name': 'Toxocariasis'}}
{'related': {'name': 'Strep'}}
{'related': {'name': 'Scarlet Fever'}}
{'related': {'name': 'Typhoid'}}
{'related': {'name': 'Typhus'}}
{'related': {'name': 'Q Fever'}}
{'related': {'name': 'Parotitis'}}
{'related': {'name': 'Otitis media'}}
{'related': {'name': 'Mumps'}}
{'related': {'name': 'Mononucleosis'}}
{'related': {'name': 'Leptospirosis'}}
{'related': {'name': 'Cytomegalovirus'}}
{'related': {'name': 'Cat Scratch Fever'}}


## Symptoms and symptom relationships

For a few of the diseases, a sets of symptoms with mock frequencies were produced, to populate that node category and relationship type: 

In [28]:
subgraph = SESSION.run(
  'MATCH (d:Disease {name: "Ebola"}), '
  '(n)-[r:CAUSES]->(s:Symptom) '
  ' RETURN d,r,s LIMIT 2'
)

for path in iter(subgraph):
  print('\nSymptom Path: ')
  for step in iter(path):
    print(step, '\n')


Symptom Path: 
<Node id=22 labels=frozenset({'Disease'}) properties={'name': 'Ebola'}> 

<Relationship id=180 nodes=(<Node id=53 labels=frozenset() properties={}>, <Node id=160 labels=frozenset({'Symptom'}) properties={'name': 'Sneezing'}>) type='CAUSES' properties={'frequency': '3.0'}> 

<Node id=160 labels=frozenset({'Symptom'}) properties={'name': 'Sneezing'}> 


Symptom Path: 
<Node id=22 labels=frozenset({'Disease'}) properties={'name': 'Ebola'}> 

<Relationship id=188 nodes=(<Node id=22 labels=frozenset({'Disease'}) properties={'name': 'Ebola'}>, <Node id=161 labels=frozenset({'Symptom'}) properties={'name': 'Sore throat'}>) type='CAUSES' properties={'frequency': '2.0'}> 

<Node id=161 labels=frozenset({'Symptom'}) properties={'name': 'Sore throat'}> 



The inclusion of frequency as a property on the edge shows how the LPG can represent rich information more intuitively than a plain triplestore.

# Exploring outbreak reports

The Georgetown Center for Global Health Science and Security publishes a [database of coded WHO Disease Outbreak Reports](https://github.com/cghss/dons), which has been ingested to this prorotype graph database. 

First a matching table was created (by hand, for the sake of prototyping) of all diseases where in the 