<a href="https://colab.research.google.com/github/radxrad/radx-kg/blob/main/notebooks/visualization/RADx-rad_Explorer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AMIA 2024 Notebook
This notebook explores a subset of the radx Knowledge graph for the [AMIA 2024 Annual Symposium](https://amia.org/education-events/amia-2024-annual-symposium) paper.

This notebook runs queries on the radx-kg to explore the content and relationships among RADx-rad digital assets. Query results are displayed as tables, bar plots, choropleth maps, and visualized as subgraphs.

To run this notebook select ```Runtime -> Run all ``` from the menu.

The radx-kg consists of nodes and their relationships that can be queried with the [Cypher graph query language](https://neo4j.com/docs/getting-started/cypher-intro/#_getting_started_with_cypher).

<p align="center">
<img src="https://github.com/radxrad/radx-kg/blob/main/docs/Sub-schema.png?raw=1", width="100%">
</p>

## Setup
This notebook installs the Neo4j Graph Database and imports the RADx-KG data and metadata. The software setup may take 2 - 3 minutes.

In [None]:
#@title Check if Notebook is running in Google Colab
in_colab = False
try:
    import google.colab
    in_colab = True
except:
    pass

In [None]:
#@title Install software
if in_colab:
    # enable third party widgets in Colab
    from google.colab import output
    output.enable_custom_widget_manager()
    output.no_vertical_scroll()

    # copy required files (temporary solution)
    !wget -q https://raw.githubusercontent.com/pwrose/neo4j-ipycytoscape/master/notebooks/neo4j_utils.py
    !wget -q https://raw.githubusercontent.com/sbl-sdsc/kg-import/master/notebooks/neo4j_bulk_importer.py
    !wget -q https://raw.githubusercontent.com/sbl-sdsc/kg-import/master/notebooks/utils.py
    !wget -q https://raw.githubusercontent.com/sbl-sdsc/kg-import/master/notebooks/PrepareNeo4jBulkImport.ipynb
    !wget -q https://raw.githubusercontent.com/radxrad/radx-kg/main/notebooks/visualization/embed.py

    !git clone --quiet https://github.com/radxrad/radx-kg.git

    # install software
    !apt -qq install openjdk-17-jre-headless 2>/dev/null > /dev/null
    %pip install -q papermill > /dev/null
    %pip install -q py2neo > /dev/null
    %pip install -q ipycytoscape > /dev/null
    %pip install -q python-dotenv > /dev/null
    %pip install -q plotly > /dev/null

    # set environment variables
    from dotenv import load_dotenv
    load_dotenv("/content/radx-kg/.env.colab")
else:
    # copy required files (temporary solution)
    !curl -s -O https://raw.githubusercontent.com/pwrose/neo4j-ipycytoscape/master/notebooks/neo4j_utils.py
    !curl -s -O https://raw.githubusercontent.com/sbl-sdsc/kg-import/master/notebooks/neo4j_bulk_importer.py
    !curl -s -O https://raw.githubusercontent.com/sbl-sdsc/kg-import/master/notebooks/utils.py
    !curl -s -O https://raw.githubusercontent.com/sbl-sdsc/kg-import/master/notebooks/PrepareNeo4jBulkImport.ipynb
    from dotenv import load_dotenv
    load_dotenv("../../.env", override=True)

In [None]:
#@title Imports
import os
import time
import pandas as pd
import seaborn as sns
import plotly.express as px
from py2neo import Graph
import neo4j_utils
import neo4j_bulk_importer

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Download and install Neo4j Graph Database
Install the Neo4j Community Edition.

In [None]:
neo4j_utils.download_neo4j_community()

### Import the RADx-rad Knowledge Graph
CSV data and metadata files are uploaded into the Neo4j Graph database from the [kg](https://github.com/radxrad/radx-kg/tree/main/kg) directory using the [kg-import](https://github.com/sbl-sdsc/kg-import) bulk upload scripts. For a description of the data organization and the specification of metadata [see](https://github.com/sbl-sdsc/kg-import/blob/main/README.md).

In [None]:
neo4j_bulk_importer.import_from_csv_to_neo4j_community();

### Connect to the local Neo4j Graph database

In [None]:
database = os.environ.get("NEO4J_DATABASE")
username = os.environ.get("NEO4J_USERNAME")
password = os.environ.get("NEO4J_PASSWORD")
stylesheet = os.environ.get("NEO4J_STYLESHEET")

graph = Graph("bolt://localhost:7687", name=database, user=username, password=password)

## Table of Content
* [Metadata](#Metadata)
* [Subgraph](#Subgraph)
* [Publications](#Publications)
* [Presentations](#Presentations)
* [Events](#Events)
* [Organizations](#Organizations)
* [Funding Opportunities](#FundingOpportunities)
* [Grants](#Grants)
* [Fulltext Search](#FulltextSearch)
* [Semantic Search](#SemanticSearch)

## Metadata <a class="anchor" id="Metadata"></a>

### Node metadata
radx-kg is a self-describing KG. The MetaNodes and MetaRelationships define the structure of the KG and the properties of nodes and relationships. The query below lists the nodes in radx-kg and their properties.

In [None]:
query = """
MATCH (n:MetaNode) RETURN n;
"""
df = graph.run(query).to_data_frame()
metadata = df["n"].tolist()
metadata = pd.DataFrame(metadata)
metadata.fillna("", inplace=True)
metadata

### Metagraph <a class="anchor" id="Metagraph"></a>
The metagraph shows the node labels and relationship types of the KG. Click on a node to display the node metadata.

In [None]:
query = """
MATCH p=(:MetaNode)-->(:MetaNode) RETURN p
"""
subgraph1 = graph.run(query).to_subgraph()

In [None]:
widget1 = neo4j_utils.draw_graph(subgraph1, stylesheet)
widget1.layout.height = "1024px"
widget1.set_layout(name="cola", padding=40, nodeSpacing=65, nodeDimensionsIncludeLabels=True, unconstrIter=15000)
widget1
# Click in the left cell margin and select "View output fullscreen" for a fullscreen view (Google Colab only)
# Click in the left notebook margin and select "Take Screenshot" to save a screenshot. (Google Colab on Firefox only)
# To improve the initial rendering of the graph, rerun this cell.

### Number of Nodes

In [None]:
query = """
MATCH (n) RETURN COUNT(n);
"""
print(f"Total number of nodes: {graph.evaluate(query)}")

### Number of nodes (digital assets) by node label (asset type)

In [None]:
query = """
MATCH (n) RETURN labels(n)[0] AS Node, COUNT(n) AS Count
ORDER BY Count DESC
"""
graph.run(query).to_data_frame()

### Number of relationships by relationship type

In [None]:
query = """
MATCH ()-[r]-() RETURN DISTINCT TYPE(r) AS Relationship, COUNT(r) AS Count
ORDER BY Count DESC
"""
graph.run(query).to_data_frame()

## Subgraph for an investigator <a class="anchor" id="Subgraph"></a>
Get the digital assets for an investigator.

#### First neighbors in KG

In [None]:
researcher = "Cirrito"
query = """
MATCH p=(r:Researcher)--() WHERE r.lastName = $researcher RETURN p
"""
subgraph2 = graph.run(query, researcher=researcher).to_subgraph()

In [None]:
widget2 = neo4j_utils.draw_graph(subgraph2, stylesheet)
widget2.layout.height = "1024px"
widget2.set_layout(name="cola", padding=0, nodeSpacing=40, nodeDimensionsIncludeLabels=True, unconstrIter=15000)
widget2
# Click in the left cell margin and select "View output fullscreen" for a fullscreen view (Google Colab only)
# Click in the left notebook margin and select "Take Screenshot" to save a screenshot. (Google Colab on Firefox only)
# To improve the initial rendering of the graph, rerun this cell.

#### First and second neighbors in KG

In [None]:
researcher = "Cirrito"
query = """
MATCH p=(r:Researcher)--()--() WHERE r.lastName = $researcher RETURN p
"""
subgraph3 = graph.run(query, researcher=researcher).to_subgraph()

In [None]:
widget3 = neo4j_utils.draw_graph(subgraph3, stylesheet)
widget3.layout.height = "1024px"
widget3.set_layout(name="cola", padding=0, nodeSpacing=40, nodeDimensionsIncludeLabels=True, unconstrIter=15000)
widget3
# Click in the left cell margin and select "View output fullscreen" for a fullscreen view (Google Colab only)
# Click in the left notebook margin and select "Take Screenshot" to save a screenshot. (Google Colab on Firefox only)
# To improve the initial rendering of the graph, rerun this cell.

## Publications <a class="anchor" id="Publications"></a>

### Number of primary publications

In [None]:
query = """
MATCH (p:Publication) WHERE p.type = "primary" RETURN COUNT(p)
"""
print(f"Number of primary publications: {graph.evaluate(query)}")

### Number of publication by RADx-rad subprojects

In [None]:
# create a color palette for projects
query = """
MATCH (g:Grant) RETURN DISTINCT g.subProject AS Project
"""
projects = graph.run(query).to_data_frame()["Project"].tolist()
colors = sns.color_palette()[:len(projects)]
palette = {project: color for project, color in zip(projects, colors)}

In [None]:
query = """
MATCH (p:Publication)<-[:AUTHORED]-(r:Researcher)-[:IS_INVESTIGATOR_OF]->(g:Grant) RETURN COUNT(DISTINCT p) AS Publications, g.subProject AS Project
ORDER BY Publications DESC
"""
publications = graph.run(query).to_data_frame()
sns.barplot(publications, x="Publications", y="Project", hue="Project", palette=palette);

### Number of secondary publications (citations)

In [None]:
query = """
MATCH (p:Publication) WHERE p.type = "secondary" RETURN COUNT(p)
"""
print(f"Number of citations: {graph.evaluate(query)}")

### Number of citations by subprojects

In [None]:
query = """
MATCH (p:Publication)-[CITES]->(:Publication)<-[:AUTHORED]-(:Researcher)-[:IS_INVESTIGATOR_OF]->(g:Grant) RETURN COUNT(DISTINCT p) AS Citations, g.subProject AS Project
ORDER BY Citations DESC
"""
citations = graph.run(query).to_data_frame()
sns.barplot(citations, x="Citations", y="Project", hue="Project", palette=palette);

### Publications authored by researcher from multiple RADx-rad grants

In [None]:
query = """
MATCH p=(g1:Grant)<-[:IS_INVESTIGATOR_OF]-(:Researcher)-[:AUTHORED]->(:Publication)<-[:AUTHORED]-(:Researcher)-[:IS_INVESTIGATOR_OF]-(g2:Grant) WHERE g1 <> g2 RETURN p
"""
subgraph4 = graph.run(query, researcher=researcher).to_subgraph()

In [None]:
widget4 = neo4j_utils.draw_graph(subgraph4, stylesheet)
widget4.layout.height = "1024px"
widget4.set_layout(name="cola", padding=0, nodeSpacing=40, nodeDimensionsIncludeLabels=True, unconstrIter=15000)
widget4
# Click in the left cell margin and select "View output fullscreen" for a fullscreen view (Google Colab only)
# Click in the left notebook margin and select "Take Screenshot" to save a screenshot. (Google Colab on Firefox only)
# To improve the initial rendering of the graph, rerun this cell.

## Presentations <a class="anchor" id="Presentations"></a>
Presentations include poster presentations.

In [None]:
query = """
MATCH (p:Presentation) RETURN COUNT(p)
"""
print(f"Number of presentations: {graph.evaluate(query)}")

### Number of presentations by RADx-rad subprojects

In [None]:
query = """
MATCH (p:Presentation)<-[:PRESENTED]-(:Researcher)-[:IS_INVESTIGATOR_OF]->(g:Grant) RETURN COUNT(DISTINCT p) AS Presentations, g.subProject AS Project
ORDER BY Presentations DESC
"""
presentations = graph.run(query).to_data_frame()
sns.barplot(presentations, x="Presentations", y="Project", hue="Project", palette=palette);

## Events <a class="anchor" id="Events"></a>

### Number of presentations per event

In [None]:
query = """
MATCH (e:Event)<-[:PRESENTED_AT]-(p:Presentation)<-[:PRESENTED]-(r:Researcher) RETURN e.name + " (" + e.startDate + ")" AS Event, COUNT(DISTINCT p) AS Presentations
ORDER BY Presentations DESC
"""
events = graph.run(query).to_data_frame()
sns.barplot(events, x="Presentations", y="Event", color="green");

### Table of Events

In [None]:
query = """
MATCH (e:Event) RETURN e.name AS Name, e.eventType AS Type, e.startDate AS Start_Date, e.endDate AS End_Date, e.city AS City, e.state AS State, e.country AS Country
ORDER BY Start_Date
"""
graph.run(query).to_data_frame()

### Table of Presentations

In [None]:
query = """
MATCH (e:Event)<-[:PRESENTED_AT]-(p:Presentation) RETURN e.name AS Name, p.name AS Title, p.presenters AS Presenters
ORDER BY e.startDate
"""
graph.run(query).to_data_frame()

## Organizations <a class="anchor" id="Organizations"></a>

### Grant PIs and their organizations

In [None]:
query = """
MATCH p=(o:Organization)<-[i:EMPLOYED_AT]-(r:Researcher)-[:IS_INVESTIGATOR_OF]->(g:Grant) RETURN p
"""
subgraph6 = graph.run(query).to_subgraph()

In [None]:
widget6 = neo4j_utils.draw_graph(subgraph6, stylesheet)
widget6.layout.height = "1024px"
widget6.set_layout(name="cola", padding=0, nodeSpacing=40, nodeDimensionsIncludeLabels=True, unconstrIter=15000)
widget6
# Click in the left cell margin and select "View output fullscreen" for a fullscreen view.
# Click in the left notebook margin and select "Take Screenshot" to save a screenshot.
# To improve the initial rendering of the graph, rerun this cell.

## Funding Opportunities <a class="anchor" id="FundingOpportunities"></a>

In [None]:
query = """
MATCH p=(:FundingOpportunity)-[:PROVIDES]->(:Grant) RETURN p
"""
subgraph7 = graph.run(query).to_subgraph()

In [None]:
widget7 = neo4j_utils.draw_graph(subgraph7, stylesheet)
widget7.layout.height = "1024px"
widget7.set_layout(name="cola", padding=0, nodeSpacing=40, nodeDimensionsIncludeLabels=True, unconstrIter=15000)
widget7
# Click in the left cell margin and select "View output fullscreen" for a fullscreen view.
# Click in the left notebook margin and select "Take Screenshot" to save a screenshot.
# To improve the initial rendering of the graph, rerun this cell.

## Grants <a class="anchor" id="Grants"></a>

### Type of Grants

In [None]:
query = """
MATCH (g:Grant) RETURN g.awardCode AS `Award Code`, COUNT(DISTINCT g) AS Grants
ORDER BY Grants DESC, `Award Code`
"""
grants = graph.run(query).to_data_frame()
sns.barplot(grants, x="Grants", y="Award Code", color="darkorange");

### Number of grants per state
If a grant has PIs in multiple states, each site is counted as a grant location.

In [None]:
query = """
MATCH (g:Grant)<-[:IS_INVESTIGATOR_OF]-(:Researcher)-[:EMPLOYED_AT]->(o:Organization) RETURN COUNT(DISTINCT g) as grants, o.state AS locations
"""
df = graph.run(query).to_data_frame()
px.choropleth(df, locations="locations", color="grants", locationmode="USA-states", scope="usa", color_continuous_scale="Reds", labels={"grants": "Grants"})

## Fulltext Search <a class="anchor" id="Fulltext Search"></a>

### Full text query by keyword or phrase
A full text query returns all nodes that match the text query ([Query Syntax](https://lucene.apache.org/core/5_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Overview)). For exact matches, enclose the phrase in double quotes, e.g., ```'"aptamer"'```.

[Learn more about full-text searches.](https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html)

In [None]:
phrase = '"aptamer"'

In [None]:
query = """
CALL db.index.fulltext.queryNodes("fulltext", $phrase) YIELD node, score
RETURN node.id AS id, LABELS(node)[0] AS type, node.name AS title, score
ORDER BY type
"""
graph.run(query, phrase=phrase).to_data_frame()

### Full text query using boolean operators
The full text query supports a variety of query types, including fuzzy, proximity, and range queries, as well as boolean operators ([Query Syntax](https://lucene.apache.org/core/5_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Overview)). The following example uses a query with an ```AND``` operator.

In [None]:
phrase = 'MIS-C AND biomarker'

In [None]:
query = """
CALL db.index.fulltext.queryNodes("fulltext", $phrase) YIELD node, score
RETURN node.id AS id, LABELS(node)[0] AS type, node.name AS title, score
ORDER BY type
"""
graph.run(query, phrase=phrase).to_data_frame()

### Shutdown Neo4j before closing this notebook.
If you run this notebook locally, uncomment the last line and run neo4j_utils.stop() to stop the database. Otherwise, the database server will keep running.

In [None]:
#neo4j_utils.stop()