# Setup

In [1]:
import json
import requests
import copy
from datetime import datetime as dt
from collections import defaultdict
import pandas as pd

In [2]:
#https://pypi.org/project/gamma-viewer/
from gamma_viewer import GammaViewer
from IPython.display import display, Markdown

In [3]:
def printjson(j):
    print(json.dumps(j,indent=4))
def print_json(j):
    printjson(j)

In [4]:
def post(name,url,message,params=None):
    """A simple function for posting to a URL and returning the json response"""
    if params is None:
        response = requests.post(url,json=message)
    else:
        response = requests.post(url,json=message,params=params)
    if not response.status_code == 200:
        print(name, 'error:',response.status_code)
        print(response.json())
        return {}
    return response.json()

## ARAGORN

ARAGORN is a Translator ARA that responds to TRAPI queries.  Here is a simple example of such a query:

In [5]:
trapi_input={
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["NCBIGene:6611"],
                    "categories": ["biolink:Gene"]
                },
                "n1": {
                    "categories": ["biolink:ChemicalSubstance"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

This query looks for Chemical Substances that are all connected directly to a given Gene (NCBIGene:6611)

In [6]:
#ARAGORN Endpoint: This is the url to post TRAPI queries
aragorn_trapi_url = 'https://aragorn.renci.org/1.1/query'

Now, we will POST the TRAPI message to ARAGORN's TRAPI interface using the python requests package:

In [7]:
start = dt.now()
#Call ARAGORN:
result = requests.post(aragorn_trapi_url, json=trapi_input)
end = dt.now()

print( result.status_code )

if result.status_code == 200:
    aragorn_result = result.json()
    print(f"ARAGORN produced {len(aragorn_result['message']['results'])} results in {end-start}.")

200
ARAGORN produced 158 results in 0:00:28.079535.


ARAGORN performs a series of operations.
1. *Strider* makes distributed calls to KPs to respond to the posted TRAPI query
2. *Answer Coalescer* looks for commonalities in the answers provided by Strider
3. *Omnicorp* adds literature co-occurrence edges between any nodes in a result
4. *Answer Weighting* Calculates numerical edge weights based on the literature support
5. *Scoring* Combines the answer weights into a final score for each answer.

Because ARAGORN has merged some answers, these results don't all look the same.  First, note that ARAGORN has updated the query graph:

In [8]:
printjson(aragorn_result['message']['query_graph'])

{
    "nodes": {
        "n0": {
            "ids": [
                "NCBIGene:6611"
            ],
            "categories": [
                "biolink:Gene"
            ],
            "is_set": false,
            "constraints": null
        },
        "n1": {
            "ids": null,
            "categories": [
                "biolink:ChemicalSubstance"
            ],
            "is_set": true,
            "constraints": null
        },
        "extra_qn_0": {
            "ids": null,
            "categories": [
                "biolink:NamedThing",
                "biolink:BiologicalEntity",
                "biolink:MolecularEntity",
                "biolink:Gene",
                "biolink:GeneOrGeneProduct",
                "biolink:MacromolecularMachine",
                "biolink:GenomicEntity"
            ],
            "is_set": false,
            "constraints": null
        }
    },
    "edges": {
        "e01": {
            "subject": "n0",
            "object": "n1",
    

A new query node, and a new query edge connecting that node to the previous node `n1` have been added.  Now look at each result, and see 1) How many result nodes come back, and how many nodes are bound to `n1` in each answer.

In [9]:
scores = []
answer_node_count = []
merged_count = []
method = []
extra = []
for res_i, result in enumerate(aragorn_result['message']['results']):
    scores.append(result['score'])
    answer_node_count.append(len(result['node_bindings']))
    merged_count.append(len(result['node_bindings']['n1']))
    try:
        method.append(result['node_bindings']['n1'][0]['coalescence_method'])
    except:
        method.append('Original')
df = pd.DataFrame({'N_Answer_Nodes':answer_node_count, 'N_Merged_Nodes':merged_count, 'Method':method, 'Score':scores})

In [10]:
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    display(df)

Unnamed: 0,N_Answer_Nodes,N_Merged_Nodes,Method,Score
0,3,7,graph_enrichment,4.379434
1,3,7,graph_enrichment,4.177219
2,3,6,graph_enrichment,3.985845
3,3,6,graph_enrichment,3.500000
4,3,5,graph_enrichment,3.500000
...,...,...,...,...
153,3,2,graph_enrichment,0.000000
154,3,2,graph_enrichment,0.000000
155,3,8,graph_enrichment,0.000000
156,3,3,graph_enrichment,0.000000


### Uncoalesced answers

The above table shows three kinds of answers: The 'original' set of answers that have not undergone any coalescence.  These contain 2 nodes corresponding to the original 2 question nodes, and each answer consists of only a single knowledge graph node bound to the query graph node `n1`.  Each answer is completely independent.  These answers are simply the scored output of strider.  We can construct an answerset of only these answers like this:

In [11]:
simple_result = copy.deepcopy(aragorn_result)
simple_result['message']['results'] = list(
    filter( lambda x: 'coalescence_method' not in x['node_bindings']['n1'][0], 
           aragorn_result['message']['results'])
)
print(len(simple_result['message']['results']))

64


In [12]:
#Print the names of the answers
for result in simple_result['message']['results']:
    #Each answer has an identifier:
    n1_id = result['node_bindings']['n1'][0]['id']
    #The information for that identifier is in the KG:
    node = simple_result['message']['knowledge_graph']['nodes'][n1_id]
    #Each node has a name
    print(node['name'])

Spermine
indicator
Spermine
5'-Deoxy-5'-methylthioadenosine
Spermidine
Spermine
Spermine
5'-Deoxy-5'-methylthioadenosine
Spermidine
nucleobase
Spermine
5'-Deoxy-5'-methylthioadenosine
Spermidine
Spermine
Mexiletine
Procysteine
5,7-Dihydroxy-2-(4-hydroxyphenyl)chroman-4-one
Dactolisib
Ricinine
Dosulepin
(3-Bromo-2,5-dimethoxy-7-bicyclo[4.2.0]octa-1(6),2,4-trienyl)methanamine
Taxifolin
Irinotecan
(2S,3R,5R)-3,4-Dihydroxy-5-[6-[(3-iodophenyl)methylamino]-9-purinyl]-N-methyl-2-oxolanecarboxamide
Mocetinostat
Saclofen
4,7-Dichloro-3-hydroxy-3-[2-(4-methoxyphenyl)-2-oxoethyl]-1H-indol-2-one
Parecoxib
Troxipide
N-[3-[[5-Chloro-2-[4-(4-methyl-1-piperazinyl)anilino]-4-pyrimidinyl]oxy]phenyl]-2-propenamide
Tetradecylthioacetic acid
(8Ar,9R)-5-[[(2R,4aR,6R,7R,8R)-7,8-dihydroxy-2-thiophen-2-yl-4,4a,6,7,8,8a-hexahydropyrano[3,2-d][1,3]dioxin-6-yl]oxy]-9-(4-hydroxy-3,5-dimethoxyphenyl)-5a,6,8a,9-tetrahydro-5H-isobenzofuro[6,5-f][1,3]benzodioxol-8-one
Fenbufen
N-[4-[1-(1,4-Dioxaspiro[4.5]dec-8-yl)-4-

### Graph Coalesced Answers

ARAGORN now examines this set of results, and tries to find commonalities between them.   In this case, there are two kinds of groupings found.  First we will look at cases where several of these entities are all connected to other nodes.  This is found via an enrichment calculation.  We want to know new nodes (the extra nodes of the modified query graph) that are connected to these n1 results more often than would happen by chance.

In [13]:
#The results that have been coalesced:
coalesced_results = list(
    filter( lambda x: 'coalescence_method'  in x['node_bindings']['n1'][0], 
           aragorn_result['message']['results'])
)

#Those that have been coalesced via a new node (graph coalescence)
graph_coalesced_results = list(
    filter( lambda x: x['node_bindings']['n1'][0]['coalescence_method'] == 'graph_enrichment', coalesced_results)
)
print(len(graph_coalesced_results))

44


Each of these new results is a combination of results from the list of 37 results above.  For instance, the first coalesced result looks like this:

In [14]:
print_json(graph_coalesced_results[0])

{
    "node_bindings": {
        "n0": [
            {
                "id": "NCBIGene:6611"
            }
        ],
        "n1": [
            {
                "id": "PUBCHEM.COMPOUND:60838",
                "p_value": 3.904388366202427e-11,
                "enriched_nodes": [
                    "NCBIGene:1576"
                ],
                "coalescence_method": "graph_enrichment"
            },
            {
                "id": "PUBCHEM.COMPOUND:68911",
                "p_value": 3.904388366202427e-11,
                "enriched_nodes": [
                    "NCBIGene:1576"
                ],
                "coalescence_method": "graph_enrichment"
            },
            {
                "id": "PUBCHEM.COMPOUND:36462",
                "p_value": 3.904388366202427e-11,
                "enriched_nodes": [
                    "NCBIGene:1576"
                ],
                "coalescence_method": "graph_enrichment"
            },
            {
                "id": "PUBC

Note that while in the original query graph, `n1` was a normal node, in the modified query graph it has been made into a set.   For this answer, `n1` is now mapped to 7 of the compounds.  This is because all 7 of these compounds have an edge of the same type linking them to extra_qn_0 (NCBIGene:1576).  We can see this by looking at the associated KG edges:

In [15]:
kg = aragorn_result['message']['knowledge_graph']
def print_gc_result(gc_result):
    print('p_value:', gc_result['node_bindings']['n1'][0]['p_value'])
    for eb in gc_result['edge_bindings']['extra_qe_0']:
        kge = aragorn_result['message']['knowledge_graph']['edges'][eb['id']]
        subject_node = kge['subject']
        object_node = kge['object']
        pred = kge['predicate']
        print( f"{kg['nodes'][subject_node]['name']} -[{pred}]-> {kg['nodes'][object_node]['name']}")
print_gc_result(graph_coalesced_results[0])

p_value: 3.904388366202427e-11
CYP3A4 -[biolink:increases_degradation_of]-> Palbociclib
CYP3A4 -[biolink:increases_degradation_of]-> Mexiletine
CYP3A4 -[biolink:increases_degradation_of]-> Trimebutine
CYP3A4 -[biolink:increases_degradation_of]-> Parecoxib
CYP3A4 -[biolink:increases_degradation_of]-> Irinotecan
CYP3A4 -[biolink:increases_degradation_of]-> Etoposide
CYP3A4 -[biolink:increases_degradation_of]-> Artemether


This set of original results have been linked together because they are all degraded by CYP3A4.  The p-value of this enrichment is 3.9e-11.  The other similarly coalesced nodes are shown here:

In [16]:
for i,result in enumerate(graph_coalesced_results):
    print('Result',i)
    print_gc_result(result)

Result 0
p_value: 3.904388366202427e-11
CYP3A4 -[biolink:increases_degradation_of]-> Palbociclib
CYP3A4 -[biolink:increases_degradation_of]-> Mexiletine
CYP3A4 -[biolink:increases_degradation_of]-> Trimebutine
CYP3A4 -[biolink:increases_degradation_of]-> Parecoxib
CYP3A4 -[biolink:increases_degradation_of]-> Irinotecan
CYP3A4 -[biolink:increases_degradation_of]-> Etoposide
CYP3A4 -[biolink:increases_degradation_of]-> Artemether
Result 1
p_value: 3.9436239074062695e-28
Bis(3-aminopropyl)amine -[biolink:interacts_with]-> AMD1
N1-Acetylspermine -[biolink:interacts_with]-> AMD1
1,3-Diaminopropane -[biolink:interacts_with]-> AMD1
Spermine -[biolink:interacts_with]-> AMD1
1,4-Diaminobutane -[biolink:interacts_with]-> AMD1
Spermidine -[biolink:interacts_with]-> AMD1
S-Adenosylmethioninamine -[biolink:interacts_with]-> AMD1
Result 2
p_value: 2.0139774271674997e-24
N1-Acetylspermine -[biolink:interacts_with]-> ODC1
Spermidine -[biolink:interacts_with]-> ODC1
Bis(3-aminopropyl)amine -[biolink:in

So we can see that at subset of the original results share treatment, adverse event, and metabolic profiles

### Property Coalesced Answers

The other form of enrichment shown looks for clusters of nodes that share not connections to other nodes, but property values.  These property values exist only for a few node types at the moment, specifically chemicals and diseases.  But we anticipate that the property landscape will increase over time.

In [17]:
#Those that have been coalesced via a shared property (property coalescence)
property_coalesced_results = list(
    filter( lambda x: x['node_bindings']['n1'][0]['coalescence_method'] == 'property_enrichment', coalesced_results)
)
print(len(property_coalesced_results))

50


Here is the structure of a property coalesced result.  The p_values and properties are in the node bindings of the result.

In [18]:
printjson(property_coalesced_results[0])

{
    "node_bindings": {
        "n0": [
            {
                "id": "NCBIGene:6611"
            }
        ],
        "n1": [
            {
                "id": "PUBCHEM.COMPOUND:60838",
                "properties": [
                    "drug",
                    "pharmaceutical",
                    "application"
                ],
                "coalescence_method": "property_enrichment",
                "p_values": [
                    2.0204155691683598e-15,
                    2.421493278176092e-15,
                    3.0888730542428343e-13
                ]
            },
            {
                "id": "PUBCHEM.COMPOUND:36462",
                "properties": [
                    "drug",
                    "pharmaceutical",
                    "application"
                ],
                "coalescence_method": "property_enrichment",
                "p_values": [
                    2.0204155691683598e-15,
                    2.421493278176092e-15,
        

This function prints out the combined nodes, as well as the properties that they share, and the p-value of the enrichment.

In [19]:
def print_pc_result(pc_result):
    print('p_value:', pc_result['node_bindings']['n1'][0]['p_values'])
    print('properties:', pc_result['node_bindings']['n1'][0]['properties'])
    for node in pc_result['node_bindings']['n1']:
        kgn = aragorn_result['message']['knowledge_graph']['nodes'][node['id']]
        print( f"  {kgn['name']}")
print_pc_result(property_coalesced_results[0])

p_value: [2.0204155691683598e-15, 2.421493278176092e-15, 3.0888730542428343e-13]
properties: ['drug', 'pharmaceutical', 'application']
  Irinotecan
  Etoposide
  Dosulepin
  Mexiletine
  Parecoxib
  Diphenylcyclopropenone
  Otenzepad
  Fenbufen
  Mesalamine
  Cytochalasin B
  Palbociclib
  Artemether
  Dactolisib


Now, these are the full set of results for property enrichments:

In [20]:
for pc_result in property_coalesced_results:
    print_pc_result(pc_result)

p_value: [2.0204155691683598e-15, 2.421493278176092e-15, 3.0888730542428343e-13]
properties: ['drug', 'pharmaceutical', 'application']
  Irinotecan
  Etoposide
  Dosulepin
  Mexiletine
  Parecoxib
  Diphenylcyclopropenone
  Otenzepad
  Fenbufen
  Mesalamine
  Cytochalasin B
  Palbociclib
  Artemether
  Dactolisib
p_value: [5.767032680186795e-15]
properties: ['drugbank.approved']
  Irinotecan
  Etoposide
  Dosulepin
  Mexiletine
  Trimebutine
  Parecoxib
  Fenbufen
  Mesalamine
  Palbociclib
  Artemether
p_value: [2.6996958599386434e-14]
properties: ['Cytochrome P-450 Substrates']
  Irinotecan
  Artemether
  Etoposide
  Dosulepin
  Mexiletine
  Palbociclib
  Trimebutine
  Parecoxib
p_value: [8.794931414496024e-14]
properties: ['therapeutic_flag']
  Irinotecan
  Etoposide
  Dosulepin
  Mexiletine
  Parecoxib
  Fenbufen
  Mesalamine
  Palbociclib
  Artemether
  Dactolisib
p_value: [3.5279060649873544e-13, 4.597521834623776e-13]
properties: ['Cytochrome P-450 CYP3A4 Substrates', 'Cytochrom

Consistent with the graph coalescence we see properties indicating P450 substrates, and antineoplastic drugs.