## Notebook to implement path searches in graph with -
1. Networkx algorithms
2. Discovery patterns
3. SPARQL queries

* Author: Sanya B Taneja
* Created: 2021-09-24
* Last edited: 2022-04-06

We combine the PheKnowLator and machine reading graphs (currently separate for each NP) using Networkx and search paths in the combined graph.

In [1]:
import os
import os.path
import networkx as nx
import json
import urllib
import traceback
from itertools import islice
from rdflib import Graph, URIRef, BNode, Namespace, Literal
from rdflib.namespace import RDF, OWL
from tqdm import tqdm
import json

In [2]:
import hashlib

In [3]:
import pickle
import pandas as pd
import numpy as np

In [4]:
#import pheknowlator kg_utils 
import sys
sys.path.append('../')
from pkt_kg.utils import *

In [8]:
KG_PATH = '/home/sanya/PheKnowLatorv2/resources/knowledge_graphs/'
MR_PATH = '/home/sanya/PheKnowLatorv2/machine_read/output_graphs/'
KG_NAME = 'PheKnowLator_v3.0.0_full_instance_inverseRelations_OWLNETS_NetworkxMultiDiGraph.gpickle'
MR_GRAPH_NAME_GT = 'machineread_greentea_version2.gpickle'
MR_GRAPH_NAME_KT = 'machineread_kratom_version2.gpickle'
MR_GRAPH_NAME_SEM = 'machineread_semrep_version3.gpickle'
NodeLabelsFilePL = 'nodeLabels_20220329.pickle'
NodeLabelsFileMR_gt = 'machineread_greentea_version2_NodeLabels.pickle'
NodeLabelsFileMR_kt = 'machineread_kratom_version2_NodeLabels.pickle'
NodeLabelsFileMR_semrep = 'machineread_semrep_version3_NodeLabels.pickle'

MR_GRAPH_INF = '/home/sanya/PheKnowLatorv2/machine_read/closure_output/machineread_inferred_symmetric_transitive.gpickle'

In [9]:

with open(KG_PATH+NodeLabelsFilePL, 'rb') as filep:
    nodeLabels = pickle.load(filep)

In [10]:
with open(MR_PATH+NodeLabelsFileMR_gt, 'rb') as filep1:
    nodemr1 = pickle.load(filep1)
with open(MR_PATH+NodeLabelsFileMR_kt, 'rb') as filep2:
    nodemr2 = pickle.load(filep2)
with open(MR_PATH+NodeLabelsFileMR_semrep, 'rb') as filep3:
    nodemr3 = pickle.load(filep3)
    

In [11]:
print(len(nodeLabels), len(nodemr1), len(nodemr2), len(nodemr3))

753219 2023 278 2726


In [12]:
for key in nodemr1:
    if key in nodeLabels:
        continue
    nodeLabels[key] = nodemr1[key]
for key in nodemr2:
    if key in nodeLabels:
        continue
    nodeLabels[key] = nodemr2[key]
for key in nodemr3:
    if key in nodeLabels:
        continue
    nodeLabels[key] = nodemr3[key]
len(nodeLabels)

753366

In [13]:
for key in nodeLabels:
    print(key, nodeLabels[key])
    break

https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000558233 IPO4-205


In [14]:
#save to pickle file
with open(KG_PATH+'merged_graphs/nodeLabels_merged_20220406.pickle', 'wb') as filep:
    pickle.dump(nodeLabels, filep)

In [15]:
nodeLabels['http://napdi.org/napdi_srs_imports:mitragyna_speciosa']

'Mitragyna_speciosa'

In [16]:
def get_graph_stats(kg):
    nodes = nx.number_of_nodes(kg)
    edges = nx.number_of_edges(kg)
    self_loops = nx.number_of_selfloops(kg)

    print('There are {} nodes, {} edges, and {} self-loop(s)'.format(nodes, edges, self_loops))
    # get degree information
    avg_degree = float(edges) / nodes
    print('The Average Degree is {}'.format(avg_degree))
    
    print('Nodes with highest degree:')
    n_deg = sorted([(str(x[0]), x[1]) for x in  kg.degree], key=lambda x: x[1], reverse=1)[:6]

    for x in n_deg:
        print('Label: {}'.format(nodeLabels[x[0]]))
        print('{} (degree={})'.format(x[0], x[1]))
    # get network density
    density = nx.density(kg)

    print('The density of the graph is: {}'.format(density))

In [17]:
##There are 745250 nodes, 7224186 edges, and 408 self-loop(s)
##The Average Degree is 9.6936410..
##The density of the graph is: 1.3007251348270127e-05
# get the number of nodes, edges, and self-loops
pl_kg = nx.read_gpickle(KG_PATH+KG_NAME)
get_graph_stats(pl_kg)

There are 745250 nodes, 7224186 edges, and 408 self-loop(s)
The Average Degree is 9.693641060046964
Nodes with highest degree:
Label: transcript
http://purl.obolibrary.org/obo/SO_0000673 (degree=190850)
Label: SNV
http://purl.obolibrary.org/obo/SO_0001483 (degree=121020)
Label: Homo sapiens
http://purl.obolibrary.org/obo/NCBITaxon_9606 (degree=116478)
Label: protein_coding_gene
http://purl.obolibrary.org/obo/SO_0001217 (degree=105046)
Label: testis
http://purl.obolibrary.org/obo/UBERON_0000473 (degree=43795)
Label: lncRNA_with_retained_intron
http://purl.obolibrary.org/obo/SO_0002113 (degree=29340)
The density of the graph is: 1.3007251348270127e-05


In [18]:
mr_kg = nx.read_gpickle(MR_PATH+MR_GRAPH_NAME_GT)
mr_kg2 = nx.read_gpickle(MR_PATH+MR_GRAPH_NAME_KT)
mr_kg3 = nx.read_gpickle(MR_PATH+MR_GRAPH_NAME_SEM)
mr_kginf = nx.read_gpickle(MR_GRAPH_INF)

In [19]:
# get the number of nodes, edges, and self-loops
print('Green Tea Machine Read REACH: ')
get_graph_stats(mr_kg)


Green Tea Machine Read REACH: 
There are 2009 nodes, 7367 edges, and 103 self-loop(s)
The Average Degree is 3.6669985067197612
Nodes with highest degree:
Label: (-)-epigallocatechin 3-gallate
http://purl.obolibrary.org/obo/CHEBI_4806 (degree=941)
Label: Tea
http://napdi.org/napdi-srs-imports:camellia_sinensis_leaf (degree=369)
Label: catechin
http://purl.obolibrary.org/obo/CHEBI_23053 (degree=206)
Label: apoptotic process
http://purl.obolibrary.org/obo/GO_0006915 (degree=161)
Label: glucose
http://purl.obolibrary.org/obo/CHEBI_17234 (degree=153)
Label: Mus <genus>
http://purl.obolibrary.org/obo/NCBITaxon_10088 (degree=144)
The density of the graph is: 0.0018261944754580483


In [20]:

print('Kratom Machine Read REACH: ')
get_graph_stats(mr_kg2)


Kratom Machine Read REACH: 
There are 272 nodes, 362 edges, and 5 self-loop(s)
The Average Degree is 1.3308823529411764
Nodes with highest degree:
Label: Mitragynine
http://purl.obolibrary.org/obo/CHEBI_6956 (degree=52)
Label: carbon monoxide
http://purl.obolibrary.org/obo/CHEBI_17245 (degree=19)
Label: high-density lipoprotein particle
http://purl.obolibrary.org/obo/GO_0034364 (degree=16)
Label: potassium voltage-gated channel subfamily H member 2 (human)
http://purl.obolibrary.org/obo/PR_Q12809 (degree=13)
Label: phosphatidic acid
http://purl.obolibrary.org/obo/CHEBI_16337 (degree=13)
Label: NPC1-like intracellular cholesterol transporter 1 (human)
http://purl.obolibrary.org/obo/PR_Q9UHC9 (degree=12)
The density of the graph is: 0.004911004992402865


In [19]:

print('SemRep machine read: ')
get_graph_stats(mr_kg3)



SemRep machine read: 
There are 2710 nodes, 9232 edges, and 10 self-loop(s)
The Average Degree is 3.4066420664206642
Nodes with highest degree:
Label: (-)-epigallocatechin gallate
http://napdi.org/napdi_srs_imports:epigallocatechin_gallate (degree=1141)
Label: catechin
http://purl.obolibrary.org/obo/CHEBI_23053 (degree=344)
Label: polyphenol
http://purl.obolibrary.org/obo/CHEBI_26195 (degree=263)
Label: flavonoids
http://purl.obolibrary.org/obo/CHEBI_72544 (degree=257)
Label: Rattus norvegicus
http://purl.obolibrary.org/obo/NCBITaxon_10116 (degree=165)
Label: import into cell
http://purl.obolibrary.org/obo/GO_0098657 (degree=161)
The density of the graph is: 0.0012575275254413673


In [20]:
print('Inferred graph stats: ')
get_graph_stats(mr_kginf)

Inferred graph stats: 
There are 1599 nodes, 13658 edges, and 43 self-loop(s)
The Average Degree is 8.541588492808005
Nodes with highest degree:
Label: (-)-epigallocatechin gallate
http://napdi.org/napdi_srs_imports:epigallocatechin_gallate (degree=665)
Label: catechin
http://purl.obolibrary.org/obo/CHEBI_23053 (degree=378)
Label: flavonoids
http://purl.obolibrary.org/obo/CHEBI_72544 (degree=354)
Label: rosoxacin
http://purl.obolibrary.org/obo/CHEBI_131715 (degree=297)
Label: calcium ion
http://purl.obolibrary.org/obo/CHEBI_39124 (degree=286)
Label: polyphenol
http://purl.obolibrary.org/obo/CHEBI_26195 (degree=285)
The density of the graph is: 0.0053451742758498155


In [None]:
#combine graphs - all MR (inferred also?)
mr_graph = nx.compose_all([mr_kg, mr_kg2, mr_kg3, mr_kginf])
print(type(mr_graph))
get_graph_stats(mr_graph)

<class 'networkx.classes.multidigraph.MultiDiGraph'>
There are 4157 nodes, 27784 edges, and 154 self-loop(s)
The Average Degree is 6.683666105364446
Nodes with highest degree:
Label: (-)-epigallocatechin gallate
http://napdi.org/napdi_srs_imports:epigallocatechin_gallate (degree=1366)
Label: (-)-epigallocatechin 3-gallate
http://purl.obolibrary.org/obo/CHEBI_4806 (degree=941)
Label: catechin
http://purl.obolibrary.org/obo/CHEBI_23053 (degree=781)
Label: flavonoids
http://purl.obolibrary.org/obo/CHEBI_72544 (degree=566)
Label: polyphenol
http://purl.obolibrary.org/obo/CHEBI_26195 (degree=505)
Label: rosoxacin
http://purl.obolibrary.org/obo/CHEBI_131715 (degree=498)
The density of the graph is: 0.0016081968492214738


In [22]:
nx_graph = nx.compose_all([mr_graph, pl_kg])
print(type(nx_graph))
get_graph_stats(nx_graph)

<class 'networkx.classes.multidigraph.MultiDiGraph'>
There are 4157 nodes, 27784 edges, and 154 self-loop(s)
The Average Degree is 6.683666105364446
Nodes with highest degree:
Label: (-)-epigallocatechin gallate
http://napdi.org/napdi_srs_imports:epigallocatechin_gallate (degree=1366)
Label: (-)-epigallocatechin 3-gallate
http://purl.obolibrary.org/obo/CHEBI_4806 (degree=941)
Label: catechin
http://purl.obolibrary.org/obo/CHEBI_23053 (degree=781)
Label: flavonoids
http://purl.obolibrary.org/obo/CHEBI_72544 (degree=566)
Label: polyphenol
http://purl.obolibrary.org/obo/CHEBI_26195 (degree=505)
Label: rosoxacin
http://purl.obolibrary.org/obo/CHEBI_131715 (degree=498)
The density of the graph is: 0.0016081968492214738


In [23]:
get_graph_stats(nx_graph)

There are 745763 nodes, 7251937 edges, and 562 self-loop(s)
The Average Degree is 9.72418449292872
Nodes with highest degree:
Label: transcript
http://purl.obolibrary.org/obo/SO_0000673 (degree=190858)
Label: SNV
http://purl.obolibrary.org/obo/SO_0001483 (degree=121020)
Label: Homo sapiens
http://purl.obolibrary.org/obo/NCBITaxon_9606 (degree=116704)
Label: protein_coding_gene
http://purl.obolibrary.org/obo/SO_0001217 (degree=105046)
Label: testis
http://purl.obolibrary.org/obo/UBERON_0000473 (degree=43818)
Label: lncRNA_with_retained_intron
http://purl.obolibrary.org/obo/SO_0002113 (degree=29340)
The density of the graph is: 1.3039259834811533e-05


In [25]:
##save graph in merged_graphs
nx.write_gpickle(nx_graph, KG_PATH+'merged_graphs/PheKnowLator_machine_read_merged_instance_based_OWLNETS_20220406.gpickle')

In [24]:
#nodes and edges examples
nodes = list(nx_graph.nodes(data=True))
for x in nodes:
    print(x)
    break

(rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_Q5F4B1'), {'key': '<http://purl.obolibrary.org/obo/PR_Q5F4B1>'})


In [26]:
#combine graphs - all MR (inferred also?) and PL
nx_graph = nx.compose_all([pl_kg, mr_kg, mr_kg2, mr_kg3, mr_kginf])
print(type(nx_graph))
get_graph_stats(nx_graph)

<class 'networkx.classes.multidigraph.MultiDiGraph'>
There are 745691 nodes, 7248571 edges, and 557 self-loop(s)
The Average Degree is 9.720609474970196
Nodes with highest degree:
Label: transcript
http://purl.obolibrary.org/obo/SO_0000673 (degree=190858)
Label: SNV
http://purl.obolibrary.org/obo/SO_0001483 (degree=121020)
Label: Homo sapiens
http://purl.obolibrary.org/obo/NCBITaxon_9606 (degree=116672)
Label: protein_coding_gene
http://purl.obolibrary.org/obo/SO_0001217 (degree=105046)
Label: testis
http://purl.obolibrary.org/obo/UBERON_0000473 (degree=43817)
Label: lncRNA_with_retained_intron
http://purl.obolibrary.org/obo/SO_0002113 (degree=29340)
The density of the graph is: 1.3035724597312818e-05


In [28]:
nodeLabels['http://purl.obolibrary.org/obo/CHEBI_4806']

'(-)-epigallocatechin 3-gallate'

In [29]:
nodeLabels['http://napdi.org/napdi_srs_imports:epigallocatechin_gallate']

'(-)-epigallocatechin gallate'

## Path Searches
1. Single source shortest path (saved)
2. k-simple paths (saved for cyp3a4, midazolam)
3. Bidirectional shortest paths (in nb)
4. Shortest paths - do

In [30]:
DIR_OUT = '/home/sanya/PheKnowLatorv2/output_files/'

In [31]:
obo = Namespace('http://purl.obolibrary.org/obo/')
napdi = Namespace('http://napdi.org/napdi_srs_imports:')

Functions. Create function for -
1. Get path narrative given path or list of paths
2. Get path URIs given path or list of paths
3. Get path with machine reading output from 2017 and prior
4. Save path with labels to file


In [67]:
def get_path_labels(path):
    path_labels = []
    if len(path) < 1:
        print('Path length 1, skipping')
        return
    for edge in zip(path, path[1:]):
        data = nx_graph.get_edge_data(*edge)
        pred = list(data.keys())[0]
        node1_lab = str(edge[0])
        node2_lab = str(edge[1])
        if node1_lab in nodeLabels:
            node1_lab = nodeLabels[node1_lab]
        if node2_lab in nodeLabels:
            node2_lab = nodeLabels[node2_lab]
        pred_lab = nodeLabels[str(pred)]
        if list(data.values())[0]:
            if 'source_graph' in list(data.values())[0]:
                source_graph = 'machine_read'
            else:
                source_graph = ''
        else:
            source_graph = ''
        labels = [node1_lab, pred_lab, node2_lab, source_graph]
        path_labels.append(labels)
    return path_labels

In [68]:
def get_path_uri(path):
    path_uri = []
    if len(path) < 1:
        print('Path length 1, skipping')
        return
    for edge in zip(path, path[1:]):
        data = nx_graph.get_edge_data(*edge)
        pred = list(data.keys())[0]
        attribute = list(data.values())
        uri = [str(edge[0]), pred, str(edge[1]), attribute]
        path_uri.append(uri)
    return path_uri

In [69]:
#get shortest path from green tea leaf
greentea_path = nx.single_source_shortest_path(nx_graph, napdi.camellia_sinensis_leaf)

In [70]:
type(greentea_path)

dict

In [71]:
save1 = 'greentea_single_source_shortest_path_50.txt'

In [72]:
#get 20 paths from green tea single source shortest path
#if returned paths are dictionary
count = 0
for target, node_list in greentea_path.items():
    count += 1
    if target != napdi.camellia_sinensis_leaf:
        if str(target) not in nodeLabels:
            target_label = str(target).split('/')[-1]
        else:
            target_label = nodeLabels[str(target)]
        print('\n{} - {} Path:'.format(str(napdi.camellia_sinensis_leaf).split('/')[-1], target_label))
        path_labels = get_path_labels(node_list)
        print(path_labels)
    if count == 20:
        break


napdi_srs_imports:camellia_sinensis_leaf - (-)-epicatechin Path:
[['Camellia_sinensis_leaf', 'has component', '(-)-epicatechin', '']]

napdi_srs_imports:camellia_sinensis_leaf - gallocatechin Path:
[['Camellia_sinensis_leaf', 'has component', 'gallocatechin', '']]

napdi_srs_imports:camellia_sinensis_leaf - Camellia_sinensis_whole Path:
[['Camellia_sinensis_leaf', 'part of', 'Camellia_sinensis_whole', '']]

napdi_srs_imports:camellia_sinensis_leaf - plant anatomical entity Path:
[['Camellia_sinensis_leaf', 'subClassOf', 'plant anatomical entity', '']]

napdi_srs_imports:camellia_sinensis_leaf - (-)-epigallocatechin gallate Path:
[['Camellia_sinensis_leaf', 'has component', '(-)-epigallocatechin gallate', '']]

napdi_srs_imports:camellia_sinensis_leaf - (-)-epigallocatechin Path:
[['Camellia_sinensis_leaf', 'has component', '(-)-epigallocatechin', '']]

napdi_srs_imports:camellia_sinensis_leaf - (-)-epicatechin-3-O-gallate Path:
[['Camellia_sinensis_leaf', 'has component', '(-)-epicate

In [73]:
#save 100 paths from green tea single source shortest path to file
#if returned paths are dictionary
count = 0
file_save = open(DIR_OUT+save1, 'w')
for target, node_list in greentea_path.items():
    count += 1
    if target != napdi.camellia_sinensis_leaf:
        if str(target) not in nodeLabels:
            target_label = str(target).split('/')[-1]
        else:
            target_label = nodeLabels[str(target)]
        file_save.write('\n{} - {} Path:\n'.format(str(napdi.camellia_sinensis_leaf).split('/')[-1], target_label))
        path_labels = get_path_labels(node_list)
        for triples in path_labels:
            for item in triples:
                file_save.write(str(item)+' ')
            file_save.write('\n')
    if count == 100:
        break
file_save.close()

In [None]:
#obo.CHEBI_83161 - St. Johns Wort extract (to test graph)

In [74]:
#green tea and warfarin
pathx = nx.bidirectional_shortest_path(nx_graph, napdi.camellia_sinensis_leaf, obo.CHEBI_10033)

In [75]:
pathx

[rdflib.term.URIRef('http://napdi.org/napdi_srs_imports:camellia_sinensis_leaf'),
 rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_68330'),
 rdflib.term.URIRef('http://napdi.org/napdi-srs-imports:camellia_sinensis_leaf'),
 rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_10033')]

In [35]:
#scratch try
for edge in zip(pathx,pathx[1:]):                                                 
    data = nx_graph.get_edge_data(*edge)    
    print('Edge info: ')
    print(data.values())
    print('source_graph' in list(data.values())[0])

Edge info: 
dict_values([{'predicate_key': '4bbadcc28097247fd55f77cbeb77ab74', 'weight': 0.0}])
False
Edge info: 
dict_values([{'predicate_key': 'cf627db0a97bc9798bf6f089a06581c9', 'weight': 0.0, 'pmid': '30286210', 'timestamp': '2018', 'source_graph': 'machine_read', 'belief': 0.65}])
True
Edge info: 
dict_values([{'predicate_key': '6dae54e8b8bbe9cdb6c615e404a5d7ce', 'weight': 0.0, 'pmid': '30286210', 'timestamp': '2018', 'source_graph': 'machine_read', 'belief': 0.65}])
True


In [21]:
path_labels

[['Camellia_sinensis_leaf',
  'has component',
  '(-)-epicatechin-3-O-gallate',
  ''],
 ['(-)-epicatechin-3-O-gallate',
  'molecularly interacts with',
  'UDP-glucuronosyltransferase 1A1 (human)',
  'machine_read']]

In [76]:
epicatechin = obo.CHEBI_90
catechin = obo.CHEBI_23053
egcg = obo.CHEBI_4806
greentea = napdi.camellia_sinensis_leaf

In [60]:
edge_path_test = nx.all_simple_edge_paths(nx_graph, catechin, obo.PR_P08684, 10)

In [61]:
i = 0
for x in edge_path_test:
    print(x)
    i += 1
    if i==3:
        break

[(rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_23053'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0018130'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002436')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0018130'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0044249'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0044249'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0009058'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0009058'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_P17735'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0000057')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_P17735'), rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000355962'), rdflib.term.URIRef('http://purl.ob

In [77]:
#returns simple paths nodes and edges, 
def k_simple_edge_paths(G, source, target, k, shortestLen):
    paths = nx.all_simple_edge_paths(G, source, target, cutoff=shortestLen+20)
    path_l = []
    path_n = []
    i = 0
    while i<k:
        try:
            print('[info] applying next operator to search for a simple path of max length {}'.format(shortestLen+20))
            path = next(paths)
        except StopIteration:
            break
        print('[info] Simple path found of length {}'.format(len(path))) 
        if len(path) > shortestLen:
            print('[info] Simple path length greater than shortest path length ({}) so adding to results'.format(shortestLen))
            path_l.append(path)
        i += 1
    for path in path_l:
        triple_list = []
        for triple in path:
            subj_lab = ''
            pred_lab = ''
            obj_lab = ''
            subj = str(triple[0])
            pred = str(triple[2])
            obj = str(triple[1])
            if subj in nodeLabels:
                subj_lab = nodeLabels[subj]
            if obj in nodeLabels:
                obj_lab = nodeLabels[obj]
            if pred in nodeLabels:
                pred_lab = nodeLabels[pred]
            triple_labels = (subj_lab, pred_lab, obj_lab)
            triple_list.append(triple_labels)
        path_n.append(triple_list)
    return path_l, path_n

In [79]:
cyp3a4_edge_paths, cyp3a4_edge_path_labs = k_simple_edge_paths(nx_graph, greentea, obo.PR_P08684, 10, 0)

[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] app

In [80]:
source = str(greentea)
target = str(obo.PR_P08684)
save2 = 'greentea_cyp3a4_simple_paths_20.txt'
file_save = open(DIR_OUT+save2, 'w')
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for path_list in cyp3a4_edge_path_labs:
    file_save.write('\nPATH: '+str(i)+'\n')
    for triples in path_list:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

In [81]:
pathx = nx.bidirectional_shortest_path(nx_graph, obo.CHEBI_23053, obo.PR_P08684)
path_labels = get_path_labels(pathx)
for triples in path_labels:
    print(triples)

['catechin', 'molecularly interacts with', 'protein binding', '']
['protein binding', 'function of', 'cytochrome P450 3A4 (human)', '']


In [82]:
source = str(obo.CHEBI_23053)
target = str(obo.PR_P08684)
save2 = 'catechin_cyp3a4_simple_paths_20.txt'
file_save = open(DIR_OUT+save2, 'w')
cyp3a4_edge_paths, cyp3a4_edge_path_labs = k_simple_edge_paths(nx_graph, obo.CHEBI_23053, obo.PR_P08684, 20, 2)
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for path_list in cyp3a4_edge_path_labs:
    file_save.write('\nPATH: '+str(i)+'\n')
    for triples in path_list:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

[info] applying next operator to search for a simple path of max length 22
[info] Simple path found of length 21
[info] Simple path length greater than shortest path length (2) so adding to results
[info] applying next operator to search for a simple path of max length 22
[info] Simple path found of length 22
[info] Simple path length greater than shortest path length (2) so adding to results
[info] applying next operator to search for a simple path of max length 22
[info] Simple path found of length 22
[info] Simple path length greater than shortest path length (2) so adding to results
[info] applying next operator to search for a simple path of max length 22
[info] Simple path found of length 22
[info] Simple path length greater than shortest path length (2) so adding to results
[info] applying next operator to search for a simple path of max length 22
[info] Simple path found of length 22
[info] Simple path length greater than shortest path length (2) so adding to results
[info] app

In [83]:
source = str(obo.CHEBI_4806)
target = str(obo.CHEBI_41879)
save2 = 'EGCG_dexamethasone_simple_paths_20.txt'
file_save = open(DIR_OUT+save2, 'w')
cyp3a4_edge_paths, cyp3a4_edge_path_labs = k_simple_edge_paths(nx_graph, obo.CHEBI_4806, obo.CHEBI_41879, 20, 0)
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for path_list in cyp3a4_edge_path_labs:
    file_save.write('\nPATH: '+str(i)+'\n')
    for triples in path_list:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] app

In [84]:
source = str(obo.CHEBI_4806)
target = str(obo.UBERON_0000468)
save2 = 'EGCG_bodyweight_simple_paths_20.txt'
file_save = open(DIR_OUT+save2, 'w')
cyp3a4_edge_paths, cyp3a4_edge_path_labs = k_simple_edge_paths(nx_graph, obo.CHEBI_4806, obo.UBERON_0000468, 20, 0)
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for path_list in cyp3a4_edge_path_labs:
    file_save.write('\nPATH: '+str(i)+'\n')
    for triples in path_list:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] app

In [85]:
for item in zip(cyp3a4_edge_path_labs[0], cyp3a4_edge_paths[0]):
    print(item)

(('(-)-epigallocatechin 3-gallate', 'subClassOf', 'flavans'), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_4806'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_38672'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf')))
(('flavans', 'subClassOf', '1-benzopyran'), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_38672'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_38443'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf')))
(('1-benzopyran', 'subClassOf', 'benzopyran'), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_38443'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_22727'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf')))
(('benzopyran', 'molecularly interacts with', 'secretory granule lumen'), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_22727'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0034774'), rdflib.term.URIRef('http://

In [87]:
source = str(obo.CHEBI_23053)
target = str(obo.PR_P08684)
save2 = 'catechin_cyp3a4_simple_paths_20.txt'
file_save = open(DIR_OUT+save2, 'w')
cyp3a4_edge_paths, cyp3a4_edge_path_labs = k_simple_edge_paths(nx_graph, obo.CHEBI_23053, obo.PR_P08684, 20, 0)
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for path_list in cyp3a4_edge_path_labs:
    file_save.write('\nPATH: '+str(i)+'\n')
    for triples in path_list:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] app

In [88]:
source = str(obo.CHEBI_23053)
target = str(obo.HP_0003074)
save2 = 'catechin_hyperglycemia_simple_paths_20.txt'
file_save = open(DIR_OUT+save2, 'w')
cyp3a4_edge_paths, cyp3a4_edge_path_labs = k_simple_edge_paths(nx_graph, obo.CHEBI_23053, obo.HP_0003074, 20, 0)
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for path_list in cyp3a4_edge_path_labs:
    file_save.write('\nPATH: '+str(i)+'\n')
    for triples in path_list:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] app

In [89]:
source = str(napdi.camellia_sinensis_leaf)
target = str(obo.PR_O08684)
save2 = 'greentea_cyp3a4_simple_paths_20.txt'
file_save = open(DIR_OUT+save2, 'w')
cyp3a4_edge_paths, cyp3a4_edge_path_labs = k_simple_edge_paths(nx_graph, napdi.camellia_sinensis_leaf, obo.PR_P08684, 20, 0)
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for path_list in cyp3a4_edge_path_labs:
    file_save.write('\nPATH: '+str(i)+'\n')
    for triples in path_list:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] app

In [90]:
cyp3a4_edge_path_labs[0]

[('Camellia_sinensis_leaf', 'has component', '(-)-epicatechin'),
 ('(-)-epicatechin', 'is enantiomer of', '(+)-epicatechin'),
 ('(+)-epicatechin', 'subClassOf', 'polyphenol'),
 ('polyphenol',
  'molecularly interacts with',
  'regulation of immune system process'),
 ('regulation of immune system process',
  '',
  'alpha-1-acid glycoprotein 1 (human)'),
 ('alpha-1-acid glycoprotein 1 (human)',
  'participates_in',
  'Innate Immune System'),
 ('Innate Immune System', '', 'CD3G (human)'),
 ('CD3G (human)', 'participates_in', 'TCR signaling'),
 ('TCR signaling', '', 'proteasome subunit beta type-8 (human)'),
 ('proteasome subunit beta type-8 (human)',
  'participates_in',
  'Beta-catenin independent WNT signaling'),
 ('Beta-catenin independent WNT signaling', '', 'GNGT2 (human)'),
 ('GNGT2 (human)', 'participates_in', 'Ca2+ pathway'),
 ('Ca2+ pathway', '', 'inositol 1,4,5-trisphosphate receptor type 2 (human)'),
 ('inositol 1,4,5-trisphosphate receptor type 2 (human)',
  'molecularly inter

In [41]:
cyp3a4_edge_paths[0]

[(rdflib.term.URIRef('http://napdi.org/napdi_srs_imports:camellia_sinensis_leaf'),
  rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_90'),
  rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002180')),
 (rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_90'),
  rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_76125'),
  rdflib.term.URIRef('http://purl.obolibrary.org/obo/chebi#is_enantiomer_of')),
 (rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_76125'),
  rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_26195'),
  rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf')),
 (rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_26195'),
  rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0002682'),
  rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002436')),
 (rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0002682'),
  rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_P02763'),
  rdflib.term.URIRef('htt

In [91]:
def k_shortest_paths(G, source, target, k, weight='weight'):
    return list(islice(nx.all_shortest_paths(G, source, target, weight=weight), k))

In [92]:
#only returns node list, use get_path_labels to generate edges and get labels
def k_simple_paths(G, source, target, k, shortestLen):
    paths = nx.all_simple_paths(G, source, target, cutoff=shortestLen+20)
    path_l = []
    i = 0
    while i < k:
        try:
            print('[info] applying next operator to search for a simple path of max length {}'.format(shortestLen+20))
            path = next(paths)
        except StopIteration:
            break
        print('[info] Simple path found of length {}'.format(len(path))) 
        if len(path) > shortestLen:
            print('[info] Simple path length greater than shortest path length ({}) so adding to results'.format(shortestLen))
            path_l.append(path)
        i += 1
    return path_l

In [25]:
cyp3a4_paths = k_simple_paths(nx_graph, napdi.camellia_sinensis_leaf, obo.PR_P08684, 10, 4)

[info] applying next operator to search for a simple path of max length 24
[info] Simple path found of length 25
[info] Simple path length greater than shortest path length (4) so adding to results
[info] applying next operator to search for a simple path of max length 24
[info] Simple path found of length 25
[info] Simple path length greater than shortest path length (4) so adding to results
[info] applying next operator to search for a simple path of max length 24
[info] Simple path found of length 25
[info] Simple path length greater than shortest path length (4) so adding to results
[info] applying next operator to search for a simple path of max length 24
[info] Simple path found of length 25
[info] Simple path length greater than shortest path length (4) so adding to results
[info] applying next operator to search for a simple path of max length 24
[info] Simple path found of length 25
[info] Simple path length greater than shortest path length (4) so adding to results
[info] app

In [179]:
str(obo.PR_P08684).split('/')[-1]

'PR_P08684'

In [43]:
#if returned paths are list
#simple paths with max length 25
save2 = 'greentea_cyp3a4_simple_paths_10.txt'
file_save = open(DIR_OUT+save2, 'w')
source = str(napdi.camellia_sinensis_leaf)
target = str(obo.PR_P08684)
source_label = source
target_label = target
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=24):\n'.format(source_label, target_label))
i = 0
for node_list in cyp3a4_paths:
    file_save.write('\nPATH: '+str(i)+'\n')
    path_labels = get_path_labels(node_list)
    for triples in path_labels:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

In [75]:
source = str(napdi.camellia_sinensis_leaf)
target = str(obo.CHEBI_6931)
save2 = 'greentea_midazolam_simple_paths_20.txt'
file_save = open(DIR_OUT+save2, 'w')
cyp3a4_edge_paths, cyp3a4_edge_path_labs = k_simple_edge_paths(nx_graph, napdi.camellia_sinensis_leaf, obo.CHEBI_6931, 20, 0)
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for path_list in cyp3a4_edge_path_labs:
    file_save.write('\nPATH: '+str(i)+'\n')
    for triples in path_list:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 19
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (0) so adding to results
[info] app

In [93]:
#if returned paths are list
#simple paths with max length 25
save3 = 'greentea_midazolam_simple_paths_10.txt'
file_save = open(DIR_OUT+save3, 'w')
source = str(napdi.camellia_sinensis_leaf)
target = str(obo.CHEBI_6931)
source_label = source
target_label = target
if source in nodeLabels:
    source_label = nodeLabels[source]
if target in nodeLabels:
    target_label = nodeLabels[target]
file_save.write('\n{} - {} Simple Path (cutoff=20):\n'.format(source_label, target_label))
i = 0
for node_list in midazolam_paths:
    file_save.write('\nPATH: '+str(i)+'\n')
    path_labels = get_path_labels(node_list)
    for triples in path_labels:
        for item in triples:
            file_save.write(str(item)+' ')
        file_save.write('\n')
    i += 1
file_save.close()

NameError: name 'midazolam_paths' is not defined

In [94]:
##Bidirectional shortest paths
pathx = nx.bidirectional_shortest_path(nx_graph, napdi.camellia_sinensis_leaf, obo.CHEBI_10033)
path_labels = get_path_labels(pathx)
for triples in path_labels:
    print(triples)

['Camellia_sinensis_leaf', 'has component', 'gallocatechin', '']
['gallocatechin', 'directly negatively regulates activity of', 'Tea', 'machine_read']
['Tea', 'directly negatively regulates activity of', 'warfarin', 'machine_read']


In [95]:
pathx = nx.bidirectional_shortest_path(nx_graph, napdi.camellia_sinensis_leaf, obo.PR_P08684)
path_labels = get_path_labels(pathx)
for triples in path_labels:
    print(triples)

['Camellia_sinensis_leaf', 'has component', '(-)-epicatechin', '']
['(-)-epicatechin', 'negatively regulates', 'etoposide', 'machine_read']
['etoposide', 'interacts with', 'cytochrome P450 3A4 (human)', '']


In [96]:
pathx = nx.bidirectional_shortest_path(nx_graph, napdi.camellia_sinensis_leaf, obo.CHEBI_6931)
path_labels = get_path_labels(pathx)
for triples in path_labels:
    print(triples)

['Camellia_sinensis_leaf', 'has component', '(-)-epicatechin', '']
['(-)-epicatechin', 'IncreaseAmount', 'taurochenodeoxycholate 6alpha-hydroxylase activity', 'machine_read']
['taurochenodeoxycholate 6alpha-hydroxylase activity', 'molecularly interacts with', 'midazolam', 'machine_read']


In [97]:
pathx = nx.bidirectional_shortest_path(nx_graph, napdi.camellia_sinensis_leaf, obo.HP_0003418)
path_labels = get_path_labels(pathx)
for triples in path_labels:
    print(triples)

['Camellia_sinensis_leaf', 'has component', '(-)-epicatechin', '']
['(-)-epicatechin', 'subClassOf', 'catechin', '']
['catechin', 'molecularly interacts with', 'daunorubicin', 'machine_read']
['daunorubicin', 'is substance that treats', 'Back pain', '']


In [98]:
pathx = nx.bidirectional_shortest_path(nx_graph, napdi.camellia_sinensis_leaf, obo.CHEBI_9150)
path_labels = get_path_labels(pathx)
for triples in path_labels:
    print(triples)

['Camellia_sinensis_leaf', 'has component', '(-)-epicatechin', '']
['(-)-epicatechin', 'subClassOf', 'catechin', '']
['catechin', 'molecularly interacts with', 'simvastatin', 'machine_read']


In [99]:
pathx = nx.bidirectional_shortest_path(nx_graph, napdi.camellia_sinensis_leaf, obo.CHEBI_7444)
path_labels = get_path_labels(pathx)
for triples in path_labels:
    print(triples)

['Camellia_sinensis_leaf', 'has component', 'gallocatechin', '']
['gallocatechin', 'directly negatively regulates activity of', 'Tea', 'machine_read']
['Tea', 'directly negatively regulates activity of', 'nadolol', 'machine_read']


KRATOM

In [100]:
kratom = napdi.mitragyna_speciosa
mitragynine = obo.CHEBI_6956
hydroxy_mitragynine = napdi['7_hydroxy_mitragynine']
hydroxy_mitragynine

rdflib.term.URIRef('http://napdi.org/napdi_srs_imports:7_hydroxy_mitragynine')

In [101]:
pathx = nx.bidirectional_shortest_path(nx_graph, kratom, obo.PR_P08684)
path_labels = get_path_labels(pathx)
for triples in path_labels:
    print(triples)

['Mitragyna_speciosa', 'has component', 'Mitragynine', '']
['Mitragynine', 'interacts with', 'cytochrome P450 3A4 (human)', '']


In [22]:
for key in nodeLabels:
    print(key)
    print(nodeLabels[key])
    break

https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000558233
IPO4-205


### Path searches with MR nodes as end points - predications with highest belief scores
1. Get MR predications with belief scores > 0.65
2. Use subject and object nodes as start and end points for simple path searches (shortest path would just be direct link between the nodes)

In [67]:
df = pd.read_csv('../machine_read/greentea_pmid_all_predicates_umls_processed.tsv', sep='\t')
df.head()

Unnamed: 0,subject_cui,subject_name,subject_source,predicate,object_source,object_cui,object_name,subj_reach_grounding,obj_reach_grounding,pmid,pub_year,belief,predicate_obo,subject_obo,object_obo
0,C0017337,Genes,gene encoding SERCA2a,Acetylation,Histone_H3,C0019652,Histones,"(None, None)","('FPLX', 'Histone_H3')",30286210,2018,0.65,http://purl.obolibrary.org/obo/GO_0006473,http://purl.obolibrary.org/obo/SO_0000704,http://purl.obolibrary.org/obo/PR_000041244
1,C1418880,PRDX2_gene,trichostatin A,Acetylation,Tubulin,C0041348,Tubulin,"('CHEBI', 'CHEBI:46024')","('FPLX', 'Tubulin')",25680958,2015 Apr,0.65,http://purl.obolibrary.org/obo/GO_0006473,http://purl.obolibrary.org/obo/CHEBI_46024,http://purl.obolibrary.org/obo/PR_000028799
2,C0059438,epigallocatechin_gallate,--epigallocatechin 3-gallate,Acetylation,Histone,C0019652,Histones,"('CHEBI', 'CHEBI:4806')","('FPLX', 'Histone')",23210776,2013,0.65,http://purl.obolibrary.org/obo/GO_0006473,http://purl.obolibrary.org/obo/CHEBI_4806,http://purl.obolibrary.org/obo/PR_000041244
3,C0073591,rosoxacin,ROS1,Acetylation,TMPRSS11D,C0444765,Hat_-_Headwear,"('HGNC', '10261')","('HGNC', '24059')",25847253,2015,0.65,http://purl.obolibrary.org/obo/GO_0006473,http://purl.obolibrary.org/obo/CHEBI_131715,http://purl.obolibrary.org/obo/PR_Q9M2N5
4,C3539643,EIF4E_wt_Allele,CREBBP,Acetylation,RELA,C1453853,"WNK1_protein,_human","('HGNC', '2348')","('HGNC', '9955')",25847253,2015,0.86,http://purl.obolibrary.org/obo/GO_0006473,http://purl.obolibrary.org/obo/PR_P63074,http://purl.obolibrary.org/obo/PR_000017431


In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7677 entries, 0 to 7676
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   subject_cui           7677 non-null   object 
 1   subject_name          7677 non-null   object 
 2   subject_source        7677 non-null   object 
 3   predicate             7677 non-null   object 
 4   object_source         7677 non-null   object 
 5   object_cui            7677 non-null   object 
 6   object_name           7677 non-null   object 
 7   subj_reach_grounding  7677 non-null   object 
 8   obj_reach_grounding   7677 non-null   object 
 9   pmid                  7677 non-null   int64  
 10  pub_year              7677 non-null   object 
 11  belief                7677 non-null   float64
 12  predicate_obo         7677 non-null   object 
 13  subject_obo           7677 non-null   object 
 14  object_obo            7677 non-null   object 
dtypes: float64(1), int64(

In [70]:
df = df.loc[df['belief'] > 0.8]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2012 entries, 4 to 7675
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   subject_cui           2012 non-null   object 
 1   subject_name          2012 non-null   object 
 2   subject_source        2012 non-null   object 
 3   predicate             2012 non-null   object 
 4   object_source         2012 non-null   object 
 5   object_cui            2012 non-null   object 
 6   object_name           2012 non-null   object 
 7   subj_reach_grounding  2012 non-null   object 
 8   obj_reach_grounding   2012 non-null   object 
 9   pmid                  2012 non-null   int64  
 10  pub_year              2012 non-null   object 
 11  belief                2012 non-null   float64
 12  predicate_obo         2012 non-null   object 
 13  subject_obo           2012 non-null   object 
 14  object_obo            2012 non-null   object 
dtypes: float64(1), int64(

In [71]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,subject_cui,subject_name,subject_source,predicate,object_source,object_cui,object_name,subj_reach_grounding,obj_reach_grounding,pmid,pub_year,belief,predicate_obo,subject_obo,object_obo
0,C3539643,EIF4E_wt_Allele,CREBBP,Acetylation,RELA,C1453853,"WNK1_protein,_human","('HGNC', '2348')","('HGNC', '9955')",25847253,2015,0.86,http://purl.obolibrary.org/obo/GO_0006473,http://purl.obolibrary.org/obo/PR_P63074,http://purl.obolibrary.org/obo/PR_000017431
1,C1530358,"EP300_protein,_human",EP300,Acetylation,RELA,C1453853,"WNK1_protein,_human","('HGNC', '3373')","('HGNC', '9955')",25847253,2015,0.86,http://purl.obolibrary.org/obo/GO_0006473,http://purl.obolibrary.org/obo/PR_000007102,http://purl.obolibrary.org/obo/PR_000017431
2,C1843013,"Alzheimer_disease,_familial,_type_3",AD,Activation,long-term synaptic potentiation,C0206249,Long-Term_Potentiation,"(None, None)","('GO', 'GO:0060291')",29944861,2018 Aug,0.923,http://purl.obolibrary.org/obo/RO_0002436,http://purl.obolibrary.org/obo/MONDO_0100087,http://purl.obolibrary.org/obo/GO_0060291
3,C0025918,"Mice,_Inbred_AKR",AKR,Activation,secondary alcohol,C0001962,Ethanol,"(None, None)","('CHEBI', 'CHEBI:35681')",28283780,2017 Jun,0.86,http://purl.obolibrary.org/obo/RO_0002436,http://purl.obolibrary.org/obo/NCBITaxon_10088,http://purl.obolibrary.org/obo/CHEBI_16236
4,C3814396,CYP3A_Gene_Locus,CYP3A,Activation,metabolic process,C0025520,metabolic_aspects,"(None, None)","('GO', 'GO:0008152')",29368187,2018 May,0.949271,http://purl.obolibrary.org/obo/RO_0002436,http://purl.obolibrary.org/obo/CHEBI_38559,http://purl.obolibrary.org/obo/GO_0008152


In [72]:
df = df.sort_values(by=['belief'], ascending=False)
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,subject_cui,subject_name,subject_source,predicate,object_source,object_cui,object_name,subj_reach_grounding,obj_reach_grounding,pmid,pub_year,belief,predicate_obo,subject_obo,object_obo
0,C0596577,Flavonoids,flavonoids,Activation,apoptotic process,C0162638,Apoptosis,"('CHEBI', 'CHEBI:72544')","('GO', 'GO:0006915')",22830339,2012,0.95,http://purl.obolibrary.org/obo/RO_0002436,http://purl.obolibrary.org/obo/CHEBI_72544,http://purl.obolibrary.org/obo/GO_0006915
1,C0059438,epigallocatechin_gallate,--epigallocatechin 3-gallate,Inhibition,Neoplasms,C0027651,Neoplasms,"('CHEBI', 'CHEBI:4806')","('MESH', 'D009369')",29137307,2017 Oct 10,0.95,http://purl.obolibrary.org/obo/RO_0002449,http://purl.obolibrary.org/obo/CHEBI_4806,http://purl.obolibrary.org/obo/HP_0002664
2,C0016979,Gallic_acid,gallic acid,Activation,Mice,C0026809,Mus,"('CHEBI', 'CHEBI:30778')","('MESH', 'D051379')",24722818,2014 Jul,0.95,http://purl.obolibrary.org/obo/RO_0002436,http://purl.obolibrary.org/obo/CHEBI_30778,http://purl.obolibrary.org/obo/NCBITaxon_10088
3,C0016993,Gambia,gallic acid,Activation,apoptotic process,C0162638,Apoptosis,"('CHEBI', 'CHEBI:30778')","('GO', 'GO:0006915')",26251571,2015,0.95,http://purl.obolibrary.org/obo/RO_0002436,http://purl.obolibrary.org/obo/CHEBI_30778,http://purl.obolibrary.org/obo/GO_0006915
4,C0071649,polyphenols,polyphenol,Inhibition,glucose transmembrane transport,C0178666,glucose_transport,"('CHEBI', 'CHEBI:26195')","('GO', 'GO:1904659')",30667442,2019 Feb 20,0.95,http://purl.obolibrary.org/obo/RO_0002449,http://purl.obolibrary.org/obo/CHEBI_26195,http://purl.obolibrary.org/obo/GO_1904659


In [58]:
df.to_csv('../machine_read/MR_triples_searchpath.tsv', sep='\t', index=False)

In [None]:
#Subject-Object pairs (testing paths) for triples from machine reading with belief scores > 0.8
'''
1. catechin (CHEBI_90) -> ABCB1 (), biosynthetic process (), transport (), apoptotic process (), coronary disease (),
cholesterol (), myocardial eschemia, cisplatin, heart disease, glucose, glucose import, glucose metabolic process,
hyperglycemia, intestinal absorption
2. epigallocatechin gallate (CHEBI_4806) -> quinone, paracetamol, Endoplasmic Reticulum Stress, ATP, ATPase, autophagy, bile acid,
transport, cell death, cholesterol, cisplatin, dexamethasone, diclofenac, digoxin, dopamine, drug metabolic process, 
erythromycin, glutathione, heart failure, hemolysis, angiotensin-2, cortisol, insulin secretion, insulin resistance,
liver failure, nadolol, obesity, quercetin, tamoxifen, verapamil
3. greentea -> atorvastatin, rosuvastatin, benzo[a]pyrene, cardiovascular disease, stroke, cholesterol,
Myocardial Ischemia, Coronary Disease, Diabetes Mellitus, diclofenac, digoxin, doxorubicin, hypertension, liver disease,
nadolol, obesity, warfarin, glucose import, glutathione
EXTENDED LISTS BELOW
'''

In [95]:
catechin_list = ['ABCB1_gene', 'Anabolism', 'Biological_Transport', 'Apoptosis', 'Cell_Proliferation', 
                 'Coronary_Arteriosclerosis', 'Cholesterol', 'Cytochrome_P-450_CYP1A1', 'Cytochrome_P-450_CYP1A2',
                 'Cytochrome_P-450_CYP3A4', 'Insulin_Secretion', 'Cisplatin', 'Heart_Diseases', 'Glucose', 
                 'glucose_uptake', 'glucose_transport', 'Hyperglycemia', 'Obesity', 'P-Glycoprotein',
                'UGT1A1_gene', 'Weight_decreased']
egcg_list = ['1,4-benzoquinone', 'ABCA1_gene', 'Acetaminophen', 'Adenosine_Triphosphatases', 'Autophagy',
            'Bile_Acids', 'Bilirubin', 'Biological_Transport', 'Body_Weight', 'BRCA1_protein,_human',
             'Cell_Death', 'Cell_Proliferation', 'Cholesterol', 'Cisplatin', 'Collagen', 'Coronary_Arteriosclerosis',
             'Cytochrome_P-450_CYP1A1', 'Cytochrome_P-450_CYP1A2', 'Cytochrome_P-450_CYP3A4', 'Cytochrome_P-450_CYP2D6',
             'Cytochrome_P-450_CYP2C19', 'drug_metabolism', 'Dexamethasone', 'Diclofenac', 'Digoxin', 'Dopamine', 
             'GA-Binding_Protein_Transcription_Factor', 'Gluconeogenesis', 'Glucose_Transporter', 
             'glucose_transport', 'glucose_uptake', 'Glutathione', 'Glycogen', 'Erythromycin',  'Heart_failure', 
             'Hemolysis_(disorder)', 'Inflammation', 'Hydrocortisone',
             'Interleukin-1', 'Interleukin-6', 'Intestinal_Absorption', 'rosoxacin', 'UGT1A1_gene',
              'Insulin_Secretion', 'Insulin_Resistance', 'Liver_Failure',
             'Nadolol', 'Obesity', 'Quercetin', 'Tamoxifen', 'Verapamil']
greentea_list = ['ABCB1_gene', 'ABCG2_gene', 'Acetaminophen', 'Biological_Transport', 'Cardiovascular_Diseases',
            'Cerebrovascular_accident', 'Coronary_Arteriosclerosis', 'atorvastatin',  'Benzopyrenes', 'Cholesterol',
            'Cytochrome_P450', 'Cytochromes', 'Cytochrome_P-450_CYP1A1', 'Cytochrome_P-450_CYP1A2',
            'Cytochrome_P-450_CYP3A4', 'Diabetes_Mellitus', 'Diclofenac', 'Digoxin', 'Doxorubicin', 'glucose_transport',
            'Hypertensive_disease', 'Hay_fever', 'Interleukin-10', 'Lipid_Metabolism', 'Liver_diseases', 
            'Low-Density_Lipoproteins', 'Nadolol', 'Obesity', 'glucose_uptake', 'Glutathione',
            'SLC2A1_protein,_human', 'SLC5A1_gene', 'SLCO1A2_gene', 'SLCO2B1_gene', 'Warfarin',
            'rosuvastatin', 'rosoxacin', 'TNFSF11_protein,_human', 'TRPA1_gene', 'TRPV1_gene']


In [96]:
#get OBO identifiers from dataframe
node_dict = {}
for item in catechin_list:
    if item not in node_dict:
        print(item)
        obo_id = df.loc[df['object_name'] == item]['object_obo'].values[0]
        node_dict[item] = obo_id.split('/')[-1]
for item in egcg_list:
    if item not in node_dict:
        print(item)
        obo_id = df.loc[df['object_name'] == item]['object_obo'].values[0]
        node_dict[item] = obo_id.split('/')[-1]
for item in tea_list:
    if item not in node_dict:
        print(item)
        obo_id = df.loc[df['object_name'] == item]['object_obo'].values[0]
        node_dict[item] = obo_id.split('/')[-1]
len(node_dict)


ABCB1_gene
Anabolism
Biological_Transport
Apoptosis
Cell_Proliferation
Coronary_Arteriosclerosis
Cholesterol
Cytochrome_P-450_CYP1A1
Cytochrome_P-450_CYP1A2
Cytochrome_P-450_CYP3A4
Insulin_Secretion
Cisplatin
Heart_Diseases
Glucose
glucose_uptake
glucose_transport
Hyperglycemia
Obesity
P-Glycoprotein
UGT1A1_gene
Weight_decreased
1,4-benzoquinone
ABCA1_gene
Acetaminophen
Adenosine_Triphosphatases
Autophagy
Bile_Acids
Bilirubin
Body_Weight
BRCA1_protein,_human
Cell_Death
Collagen
Cytochrome_P-450_CYP2D6
Cytochrome_P-450_CYP2C19
drug_metabolism
Dexamethasone
Diclofenac
Digoxin
Dopamine
GA-Binding_Protein_Transcription_Factor
Gluconeogenesis
Glucose_Transporter
Glutathione
Glycogen
Erythromycin
Heart_failure
Hemolysis_(disorder)
Inflammation
Hydrocortisone
Interleukin-1
Interleukin-6
Intestinal_Absorption
rosoxacin
Insulin_Resistance
Liver_Failure
Nadolol
Quercetin
Tamoxifen
Verapamil
ABCG2_gene
Cardiovascular_Diseases
Cerebrovascular_accident
atorvastatin
Benzopyrenes
Cytochrome_P450
Cyto

83

In [122]:
node_dict['Hyperglycemia']

'HP_0003074'

In [35]:
x = zip(cyp3a4_edge_paths, cyp3a4_edge_path_labs)

In [38]:
for item in x:
    print(item)
    print(type(item))
    break

([(rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_23053'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0018130'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002436')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0018130'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0044249'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0044249'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0009058'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0009058'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_P17735'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0000057')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_P17735'), rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000355962'), rdflib.term.URIRef('http://purl.o

In [61]:
from typing import Dict, List, Optional, Set, Tuple, Union

def n3(node: Union[URIRef, BNode, Literal]) -> str:
    """Method takes an RDFLib node of type BNode, URIRef, or Literal and serializes it to meet the RDF 1.1 NTriples
    format.
    Src: https://github.com/RDFLib/rdflib/blob/c11f7b503b50b7c3cdeec0f36261fa09b0615380/rdflib/plugins/serializers/nt.py
    Args:
        node: An RDFLib
    Returns:
        serialized_node: A string containing the serialized
    """
    if isinstance(node, Literal): serialized_node = "%s" % _quoteLiteral(node)
    else: serialized_node = "%s" % node.n3()
    return serialized_node

In [59]:
s = URIRef('http://napdi.org/napdi_srs_imports:camellia_sinensis_leaf')
p = URIRef('http://purl.obolibrary.org/obo/RO_0002180')
o = URIRef('http://purl.obolibrary.org/obo/CHEBI_90')

In [62]:
pred_key = hashlib.md5('{}{}{}'.format(n3(s), n3(p), n3(o)).encode()).hexdigest()

In [63]:
pred_key

'65161e94646ef7334785bf7ac25257be'

In [71]:
nx_graph[s][o][p]

{'predicate_key': '65161e94646ef7334785bf7ac25257be', 'weight': 0.0}

### Fix nodeLabels in file from build 3.0.0

In [1]:
import json
import pickle

In [2]:
import re

In [50]:
len(nodeLabels)

753217

In [4]:
file1 = open('/home/sanya/PheKnowLatorv2/resources/knowledge_graphs/' + 'nodeLabels_20211021.pickle', 'rb')
nods = pickle.load(file1)

In [5]:
nods['http://purl.obolibrary.org/obo/PR_Q9H9S0']

'homeobox protein NANOG (human)'

In [6]:
nods['http://purl.obolibrary.org/obo/PR_Q9H9S0']

'homeobox protein NANOG (human)'

In [12]:
nods['http://purl.obolibrary.org/obo/DIDEO_00000041'] = 'is substrate of'

In [9]:
nods['http://purl.obolibrary.org/obo/RO_0002204'] = 'gene product of'

In [None]:
#check inhibits, RO_0002204, DIDEO_

In [14]:
fileo = open('/home/sanya/PheKnowLatorv2/resources/knowledge_graphs/' + 'nodeLabels_20220329.pickle', 'wb')


In [15]:
len(nods)

753219

In [52]:
for key in correctLabels:
    print(key)
    print(correctLabels[key])
    break

<http://purl.obolibrary.org/obo/VO_0002752>
{'label': 'E2 from Western equine encephalomyelitis virus', 'description/definition': 'N/A'}


In [53]:
for key in correctLabels:
    node = key.strip('<')
    node = node.strip('>')
    newLabel = correctLabels[key]['label']
    if node in nods:
        if newLabel != 'N/A':
            nods[node] = newLabel
len(nods)

753217

In [16]:
nods['http://purl.obolibrary.org/obo/SO_0000704']

'gene'

In [20]:
##add NCBI labels for kratom and green tea. Add others when including more NPs
nods['http://purl.obolibrary.org/obo/NCBITaxon_4442'] = 'Camellia sinensis'
nods['http://purl.obolibrary.org/obo/NCBITaxon_170351'] = 'Mitragyna speciosa'


In [17]:
nods['http://purl.obolibrary.org/obo/PR_Q9H9S0']

'homeobox protein NANOG (human)'

In [63]:
nas[:20]

['<http://purl.obolibrary.org/obo/PR_Q6V1P9>',
 '<http://purl.obolibrary.org/obo/PR_Q9NZV6-1>',
 '<http://purl.obolibrary.org/obo/PR_Q9GZL7>',
 '<http://purl.obolibrary.org/obo/PR_O76083-5>',
 '<http://purl.obolibrary.org/obo/PR_A0A087WT02>',
 '<http://purl.obolibrary.org/obo/PR_A0AVF1-1>',
 '<http://purl.obolibrary.org/obo/PR_Q12968-4>',
 '<http://purl.obolibrary.org/obo/PR_Q9HBI1-2>',
 '<http://purl.obolibrary.org/obo/PR_Q86SZ2>',
 '<http://purl.obolibrary.org/obo/PR_Q5HYK9>',
 '<http://purl.obolibrary.org/obo/PR_Q9H2P9-4>',
 '<http://purl.obolibrary.org/obo/PR_O60381-1>',
 '<http://purl.obolibrary.org/obo/PR_Q16667>',
 '<http://purl.obolibrary.org/obo/PR_A2NJV5>',
 '<http://purl.obolibrary.org/obo/PR_Q6P461-3>',
 '<http://purl.obolibrary.org/obo/PR_Q8TDC0>',
 '<http://purl.obolibrary.org/obo/PR_000054919>',
 '<http://purl.obolibrary.org/obo/PR_P53667-4>',
 '<http://purl.obolibrary.org/obo/PR_O14647-1>',
 '<http://purl.obolibrary.org/obo/PR_Q6ZSI9>']

In [58]:
nas = []
for key in correctLabels:
    label = correctLabels[key]['label']
    if label == 'N/A':
        nas.append(key)
len(nas)

54903

In [62]:
count = 1
for item in nas:
    if 'PR' in item:
        count += 1
count

54844

In [21]:
pickle.dump(nods, fileo)

In [22]:
i = 0
for key in nods:
    print(key)
    print(nods[key])
    i+=1
    if i == 10:
        break

https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000558233
IPO4-205
https://www.ncbi.nlm.nih.gov/snp/rs777826971
NM_002397.5(MEF2C):c.860C>T (p.Ser287Leu)
https://www.ncbi.nlm.nih.gov/snp/rs781825074
NM_000489.5(ATRX):c.3218G>C (p.Ser1073Thr)
http://purl.obolibrary.org/obo/CHEBI_136208
phosphatidylethanolamine (P-18:0/18:3)
http://purl.obolibrary.org/obo/PR_O15079-1
syntaphilin isoform h1 (human)
http://purl.obolibrary.org/obo/PR_O43688-1
phospholipid phosphatase 2 isoform h1 (human)
https://www.ncbi.nlm.nih.gov/snp/rs5404
NM_000340.2(SLC2A2):c.594G>A (p.Thr198=)
http://purl.obolibrary.org/obo/CHEBI_29828
sulfonothioyl group
http://purl.obolibrary.org/obo/CHEBI_17410
pteridine-2,4,6,7-tetrol
http://www.ncbi.nlm.nih.gov/gene/731157
LOC731157


In [None]:
##see merge_machine_read for ntriples of nodelabels

## Kratom - seizures

In [221]:
nodeLabels['http://purl.obolibrary.org/obo/MONDO_0016227'] = 'seizure'

In [229]:
nx.neighbors(nx_graph, obo.HP_0001250)

<dict_keyiterator at 0x7f8999b47868>

In [227]:
seizure = obo.HP_0001250

In [233]:
KG_path_searches.get_bidirectional_shortest_path(nx_graph, obodict['kratom'][0], seizure, nodeLabels)

Searching for path from http://napdi.org/napdi_srs_imports:mitragyna_speciosa - http://purl.obolibrary.org/obo/HP_0001250
(['Mitragyna_speciosa', 'has component', 'Mitragynine', ''], ['http://napdi.org/napdi_srs_imports:mitragyna_speciosa', 'http://purl.obolibrary.org/obo/RO_0002180', 'http://purl.obolibrary.org/obo/CHEBI_6956', [{'predicate_key': '92919f4d7a39d7db676c369dd45a1c1c', 'weight': 0.0}]])
(['Mitragynine', {'entity_type': 'RELATIONS', 'label': 'Activation'}, 'digoxin', 'machine_read'], ['http://purl.obolibrary.org/obo/CHEBI_6956', 'http://purl.obolibrary.org/obo/RO_0002448', 'http://purl.obolibrary.org/obo/CHEBI_4551', [{'predicate_key': '923ec5bcc14b673e70e2b2df5375b60a', 'weight': 0.0, 'pmid': '30604191', 'timestamp': '2019 Apr', 'source_graph': 'machine_read', 'belief': 0.65}, {'predicate_key': '9fdebc8bbe18c72c594081e9ae378bfa', 'weight': 0.0, 'pmid': '30604191', 'timestamp': '2019 Apr', 'source_graph': 'machine_read', 'belief': 0.65}]])
(['digoxin', 'is substance that t

([['Mitragyna_speciosa', 'has component', 'Mitragynine', ''],
  ['Mitragynine',
   {'entity_type': 'RELATIONS', 'label': 'Activation'},
   'digoxin',
   'machine_read'],
  ['digoxin', 'is substance that treats', 'Seizure', '']],
 [['http://napdi.org/napdi_srs_imports:mitragyna_speciosa',
   'http://purl.obolibrary.org/obo/RO_0002180',
   'http://purl.obolibrary.org/obo/CHEBI_6956',
   [{'predicate_key': '92919f4d7a39d7db676c369dd45a1c1c', 'weight': 0.0}]],
  ['http://purl.obolibrary.org/obo/CHEBI_6956',
   'http://purl.obolibrary.org/obo/RO_0002448',
   'http://purl.obolibrary.org/obo/CHEBI_4551',
   [{'predicate_key': '923ec5bcc14b673e70e2b2df5375b60a',
     'weight': 0.0,
     'pmid': '30604191',
     'timestamp': '2019 Apr',
     'source_graph': 'machine_read',
     'belief': 0.65},
    {'predicate_key': '9fdebc8bbe18c72c594081e9ae378bfa',
     'weight': 0.0,
     'pmid': '30604191',
     'timestamp': '2019 Apr',
     'source_graph': 'machine_read',
     'belief': 0.65}]],
  ['http:

In [241]:
kpaths = KG_path_searches.get_k_simple_paths(nx_graph, obodict['kratom'][0], seizure, 5, 10, 3)

Searching for paths from http://napdi.org/napdi_srs_imports:mitragyna_speciosa - http://purl.obolibrary.org/obo/HP_0001250
[info] applying next operator to search for a simple path of max length 10
[info] Simple path found of length 10
[info] Simple path length greater than shortest path length (3) so adding to results
[info] applying next operator to search for a simple path of max length 10
[info] Simple path found of length 10
[info] Simple path length greater than shortest path length (3) so adding to results
[info] applying next operator to search for a simple path of max length 10
[info] Simple path found of length 10
[info] Simple path length greater than shortest path length (3) so adding to results
[info] applying next operator to search for a simple path of max length 10
[info] Simple path found of length 10
[info] Simple path length greater than shortest path length (3) so adding to results
[info] applying next operator to search for a simple path of max length 10
[info] Sim

In [242]:
for path in kpaths:
    print('\n'.join([str(x) for x in path]))
    print('\n\n\n')

[('Mitragyna_speciosa', 'subClassOf', 'plant anatomical entity'), ('plant anatomical entity', 'only in taxon', 'Viridiplantae'), ('Viridiplantae', 'subClassOf', 'Eukaryota'), ('Eukaryota', 'has_part', 'genome of Eukaryota'), ('genome of Eukaryota', 'has_part', 'gene of Eukaryota'), ('gene of Eukaryota', 'subClassOf', 'gene'), ('gene', 'subClassOf', 'gene'), ('gene', 'directly negatively regulates activity of', 'rosoxacin'), ('rosoxacin', '', 'lipopolysaccharide'), ('lipopolysaccharide', 'is substance that treats', 'Seizure')]
[('Mitragyna_speciosa', 'subClassOf', 'plant anatomical entity'), ('plant anatomical entity', 'only in taxon', 'Viridiplantae'), ('Viridiplantae', 'subClassOf', 'Eukaryota'), ('Eukaryota', 'has_part', 'genome of Eukaryota'), ('genome of Eukaryota', 'has_part', 'gene of Eukaryota'), ('gene of Eukaryota', 'subClassOf', 'gene'), ('gene', 'subClassOf', 'gene'), ('gene', 'directly negatively regulates activity of', 'rosoxacin'), ('rosoxacin', 'positively regulates', 'l

In [239]:
kpaths = KG_path_searches.get_k_simple_paths(nx_graph, obodict['kratom'][0], seizure, 5, 20, 3)

Searching for paths from http://napdi.org/napdi_srs_imports:mitragyna_speciosa - http://purl.obolibrary.org/obo/HP_0001250
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (3) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (3) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (3) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Simple path found of length 20
[info] Simple path length greater than shortest path length (3) so adding to results
[info] applying next operator to search for a simple path of max length 20
[info] Sim

In [240]:
for path in kpaths:
    print('\n'.join([str(x) for x in path]))
    print('\n\n\n')

[('Mitragyna_speciosa', 'subClassOf', 'plant anatomical entity'), ('plant anatomical entity', 'only in taxon', 'Viridiplantae'), ('Viridiplantae', 'subClassOf', 'Eukaryota'), ('Eukaryota', 'has_part', 'genome of Eukaryota'), ('genome of Eukaryota', 'has_part', 'gene of Eukaryota'), ('gene of Eukaryota', 'subClassOf', 'gene'), ('gene', 'subClassOf', 'gene'), ('gene', 'directly negatively regulates activity of', 'apoptotic process'), ('apoptotic process', 'directly negatively regulates activity of', 'BCL2_gene'), ('BCL2_gene', 'directly negatively regulates activity of', 'apoptosis regulator BAX (human)'), ('apoptosis regulator BAX (human)', '', 'Cytochromes_c'), ('Cytochromes_c', '', 'Cytochromes_c'), ('Cytochromes_c', 'directly negatively regulates activity of', '(-)-epigallocatechin 3-gallate'), ('(-)-epigallocatechin 3-gallate', '', 'NOS3_protein,_human'), ('NOS3_protein,_human', 'Phosphorylation', 'calmodulin'), ('calmodulin', 'Phosphorylation', 'photon'), ('photon', '', 'calcium io

## Reweighting the KG

1. Fix subclassof chemical entity
2. maybe downweight subclassof
3. use belief scores of MR to weight?
4. centrality measures - node degree centrality
5. Fix mapping to TEA in REACH/SemRep (maps to triethylamine CHEBI_35026)

In [None]:
#INDRA pathfinding module also searches in nx multidigraph 
#https://indra.readthedocs.io/en/latest/modules/explanation/pathfinding.html
#uses belief in metadata (I think)