The libraries needed to import the graph and work with it

In [1]:
!pip install -q networkx

In [2]:
import os
import os.path
import networkx as nx
import json
import urllib
import traceback
import sys
from itertools import islice
from rdflib import Graph, URIRef, BNode, Namespace, Literal
from rdflib.namespace import RDF, OWL
import smart_open
from node2vec import Node2Vec

The next two cell are the graph loading step. The graph is the output of the PheKnowLator process including filtering out OWL semantics. The sources used in this version of the knowledge graph are listed on [this Wiki page](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources). The graph takes more than 15 gigs of RAM once loaded.

In [5]:
GRAPHPATH = "PheKnowLator_full_InverseRelations_NotClosed_OWLNETS_Networkx_MultiDiGraph_REWEIGHT_NO_DISJOINT.gpickle"

In [6]:
# Reloading the graph produced from the PheKnowLator workflow
# The graph 
nx_mdg = nx.read_gpickle(GRAPHPATH)

The next cell defines several functions that will help to explain the results of path searches.

### Part III - how to embed nodes in the knowledge graph to try and fill in gaps

We're going to tackle link prediction as a supervised learning problem on top of node representations/embeddings. The embeddings are computed with the unsupervised node2vec algorithm. After obtaining embeddings, a binary classifier can be used to predict a link, or not, between any two nodes in the graph. Various hyperparameters could be relevant in obtaining the best link classifier -

There are four steps:

1. Obtain embeddings for each node
2. For each set of hyperparameters, train a classifier
3. Select the classifier that performs the best
4. Evaluate the selected classifier on unseen data to validate its ability to generalise

This part of the notebook has been adapted from a demo of the StellarGraph library (https://stellargraph.readthedocs.io/en/stable/index.html) for link prediction using the node2vec algorithm.

<a name="refs"></a>
**References:** 

[1] Node2Vec: Scalable Feature Learning for Networks. A. Grover, J. Leskovec. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016.

First, we derive a simple undirected graph that has only predicates (edges) indicitive of interaction:

http://purl.obolibrary.org/obo/RO_0002434 - interacts with

http://purl.obolibrary.org/obo/RO_0002436 - molecularly interacts with

http://purl.obolibrary.org/obo/RO_0002435 - geneticly interacts with

http://purl.obolibrary.org/obo/RO_0002437 - biotically interacts with

http://purl.obolibrary.org/obo/RO_0000056 - partipates in

Note: We limit the graph size to be a few tens of thousands of nodes (the full graph is >200K nodes) so that embedding can happen in a relatively short amount of time 

We also need to be choosy about the nodes we select because there are some more general nodes that have many thousands of interaction relationships.

In [26]:
nx_mdg_interacts = nx.Graph()

object_nodeD = {} # tracking object nodes so we can add only those that are the object of 2 or more interactions
maxNodes = 6000 # the max count of unique subject nodes
edge_key = -1 # an int that we will increment to uniquely identify edges
s_node_visitedD = {} # tracks unique subject nodes 
s_count = 0 # some entitites have a huge number of interactions so we will only sample a max of 100 for any entity
s_current = None
for s, o, data in nx_mdg.edges(data=True):
    if not s_node_visitedD.get(s):
        s_node_visitedD[s] = 1
        if len(s_node_visitedD) == maxNodes:
            break
    
    if s == s_current:
        s_count += 1
        if s_count > 50:
            continue
    else:
        s_current = s
        s_count = 0 
        
    p = data['predicate']
    
    edge_key += 1
    
    if not (p.toPython() == 'http://purl.obolibrary.org/obo/RO_0002434' # interacts with
            or p.toPython() == 'http://purl.obolibrary.org/obo/RO_0002436' # molecularly interacts with
            or p.toPython() == 'http://purl.obolibrary.org/obo/RO_0002435' # genetically interacts with
            or p.toPython() == 'http://purl.obolibrary.org/obo/RO_0002437' # biotically interacts with 
            or p.toPython() == 'http://purl.obolibrary.org/obo/RO_0000056' # partipates in
           ):
        continue    
    
    if object_nodeD.get(o):
        object_nodeD[o] += 1
        
        # add the edge to the graph giving it a unique key 
        nx_mdg_interacts.add_edge(s, o, **{'predicate': p,'key': edge_key})
        
    else:
        object_nodeD[o] = 1

In [27]:
nx.write_gpickle(nx_mdg_interacts,'PheKnowLator_full_InverseRelations_NotClosed_OWLNETS_INTERACTS.gpickle')

In [28]:
## Uncomment this to reload the 'interacts only' graph from file if needed
# nx_mdg_interacts = nx.read_gpickle('PheKnowLator_full_InverseRelations_NotClosed_OWLNETS_INTERACTS.gpickle')

The next few cells give some insight into the content of our interaction sub-graph

In [29]:
print(nx.info(nx_mdg_interacts))

Name: 
Type: Graph
Number of nodes: 30568
Number of edges: 150562
Average degree:   9.8510


In [30]:
i = 0 
for d in nx_mdg_interacts.degree():
    if d[1] > 2:        
        print(d)
        i += 1
        if i == 100:
            break

(rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000425941'), 3)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_46024'), 427)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_39867'), 1061)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_23965'), 519)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_P41002'), 13)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_O95376'), 8)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_000009875'), 5)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_O95352'), 10)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_Q9H765'), 13)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/GO_0098685'), 3)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_3766'), 52)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_33566'), 121)
(rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_31206'), 71)
(rdflib.term.URIRef('https://uswest.ensembl.org/Hom

In [31]:
# A sample of what this new graph looks like
i = 0
for s, o, data in nx_mdg_interacts.edges(data=True):
    if i == 200:
        break
    
    print('{}\t{}\t{}'.format(s,o,data))
    i += 1

https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000425941	http://purl.obolibrary.org/obo/CHEBI_46024	{'predicate': rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002434'), 'key': 218}
https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000425941	http://purl.obolibrary.org/obo/CHEBI_39867	{'predicate': rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002434'), 'key': 235}
https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000425941	http://purl.obolibrary.org/obo/CHEBI_23965	{'predicate': rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002434'), 'key': 236}
http://purl.obolibrary.org/obo/CHEBI_46024	https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000441709	{'predicate': rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002434'), 'key': 1096}
http://purl.obolibrary.org/obo/CHEBI_46024	https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000396475	{'predicate': rdflib.term.URIRef('h

#### Node2Vec

We use Node2Vec [[1]](#refs), to calculate node embeddings. These embeddings are learned in such a way to ensure that nodes that are close in the graph remain close in the embedding space. Node2Vec first involves running random walks on the graph to obtain our context pairs, and using these to train a Word2Vec model.

These are the set of parameters we can use:

* `p` - Random walk parameter "p"
* `q` - Random walk parameter "q"
* `dimensions` - Dimensionality of node2vec embeddings
* `num_walks` - Number of walks from each node
* `walk_length` - Length of each random walk
* `window_size` - Context window size for Word2Vec
* `num_iter` - number of SGD iterations (epochs)
* `workers` - Number of workers for Word2Vec

There are a few steps involved in using the model to perform link prediction:
1. We calculate link/edge embeddings for the positive and negative edge samples by applying a binary operator on the embeddings of the source and target nodes of each sampled edge.
2. Given the embeddings of the positive and negative examples, we train a logistic regression classifier to predict the probability indicating whether an edge between two nodes should exist or not.
3. To use the embeddings in a traditional machine learning classifier, we use binary operators on the embeddings such as average, Hadamard, L1, L2. Here we use the 'average operator' with node embeddings in the logistic regression classifier.


The next cells aply the node2vec to conctruc a 30 dimension embedding of the interaction sub-graph using 5 random walks per node with up to 10 node steps each. These hyperparameters are not tuned.

In [32]:
node2vec = Node2Vec(nx_mdg_interacts, dimensions=30, walk_length=10, num_walks=5, workers=1)

HBox(children=(FloatProgress(value=0.0, description='Computing transition probabilities', max=30568.0, style=P…

Generating walks (CPU: 1):   0%|          | 0/5 [00:00<?, ?it/s]




Generating walks (CPU: 1): 100%|██████████| 5/5 [00:07<00:00,  1.46s/it]


In [33]:
model = node2vec.fit(window=10, min_count=1, batch_words=4, vector_size=200)

In [34]:
model.save('PheKnowLator_full_InverseRelations_NotClosed_OWLNETS_INTERACTS.model')

In [35]:
## Uncomment this and run it to reload the model from a file if you have already created it
# from gensim.models import Word2Vec
# model = Word2Vec.load('PheKnowLator_full_InverseRelations_NotClosed_OWLNETS_INTERACTS.model')

We can examine the embedding vectors for any node in the graph.

In [36]:
# the drug entinostat
model.wv.get_vector('http://purl.obolibrary.org/obo/CHEBI_132082')

array([-5.93670527e-04, -5.17316349e-03,  1.10559724e-03, -4.34594724e-04,
        3.10785999e-03,  2.64846301e-03, -2.34318012e-03, -2.80945818e-03,
       -3.97979235e-03, -1.83007820e-03, -3.55141994e-04, -2.85191782e-05,
       -2.26112944e-03, -3.46464943e-03,  3.44630890e-03,  1.93310273e-03,
       -2.52015167e-03, -8.02397495e-04, -2.38945824e-03,  2.79004732e-03,
       -1.87318027e-03,  7.45047932e-04, -2.90185725e-03,  2.36411448e-04,
       -3.76802962e-03,  2.61286553e-03, -2.85741151e-03,  1.00265106e-03,
       -5.24299452e-03, -2.61541409e-03,  3.05629661e-03,  8.82533088e-04,
        2.04425189e-03, -3.32299829e-03, -8.91144795e-04, -8.20668647e-04,
        5.66369528e-03, -5.66688941e-05, -1.31841877e-03,  3.84059455e-03,
        2.41078739e-03, -5.86752663e-04, -2.77923793e-03, -1.28730072e-03,
        4.38222801e-03, -3.47604160e-03,  4.44684643e-03,  3.64490552e-03,
        3.61964875e-03,  3.83566995e-03,  3.26912547e-03, -4.83042514e-03,
       -1.92926941e-03, -

In [37]:
model.wv.most_similar('http://purl.obolibrary.org/obo/CHEBI_132082')

[('http://purl.obolibrary.org/obo/PR_Q15436', 0.2880791425704956),
 ('http://purl.obolibrary.org/obo/CHEBI_39867', 0.2801530063152313),
 ('http://purl.obolibrary.org/obo/GO_0007275', 0.27965760231018066),
 ('http://purl.obolibrary.org/obo/PR_P63167', 0.2755066752433777),
 ('https://www.ncbi.nlm.nih.gov/gene/9734', 0.27121084928512573),
 ('https://www.ncbi.nlm.nih.gov/gene/28984', 0.2696928083896637),
 ('https://www.ncbi.nlm.nih.gov/gene/4726', 0.2615227699279785),
 ('http://purl.obolibrary.org/obo/CHEBI_45716', 0.2602190673351288),
 ('https://www.ncbi.nlm.nih.gov/gene/9500', 0.2584768533706665),
 ('https://www.ncbi.nlm.nih.gov/gene/9081', 0.2555481791496277)]

In [38]:
# positive regulation of nervous system development
model.wv.get_vector('http://purl.obolibrary.org/obo/GO_0051962')

array([-2.4946450e-04, -2.2245706e-03,  3.8163955e-03, -4.4157868e-03,
       -2.6619053e-03, -4.1387943e-03,  3.5243883e-04, -4.0848114e-04,
       -3.0815841e-03,  3.6395199e-03, -2.0938814e-03, -5.0422843e-03,
        1.1586120e-03, -5.0233668e-03,  4.5230807e-04,  2.3617669e-05,
       -4.0696347e-03,  4.4757947e-03, -3.2621042e-03, -5.0370586e-03,
        3.9223293e-03,  2.7220657e-03,  8.6980406e-04,  7.2830204e-05,
       -1.9545886e-03,  1.4149619e-04, -9.5717754e-04, -1.6380666e-03,
        4.6243938e-03,  3.1584823e-03, -4.0717321e-03,  3.4332299e-03,
        1.4038777e-03, -1.7748680e-04,  6.7058060e-04, -2.0902080e-03,
        4.2528696e-03, -6.6892640e-04, -1.4804092e-03, -3.8500098e-03,
       -1.5967677e-03, -4.4460180e-03,  4.8620580e-04,  5.0436164e-04,
       -4.2818976e-03,  3.2651313e-03,  4.1220891e-03,  4.8834383e-03,
       -1.4177533e-03,  3.8324569e-03, -3.8837799e-04,  3.4165748e-03,
       -1.0933253e-03, -1.8316426e-03, -1.0094808e-03, -1.2022591e-03,
      

We can also seek the most similar nodes to a given ndode in the embedding space 

In [39]:
model.wv.most_similar('http://purl.obolibrary.org/obo/GO_0051962')

[('https://www.ncbi.nlm.nih.gov/gene/7436', 0.2810508608818054),
 ('http://purl.obolibrary.org/obo/PR_000009232', 0.27281567454338074),
 ('http://purl.obolibrary.org/obo/PR_000000123', 0.2670929431915283),
 ('http://purl.obolibrary.org/obo/GO_0048856', 0.25589990615844727),
 ('http://purl.obolibrary.org/obo/PR_000007253', 0.25027763843536377),
 ('https://www.ncbi.nlm.nih.gov/gene/29126', 0.24317747354507446),
 ('http://purl.obolibrary.org/obo/GO_0006767', 0.23713308572769165),
 ('http://purl.obolibrary.org/obo/PR_P29323', 0.23678505420684814),
 ('http://purl.obolibrary.org/obo/PR_Q99880', 0.2366356998682022),
 ('https://www.ncbi.nlm.nih.gov/gene/10769', 0.23162728548049927)]

In [40]:
# positive=purine ribonucleoside binding and negative=positive regulation of nervous system development
model.wv.most_similar(positive=['http://purl.obolibrary.org/obo/GO_0032550'],negative=['http://purl.obolibrary.org/obo/GO_0051962'])

[('http://purl.obolibrary.org/obo/PR_P47992', 0.277768611907959),
 ('https://www.ncbi.nlm.nih.gov/gene/6352', 0.27562808990478516),
 ('http://purl.obolibrary.org/obo/PR_000008875', 0.27455294132232666),
 ('http://purl.obolibrary.org/obo/PR_P55081', 0.2608744502067566),
 ('http://purl.obolibrary.org/obo/PR_O14638', 0.2570715844631195),
 ('http://purl.obolibrary.org/obo/GO_1901215', 0.24670900404453278),
 ('http://purl.obolibrary.org/obo/PR_P13569', 0.24514862895011902),
 ('http://purl.obolibrary.org/obo/GO_0097242', 0.24114495515823364),
 ('http://purl.obolibrary.org/obo/PR_Q8IV08', 0.24086478352546692),
 ('http://purl.obolibrary.org/obo/PR_000007694', 0.23945611715316772)]

The next cells train a model for link prediction using the embeddings. We start by building an adjacency matrix and then traversing it to find the positions of the zeros which represent node pairs that are not connected in the graph.  

The steps shown here follow the same procedure used in this [nice tutorial by Prateek Joshi](https://www.analyticsvidhya.com/blog/2020/01/link-prediction-how-to-predict-your-future-connections-on-facebook/) but applied to the interaction sub-graph of the PheKnowLator graph.  

In [41]:
nodelist = []
nds = nx_mdg_interacts.nodes()
initial_node_count = nx.number_of_nodes(nx_mdg_interacts)
for n in nds:
    nodelist.append(n)

In [42]:
# build adjacency matrix
adj_G = nx.to_numpy_matrix(nx_mdg_interacts,nodelist)

In [43]:
adj_G.shape

(30568, 30568)

In [44]:
# get unconnected node-pairs
all_unconnected_pairs = [None]*(adj_G.shape[0]*adj_G.shape[1])
all_connected_pairs = [None]*(adj_G.shape[0]*adj_G.shape[1])

# traverse adjacency matrix (iterate columns by rows)
offset = 0
for i in range(adj_G.shape[0]):
  for j in range(offset,adj_G.shape[1]):
    if i != j:
        if adj_G[i,j] == 0:
          all_unconnected_pairs[i+j] = (nodelist[i],nodelist[j])
        else:            
          all_connected_pairs[i+j] = (nodelist[i],nodelist[j])

  offset = offset + 1

In [None]:
import pandas as pd

all_connected_pairs_clean = [x for x in filter(lambda x: x != None, all_connected_pairs)]
node_1_linked = [i[0] for i in all_connected_pairs_clean]
node_2_linked = [i[1] for i in all_connected_pairs_clean]
original_g_df = pd.DataFrame({'node_1':node_1_linked, 
                              'node_2':node_2_linked})

all_unconnected_pairs_clean = [x for x in filter(lambda x: x != None, all_unconnected_pairs)]
node_1_unlinked = [i[0] for i in all_unconnected_pairs_clean]
node_2_unlinked = [i[1] for i in all_unconnected_pairs_clean]
data = pd.DataFrame({'node_1':node_1_unlinked, 
                     'node_2':node_2_unlinked})

# add target variable 'link'
data['link'] = 0

In [None]:
import pickle
f = open('original_g_df.pickle','wb')
pickle.dump(original_g_df,f)
f.close()

f = open('data.pickle','wb')
pickle.dump(data,f)
f.close()

In [None]:
## Use this cell to reload if needed
#import pickle
#f = open('original_g_df.pickle','rb')
#original_g_df = pickle.load(f)
#f.close()

#f = open('data.pickle','rb')
#data = pickle.load(f)
#f.close()

In [None]:
G_orig = nx.from_pandas_edgelist(original_g_df, "node_1", "node_2", create_using=nx.Graph())
initial_node_count = nx.number_of_nodes(G_orig)
print('initial_node_count: {}'.format(initial_node_count))

In [None]:
original_g_df_temp = original_g_df.copy()
G_temp = nx.from_pandas_edgelist(original_g_df_temp.drop(2), "node_1", "node_2", create_using=nx.Graph())
temp_node_count = nx.number_of_nodes(G_temp)
print('temp_node_count: {}'.format(temp_node_count))

In [None]:
original_g_df_temp = original_g_df.copy()

# empty list to store removable links
omissible_links_index = []

ctr = 0
ctr2 = 0
for i in original_g_df.index.values:
    # remove a node pair and build a new graph
    G_temp = nx.from_pandas_edgelist(original_g_df_temp.drop(i), "node_1", "node_2", create_using=nx.Graph())
    
    #print('len(G_temp.nodes): {}'.format(nx.number_of_nodes(G_temp)))
    
    ctr2 += 1
    if ctr2 % 1000 == 0:
        print(str(ctr2))
    
    # check that the number of nodes is same as the resulting graph
    if (nx.number_of_nodes(G_temp) == initial_node_count):                                                            
        omissible_links_index.append(i)
        original_g_df_temp = original_g_df_temp.drop(index = i)
        ctr += 1
        if ctr % 500 == 0:
            print('[info] Count of admissable links: {}'.format(ctr))
        if ctr == 2000:
            break

In [None]:
len(omissible_links_index)

In [None]:
f = open('omissible_links_index.pickle','wb')
pickle.dump(omissible_links_index,f)
f.close()

In [None]:
# create dataframe of removable edges
original_g_df_ghost = original_g_df.loc[omissible_links_index]

# add the target variable 'link'
original_g_df_ghost['link'] = 1

# Reduce the dataset containing non-connected nodes to 4K so that there is a 1:2 ratio of connected nodes to non-connected nodes
# The reduced dataset is a random sample without replacement
data_reduced = data.sample(4000)

data_reduced = data_reduced.append(original_g_df_ghost[['node_1', 'node_2', 'link']], ignore_index=True)

In [None]:
model.wv.get_vector('http://purl.obolibrary.org/obo/CHEBI_2430')

In [None]:
x = [model.wv.get_vector(i.toPython()) + model.wv.get_vector(j.toPython()) for i,j in zip(data_reduced['node_1'], data_reduced['node_2'])]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
import numpy as np
xtrain, xtest, ytrain, ytest = train_test_split(np.array(x), 
                                                data_reduced['link'], 
                                                test_size = 0.3, 
                                                random_state = 35)

Two baseline classifiers with which to compare the LR using embedding features

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(xtrain, ytrain)
predictions = dummy_clf.predict_proba(xtest)
roc_auc_score(ytest, predictions[:,1])

In [None]:
dummy_clf = DummyClassifier(strategy="stratified")
dummy_clf.fit(xtrain, ytrain)
predictions = dummy_clf.predict_proba(xtest)
roc_auc_score(ytest, predictions[:,1])

In [None]:
yhat = dummy_clf.predict(xtest)
d_f1 = f1_score(ytest, yhat)
print('F1: {}'.format(d_f1))

LR using the node2vec embeddings

In [None]:
lr = LogisticRegression(class_weight="balanced")
lr.fit(xtrain, ytrain)

In [None]:
predictions = lr.predict_proba(xtest)

In [None]:
roc_auc_score(ytest, predictions[:,1])

In [None]:
yhat = lr.predict(xtest)
lr_f1 = f1_score(ytest, yhat)
print('F1: {}'.format(lr_f1))

In [None]:
%matplotlib inline

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

lr_precision, lr_recall, _ = precision_recall_curve(ytest, predictions[:,1])

plt.plot(lr_recall, lr_precision, marker='.', label='Logistic')

# axis labels
plt.xlabel('Recall')
plt.ylabel('Precision')
# show the legend
plt.legend()
# show the plot
plt.show()


### Question 10: 

List 3 applications of network embeddings in biomedicine. For each embedding application, name an ontology that you believe would be useful in constructing the knowledge graph that would be used to create the embeddings.

### Answer: 
how genes and disease pathways are related - GO Gene Ontology
Transcriptional regulation and drug targets fot different pathways - BioPax
how an organism's gene functions are related to other organism's gene function - GO Gene Ontology


### Question 11: 

Node embedding methods (such as node2vec and DeepWalk) use node features to embed graph information into vector space. Explain ‘node features’ with a few examples.

### Answer: 
if we are trying to find a protein - protein interactions, the nodes are proteins here and node features include the features of proteins like surface accessibility, sequence conservation, and residue properties like hydrophobicity and charge.
The same can be applied for TF - pathway relationship, where node features are TF affinity or gene expression.

### Question 12: 

List a few data sources used to create the kg-covid-19 knowledge graph and find examples of triples that are included in the graph. 

### Answer: 
databases: DrugCentral, the Pharmacogenomics Knowledgebase (PharmGKB), Therapeutic Target Database (TTD),] and ChEMBL), 
functional annotations and synonyms for coronavirus genes and proteins from the Gene Ontology (GO), 
and protein interaction data from STRING and the IntAct Molecular Interaction Database.  
kg covid-19 also has date from COVID-19 scientific publications of concepts such as Gene Ontology (GO) terms,
UniProt Knowledgebase (UniProtKB) proteins, 
National Center for Biotechnology Information (NCBI) and HUGO Gene Nomenclature Committee (HGNC) genes,
and ChEMBL IDs via SciBite annotations

Triples: 
chemical-subject-drug <br />
protein-relation-molecule

### Question 13: 

Current graph representation learning (GRL) methods are not able to embed temporal graphs. Find an application in biomedicine where temporal graphs would be useful for representing knowledge.

### Answer: 
ventricle and arterial pulse, heartbeat related applications need temporal graphs to represent knowledge, as these vary with time


### Question 14: 

In this vignette, did we do link prediction on a homogeneous or heterogenious graph? 

### Answer: 
we have to do link prediction on heteriogenious graphs, as it involves multiple features

### Question 15: 

Given its performance, would you trust for predicting novel links that are currently not in the graph? Please explain your answer. 

### Answer: 
If its performance or accuracy is high, I would partially trust the links that are not in the graph. These new links can be potential connections or relationship between the nodes. This will be further subjected to clinical or wet lab experiments to confirm that such a relationship exits. This way it can lead to new discoveries in biomedical sector.

### Question 16: 

 We did not tune any hyperparameters for the embedding. What options would you be interested in trying out?

### Answer: 
walk_length  - walk length 
num_walks - number of random walks

### Question 17: 

Some researchers have experimented with embedding knowledge graph nodes in hyperbolic geometry. Do some searching about this topic and write a sentence or two on what you find to be the main motivation of researchers for trying this out. Include citations.

### Answer: 
In particular for hierarchical graphs, hyperbolic embeddings can retain graph distances and complicated interactions in only a few dimensions.

### Question 18: 

Write any comments or questions that you have about the embedding excercise in this notebook.

### Answer: 
i recently came across graphSAGE algorithm. What advantage does GNNs and GraphSage have over Node2vec or node embeddings , and basic network analysis using centrality measures.

**ALL DONE**