# Knowledge graph completion with PyKeen and Neo4j

## Integrate PyKeen library with Neo4j for multi-class link prediction using knowledge graph embedding models

A couple of weeks ago, I met Francois Vanderseypen, a Graph Data Science consultant. We decided to join forces and start a Graph Machine learning blog series. This blog post will present how to perform knowledge graph completion, which is simply a multi-class link prediction. Instead of just predicting is a link, we are also trying to predict its type.

For knowledge graph completion, the underlying graph should contain multiple types of relationships. Otherwise, if you are dealing with only a single kind of relationship, you can use the standard link prediction techniques that do not consider the relationship type. The example visualization has only a single node type, but in practice, your input graph can consists of multiple node types as well.

We have to use the knowledge graph embedding models for a multi-class link prediction pipeline instead of plain node embedding models.
What's the difference, you may ask.
While node embedding models embed only nodes, the knowledge graph embedding models embed both nodes and relationships.

The standard syntax to describe the pattern is that the starting node is called head (h), the end or target node is referred to as tail (t), and the relationship is r.
The intuition behind the knowledge graph embedding model such as TransE is that the embedding of the head plus the relationship is close to the embedding of the tail if the relationship is present.

The predictions are then quite simple. For example, if you want to predict new relationships for a specific node, you just sum the node plus the relationship embedding and evaluate if any of the nodes are near the embedding sum.

# Prepare the data in Neo4j Desktop

To follow along with this tutorial, I recommend you download the Neo4j Desktop application.

Once you have installed the Neo4j Desktop, you can download the database dump and use it to restore a database instance. https://drive.google.com/file/d/1u34cFBYvBtdBsqOUPdmbcIyIt88IiZYe/view?usp=sharing

Our subset of the Hetionet graph contains genes, compounds, and diseases. There are many relationships between them, and you would probably need to be in the biomedical domain to understand them, so I won't go into details.
In our case, the most important relationship is the treats relationship between compounds and diseases. This blog post will use the knowledge graph embedding models to predict new treats relationships. You could think of this scenario as a drug repurposing task.

# PyKeen

PyKeen is an incredible, simple-to-use library that can be used for knowledge graph completion tasks.
Currently, it features 35 knowledge graph embedding models and even supports out-of-the-box hyper-parameter optimizations.
I like it due to its high-level interface, making it very easy to construct a PyKeen graph and train an embedding model.

# Transform a Neo4j to a PyKeen graph

Now we will move on to the practical part of this post.
First, we will transform the Neo4j graph to the PyKeen graph and split the train-test data. To begin, we have to define the connection to the Neo4j database.


In [1]:
# Define Neo4j connections
import pandas as pd
from neo4j import GraphDatabase

host = "bolt://localhost:7687"
user = "neo4j"
password = "letmein"
driver = GraphDatabase.driver(host, auth=(user, password))


def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

The `run_query` function executes a Cypher query and returns the output in the form of a Pandas dataframe. The PyKeen library has a `from_labeled_triples` that takes a list of triples as an input and constructs a graph from it.


In [2]:
data = run_query(
    """
    MATCH (s)-[r]->(t)
    RETURN toString(id(s)) as source, toString(id(t)) AS target, type(r) as type
    """
)

In [3]:
data.head()

Unnamed: 0,source,target,type
0,0,12590,interacts
1,0,8752,interacts
2,0,7915,interacts
3,0,21711,interacts
4,0,6447,interacts


In [4]:
from pykeen.triples import TriplesFactory

tf = TriplesFactory.from_labeled_triples(
    data[["source", "type", "target"]].values,
    create_inverse_triples=False,
    entity_to_id=None,
    relation_to_id=None,
    compact_id=False,
    filter_out_candidate_inverse_relations=True,
    metadata=None,
)

This example has a generic Cypher query that can be used to fetch any Neo4j dataset and construct a PyKeen from it. Notice that we use the internal Neo4j ids of nodes to build the triples data frame. For some reason, the PyKeen library expects the triple elements to be all strings, so we simply cast the internal ids to string.
Now that we have our PyKeen graph, we can use the split method to perform the train-test data split.


In [5]:
training, testing, validation = tf.split([0.8, 0.1, 0.1])

using automatically assigned random_state=3324760580


It couldn't get any easier than this. I must congratulate the PyKeen authors for developing such a straightforward interface.

# Train a knowledge graph embedding model

Now that we have the train-test data available, we can go ahead and train a knowledge graph embedding model. We will use the RotatE model in this example. I am not that familiar with all the variations of the embedding models, but if you want to learn more, I would suggest the lecture by Jure Leskovec I linked above.
We won't perform any hyper-parameter optimization to keep the tutorial simple. I've chosen to use 20 epochs and defined the dimension size to be 512.


In [17]:
from pykeen.pipeline import pipeline

result = pipeline(
    training=training,
    testing=testing,
    validation=validation,
    model="RotatE",
    stopper="early",
    epochs=20,
    dimensions=512,
    random_seed=420,
)



Training epochs on cpu:   0%|          | 0/20 [00:00<?, ?epoch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Currently automatic memory optimization only supports GPUs, but you're using a CPU. Therefore, the batch_size will be set to the default value.
INFO:pykeen.evaluation.evaluator:No evaluation batch_size provided. Setting batch_size to '32'.
INFO:pykeen.evaluation.evaluator:Evaluation took 2050.90s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.


Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/1756 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Currently automatic memory optimization only supports GPUs, but you're using a CPU. Therefore, the batch_size will be set to the default value.
INFO:pykeen.evaluation.evaluator:No evaluation batch_size provided. Setting batch_size to '32'.
INFO:pykeen.evaluation.evaluator:Evaluation took 2317.22s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 20.
INFO:pykeen.training.training_loop:=> loading checkpoint '/tmp/tmpx04jauw1'
INFO:pykeen.training.training_loop:=> loaded checkpoint '/tmp/tmpx04jauw1' stopped after having finished epoch 20
INFO:pykeen.evaluation.evaluator:Currently automatic memory optimization only supports GPUs, but you're using a CPU. Therefore, the batch_size will be set to the default value.
INFO:pykeen.evaluation.evaluator:No evaluation batch_size provided. Setting batch_size to '32'.


Evaluating on cpu:   0%|          | 0.00/56.2k [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 2556.08s seconds


# Multi-class link prediction

The PyKeen library supports multiple methods for multi-class link prediction.
You could find the top K predictions in the network, or you can be more specific and define a particular head node and relationship type and evaluate if there are any new connections predicted.

In this example, you will predict new treats relationships for the L-Asparagine compound. Because we used the internal node ids for mapping, we first have to retrieve the node id of L-Asparagine from Neo4j and input it into the prediction method.


In [18]:
from pykeen.models.predict import get_tail_prediction_df

compound_id = run_query(
    """
    MATCH (s:Compound)
    WHERE s.name = "L-Asparagine"
    RETURN toString(id(s)) as id
    """
)["id"][0]


df = get_tail_prediction_df(
    result.model, compound_id, "treats", triples_factory=result.training
)
print(df.head(5))

       tail_id tail_label     score  in_training
13279    13279        322 -7.856274        False
5671      5671      15821 -7.987956        False
3561      3561      13667 -8.157429        False
11437    11437      21700 -8.158304        False
17674    17674       7714 -8.204805        False


# Store predictions to Neo4j

For easier evaluation of the results, we will store the top five predictions back to Neo4j.


In [19]:
candidate_nodes = df[df["in_training"] == False].head(5)["tail_label"].to_list()

run_query(
    """
    MATCH (n)
    WHERE id(n) = toInteger($compound_id)
    UNWIND $candidates as ca
    MATCH (c)
    WHERE id(c) = toInteger(ca)
    MERGE (n)-[:PREDICTED_TREATS]->(c)
    """,
    {"compound_id": compound_id, "candidates": candidate_nodes},
)

# Inspect results


In [20]:
run_query(
    """
    MATCH (c:Compound)-[:PREDICTED_TREATS]->(d:Disease)
    RETURN c.name as compound, d.name as disease
    """
)

Unnamed: 0,compound,disease
0,L-Asparagine,Crohn's disease
1,L-Asparagine,hematologic cancer
2,L-Asparagine,colon cancer
3,L-Asparagine,stomach cancer
4,L-Asparagine,chronic obstructive pulmonary disease


# Explaining predictions

As far as I know, the knowledge graph embedding model is not that useful for explaining predictions. On the other hand, you could use the existing connections in the graph to present the information to a medical doctor and let him decide if the predictions make sense or not.
For example, you could investigate direct and indirect paths between L-Asparagine and colon cancer with the following Cypher query.


In [22]:
run_query(
    """
    MATCH (c:Compound {name: "L-Asparagine"}),(d:Disease {name:"colon cancer"})
    WITH c,d
    MATCH p=AllShortestPaths((c)-[r:binds|regulates|interacts|upregulates|downregulates|associates*1..4]-(d))
    RETURN [n in nodes(p) | n.name] LIMIT 25
    """
)

Unnamed: 0,[n in nodes(p) | n.name]
0,"[L-Asparagine, ASRGL1, SSBP2, colon cancer]"
1,"[L-Asparagine, SLC38A3, PLXNA1, colon cancer]"
2,"[L-Asparagine, ASRGL1, NME1, colon cancer]"
3,"[L-Asparagine, SLC1A5, VEGFA, colon cancer]"
4,"[L-Asparagine, ASRGL1, GDF15, colon cancer]"
5,"[L-Asparagine, SLC1A5, FZD5, colon cancer]"
6,"[L-Asparagine, ASNS, CCNB1, colon cancer]"
7,"[L-Asparagine, ASNS, HSF1, colon cancer]"
8,"[L-Asparagine, SLC38A3, VEGFA, colon cancer]"
9,"[L-Asparagine, SLC1A5, OXCT1, colon cancer]"
