# Knowledge Graph Completion

Knowledge Graph Completion (KGC) is the process of enhancing or enriching a knowledge graph by predicting and filling in missing information (links or nodes). A knowledge graph represents relationships between entities in the form of a graph, where entities are nodes and relationships are edges. However, real-world knowledge graphs are often incomplete, with many potential links missing. KGC aims to infer these missing links or facts by leveraging existing data and patterns in the graph.

### Key Concepts:

-   Entities: Represent real-world objects (e.g., people, places, organizations).
-   Relations: The connections between entities (e.g., "works_for", "located_in").
-   Triples: The fundamental unit of a knowledge graph, typically in the form of $(h, r, t)$, where:
    -   $ h $ (head) is the starting entity.
    -   $ r $ (relation) describes the type of connection.
    -   $ t $ (tail) is the target entity.
    -   Example: $(\text{Einstein}, \text{invented}, \text{Theory of Relativity})$.

### Example of Completion:

Given a knowledge graph with the triple:

-   $(\text{Paris}, \text{capital\_of}, ?)$

    KGC would predict that the missing entity is "France."

### Techniques for Knowledge Graph Completion:

1. Embedding-Based Methods:
    - Translate entities and relations into a continuous vector space.
    - Models: TransE, TransH, DistMult, ComplEx.
2. Rule-Based Approaches:
    - Learn logical rules from the graph to infer missing links.
    - Example: If $A \rightarrow B$ and $B \rightarrow C$, then $A \rightarrow C$.
3. Graph Neural Networks (GNNs):
    - Leverage neural networks to learn from the graph structure.
4. Probabilistic Models:
    - Predict links based on statistical and probabilistic relationships.
5. Neuro-symbolic Approaches:
    - Combine symbolic reasoning with neural networks for better interpretability.
6. Large Language Models (LLMs):
    - Use LLMs to extract triples from text or predict missing relations directly.
    - Techniques: Fine-tuning on graph data, prompt-based triple prediction, embedding extraction.

### Applications of KGC:

-   Recommendation Systems: Predict user preferences by completing user-item interaction graphs.
-   Healthcare: Fill missing links in biomedical knowledge graphs for drug discovery.
-   Search Engines: Enhance search capabilities by improving knowledge graph coverage.
-   Finance: Predict links between entities for fraud detection or market analysis.


# PyKEEN and Neo4j for multi-class link prediction using knowledge graph embedding models

This notebook demonstrates how to perform knowledge graph completion, focusing on multi-class link prediction. Unlike standard link prediction, which predicts the existence of a link, multi-class link prediction also classifies the type of relationship between entities.

To apply this method, the knowledge graph must contain multiple relationship types. If the graph has only one type of relationship, standard [link prediction techniques](https://towardsdatascience.com/a-deep-dive-into-neo4j-link-prediction-pipeline-and-fastrp-embedding-algorithm-bf244aeed50d) or alternative approaches that do not require relationship classification may be more suitable.

For multi-class link prediction, we employ knowledge graph embedding models rather than traditional node embedding models. The key distinction lies in their scope:

-   Node embedding models generate embeddings solely for nodes.
-   Knowledge graph embedding models create embeddings for both nodes and relationships.

In knowledge graph embedding, the conventional notation is:

-   $h$ – head (starting node)
-   $r$ – relationship (edge)
-   $t$ – tail (target node)

The core idea is that if a relationship exists between nodes, the embedding of the head node ($h$) plus the embedding of the relationship ($r$) should approximate the embedding of the tail node ($t$): $h + r \approx t$

Prediction follows intuitively from this principle. To infer new relationships for a node, sum the node’s embedding with the embedding of a candidate relationship, then evaluate which nodes are closest to the result in the embedding space.


# Neo4j Desktop

To follow along with this notebook, ensure that the Neo4j Desktop application is installed.

Neo4j is a leading graph database platform designed to efficiently store, manage, and query data that is highly interconnected. Unlike traditional relational databases, Neo4j uses a property graph model where data is represented as nodes (entities), edges (relationships), and properties (attributes). This structure makes Neo4j particularly well-suited for applications involving complex relationships, such as knowledge graphs, social networks, fraud detection, and recommendation systems.

Key Features of Neo4j:

-   Native Graph Storage and Processing – Optimized for handling graph data directly.
-   Cypher Query Language – A powerful, declarative language designed for graph traversal and pattern matching.
-   Scalability and Performance – Handles large-scale data while maintaining high performance for complex queries.
-   Visualization Tools – Enables intuitive visualization of graph structures and relationships.
-   Extensibility – Supports integration with various tools, plugins, and programming languages.

After installing Neo4j Desktop, download a database dump and restore it to create a database instance. For this demonstration, you can use a [subset of the Hetionet](https://drive.google.com/file/d/1u34cFBYvBtdBsqOUPdmbcIyIt88IiZYe/view?usp=sharing), a knowledge graph containing information on genes, compounds, and diseases. This dataset includes numerous relationships between entities, primarily within the biomedical domain. While a deep understanding of these relationships may require domain expertise, we will focus on the treats relationship, which links compounds to diseases.

In this notebook, we will apply knowledge graph embedding models to predict new treats relationships, simulating a drug repurposing task – identifying potential new uses for existing drugs based on inferred connections within the graph.

# PyKEEN

[PyKEEN](https://pykeen.readthedocs.io/en/stable/index.html) (Python Knowledge Embedding Engine) is an open-source Python library for training and evaluating knowledge graph embedding (KGE) models. It simplifies the application of machine learning to knowledge graphs by providing pre-implemented models, training pipelines, and evaluation tools.  

## Key Features:  

- Model Support: Includes models like TransE, DistMult, ComplEx, and RotatE.  
- Modular and Customizable: Easily mix, match, and extend components.  
- Automation: Handles hyperparameter tuning, training, and evaluation.  
- Benchmarking: Compares model performance across datasets.  
- Visualization: Offers tools to analyze embeddings and results.  

## Use Cases:  

- Knowledge Graph Completion: Predict missing links and relationship types.  
- Link Prediction: Identify potential connections between entities.  
- Node Classification: Infer properties or classes for nodes.  
- Recommender Systems: Suggest related items by predicting links.  
- Biomedical and NLP Applications: Predict interactions, enhance search, and retrieve information.  

PyKEEN is widely used in fields like healthcare, social networks, and recommendation systems for knowledge graph analysis and enrichment.

# Imports

In [1]:
import pandas as pd
from neo4j import GraphDatabase
from pykeen import predict
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory

# Hyperparameter definitions

In [None]:
# Define Neo4j connections
host = "bolt://localhost:7687"
user = "neo4j"
password = "123456798"
driver = GraphDatabase.driver(host, auth=(user, password))

# Auxiliary functions

In [1]:
def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

# Transform a Neo4j to a PyKEEN

Now we will move on to the practical part of this post.
First, we will transform the Neo4j graph to the PyKEEN graph and split the train-test data. To begin, we have to define the connection to the Neo4j database.

The `run_query` function executes a Cypher query and returns the output in the form of a Pandas dataframe. The PyKEEN library has a `from_labeled_triples` that takes a list of triples as an input and constructs a graph from it.


In [2]:
data = run_query(
    """
    MATCH (s)-[r]->(t)
    RETURN toString(id(s)) as source, toString(id(t)) AS target, type(r) as type
    """
)

In [None]:
data.head()

In [4]:
tf = TriplesFactory.from_labeled_triples(
    data[["source", "type", "target"]].values,
    create_inverse_triples=False,
    entity_to_id=None,
    relation_to_id=None,
    compact_id=False,
    filter_out_candidate_inverse_relations=True,
    metadata=None,
)

This example has a generic Cypher query that can be used to fetch any Neo4j dataset and construct a PyKEEN from it. Notice that we use the internal Neo4j ids of nodes to build the triples data frame. For some reason, the PyKEEN library expects the triple elements to be all strings, so we simply cast the internal ids to string. Now that we have our PyKEEN graph, we can use the split method to perform the train-test data split.


In [None]:
training, testing, validation = tf.split([0.8, 0.1, 0.1])

# Train a knowledge graph embedding model

Now that we have the train-test data available, we can go ahead and train a knowledge graph embedding model. We will use the RotatE model in this example. I am not that familiar with all the variations of the embedding models, but if you want to learn more, I would suggest the lecture by Jure Leskovec I linked above.
We won't perform any hyper-parameter optimization to keep the tutorial simple. I've chosen to use 20 epochs and defined the dimension size to be 512.


In [None]:
result = pipeline(
    training=training,
    testing=testing,
    validation=validation,
    model="RotatE",
    stopper="early",
    epochs=20,
    dimensions=512,
    random_seed=420,
)

# Multi-class link prediction

The PyKEEN library supports multiple methods for multi-class link prediction.
You could find the top K predictions in the network, or you can be more specific and define a particular head node and relationship type and evaluate if there are any new connections predicted.

In this example, you will predict new treats relationships for the L-Asparagine compound. Because we used the internal node ids for mapping, we first have to retrieve the node id of L-Asparagine from Neo4j and input it into the prediction method.


In [None]:
compound_id = run_query(
    """
    MATCH (s:Compound)
    WHERE s.name = "L-Asparagine"
    RETURN toString(id(s)) as id
    """
)["id"][0]


df = predict.predict_target(
    result.model, compound_id, "treats", triples_factory=result.training
).df
print(df.head(5))

# Store predictions to Neo4j

For easier evaluation of the results, we will store the top five predictions back to Neo4j.


In [None]:
candidate_nodes = df[df["in_training"] == False].head(5)["tail_label"].to_list()

run_query(
    """
    MATCH (n)
    WHERE id(n) = toInteger($compound_id)
    UNWIND $candidates as ca
    MATCH (c)
    WHERE id(c) = toInteger(ca)
    MERGE (n)-[:PREDICTED_TREATS]->(c)
    """,
    {"compound_id": compound_id, "candidates": candidate_nodes},
)

# Inspect results


In [None]:
run_query(
    """
    MATCH (c:Compound)-[:PREDICTED_TREATS]->(d:Disease)
    RETURN c.name as compound, d.name as disease
    """
)

# Explaining predictions

As far as I know, the knowledge graph embedding model is not that useful for explaining predictions. On the other hand, you could use the existing connections in the graph to present the information to a medical doctor and let him decide if the predictions make sense or not.
For example, you could investigate direct and indirect paths between L-Asparagine and colon cancer with the following Cypher query.


In [None]:
run_query(
    """
    MATCH (c:Compound {name: "L-Asparagine"}),(d:Disease {name:"colon cancer"})
    WITH c,d
    MATCH p=AllShortestPaths((c)-[r:binds|regulates|interacts|upregulates|downregulates|associates*1..4]-(d))
    RETURN [n in nodes(p) | n.name] LIMIT 25
    """
)