# Using SEPAL to embed Mini YAGO3

In [9]:
import pykeen
import torch
import scipy.sparse
import numpy as np
from time import time
from pathlib import Path

import sys
sys.path.append('../sepal')
from knowledge_graph import KnowledgeGraph
from settings import set_control_params
from dataloader import DataLoader
from downstream_evaluation import prediction_scores, RW_PATH
from sepal import run_sepal

## Loading the data

The Mini YAGO3 dataset is constructed from YAGO3 [1] by first filtering for entities with a degree of 9 or greater, and then selecting the largest connected component from the resulting subgraph.

In [10]:
triples_dir = Path("../data/knowledge_graphs/mini_yago3")
data_loader = DataLoader(triples_dir)
triples_factory = data_loader.get_triples_factory()
graph = KnowledgeGraph(triples_factory, name='mini_yago3')

Now that the graph is loaded we can print some basic statistics.

In [11]:
print(f"Number of entities: {graph.num_entities}")
print(f"Number of relations: {graph.num_relations}")
print(f"Number of triples: {graph.num_triples}")
print(f"Highest degree: {graph.degrees.max():.0f}")
print(f"Average degree: {graph.degrees.mean():.1f}")

Number of entities: 129493
Number of relations: 74
Number of triples: 1132010
Highest degree: 65711
Average degree: 12.6


We can also check if the graph is connected or not.

In [12]:
n_components, labels = scipy.sparse.csgraph.connected_components(
    graph.adjacency, directed=False, return_labels=True
)
largest_cc = np.where(labels == np.argmax(np.bincount(labels)))[0]
print(f"Number of connected components: {n_components}")
print(f"Largest connected component contains {len(largest_cc) / graph.num_entities:.2%} of all the nodes.")

Number of connected components: 1
Largest connected component contains 100.00% of all the nodes.


## Running SEPAL

First, we specify the hyperparameters. Here, we use SEPAL with DistMult [2] as base model.

Since the input graph is connected, we can use `handle_disconnected=False` (this will skip some connectedness checks and make the method slightly faster).

In [13]:
device = "cuda:0" # Change to the device you want to use, e.g., "cpu" or "cuda:1"
print(f"Using device: {device}")
params = {
    "embed_dim": 100,
    "subgraph_max_size": 4e4,
    "embed_method": "distmult",
    "core_node_proportions": 0.05,  # proportion of nodes to select as core nodes
    "core_selection": "degree",  # "degree" or "hybrid" or "pagerank"
    "diffusion_stop": 0.8,
    "propagation_lr": 1,
    "n_propagation_steps": 10,
    "num_epochs": 50,
    "batch_size": 512,
    "num_negs_per_pos": 1,
    "handle_disconnected": False, # since our graph is connected
    "seed": 0,
}
ctrl = set_control_params(device=device, **params)

Using device: cuda:0


Then we can train the model in less than a minute.

In [14]:
embeddings, relation_embeddings, sepal_time = run_sepal(ctrl, graph)

Central core extraction:
Core subgraph contains 4498 entities (3.5% of total graph)
Assigning super-spreaders' neighbors...
59.4% assigned
Diffusion on the graph...
80.0% assigned
Merging subgraphs...
27 subgraphs before merging.
6 subgraphs after merging.
Subgraph dilation...
0 remaining         
Merging small subgraphs...
Splitting large subgraphs...
5 subgraphs before merging.
5 subgraphs after merging.
Subgraph sizes: min: 27975, max: 38427


Training epochs on cuda:0: 100%|██████████| 50/50 [00:47<00:00,  1.06epoch/s, loss=0.235, prev_loss=0.239]
Propagating through subgraphs: 100%|██████████| 5/5 [00:01<00:00,  2.72it/s]

Total time: 50.41 seconds





And save the embeddings.

In [15]:
# Create the output directory if it doesn't exist
output_dir = Path("../embeddings")
output_dir.mkdir(parents=True, exist_ok=True)

np.savez_compressed(
    output_dir / "mini_yago3_sepal_embeddings.npz",
    entity_embeddings=embeddings.cpu().numpy(),
    relation_embeddings=relation_embeddings.cpu().numpy(),
    time=sepal_time,
)

## Comparison with DistMult

As a baseline, we now train a DistMult model on the full graph.

In [16]:
# Initialize DistMult model
model = pykeen.models.DistMult(
    triples_factory=graph.triples_factory,
    random_seed=0,
    embedding_dim=100,
    loss="CrossEntropyLoss",
).to(device)

# Set up training loop
training_loop = pykeen.training.SLCWATrainingLoop(
    model=model,
    triples_factory=graph.triples_factory,
    optimizer=torch.optim.Adam(params=model.get_grad_params(), lr=1e-3),
    negative_sampler_kwargs={
        "num_negs_per_pos": 1,
    },
)

# Train model
start = time()
losses = training_loop.train(
    triples_factory=graph.triples_factory,
    num_epochs=50,
    batch_size=512,
)
end = time()
distmult_time = end - start
print(f"DistMult training time: {distmult_time:.2f} seconds")

# Get embeddings for entities and relations
distmult_embeddings = model.entity_representations[0]().detach().cpu().numpy()
distmult_rel_embed = model.relation_representations[0]().detach().cpu().numpy()

# Save the embeddings
np.savez_compressed(
    "../embeddings/mini_yago3_distmult_embeddings.npz",
    entity_embeddings=distmult_embeddings,
    relation_embeddings=distmult_rel_embed,
    time=distmult_time,
)

Training epochs on cuda:0: 100%|██████████| 50/50 [23:25<00:00, 28.12s/epoch, loss=0.518, prev_loss=0.523]


DistMult training time: 1406.22 seconds


We can see that SEPAL is much faster:

In [27]:
speedup_ratio = distmult_time / sepal_time

print(f"SEPAL runs {speedup_ratio:.0f}x faster than DistMult. SEPAL time: {sepal_time:.0f} seconds, DistMult time: {distmult_time/60:.0f} minutes.")

SEPAL runs 28x faster than DistMult. SEPAL time: 50 seconds, DistMult time: 23 minutes.


## Evaluation on downstream tasks

Now we can evaluate the utility of embeddings for downstream tasks. In the paper we provide 46 downstream tables, 26 for classification and 20 for regression. For this example, let's use the Housing Prices dataset, a regression task on real-world data. The task is to predict the average housing price in US cities, from the embeddings of those cities.

In [18]:
results = prediction_scores(
    embeddings,
    "sepal-mini_yago3",
    "mini_yago3",
    data_loader.entity_to_idx,
    RW_PATH / "housing_prices.parquet",
    "regression",
    n_repeats=5,
    tune_hyperparameters=True,
)

In [19]:
mean_score = results["scores"].mean()
std_err = np.std(results["scores"]) / np.sqrt(len(results["scores"]))
print(f"SEPAL R2 score: {mean_score:.3f} ± {std_err:.3f}")

SEPAL R2 score: 0.139 ± 0.002


In [None]:
distmult_results = prediction_scores(
    distmult_embeddings,
    "distmult-mini_yago3",
    "mini_yago3",
    data_loader.entity_to_idx,
    RW_PATH / "housing_prices.parquet",
    "regression",
    n_repeats=5,
    tune_hyperparameters=True,
)

In [21]:
mean_score = distmult_results["scores"].mean()
std_err = np.std(distmult_results["scores"]) / np.sqrt(len(distmult_results["scores"]))
print(f"DistMult R2 score: {mean_score:.3f} ± {std_err:.3f}")

DistMult R2 score: 0.109 ± 0.002


The R2 score (higher is better) reported above shows that SEPAL outperforms DistMult on this task.

## Remarks

- SEPAL works better on *large* knowledge graphs (> 1M entities).
- The partitioning method, BLOCS, is designed for *scale-free* graphs. It may be slower on graphs with other degree distributions.

## References

[1] Farzaneh Mahdisoltani, Joanna Biega, and Fabian Suchanek. ["Yago3: A knowledge base from multilingual wikipedias"](https://imt.hal.science/hal-01699874/document). In *7th biennial conference on innovative data systems research*. CIDR Conference, 2014.

[2] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. ["Embedding entities and relations for learning and inference in knowledge bases"](https://arxiv.org/pdf/1412.6575). In *Proceedings of the 3rd International Conference on Learning Representations (ICLR)*, 2015.