# Self study 12

## Libraries
For this task you will need the PyKEEN library, which implements several knowledge graph embedding methods. You can find its documentation [here](https://pykeen.readthedocs.io).

In [None]:
from pykeen.pipeline import pipeline, plot_losses, plot_er
from pykeen.models import TransE, DistMult, RESCAL
from pykeen.datasets import Nations, get_dataset
from pykeen.predict import predict_triples, predict_target
from pykeen.evaluation import RankBasedEvaluator

## Embeddings computation
In this session, we will learn how to compute knowledge graph embeddings using the PyKEEN library.

This self-study module is based on the [PyKEEN tutorial](https://pykeen.readthedocs.io/en/latest/tutorial/first_steps.html). We will use TransE to generate the embeddings, but other methods are possible, such as DistMult and RESCAL, which we studied during the lecture. They are already imported, it's enough to change the next code snippet to use a different method.

As the dataset, we will use the Nations dataset. This knowledge graph is very small: it contains 14 nodes, 55 predicates and 1992 triples. However, it allows to execute PyKEEN also on PC not equipped for heavy machine learning training, with GPU and so on. If you have a machine setup to execute PyTorch using GPU, you can change Nations with another dataset - such as [FB15k237](https://pykeen.readthedocs.io/en/stable/api/pykeen.datasets.FB15k237.html).

The easiest way to use PyKEEN is through the pipeline function. This function allows to declare the information about the KGE method and the dataset, and it trains the embeddings.

In [None]:
result = pipeline(
    dataset=Nations,
    model=TransE,
)

The pipeline function can be extended with several parameters, which affect the training of the model. For example, one can set the learning rate, or change the negative sampling algorithm. The [documentation](https://pykeen.readthedocs.io/en/latest/api/pykeen.pipeline.pipeline.html#pykeen.pipeline.pipeline) of the function illustrates all the options.

## Visualisation
We can now investigate the results of the training process. First, we can use the plot_losses function to visualise how the loss function changes over the different epochs.

In [None]:
plot_losses(result)

We can also visualise the embeddings. The plot_er function visualise the vectors in a 2D space. The following plot shows the embeddings of the entities.

In [None]:
plot_er(result, plot_relations=False)

## Triple scores
We can now use the model for prediction tasks. First, we use the predict_triples function to predict the triple score of a set of statements. In the following, we use the validation dataset provided by PyKEEN for the Nations dataset. 

The resulting object, pack, contains pairs of statements and their predicted score. The pack object is enriched with the node and edge labels, and is exported in a Pandas dataframe.

In [None]:
dataset = get_dataset(dataset="nations")
pack = predict_triples(model=result.model, triples=dataset.validation)
df = pack.process(factory=result.training).df

This command retrieves the five statements that obtained the highest scores.

In [None]:
df.nlargest(n=5, columns="score")

We can retrieve the nodes and the average score of the statements where they are involved.

In [None]:
df.groupby(by=["head_id", "head_label"]).agg({"score": ["mean", "std", "count"]}).sort_values(by=("score", "mean"), ascending=False)

## Link prediction
Instead of computing the triple scores, one can perform a link prediction task, predicting the object (a.k.a. tail) given the subject (a.k.a. head) and the predicate (a.k.a. relation). The function to perform link prediction is predict_target. This code executes the query (uk, conferences, ?). 

In [None]:
pred = predict_target(
    model=result.model,
    head="uk",
    relation="conferences",
    triples_factory=result.training,
)

To process the predictions, we first perform a filter step, i.e., we remove the statements which were already included in the training set. We used them to learn, so there is very little value in predicting something that we know to be true from the beginning. 

Next, we enrich the predictions with information about the presence of the statements in the validation or testing datasets.

In [None]:
pred_filtered = pred.filter_triples(dataset.training)
pred_annotated = pred_filtered.add_membership_columns(validation=dataset.validation, testing=dataset.testing)
pred_annotated.df.sort_values(by="score", ascending=False)

## Evaluation
Evaluation is an essential step to assess the quality of a model, as well to compare different KGE methods. A typical way to evaluate KGE methods is to use rank based evaluation, treating the KGE model as a recommender system. Given a query, a model can return a sequence of potential answers, ordered by the triple score associated to the relative statement. Therefore, given a test set that includes positive statements, one can submit link prediction queries to the model, and study how the expected answer is ranked in the results. 

The following code evaluates the model. The evaluate method takes as inputs the trained model, the test set, and the list statements used during training and validation, so that they can be filtered out from the query answers.

There are typical three metrics that are studied when evaluating KGE methods: the mean rank (MR), the mean reciprocal rank (MRR), and Hits@k. Their definition are available e.g. in [Wikipedia](https://en.wikipedia.org/wiki/Knowledge_graph_embedding#Performance_indicators).

In [None]:
evaluator = RankBasedEvaluator()
metrics = evaluator.evaluate(result.model, dataset.testing.mapped_triples, additional_filter_triples=[dataset.training.mapped_triples, dataset.validation.mapped_triples])
print(f"Mean rank: {metrics.get_metric('mean_rank'):.2f}")
print(f"Mean reciprocal rank: {metrics.get_metric('mean_reciprocal_rank'):.2f}")
print(f"Hits@10: {metrics.get_metric('hits@10'):.2%}")
print(f"Hits@1: {metrics.get_metric('hits@1'):.2%}")