Possible issue with model evaluation when using datasets with inverse triples #1346

martytom · 2023-11-17T18:01:07Z

Describe the bug

Using the build-in dataset Nations, we show that there might be a problem with the model evaluation process when using datasets with inverse triples, i.e., when dataset = Nations(create_inverse_triples=True). The evaluation process in that case seems to be different from the process when using a model without inverse triples, i.e., when dataset = Nations(create_inverse_triples=False). In the following we provide a minimal example to show the difference between the evaluation process for these two scenarios:

scenario 1: no inverse triples are present, i.e., dataset = Nations(create_inverse_triples=False)
scenario 2: inverse triples are present, i.e., dataset = Nations(create_inverse_triples=True)

and show results for metrics using identical setups apart from the fact that in scenario 1 we are not using inverse triples and in scenario 2 we do use inverse triples.

How to reproduce

1. The setup

Required imports.

from pykeen.nn.modules import Interaction
from pykeen.pipeline import pipeline
from pykeen.datasets import Nations
from pykeen.models import make_model_cls

Use the standard DistMult interaction function.

class DistMultInteraction(Interaction):
    def forward(self, h, r, t):
        return (h * r * t).sum(dim=-1)

Take the DistMult model with embedding dimension d=4.

model = make_model_cls(
    interaction=DistMultInteraction,
    dimensions={'d': 4},)

We create a regular pipeline object with a basic setup and set num_epochs=0 to only evaluate the model on the initialized embeddings such that there is no training happening. We set seed=126 to guarantee that the initial embeddings in both scenarios are identical. In each run we use the same parameters so that we can properly compare the model evaluation for scenario 1 and scenario 2. Since inverse triples are not being used in the evaluation process, the resulting scores for all metrics should be identical in both scenarios because we set num_epochs=0 and we use the same seed random_seed=126.

result = pipeline(
    random_seed=126,
    dataset=dataset,
    model=model,
    loss="BCEWithLogitsLoss",
    training_loop="sLCWA",
    training_kwargs=dict(
        num_epochs=0,
        batch_size= 32,),
    negative_sampler = "basic",
    negative_sampler_kwargs=dict(
        num_negs_per_pos = 1,),
)

2. Model results for both scenarios

From this, we get the following results for the hits_at_3 metric.

Scenario: There are no inverse triples present, i.e., dataset = Nations(create_inverse_triples=False)

Side	Rank_type	Metric	Value
head	optimistic	hits_at_3	0.437811
tail	optimistic	hits_at_3	0.398010
both	optimistic	hits_at_3	0.417910
head	realistic	hits_at_3	0.437811
tail	realistic	hits_at_3	0.398010
both	realistic	hits_at_3	0.417910
head	pessimistic	hits_at_3	0.437811
tail	pessimistic	hits_at_3	0.398010
both	pessimistic	hits_at_3	0.417910

Scenario: There are inverse triples present, i.e., dataset = Nations(create_inverse_triples=True)

Side	Rank_type	Metric	Value
head	optimistic	hits_at_3	0.507463
tail	optimistic	hits_at_3	0.417910
both	optimistic	hits_at_3	0.462687
head	realistic	hits_at_3	0.507463
tail	realistic	hits_at_3	0.417910
both	realistic	hits_at_3	0.462687
head	pessimistic	hits_at_3	0.507463
tail	pessimistic	hits_at_3	0.417910
both	pessimistic	hits_at_3	0.462687

As can be seen, the results are different despite the fact that the embeddings are identical. This suggests that model evaluation is different when using a dataset with inverse_triples=False compared to the same dataset with inverse_triples=True even though inverse triples are not being used in model evaluation (such that the evaluation part of the model in scenario 1 and scenario 2 should be identical and thus give the same scores.

3. Details about the `Nations` dataset.

The dataset has num_entites=14 and num_triples=1992. The size of the test set is: dataset.testing.mapped_triples.shape = torch.Size([201, 3]). So there are 201 (mapped) triples in the test set. The first 10 triples of the test set are:

dataset.testing.mapped_triples[:10] = 
tensor([[ 0,  7,  5],
        [ 0, 14, 11],
        [ 0, 18, 12],
        [ 0, 20,  3],
        [ 0, 20, 10],
        [ 0, 21, 12],
        [ 1,  3, 12],
        [ 1,  7,  0],
        [ 1, 10, 11],
        [ 1, 18,  5]])

We checked that the test triples are identical for the scenario with and without inverse triples, i.e., scenario 1 (Nations without inverse triples) and scenario 2 (Nations with inverse triples).

4. The evaluation process in detail

As far as I understand, the model evaluation on the test set happens in two steps.

Step 1: Corrupt heads: For every triple in the test set, fix the tail and replace the head by any entity that is present in the graph.
Step 2: Corrupt tails: For every triple in the test set, fix the head and replace the tail by any entity that is present in the graph.

In the following we will go through the 2-step evaluation process for both described scenarios (scenario 1 is Nations without inverse triples, scenario 2 is Nations with inverse triples)

We print heads h, tails ,t and relations r that are passed to the interaction function during the evaluation loop by printing their shapes within the interaction function:

class DistMultInteraction(Interaction):
    def forward(self, h, r, t):
        print("Interaction Function: head, tail and relations")
        print("h", h.shape)
        print("t", t.shape)
        print("r", r.shape)
        return (h * r * t).sum(dim=-1)

In the following we report the corresponding shapes for both steps (step 1: corrupt heads, step 2: corrupt tails) and the two described scenarios (scenario 1: without inverse relations, scenario 2: with inverse relations).

Scenario 1: Nations without inverse relations

Step 1: head corruption:

Dimensions of head, tail and relation tensors during head corruption evaluation

h torch.Size([1, 14, 4])
t torch.Size([201, 1, 4])
r torch.Size([201, 1, 4])

Tails correspond to the size of the test set, t.shape = torch.Size([201,1,4]) and head has size 14, h.shape=torch.Size([1,14,4]) which corresponds to the number of entities in Nations. So the model is evaluated by replacing for each of the 201 triples the head while the tail is fixed. This will generate a [201,14] dimensional scores tensor that contains the score for each of the 201 test triples all 14 possible triples that can be generated by corrupting the head (i.e., fixing the tail and replacing the head with any entity that is in the graph. Since there are 14, the second dimension of the head tensor is 14).

Step 2: Tail corruption:

Dimensions of head, tail and relation tensors during tail corruption evaluation

h torch.Size([201, 1, 4])
t torch.Size([1, 14, 4])
r torch.Size([201, 1, 4])

Heads correspond to the size of the test set, h.shape = torch.Size([201,1,4]) and tail has size 14, t.shape=torch.Size([1,14,4]) which corresponds to the number of entities in Nations. So the model is evaluated by replacing for each of the 201 triples the tail while the head is fixed. Again, this generates a [201,14] dimensional scores tensor only this time with fixed heads and corrupted tails for each of the 201 test triples.

5. The possible issue with inverse triples

If we repeat this procedure on Nations with inverse triples we get the following:

Scenario 2: Nations with inverse relations

Step 1: head corruption:

Dimensions of head, tail and relation tensors during head corruption evaluation

h torch.Size([201, 1, 4])
t torch.Size([1, 14, 4])
r torch.Size([201, 1, 4])

Even though we are in head corruption mode, we are presented with a head tensor that has the size of the test set. By looking at the actually embeddings in the head set it turns out that they correspond to the tails of the test set. So what the interaction function is given as heads is actually the tails of the test set. We checked this by using print(h) in the interaction function and what we get is the following:

The 10 first head embeddings:

print(h[:10]) = 
tensor([[[ 0.4079, -1.2848,  0.5473, -1.4430]],
        [[-0.6264, -1.1625,  0.6276, -0.2435]],
        [[-0.4292,  2.2524,  0.1795,  1.8248]],
        [[-0.0718,  1.0695, -0.0087, -0.6665]],
        [[ 1.2585, -1.7765, -1.7085,  0.8416]],
        [[-0.4292,  2.2524,  0.1795,  1.8248]],
        [[-0.4292,  2.2524,  0.1795,  1.8248]],
        [[-0.2661, -0.5869, -1.0888,  1.5446]],
        [[-0.6264, -1.1625,  0.6276, -0.2435]],
        [[ 0.4079, -1.2848,  0.5473, -1.4430]]])

We see that the embedding in position 2 (if we start counting rows at 0) and position 5 and 6 are identical. The same is true for position 0 and 9 and for position 1 and 8. If we have again a look at the first 10 test triples and if we focus on the tail indices

dataset.testing.mapped_triples[:10]=
tensor([[ 0,  7,  5],
        [ 0, 14, 11],
        [ 0, 18, 12],
        [ 0, 20,  3],
        [ 0, 20, 10],
        [ 0, 21, 12],
        [ 1,  3, 12],
        [ 1,  7,  0],
        [ 1, 10, 11],
        [ 1, 18,  5]])

we see that in position 2,6,7 we have the same entity (with index 12). The same is true for position 0 and 9 (entity 5) and position 1 and 8 (entity 11). So this means that we are presented with 201 head entities that are actually the tails of the test set. This is the first observation.

The second observation is the fact that we are given 201 heads and 14 tails even though we are in the head corruption step. However, instead of corrupting for each of the 201 test triples the head position we are given 201 heads that actually correspond to the tails of the test set and for each of them we create 14 triples where the tail is corrupted. So it seems that the interaction function is not receiving the right tensors for head and tail.

Lets now have a look at step 2, the tail corruption part of the evaluation process.

Step 2: tail corruption:

Dimensions of head, tail and relation tensors during tail corruption evaluation

h torch.Size([201, 1, 4])
t torch.Size([1, 14, 4])
r torch.Size([201, 1, 4])

As can be seen, the presented dimensions are identical to step 1 (the head corruption mode). We are now presented with the expected dimensions for head and tail. We want to corrupt the the tail by any entity for each of the 201 test triples and we fix the head.

Looking at the first 10 head embeddings gives us:

print(h[:10]) = 
tensor([[[-0.2661, -0.5869, -1.0888,  1.5446]],
        [[-0.2661, -0.5869, -1.0888,  1.5446]],
        [[-0.2661, -0.5869, -1.0888,  1.5446]],
        [[-0.2661, -0.5869, -1.0888,  1.5446]],
        [[-0.2661, -0.5869, -1.0888,  1.5446]],
        [[-0.2661, -0.5869, -1.0888,  1.5446]],
        [[ 2.1995,  0.4334,  1.2047,  0.2162]],
        [[ 2.1995,  0.4334,  1.2047,  0.2162]],
        [[ 2.1995,  0.4334,  1.2047,  0.2162]],
        [[ 2.1995,  0.4334,  1.2047,  0.2162]]])

We see that the first six embeddings (from row 0 to 5) are identical. From row 6 to 9 we see another embedding. Having again a look at the test set and focussing this time at the head position

dataset.testing.mapped_triples[:10]=
tensor([[ 0,  7,  5],
        [ 0, 14, 11],
        [ 0, 18, 12],
        [ 0, 20,  3],
        [ 0, 20, 10],
        [ 0, 21, 12],
        [ 1,  3, 12],
        [ 1,  7,  0],
        [ 1, 10, 11],
        [ 1, 18,  5]])

we see that the first 6 heads are identical This is in agreement with the first 10 embeddings of the heads. So in step 2 (the tail corruption mode) we are now given the correct heads and we generate 14 version where the tail is replaced by any entity for each of the 201 test triples.

So in summary. For datasets where inverse triples are present, i.e., inverse_triples=True, we find that in both steps of the evaluation process (step 1: head corruption, step2: tail corruption) the dimensions of the head and tail tensor are identical (whereas they are different when using datasets with inverse_triples=False). In fact, in step 1 the heads correspond to the tails and in step 2 the heads are the actual steps.

6. Is this a problem that might skew metrics?

Is it possible that what we have just reported might modify the evaluation process when datasets with inverse triples are being used (as it is for example required when NodePiece is being used).

Environment

Key	Value
OS	posix
Platform	Darwin
Release	22.6.0
Time	Fri Nov 17 17:58:24 2023
Python	3.9.5
PyKEEN	1.10.2-dev
PyKEEN Hash	`c94213c`
PyKEEN Branch	master
PyTorch	2.1.0
CUDA Available?	false
CUDA Version	N/A
cuDNN Version	N/A

Additional information

No response

Issue Template Checks

This is not a feature request (use a different issue template if it is)
This is not a question (use the discussions forum instead)
I've read the text explaining why including environment information is important and understand if I omit this information that my issue will be dismissed

The text was updated successfully, but these errors were encountered:

mberr · 2023-11-18T16:01:57Z

Hi @martytom ,

if you train a model on a dataset that uses inverse relations that has an effect on the model:

the model will have twice as many relation representations
the model will convert any head prediction $(?, r, t)$ to inverse tail prediction, i.e., $(t, r^{-1}, ?)$

Thus, the evaluation results are expected to be different since you are evaluating a different model on the same evaluation dataset.

You can verify that the two models are different by, e.g., looking at their string-representation.

martytom added the bug Something isn't working label Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue with model evaluation when using datasets with inverse triples #1346

Possible issue with model evaluation when using datasets with inverse triples #1346

martytom commented Nov 17, 2023

mberr commented Nov 18, 2023 •

edited by cthoyt

Possible issue with model evaluation when using datasets with inverse triples #1346

Possible issue with model evaluation when using datasets with inverse triples #1346

Comments

martytom commented Nov 17, 2023

Describe the bug

How to reproduce

1. The setup

2. Model results for both scenarios

3. Details about the Nations dataset.

4. The evaluation process in detail

5. The possible issue with inverse triples

6. Is this a problem that might skew metrics?

Environment

Additional information

Issue Template Checks

mberr commented Nov 18, 2023 • edited by cthoyt

3. Details about the `Nations` dataset.

mberr commented Nov 18, 2023 •

edited by cthoyt