Possible issue with model evaluation when using datasets with inverse triples #1346
Open
3 tasks done
Labels
bug
Something isn't working
Describe the bug
Using the build-in dataset
Nations
, we show that there might be a problem with the model evaluation process when using datasets with inverse triples, i.e., whendataset = Nations(create_inverse_triples=True)
. The evaluation process in that case seems to be different from the process when using a model without inverse triples, i.e., whendataset = Nations(create_inverse_triples=False)
. In the following we provide a minimal example to show the difference between the evaluation process for these two scenarios:dataset = Nations(create_inverse_triples=False)
dataset = Nations(create_inverse_triples=True)
and show results for metrics using identical setups apart from the fact that in scenario 1 we are not using inverse triples and in scenario 2 we do use inverse triples.
How to reproduce
1. The setup
Required imports.
Use the standard DistMult interaction function.
Take the DistMult model with embedding dimension d=4.
We create a regular pipeline object with a basic setup and set
num_epochs=0
to only evaluate the model on the initialized embeddings such that there is no training happening. We setseed=126
to guarantee that the initial embeddings in both scenarios are identical. In each run we use the same parameters so that we can properly compare the model evaluation for scenario 1 and scenario 2. Since inverse triples are not being used in the evaluation process, the resulting scores for all metrics should be identical in both scenarios because we setnum_epochs=0
and we use the same seedrandom_seed=126
.2. Model results for both scenarios
From this, we get the following results for the
hits_at_3
metric.dataset = Nations(create_inverse_triples=False)
dataset = Nations(create_inverse_triples=True)
As can be seen, the results are different despite the fact that the embeddings are identical. This suggests that model evaluation is different when using a dataset with
inverse_triples=False
compared to the same dataset withinverse_triples=True
even though inverse triples are not being used in model evaluation (such that the evaluation part of the model in scenario 1 and scenario 2 should be identical and thus give the same scores.3. Details about the
Nations
dataset.The dataset has
num_entites=14
andnum_triples=1992
. The size of the test set is:dataset.testing.mapped_triples.shape = torch.Size([201, 3])
. So there are 201 (mapped) triples in the test set. The first 10 triples of the test set are:We checked that the test triples are identical for the scenario with and without inverse triples, i.e., scenario 1 (
Nations
without inverse triples) and scenario 2 (Nations
with inverse triples).4. The evaluation process in detail
As far as I understand, the model evaluation on the test set happens in two steps.
In the following we will go through the 2-step evaluation process for both described scenarios (scenario 1 is
Nations
without inverse triples, scenario 2 isNations
with inverse triples)We print heads
h
, tails,t
and relationsr
that are passed to the interaction function during the evaluation loop by printing their shapes within the interaction function:In the following we report the corresponding shapes for both steps (step 1: corrupt heads, step 2: corrupt tails) and the two described scenarios (scenario 1: without inverse relations, scenario 2: with inverse relations).
Scenario 1:
Nations
without inverse relationsStep 1: head corruption:
Dimensions of head, tail and relation tensors during head corruption evaluation
Tails correspond to the size of the test set,
t.shape = torch.Size([201,1,4])
and head has size 14,h.shape=torch.Size([1,14,4])
which corresponds to the number of entities inNations
. So the model is evaluated by replacing for each of the 201 triples the head while the tail is fixed. This will generate a[201,14]
dimensional scores tensor that contains the score for each of the 201 test triples all 14 possible triples that can be generated by corrupting the head (i.e., fixing the tail and replacing the head with any entity that is in the graph. Since there are 14, the second dimension of the head tensor is 14).Step 2: Tail corruption:
Dimensions of head, tail and relation tensors during tail corruption evaluation
Heads correspond to the size of the test set,
h.shape = torch.Size([201,1,4])
and tail has size 14,t.shape=torch.Size([1,14,4])
which corresponds to the number of entities inNations
. So the model is evaluated by replacing for each of the 201 triples the tail while the head is fixed. Again, this generates a[201,14]
dimensional scores tensor only this time with fixed heads and corrupted tails for each of the 201 test triples.5. The possible issue with inverse triples
If we repeat this procedure on
Nations
with inverse triples we get the following:Scenario 2:
Nations
with inverse relationsStep 1: head corruption:
Dimensions of head, tail and relation tensors during head corruption evaluation
Even though we are in head corruption mode, we are presented with a head tensor that has the size of the test set. By looking at the actually embeddings in the head set it turns out that they correspond to the tails of the test set. So what the interaction function is given as heads is actually the tails of the test set. We checked this by using
print(h)
in the interaction function and what we get is the following:The 10 first head embeddings:
We see that the embedding in position 2 (if we start counting rows at 0) and position 5 and 6 are identical. The same is true for position 0 and 9 and for position 1 and 8. If we have again a look at the first 10 test triples and if we focus on the tail indices
we see that in position 2,6,7 we have the same entity (with index 12). The same is true for position 0 and 9 (entity 5) and position 1 and 8 (entity 11). So this means that we are presented with 201 head entities that are actually the tails of the test set. This is the first observation.
The second observation is the fact that we are given 201 heads and 14 tails even though we are in the head corruption step. However, instead of corrupting for each of the 201 test triples the head position we are given 201 heads that actually correspond to the tails of the test set and for each of them we create 14 triples where the tail is corrupted. So it seems that the interaction function is not receiving the right tensors for head and tail.
Lets now have a look at step 2, the tail corruption part of the evaluation process.
Step 2: tail corruption:
Dimensions of head, tail and relation tensors during tail corruption evaluation
As can be seen, the presented dimensions are identical to step 1 (the head corruption mode). We are now presented with the expected dimensions for head and tail. We want to corrupt the the tail by any entity for each of the 201 test triples and we fix the head.
Looking at the first 10 head embeddings gives us:
We see that the first six embeddings (from row 0 to 5) are identical. From row 6 to 9 we see another embedding. Having again a look at the test set and focussing this time at the head position
we see that the first 6 heads are identical This is in agreement with the first 10 embeddings of the heads. So in step 2 (the tail corruption mode) we are now given the correct heads and we generate 14 version where the tail is replaced by any entity for each of the 201 test triples.
So in summary. For datasets where inverse triples are present, i.e.,
inverse_triples=True
, we find that in both steps of the evaluation process (step 1: head corruption, step2: tail corruption) the dimensions of the head and tail tensor are identical (whereas they are different when using datasets withinverse_triples=False
). In fact, in step 1 the heads correspond to the tails and in step 2 the heads are the actual steps.6. Is this a problem that might skew metrics?
Is it possible that what we have just reported might modify the evaluation process when datasets with inverse triples are being used (as it is for example required when NodePiece is being used).
Environment
Additional information
No response
Issue Template Checks
The text was updated successfully, but these errors were encountered: