Incorrect masking of reverse relations in evaluation procedure #18

TimDettmers · 2018-05-07T08:24:16Z

This is to document a buy brought to me by Victoria Lin from Salesforce Research. She noted the following:

The problem is caused by the design of the dictionary keys. For both directions, the relation part of the key is the same. This causes some false positives to be mixed into the ground truth sets.
Consider a relation of the construct:
(A, father of, B)*
(B, father of, C)*
The statement d_egraph[(e2, rel)].add(e1) added A as a correct answer for (B, father of, ?). As a result, A does not trigger a rank penalty in evaluation while it should. A model that predicts an entity ranking list [A, C, ...] receives a measure of rank 1 (while the correct measure should be 2).

* example altered for clarity.

In other words, for the test triples:

(Mike,  father of, John)
(John, father of, Tom)

We would have at test time for the masks of existing triples (as computed in wrangle_KG.py):

(John, fatherOf, ?) -> mask = {Mike, Tom}
(?, fatherOf, John) -> mask = {Mike, Tom}

while the correct masks should be:

(John, fatherOf, ?) -> mask = {Tom}
(?, fatherOf, John) -> mask = {Mike}

Fixing the issue was not simple since ConvE is, unlike other link predictors, directional due to 1-N scoring. If we want to score (E, rel, e2) in ConvE, where E are all entities, then we can only do this by computing (e2, rel, E). One can simply ignore the issue of an directional model and provide different masks for correctness, but this decreases the scoring for ConvE, since it would predict for (e1, rel, E) and (e2, rel, E) the same values although the labels are different.

The solution that I opted for was to introduce a "reverse relation" to indicate the direction of evaluation. If ConvE is evaluated from right to left, that is, (E, rel, e2) then we would compute the ConvE score with (e2, rel_reverse, E); for evaluations from left to right, the scoring remains the same (e1, rel, E).

This bugfix was implemented in d830ddf.

New Results

Currently, I do not have the compute resources to compute an grid search for new values, but I found the following differences in scores. Here + means an indirect in score (good for Hits and MRR) and - means a decrease in score (good for MR).

Better

UMLS
- MR -1, MRR +0.13 Hits@10: +0.01, Hits@3: +0.06, Hits@1: +0.20
WN18RR:
- MR -1090, MRR 0.0 Hits@10: +0.03, Hits@3: +0.01, Hits@1: -0.010
FB15k-237:
- MR -2, MRR +0.009 Hits@10: +0.010, Hits@3: +0.006, Hits@1: -0.002

Almost No change

Kinship
- MR 0, MRR -0.01 Hits@10: +0.01, Hits@3: 0.00, Hits@1: -0.02
WN18:
- MR -530, MRR +0.001 Hits@10: +0.001, Hits@3: -0.001, Hits@1: 0.000

Worse

FB15k?
YAGO3-10?

There seems to be something wrong with the FB15k scores. And I have to investigate what that exactly is. I am currently still computing YAGO3-10 scores.

I will update the paper once I have all the scores.

The text was updated successfully, but these errors were encountered:

TimDettmers mentioned this issue May 7, 2018

Can not reproduce results in the paper for WN18RR dataset #15

Closed

TimDettmers closed this as completed May 9, 2018

TimDettmers mentioned this issue Jul 4, 2018

Not able to replicate FB15k scores #26

Closed

TimDettmers mentioned this issue Apr 28, 2019

Reopening issue #43 on data augmentation with reversed triples #45

Closed

TimDettmers mentioned this issue Jun 6, 2019

How many epochs? #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect masking of reverse relations in evaluation procedure #18

Incorrect masking of reverse relations in evaluation procedure #18

TimDettmers commented May 7, 2018

Incorrect masking of reverse relations in evaluation procedure #18

Incorrect masking of reverse relations in evaluation procedure #18

Comments

TimDettmers commented May 7, 2018

New Results

Better

Almost No change

Worse