Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect masking of reverse relations in evaluation procedure #18

Closed
TimDettmers opened this issue May 7, 2018 · 0 comments
Closed

Comments

@TimDettmers
Copy link
Owner

This is to document a buy brought to me by Victoria Lin from Salesforce Research. She noted the following:

The problem is caused by the design of the dictionary keys. For both directions, the relation part of the key is the same. This causes some false positives to be mixed into the ground truth sets.
Consider a relation of the construct:
(A, father of, B)*
(B, father of, C)*
The statement d_egraph[(e2, rel)].add(e1) added A as a correct answer for (B, father of, ?). As a result, A does not trigger a rank penalty in evaluation while it should. A model that predicts an entity ranking list [A, C, ...] receives a measure of rank 1 (while the correct measure should be 2).

* example altered for clarity.

In other words, for the test triples:

(Mike,  father of, John)
(John, father of, Tom)

We would have at test time for the masks of existing triples (as computed in wrangle_KG.py):

(John, fatherOf, ?) -> mask = {Mike, Tom}
(?, fatherOf, John) -> mask = {Mike, Tom}

while the correct masks should be:

(John, fatherOf, ?) -> mask = {Tom}
(?, fatherOf, John) -> mask = {Mike}

Fixing the issue was not simple since ConvE is, unlike other link predictors, directional due to 1-N scoring. If we want to score (E, rel, e2) in ConvE, where E are all entities, then we can only do this by computing (e2, rel, E). One can simply ignore the issue of an directional model and provide different masks for correctness, but this decreases the scoring for ConvE, since it would predict for (e1, rel, E) and (e2, rel, E) the same values although the labels are different.

The solution that I opted for was to introduce a "reverse relation" to indicate the direction of evaluation. If ConvE is evaluated from right to left, that is, (E, rel, e2) then we would compute the ConvE score with (e2, rel_reverse, E); for evaluations from left to right, the scoring remains the same (e1, rel, E).

This bugfix was implemented in d830ddf.

New Results

Currently, I do not have the compute resources to compute an grid search for new values, but I found the following differences in scores. Here + means an indirect in score (good for Hits and MRR) and - means a decrease in score (good for MR).

Better

  • UMLS
    • MR -1, MRR +0.13 Hits@10: +0.01, Hits@3: +0.06, Hits@1: +0.20
  • WN18RR:
    • MR -1090, MRR 0.0 Hits@10: +0.03, Hits@3: +0.01, Hits@1: -0.010
  • FB15k-237:
    • MR -2, MRR +0.009 Hits@10: +0.010, Hits@3: +0.006, Hits@1: -0.002

Almost No change

  • Kinship
    • MR 0, MRR -0.01 Hits@10: +0.01, Hits@3: 0.00, Hits@1: -0.02
  • WN18:
    • MR -530, MRR +0.001 Hits@10: +0.001, Hits@3: -0.001, Hits@1: 0.000

Worse

  • FB15k?
  • YAGO3-10?

There seems to be something wrong with the FB15k scores. And I have to investigate what that exactly is. I am currently still computing YAGO3-10 scores.

I will update the paper once I have all the scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant