Add side-specific evaluation #44

mberr · 2020-07-10T16:46:22Z

This PR adds additional variants of the already computed rank-based metrics, differentiating between head and tail prediction as well as reporting the "old" average scores. Reporting scores for individual sides may help to reveal model deficiencies (e.g. a model can only successfully predict e.g. tails).

There are some open questions from the implementation side, e.g. how to address specific sides when running from CLI.

cthoyt · 2020-07-10T17:12:43Z

Can you report these new results in addition to the old style ones? Or is this a good time to completely stop reporting combined results?

mberr · 2020-07-10T17:33:39Z

In the current implementation I report "head", and "tail", which are for the one-sided evaluation, as well as "both" with uses both, and is equivalent to the "old" version.

cthoyt · 2020-07-10T18:29:15Z

Can we come up with a simple metric to aggregate head/tail for each? Like for each metric, report the percentage difference between head/tail? So if it's small, it means there are no directionality problems, and if it's big, then there are

like for example, in mean rank, we could also calculate standard deviation/variance of rank. Then we have two means and two stds, so we could do a statistical test to ask if the're the same like the Independent two-sample t-test (equal sample sizes and variance or Equal or unequal sample sizes, similar variances; see https://en.wikipedia.org/wiki/Student%27s_t-test)

mberr · 2020-07-12T08:02:15Z

Can we come up with a simple metric to aggregate head/tail for each? Like for each metric, report the percentage difference between head/tail? So if it's small, it means there are no directionality problems, and if it's big, then there are

I am not sure whether the numbers are directly comparable, since they evaluate a different problem (e.g. "predict where Max was born" vs. "predict a person born in Germany"). When using filtered evaluation, we thus also end up with having a different number of choices remaining after the filter step. The latter could be solved by looking at the adjusted version of mean rank.

like for example, in mean rank, we could also calculate standard deviation/variance of rank. Then we have two means and two stds, so we could do a statistical test to ask if the're the same like the Independent two-sample t-test (equal sample sizes and variance or Equal or unequal sample sizes, similar variances; see https://en.wikipedia.org/wiki/Student%27s_t-test)

Again, I not sure about the validity of such test, since it compares the results of two different tasks which happen to be measured with the same metric.

But at least we can report those values, and leave the interpretation up to the user. We should provide the interpretation hints discussed here to help with interpreting the results.

[skip ci]

…scores # Conflicts: # tests/test_early_stopping.py

src/pykeen/evaluation/rank_based_evaluator.py

cthoyt · 2020-07-14T13:05:28Z

@mberr are there any other goals for this PR? Otherwise you should have @lvermue check it over

lvermue

@mberr I have added the adjusted_mean_rank case detection as an example. Please feel free to adjust or revert 😄 Aside that everything looks fine to me 👍

mberr · 2020-07-15T07:24:10Z

@cthoyt Ready from my side for Squash & Merge.

[skip ci]

cthoyt

@mberr from the code side it's fine, but I added a tutorial page that I'd like you to fill out. It has 4 questions that users will need answered, and feel free to elaborate on anything else a user should know to understand what's coming out of this new code

cthoyt · 2020-07-15T10:33:29Z

docs/source/tutorial/understanding_evaluation.rst

+This part of the tutorial is aimed to help you understand the rank-based evaluation metrics reported
+in :class:`pykeen.evaluation.RankBasedMetricResults`.
+
+Side-Specific Metrics


@mberr fill out here, please

Since there is no general discussion of the evaluation, it might be better to start there 😉 I can take care of it, but it will take me a bit 🙂

Since this PR is not a fix, but a new feature, I think it should be okay to have some delay here to improve the documentation 🙂

@cthoyt I started adding a bit about the general evaluation 🙂

…head-tail-specific-scores

mberr · 2020-07-21T08:38:13Z

docs/source/tutorial/understanding_evaluation.rst

+    :math:`score(h,r,t)=\sum_{i=1}^d \mathbf{h}_i \cdot \mathbf{r}_i \cdot \mathbf{t}_i`. Here, we can score all
+    entities as candidate heads for a given tail and relation by first computing the element-wise product of tail and
+    relation, and then performing a matrix multiplication with the matrix of all entity embeddings.
+    # TODO: Link to section explaining this concept.


@lvermue (@cthoyt ) I think we should describe this somewhere in a dedicated section of the documentation where we talk about performance optimizations.

agreed! I made a placeholder for this in #55

mberr · 2020-07-21T10:39:33Z

@cthoyt Sidedness is something which could also be done for e.g. the sklearn based metrics 🙂

cthoyt · 2020-07-21T10:42:11Z

@mberr thanks for adding the write-up. Will merge when travis passes!

codecov-commenter · 2020-07-21T10:52:10Z

Codecov Report

Merging #44 into master will decrease coverage by 1.14%.
The diff coverage is 64.44%.

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
- Coverage   70.01%   68.87%   -1.15%     
==========================================
  Files          90       91       +1     
  Lines        5166     5368     +202     
  Branches      601      639      +38     
==========================================
+ Hits         3617     3697      +80     
- Misses       1368     1483     +115     
- Partials      181      188       +7

Impacted Files	Coverage Δ
src/pykeen/__init__.py	`100.00% <ø> (ø)`
src/pykeen/ablation/ablation.py	`0.00% <0.00%> (ø)`
src/pykeen/cli.py	`0.00% <0.00%> (ø)`
src/pykeen/datasets/generate.py	`0.00% <0.00%> (ø)`
src/pykeen/experiments/cli.py	`55.91% <ø> (ø)`
src/pykeen/experiments/validate.py	`63.91% <0.00%> (ø)`
src/pykeen/hpo/__init__.py	`100.00% <ø> (ø)`
src/pykeen/hpo/pruners.py	`100.00% <ø> (ø)`
src/pykeen/hpo/samplers.py	`100.00% <ø> (ø)`
src/pykeen/models/multimodal/__init__.py	`100.00% <ø> (ø)`
... and 65 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0503e62...3ea749d. Read the comment docs.

mberr added 3 commits July 10, 2020 18:43

Add side-specific evaluation

5dcfa94

Fix line-too-long

d2eaee8

Fix imports

db9b2d2

mberr added 8 commits July 12, 2020 10:10

Update get_metric

910f3de

[skip ci]

Fix to_flat_dict

aea0ff6

[skip ci]

Fix to_df

1490110

[skip ci]

Fix to_flat_dict

f9b1b22

[skip ci]

Update unittest for RankBasedEvaluator.finalize

5fa0be2

[skip ci]

Fix MockEvaluator for unittest of EarlyStopper

c4553bb

[skip ci]

Merge remote-tracking branch 'origin/master' into head-tail-specific-…

45c3ce1

…scores # Conflicts: # tests/test_early_stopping.py

Use proper MetricResults object for unittest of early stopping

d3857bd

mberr commented Jul 12, 2020

View reviewed changes

src/pykeen/evaluation/rank_based_evaluator.py Outdated Show resolved Hide resolved

Fix import order

5f67724

cthoyt reviewed Jul 13, 2020

View reviewed changes

src/pykeen/evaluation/rank_based_evaluator.py Outdated Show resolved Hide resolved

Cast result to float

38a2360

mberr marked this pull request as ready for review July 14, 2020 13:17

mberr requested a review from lvermue July 14, 2020 13:17

Add adjusted_mean_rank evaluation case detection

084bd07

lvermue approved these changes Jul 15, 2020

View reviewed changes

mberr requested a review from cthoyt July 15, 2020 07:23

Create understanding_evaluation.rst

5121125

[skip ci]

cthoyt requested changes Jul 15, 2020

View reviewed changes

cthoyt reviewed Jul 15, 2020

View reviewed changes

Extend documentation of evaluation

7025448

cthoyt and others added 3 commits July 17, 2020 14:02

Update index.rst

ac86871

Continue documentation

c781893

Merge remote-tracking branch 'origin/head-tail-specific-scores' into …

dd78545

…head-tail-specific-scores

mberr commented Jul 21, 2020

View reviewed changes

mberr requested a review from cthoyt July 21, 2020 08:39

cthoyt added 2 commits July 21, 2020 12:33

Update understanding_evaluation.rst

29dd1ba

Fix language and add table

3ea749d

cthoyt approved these changes Jul 21, 2020

View reviewed changes

cthoyt merged commit 1c56f7a into master Jul 21, 2020

cthoyt deleted the head-tail-specific-scores branch July 21, 2020 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add side-specific evaluation #44

Add side-specific evaluation #44

mberr commented Jul 10, 2020

cthoyt commented Jul 10, 2020

mberr commented Jul 10, 2020

cthoyt commented Jul 10, 2020 •

edited

Loading

mberr commented Jul 12, 2020

cthoyt commented Jul 14, 2020

lvermue left a comment

mberr commented Jul 15, 2020

cthoyt left a comment

cthoyt Jul 15, 2020

mberr Jul 16, 2020

mberr Jul 16, 2020

mberr Jul 17, 2020

mberr Jul 21, 2020

cthoyt Jul 21, 2020

mberr commented Jul 21, 2020

cthoyt commented Jul 21, 2020

codecov-commenter commented Jul 21, 2020

Add side-specific evaluation #44

Add side-specific evaluation #44

Conversation

mberr commented Jul 10, 2020

cthoyt commented Jul 10, 2020

mberr commented Jul 10, 2020

cthoyt commented Jul 10, 2020 • edited Loading

mberr commented Jul 12, 2020

cthoyt commented Jul 14, 2020

lvermue left a comment

Choose a reason for hiding this comment

mberr commented Jul 15, 2020

cthoyt left a comment

Choose a reason for hiding this comment

cthoyt Jul 15, 2020

Choose a reason for hiding this comment

mberr Jul 16, 2020

Choose a reason for hiding this comment

mberr Jul 16, 2020

Choose a reason for hiding this comment

mberr Jul 17, 2020

Choose a reason for hiding this comment

mberr Jul 21, 2020

Choose a reason for hiding this comment

cthoyt Jul 21, 2020

Choose a reason for hiding this comment

mberr commented Jul 21, 2020

cthoyt commented Jul 21, 2020

codecov-commenter commented Jul 21, 2020

Codecov Report

cthoyt commented Jul 10, 2020 •

edited

Loading