Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add side-specific evaluation #44

Merged
merged 21 commits into from
Jul 21, 2020
Merged

Add side-specific evaluation #44

merged 21 commits into from
Jul 21, 2020

Conversation

mberr
Copy link
Member

@mberr mberr commented Jul 10, 2020

This PR adds additional variants of the already computed rank-based metrics, differentiating between head and tail prediction as well as reporting the "old" average scores. Reporting scores for individual sides may help to reveal model deficiencies (e.g. a model can only successfully predict e.g. tails).

There are some open questions from the implementation side, e.g. how to address specific sides when running from CLI.

@cthoyt
Copy link
Member

cthoyt commented Jul 10, 2020

Can you report these new results in addition to the old style ones? Or is this a good time to completely stop reporting combined results?

@mberr
Copy link
Member Author

mberr commented Jul 10, 2020

In the current implementation I report "head", and "tail", which are for the one-sided evaluation, as well as "both" with uses both, and is equivalent to the "old" version.

@cthoyt
Copy link
Member

cthoyt commented Jul 10, 2020

Can we come up with a simple metric to aggregate head/tail for each? Like for each metric, report the percentage difference between head/tail? So if it's small, it means there are no directionality problems, and if it's big, then there are

like for example, in mean rank, we could also calculate standard deviation/variance of rank. Then we have two means and two stds, so we could do a statistical test to ask if the're the same like the Independent two-sample t-test (equal sample sizes and variance or Equal or unequal sample sizes, similar variances; see https://en.wikipedia.org/wiki/Student%27s_t-test)

@mberr
Copy link
Member Author

mberr commented Jul 12, 2020

Can we come up with a simple metric to aggregate head/tail for each? Like for each metric, report the percentage difference between head/tail? So if it's small, it means there are no directionality problems, and if it's big, then there are

I am not sure whether the numbers are directly comparable, since they evaluate a different problem (e.g. "predict where Max was born" vs. "predict a person born in Germany"). When using filtered evaluation, we thus also end up with having a different number of choices remaining after the filter step. The latter could be solved by looking at the adjusted version of mean rank.

like for example, in mean rank, we could also calculate standard deviation/variance of rank. Then we have two means and two stds, so we could do a statistical test to ask if the're the same like the Independent two-sample t-test (equal sample sizes and variance or Equal or unequal sample sizes, similar variances; see https://en.wikipedia.org/wiki/Student%27s_t-test)

Again, I not sure about the validity of such test, since it compares the results of two different tasks which happen to be measured with the same metric.

But at least we can report those values, and leave the interpretation up to the user. We should provide the interpretation hints discussed here to help with interpreting the results.

@cthoyt
Copy link
Member

cthoyt commented Jul 14, 2020

@mberr are there any other goals for this PR? Otherwise you should have @lvermue check it over

@mberr mberr marked this pull request as ready for review July 14, 2020 13:17
@mberr mberr requested a review from lvermue July 14, 2020 13:17
Copy link
Member

@lvermue lvermue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mberr I have added the adjusted_mean_rank case detection as an example. Please feel free to adjust or revert 😄 Aside that everything looks fine to me 👍

@mberr mberr requested a review from cthoyt July 15, 2020 07:23
@mberr
Copy link
Member Author

mberr commented Jul 15, 2020

@cthoyt Ready from my side for Squash & Merge.

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mberr from the code side it's fine, but I added a tutorial page that I'd like you to fill out. It has 4 questions that users will need answered, and feel free to elaborate on anything else a user should know to understand what's coming out of this new code

This part of the tutorial is aimed to help you understand the rank-based evaluation metrics reported
in :class:`pykeen.evaluation.RankBasedMetricResults`.

Side-Specific Metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mberr fill out here, please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no general discussion of the evaluation, it might be better to start there 😉 I can take care of it, but it will take me a bit 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is not a fix, but a new feature, I think it should be okay to have some delay here to improve the documentation 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cthoyt I started adding a bit about the general evaluation 🙂

:math:`score(h,r,t)=\sum_{i=1}^d \mathbf{h}_i \cdot \mathbf{r}_i \cdot \mathbf{t}_i`. Here, we can score all
entities as candidate heads for a given tail and relation by first computing the element-wise product of tail and
relation, and then performing a matrix multiplication with the matrix of all entity embeddings.
# TODO: Link to section explaining this concept.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lvermue (@cthoyt ) I think we should describe this somewhere in a dedicated section of the documentation where we talk about performance optimizations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed! I made a placeholder for this in #55

@mberr mberr requested a review from cthoyt July 21, 2020 08:39
@mberr
Copy link
Member Author

mberr commented Jul 21, 2020

@cthoyt Sidedness is something which could also be done for e.g. the sklearn based metrics 🙂

@cthoyt
Copy link
Member

cthoyt commented Jul 21, 2020

@mberr thanks for adding the write-up. Will merge when travis passes!

@codecov-commenter
Copy link

Codecov Report

Merging #44 into master will decrease coverage by 1.14%.
The diff coverage is 64.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
- Coverage   70.01%   68.87%   -1.15%     
==========================================
  Files          90       91       +1     
  Lines        5166     5368     +202     
  Branches      601      639      +38     
==========================================
+ Hits         3617     3697      +80     
- Misses       1368     1483     +115     
- Partials      181      188       +7     
Impacted Files Coverage Δ
src/pykeen/__init__.py 100.00% <ø> (ø)
src/pykeen/ablation/ablation.py 0.00% <0.00%> (ø)
src/pykeen/cli.py 0.00% <0.00%> (ø)
src/pykeen/datasets/generate.py 0.00% <0.00%> (ø)
src/pykeen/experiments/cli.py 55.91% <ø> (ø)
src/pykeen/experiments/validate.py 63.91% <0.00%> (ø)
src/pykeen/hpo/__init__.py 100.00% <ø> (ø)
src/pykeen/hpo/pruners.py 100.00% <ø> (ø)
src/pykeen/hpo/samplers.py 100.00% <ø> (ø)
src/pykeen/models/multimodal/__init__.py 100.00% <ø> (ø)
... and 65 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0503e62...3ea749d. Read the comment docs.

@cthoyt cthoyt merged commit 1c56f7a into master Jul 21, 2020
@cthoyt cthoyt deleted the head-tail-specific-scores branch July 21, 2020 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants