-
-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add side-specific evaluation #44
Conversation
Can you report these new results in addition to the old style ones? Or is this a good time to completely stop reporting combined results? |
In the current implementation I report "head", and "tail", which are for the one-sided evaluation, as well as "both" with uses both, and is equivalent to the "old" version. |
Can we come up with a simple metric to aggregate head/tail for each? Like for each metric, report the percentage difference between head/tail? So if it's small, it means there are no directionality problems, and if it's big, then there are like for example, in mean rank, we could also calculate standard deviation/variance of rank. Then we have two means and two stds, so we could do a statistical test to ask if the're the same like the Independent two-sample t-test (equal sample sizes and variance or Equal or unequal sample sizes, similar variances; see https://en.wikipedia.org/wiki/Student%27s_t-test) |
I am not sure whether the numbers are directly comparable, since they evaluate a different problem (e.g. "predict where Max was born" vs. "predict a person born in Germany"). When using filtered evaluation, we thus also end up with having a different number of choices remaining after the filter step. The latter could be solved by looking at the adjusted version of mean rank.
Again, I not sure about the validity of such test, since it compares the results of two different tasks which happen to be measured with the same metric. But at least we can report those values, and leave the interpretation up to the user. We should provide the interpretation hints discussed here to help with interpreting the results. |
[skip ci]
[skip ci]
[skip ci]
…scores # Conflicts: # tests/test_early_stopping.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mberr I have added the adjusted_mean_rank
case detection as an example. Please feel free to adjust or revert 😄 Aside that everything looks fine to me 👍
@cthoyt Ready from my side for Squash & Merge. |
[skip ci]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mberr from the code side it's fine, but I added a tutorial page that I'd like you to fill out. It has 4 questions that users will need answered, and feel free to elaborate on anything else a user should know to understand what's coming out of this new code
This part of the tutorial is aimed to help you understand the rank-based evaluation metrics reported | ||
in :class:`pykeen.evaluation.RankBasedMetricResults`. | ||
|
||
Side-Specific Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mberr fill out here, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no general discussion of the evaluation, it might be better to start there 😉 I can take care of it, but it will take me a bit 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is not a fix, but a new feature, I think it should be okay to have some delay here to improve the documentation 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cthoyt I started adding a bit about the general evaluation 🙂
:math:`score(h,r,t)=\sum_{i=1}^d \mathbf{h}_i \cdot \mathbf{r}_i \cdot \mathbf{t}_i`. Here, we can score all | ||
entities as candidate heads for a given tail and relation by first computing the element-wise product of tail and | ||
relation, and then performing a matrix multiplication with the matrix of all entity embeddings. | ||
# TODO: Link to section explaining this concept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed! I made a placeholder for this in #55
@cthoyt Sidedness is something which could also be done for e.g. the sklearn based metrics 🙂 |
@mberr thanks for adding the write-up. Will merge when travis passes! |
Codecov Report
@@ Coverage Diff @@
## master #44 +/- ##
==========================================
- Coverage 70.01% 68.87% -1.15%
==========================================
Files 90 91 +1
Lines 5166 5368 +202
Branches 601 639 +38
==========================================
+ Hits 3617 3697 +80
- Misses 1368 1483 +115
- Partials 181 188 +7
Continue to review full report at Codecov.
|
This PR adds additional variants of the already computed rank-based metrics, differentiating between head and tail prediction as well as reporting the "old" average scores. Reporting scores for individual sides may help to reveal model deficiencies (e.g. a model can only successfully predict e.g. tails).
There are some open questions from the implementation side, e.g. how to address specific sides when running from CLI.