-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.11 F1 score #15
Comments
@miguelballesteros I think it's because smatch has the assumption that the same triple can only occur once. In your example, you have:ARG0 (p / person :ARG1-of (s / settle-03 :ARG1 p, which results in two same triples <ARG1, settle, person>. I am not sure if this duplication is a mistake or intended behavior, but we could add something to fix if there are more than one same triples. If you remove :ARG1 p from your example, the score will be 1.0. |
I see, makes sense. I understand that this needs to occur both in the gold graph and in the predicted graph; if it only happens in the predicted graph it wont have that effect, is this right? |
Currently smatch treats gold graph and predicted graph equally, so if the duplication happens in the predicted graph it will also cause some overcounting. Before a fix is applied, a workaround is to check if there are duplicate triples in your graphs. |
The duplicate issue in #28 highlights that this is still causing headaches. I think the problem here is that the duplicated triple is counted in the numerator and not the denominator. There are 3 other arrangements we could consider:
I think 3 leaves open the door for gaming the metric; users can pad their AMRs with edges they are confident about. Both 1 and 2 are ok, but I have a preference for 2 as these duplicated edges are bad AMRs (see amrisi/amr-guidelines#93 and amrisi/amr-guidelines#121) and the badness should be reflected in the score. |
I agree that 2 makes the most sense. |
SMATCH computes F1 scores, so there should be no uncertainty about what is the correct definition here. there could be ‘duplicate’ tuples in either the gold or the system graph, or both. all should be counted equally, i.e. some will be correct (in both graphs), some maybe only in one or the other. i believe the right solution will be more in the spirit of your option 3. rather than 2. the potential for ‘gaming’ scores, to me, seems to presuppose that one can change the gold-standard target graph? assuming a fixed ‘gold’ graph, say it contains two ‘duplicate’ triples |
one more comment on the legitimacy of ‘duplicate’ triples. the original issue was about multiple edges with the same label (roles), and the discussions among AMR developers that you dug up, @goodmami, seem to lean towards outlawing those. but i think it is a new observation by @ramon-astudillo (in #28) that the same over-counting problem also applies to constant-valued node properties (attributes). there are some legitimate instances of multiple occurrences of the same attribute in the latest AMR release, pointed out to me by @timjogorman last year (look for
|
@oepen those :li attributes seem to have different values ... not the same value duplicated |
In any case, after readings @oepen comment, I agree that it may be best to consider repetitions in the gold AMR as such and penalize having either a higher or lower count on the predicted AMR (closer to option 3). Even if right now such repetitions in AMR wont happen, this is closer to a pure F1. It would also support repetitions if needed (but only if they are present in the gold AMR). |
2 seems simpler ... and harmless if multiple triples are not allowed according to the AMR guidelines .... but @oepen 's is suggesting a sophisticated version of 3 that counts only as many duplicated triples in the numerator as are present in the gold, but not the rest ... that should mean more change to the code but would not be relying on gold graphs always sticking to 'no-duplicate-triples' |
yes, so only an example of (arguably) motivated repetition of attributes. i struggle to suggest a linguistically plausible graph where the same attribute would repeatedly have the same value. but once multiple occurrences of an attribute are legitimate, and their values are arbitrary constants ... there is no way to prevent a parser (or possibly annotator) from creating wholly ‘duplicate’ triples. my general view is that these are not technically duplicates, just multiple tokens of the same tuple type. |
@oepen you make a fair point about keeping the metric a correct implementation of F1. It seems like you may have been misinterpreting about what is meant by duplicate triple. In this case we're not talking merely about source node and role, but the full triple, so in the original issue the triple $ cat i15.gold # e.g., "Ethiopian coffee is very good."
(g / good
:ARG1 (c / coffee
:source (c2 / country
:name (n / name :op1 "Ethiopia")))
:degree (v / very))
$ cat i15.test-a # duplicated degree(g, v)
(g / good
:ARG1 (c / coffee
:source (c2 / country
:name (n / name :op1 "Ethiopia")))
:degree (v / very)
:degree v)
$ cat i15.test-b # duplicated op1(n, "Ethiopia")
(g / good
:ARG1 (c / coffee
:source (c2 / country
:name (n / name :op1 "Ethiopia" :op1 "Ethiopia")))
:degree (v / very))
$ python3 smatch.py -f i15.test-a i15.gold
F-score: 1.04
$ python3 smatch.py -f i15.test-b i15.gold
F-score: 1.04 I think @ramon-astudillo and @Tahira123's suggestions are good. We need more sophisticated counting/matching of the triples so that matching triples are paired off and removed from consideration from further pairings, or, alternatively, that we look for the same counts of matching triples. I've come around to liking this solution better than (2) above, since it leaves open the possibility for legitimate duplicates in the gold graph. Thanks for the discussion, everyone. |
Thanks for the nice discussions. About the duplicate triples, I agree that although legitimate repetition of triples are arguable at this moment, it will be nice to support them instead of ruling them out. I propose the following code changes:
How does this sound? |
@snowblink14 Sounds great. Especially 2 seems good for many use-cases. Any progress in this regard? |
Hi all, just adding a comment on this issue, maybe this helps someone. Imo this issue here is quite a problem since in combination with standard micro scoring it becomes a real vulnerability, see my blog article. However, the good news is that there are straightforward solutions to this problem, as implemented in Smatch++:
If you want to have this also in this Smatch library, I think implementation should be rather simple. For now the feature is available in Smatch++. |
I tried evaluating a single sentence against itself and I got a Smatch score greater than one (!!), any idea why?
Thank you.
Details below:
python smatchnew/smatch/smatch.py -f q3.txt q3.txt
F-score: 1.11
The text was updated successfully, but these errors were encountered: