1.11 F1 score #15

miguelballesteros · 2018-09-27T18:19:18Z

I tried evaluating a single sentence against itself and I got a Smatch score greater than one (!!), any idea why?
Thank you.

Details below:

python smatchnew/smatch/smatch.py -f q3.txt q3.txt
F-score: 1.11

cat q3.txt
# ::snt How many white settlers were living in Kenya in the 1950's ?
(l / live-01
      :ARG0 (p / person
            :ARG1-of (s / settle-03
                  :ARG1 p
                  :ARG4 c)
            :ARG1-of (w / white-02)
            :quant (a / amr-unknown))
      :location (c / country :name "Kenya")
      :time (d / date-entity :decade 1950))

The text was updated successfully, but these errors were encountered:

snowblink14 · 2018-09-30T04:08:15Z

@miguelballesteros I think it's because smatch has the assumption that the same triple can only occur once. In your example, you have:ARG0 (p / person :ARG1-of (s / settle-03 :ARG1 p, which results in two same triples <ARG1, settle, person>.

I am not sure if this duplication is a mistake or intended behavior, but we could add something to fix if there are more than one same triples.

If you remove :ARG1 p from your example, the score will be 1.0.

miguelballesteros · 2018-09-30T04:51:51Z

I see, makes sense. I understand that this needs to occur both in the gold graph and in the predicted graph; if it only happens in the predicted graph it wont have that effect, is this right?

snowblink14 · 2018-10-01T05:20:12Z

Currently smatch treats gold graph and predicted graph equally, so if the duplication happens in the predicted graph it will also cause some overcounting. Before a fix is applied, a workaround is to check if there are duplicate triples in your graphs.

goodmami · 2020-04-02T00:18:45Z

The duplicate issue in #28 highlights that this is still causing headaches. I think the problem here is that the duplicated triple is counted in the numerator and not the denominator. There are 3 other arrangements we could consider:

~~Count duplicated edges in the numerator but not the denominator (leading to a score > 1.0 if everything else is correct)~~ (current situation)
Ignore duplicated edges entirely (leading to a score of 1.0 if everything else is correct)
Count duplicated edges in the denominator but not the numerator (leading to a score < 1.0)
Count duplicated edges in both (leading to a score < 1.0, but approaching 1.0 the more duplicated edges there are)

I think 3 leaves open the door for gaming the metric; users can pad their AMRs with edges they are confident about. Both 1 and 2 are ok, but I have a preference for 2 as these duplicated edges are bad AMRs (see amrisi/amr-guidelines#93 and amrisi/amr-guidelines#121) and the badness should be reflected in the score.

ramon-astudillo · 2020-04-02T14:59:35Z

I agree that 2 makes the most sense.

oepen · 2020-04-02T16:11:41Z

SMATCH computes F1 scores, so there should be no uncertainty about what is the correct definition here.

there could be ‘duplicate’ tuples in either the gold or the system graph, or both. all should be counted equally, i.e. some will be correct (in both graphs), some maybe only in one or the other. i believe the right solution will be more in the spirit of your option 3. rather than 2. the potential for ‘gaming’ scores, to me, seems to presuppose that one can change the gold-standard target graph?

assuming a fixed ‘gold’ graph, say it contains two ‘duplicate’ triples (h : mode interrogative). a parser output with two such triples will yield perfect precision and recall; missing for example one of them will reduce recall, whereas padding with extra :mode triples would penalize precision.

oepen · 2020-04-02T17:32:40Z

one more comment on the legitimacy of ‘duplicate’ triples. the original issue was about multiple edges with the same label (roles), and the discussions among AMR developers that you dug up, @goodmami, seem to lean towards outlawing those.

but i think it is a new observation by @ramon-astudillo (in #28) that the same over-counting problem also applies to constant-valued node properties (attributes). there are some legitimate instances of multiple occurrences of the same attribute in the latest AMR release, pointed out to me by @timjogorman last year (look for :li):

"d)Finaly, which shop/website do you recommend and some buying advise would be realy good plz!!"
(a3 / and :polite + :li "d" :li "-1"
    :op1 (r / recommend-01
          :ARG0 (y / you)
          :ARG4 (a / amr-unknown
                :domain (s / slash
                      :op1 (s2 / shop)
                      :op2 (w / website))))
    :op2 (g / good-02
          :ARG1 (t / thing
                :ARG2-of (a2 / advise-01)
                :purpose (b / buy-01)
                :quant (s3 / some))
          :ARG1-of (r2 / real-04)))

tahira · 2020-04-02T17:36:42Z

@oepen those :li attributes seem to have different values ... not the same value duplicated

ramon-astudillo · 2020-04-02T17:43:20Z

In any case, after readings @oepen comment, I agree that it may be best to consider repetitions in the gold AMR as such and penalize having either a higher or lower count on the predicted AMR (closer to option 3).

Even if right now such repetitions in AMR wont happen, this is closer to a pure F1. It would also support repetitions if needed (but only if they are present in the gold AMR).

tahira · 2020-04-02T17:43:33Z

2 seems simpler ... and harmless if multiple triples are not allowed according to the AMR guidelines .... but @oepen 's is suggesting a sophisticated version of 3 that counts only as many duplicated triples in the numerator as are present in the gold, but not the rest ... that should mean more change to the code but would not be relying on gold graphs always sticking to 'no-duplicate-triples'

oepen · 2020-04-02T17:47:12Z

yes, so only an example of (arguably) motivated repetition of attributes. i struggle to suggest a linguistically plausible graph where the same attribute would repeatedly have the same value. but once multiple occurrences of an attribute are legitimate, and their values are arbitrary constants ... there is no way to prevent a parser (or possibly annotator) from creating wholly ‘duplicate’ triples. my general view is that these are not technically duplicates, just multiple tokens of the same tuple type.

goodmami · 2020-04-03T03:58:38Z

@oepen you make a fair point about keeping the metric a correct implementation of F1. It seems like you may have been misinterpreting about what is meant by duplicate triple. In this case we're not talking merely about source node and role, but the full triple, so in the original issue the triple ARG1(s, p) appears twice. Also it doesn't matter if it's attribute triples or node-to-node triples. Here are examples of both:

$ cat i15.gold  # e.g., "Ethiopian coffee is very good."
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia")))
   :degree (v / very))
$ cat i15.test-a  # duplicated degree(g, v)
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia")))
   :degree (v / very)
   :degree v)
$ cat i15.test-b  # duplicated op1(n, "Ethiopia")
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia" :op1 "Ethiopia")))
   :degree (v / very))
$ python3 smatch.py -f i15.test-a i15.gold 
F-score: 1.04
$ python3 smatch.py -f i15.test-b i15.gold 
F-score: 1.04

I think @ramon-astudillo and @Tahira123's suggestions are good. We need more sophisticated counting/matching of the triples so that matching triples are paired off and removed from consideration from further pairings, or, alternatively, that we look for the same counts of matching triples.

I've come around to liking this solution better than (2) above, since it leaves open the possibility for legitimate duplicates in the gold graph. Thanks for the discussion, everyone.

snowblink14 · 2020-04-04T22:19:12Z

Thanks for the nice discussions. About the duplicate triples, I agree that although legitimate repetition of triples are arguable at this moment, it will be nice to support them instead of ruling them out.

I propose the following code changes:

Check and output warning messages if either predicted/gold AMR has duplicate triples.
Add an option to customize how to treat duplicate triples. By default we treat the duplicate triples in gold as legitimate, and predicted graph can only get 1.0 score if matching the gold amr entirely (including the number of duplicate triples) but user can specify options to ignore the duplicates, .etc.

How does this sound?

BramVanroy · 2023-08-08T08:40:33Z

@snowblink14 Sounds great. Especially 2 seems good for many use-cases. Any progress in this regard?

flipz357 · 2023-10-17T19:22:56Z

Hi all,

just adding a comment on this issue, maybe this helps someone. Imo this issue here is quite a problem since in combination with standard micro scoring it becomes a real vulnerability, see my blog article.

However, the good news is that there are straightforward solutions to this problem, as implemented in Smatch++:

default: AMR graph is standardized and duplicates edges are removed (since they don't really add information). I see that this solution was also proposed somewhere in the thread above.
optional: keep duplicate edges but use proper scoring. Full credits go to my colleague Julius Steen who found that the solution to this issue is rather simple by using a counter / count dict in creating the weight dict and if two edges match we save minimum count of a duplicate edge of two graphs (since that's the proper max matching count) before alignment. This way duplicates are allowed but the final score is proper.

If you want to have this also in this Smatch library, I think implementation should be rather simple. For now the feature is available in Smatch++.

miguelballesteros mentioned this issue Jan 16, 2019

Issue about the 'TOP' attribute relation #18

Open

oepen mentioned this issue Jun 16, 2019

F1 scores above 1.0 from SMATCH cfmrp/mtool#23

Closed

ramon-astudillo mentioned this issue Apr 1, 2020

Smatch (still) yields scores above 100% #28

Closed

jheinecke mentioned this issue Feb 1, 2023

AMR graph which scores > 100% #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.11 F1 score #15

1.11 F1 score #15

miguelballesteros commented Sep 27, 2018 •

edited

Loading

snowblink14 commented Sep 30, 2018

miguelballesteros commented Sep 30, 2018

snowblink14 commented Oct 1, 2018

goodmami commented Apr 2, 2020

ramon-astudillo commented Apr 2, 2020

oepen commented Apr 2, 2020 •

edited

Loading

oepen commented Apr 2, 2020

tahira commented Apr 2, 2020

ramon-astudillo commented Apr 2, 2020

tahira commented Apr 2, 2020

oepen commented Apr 2, 2020 •

edited

Loading

goodmami commented Apr 3, 2020

snowblink14 commented Apr 4, 2020

BramVanroy commented Aug 8, 2023

flipz357 commented Oct 17, 2023 •

edited

Loading

1.11 F1 score #15

1.11 F1 score #15

Comments

miguelballesteros commented Sep 27, 2018 • edited Loading

snowblink14 commented Sep 30, 2018

miguelballesteros commented Sep 30, 2018

snowblink14 commented Oct 1, 2018

goodmami commented Apr 2, 2020

ramon-astudillo commented Apr 2, 2020

oepen commented Apr 2, 2020 • edited Loading

oepen commented Apr 2, 2020

tahira commented Apr 2, 2020

ramon-astudillo commented Apr 2, 2020

tahira commented Apr 2, 2020

oepen commented Apr 2, 2020 • edited Loading

goodmami commented Apr 3, 2020

snowblink14 commented Apr 4, 2020

BramVanroy commented Aug 8, 2023

flipz357 commented Oct 17, 2023 • edited Loading

miguelballesteros commented Sep 27, 2018 •

edited

Loading

oepen commented Apr 2, 2020 •

edited

Loading

oepen commented Apr 2, 2020 •

edited

Loading

flipz357 commented Oct 17, 2023 •

edited

Loading