Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.11 F1 score #15

Open
miguelballesteros opened this issue Sep 27, 2018 · 13 comments
Open

1.11 F1 score #15

miguelballesteros opened this issue Sep 27, 2018 · 13 comments

Comments

@miguelballesteros
Copy link

@miguelballesteros miguelballesteros commented Sep 27, 2018

I tried evaluating a single sentence against itself and I got a Smatch score greater than one (!!), any idea why?
Thank you.

Details below:

python smatchnew/smatch/smatch.py -f q3.txt q3.txt
F-score: 1.11

cat q3.txt
# ::snt How many white settlers were living in Kenya in the 1950's ?
(l / live-01
      :ARG0 (p / person
            :ARG1-of (s / settle-03
                  :ARG1 p
                  :ARG4 c)
            :ARG1-of (w / white-02)
            :quant (a / amr-unknown))
      :location (c / country :name "Kenya")
      :time (d / date-entity :decade 1950))
@snowblink14
Copy link
Owner

@snowblink14 snowblink14 commented Sep 30, 2018

@miguelballesteros I think it's because smatch has the assumption that the same triple can only occur once. In your example, you have:ARG0 (p / person :ARG1-of (s / settle-03 :ARG1 p, which results in two same triples <ARG1, settle, person>.

I am not sure if this duplication is a mistake or intended behavior, but we could add something to fix if there are more than one same triples.

If you remove :ARG1 p from your example, the score will be 1.0.

Loading

@miguelballesteros
Copy link
Author

@miguelballesteros miguelballesteros commented Sep 30, 2018

I see, makes sense. I understand that this needs to occur both in the gold graph and in the predicted graph; if it only happens in the predicted graph it wont have that effect, is this right?

Loading

@snowblink14
Copy link
Owner

@snowblink14 snowblink14 commented Oct 1, 2018

Currently smatch treats gold graph and predicted graph equally, so if the duplication happens in the predicted graph it will also cause some overcounting. Before a fix is applied, a workaround is to check if there are duplicate triples in your graphs.

Loading

@goodmami
Copy link
Contributor

@goodmami goodmami commented Apr 2, 2020

The duplicate issue in #28 highlights that this is still causing headaches. I think the problem here is that the duplicated triple is counted in the numerator and not the denominator. There are 3 other arrangements we could consider:

  1. Count duplicated edges in the numerator but not the denominator (leading to a score > 1.0 if everything else is correct) (current situation)
  2. Ignore duplicated edges entirely (leading to a score of 1.0 if everything else is correct)
  3. Count duplicated edges in the denominator but not the numerator (leading to a score < 1.0)
  4. Count duplicated edges in both (leading to a score < 1.0, but approaching 1.0 the more duplicated edges there are)

I think 3 leaves open the door for gaming the metric; users can pad their AMRs with edges they are confident about. Both 1 and 2 are ok, but I have a preference for 2 as these duplicated edges are bad AMRs (see amrisi/amr-guidelines#93 and amrisi/amr-guidelines#121) and the badness should be reflected in the score.

Loading

@ramon-astudillo
Copy link

@ramon-astudillo ramon-astudillo commented Apr 2, 2020

I agree that 2 makes the most sense.

Loading

@oepen
Copy link

@oepen oepen commented Apr 2, 2020

SMATCH computes F1 scores, so there should be no uncertainty about what is the correct definition here.

there could be ‘duplicate’ tuples in either the gold or the system graph, or both. all should be counted equally, i.e. some will be correct (in both graphs), some maybe only in one or the other. i believe the right solution will be more in the spirit of your option 3. rather than 2. the potential for ‘gaming’ scores, to me, seems to presuppose that one can change the gold-standard target graph?

assuming a fixed ‘gold’ graph, say it contains two ‘duplicate’ triples (h : mode interrogative). a parser output with two such triples will yield perfect precision and recall; missing for example one of them will reduce recall, whereas padding with extra :mode triples would penalize precision.

Loading

@oepen
Copy link

@oepen oepen commented Apr 2, 2020

one more comment on the legitimacy of ‘duplicate’ triples. the original issue was about multiple edges with the same label (roles), and the discussions among AMR developers that you dug up, @goodmami, seem to lean towards outlawing those.

but i think it is a new observation by @ramon-astudillo (in #28) that the same over-counting problem also applies to constant-valued node properties (attributes). there are some legitimate instances of multiple occurrences of the same attribute in the latest AMR release, pointed out to me by @timjogorman last year (look for :li):

"d)Finaly, which shop/website do you recommend and some buying advise would be realy good plz!!"
(a3 / and :polite + :li "d" :li "-1"
    :op1 (r / recommend-01
          :ARG0 (y / you)
          :ARG4 (a / amr-unknown
                :domain (s / slash
                      :op1 (s2 / shop)
                      :op2 (w / website))))
    :op2 (g / good-02
          :ARG1 (t / thing
                :ARG2-of (a2 / advise-01)
                :purpose (b / buy-01)
                :quant (s3 / some))
          :ARG1-of (r2 / real-04)))

Loading

@tahira123
Copy link

@tahira123 tahira123 commented Apr 2, 2020

@oepen those :li attributes seem to have different values ... not the same value duplicated

Loading

@ramon-astudillo
Copy link

@ramon-astudillo ramon-astudillo commented Apr 2, 2020

In any case, after readings @oepen comment, I agree that it may be best to consider repetitions in the gold AMR as such and penalize having either a higher or lower count on the predicted AMR (closer to option 3).

Even if right now such repetitions in AMR wont happen, this is closer to a pure F1. It would also support repetitions if needed (but only if they are present in the gold AMR).

Loading

@tahira123
Copy link

@tahira123 tahira123 commented Apr 2, 2020

2 seems simpler ... and harmless if multiple triples are not allowed according to the AMR guidelines .... but @oepen 's is suggesting a sophisticated version of 3 that counts only as many duplicated triples in the numerator as are present in the gold, but not the rest ... that should mean more change to the code but would not be relying on gold graphs always sticking to 'no-duplicate-triples'

Loading

@oepen
Copy link

@oepen oepen commented Apr 2, 2020

yes, so only an example of (arguably) motivated repetition of attributes. i struggle to suggest a linguistically plausible graph where the same attribute would repeatedly have the same value. but once multiple occurrences of an attribute are legitimate, and their values are arbitrary constants ... there is no way to prevent a parser (or possibly annotator) from creating wholly ‘duplicate’ triples. my general view is that these are not technically duplicates, just multiple tokens of the same tuple type.

Loading

@goodmami
Copy link
Contributor

@goodmami goodmami commented Apr 3, 2020

@oepen you make a fair point about keeping the metric a correct implementation of F1. It seems like you may have been misinterpreting about what is meant by duplicate triple. In this case we're not talking merely about source node and role, but the full triple, so in the original issue the triple ARG1(s, p) appears twice. Also it doesn't matter if it's attribute triples or node-to-node triples. Here are examples of both:

$ cat i15.gold  # e.g., "Ethiopian coffee is very good."
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia")))
   :degree (v / very))
$ cat i15.test-a  # duplicated degree(g, v)
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia")))
   :degree (v / very)
   :degree v)
$ cat i15.test-b  # duplicated op1(n, "Ethiopia")
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia" :op1 "Ethiopia")))
   :degree (v / very))
$ python3 smatch.py -f i15.test-a i15.gold 
F-score: 1.04
$ python3 smatch.py -f i15.test-b i15.gold 
F-score: 1.04

I think @ramon-astudillo and @tahira123's suggestions are good. We need more sophisticated counting/matching of the triples so that matching triples are paired off and removed from consideration from further pairings, or, alternatively, that we look for the same counts of matching triples.

I've come around to liking this solution better than (2) above, since it leaves open the possibility for legitimate duplicates in the gold graph. Thanks for the discussion, everyone.

Loading

@snowblink14
Copy link
Owner

@snowblink14 snowblink14 commented Apr 4, 2020

Thanks for the nice discussions. About the duplicate triples, I agree that although legitimate repetition of triples are arguable at this moment, it will be nice to support them instead of ruling them out.

I propose the following code changes:

  1. Check and output warning messages if either predicted/gold AMR has duplicate triples.
  2. Add an option to customize how to treat duplicate triples. By default we treat the duplicate triples in gold as legitimate, and predicted graph can only get 1.0 score if matching the gold amr entirely (including the number of duplicate triples) but user can specify options to ignore the duplicates, .etc.

How does this sound?

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants