Overestimated performance for NER in the 0.2 version #206

qiuwei · 2018-11-14T00:11:36Z

I found that there is a bug in the evaluation script for NER of the 0.2 version.
The perl script is originally written to handle BIO encoded input. When it's used on BIOES encoded input, it won't give any warning, but just gives an overestimated result(According to my experience, it's normally 0.3 ~ 0.4 points higher than the actual F1)

So if you used the 0.2 version to produce the numbers reported in the Coling paper, I guess it might be overestimated.

alanakbik · 2018-11-14T10:17:58Z

Hello @qiuwei thanks for pointing this out. I think the best thing to do is to run another parameter sweep to post updated numbers. We are gearing up to another round of bigger experiments in the context of a new paper and the 0.4. release of Flair, so we'll probably end up doing it all at the same time.

For now, I've run some initial experiments on 0.3.2 over CoNLL-03 with our current default training settings (locked dropout of 0.5, patience of 3, annealing rate of 0.5, mini batch size of 32 and learning rate of 0.1) to just compare BIOES and BIO with our new evaluation method (we no longer use the conll03 script). For each setting, I ran 5 runs each, giving us 92.842 +- 0.11 for BIOES, and 92.984 +- 0.13 for BIO. These are just preliminary numbers to give a rough indication - with parameter selection, these numbers might still change a little. I am hoping we complete a full evaluation and publish these numbers here very soon!

stefan-it · 2018-11-14T10:55:31Z

@alanakbik sounds very interesting, would you share a kind of preprint of the upcoming paper with us?

alanakbik · 2018-11-14T11:14:06Z

@stefan-it As soon as there is one, absolutely :)

roshammar · 2018-11-15T13:31:35Z

Hello @alanakbik. Good timing on your comment. I am thinking of training a NER for Swedish (as in issue #3) and will probably want to benchmark my results.

Is there a way that I could take a look at how you did your different BIOES vs BIO tests and how you calculated the results with the new evaluation method?

alanakbik · 2018-11-15T13:41:46Z

Awesome, please do - we'd be very interested to hear how the approach performs for Swedish!

I would generally stick to the tutorial on how to train models and use the suggested values there - the current code uses the new evaluation script and uses BIOES by default which is generally a good option. We are running more experiments here wrt BIO vs BIOES but it seems that the difference is not so big between the two - depending on your luck of random seeds, one or the other works better.

stale · 2020-04-30T03:53:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

qiuwei changed the title ~~Overestimated performance in the 0.2 version~~ Overestimated performance for NER in the 0.2 version Nov 14, 2018

stale bot added the wontfix This will not be worked on label Apr 30, 2020

stale bot closed this as completed May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overestimated performance for NER in the 0.2 version #206

Overestimated performance for NER in the 0.2 version #206

qiuwei commented Nov 14, 2018 •

edited

alanakbik commented Nov 14, 2018

stefan-it commented Nov 14, 2018

alanakbik commented Nov 14, 2018

roshammar commented Nov 15, 2018

alanakbik commented Nov 15, 2018

stale bot commented Apr 30, 2020

Overestimated performance for NER in the 0.2 version #206

Overestimated performance for NER in the 0.2 version #206

Comments

qiuwei commented Nov 14, 2018 • edited

alanakbik commented Nov 14, 2018

stefan-it commented Nov 14, 2018

alanakbik commented Nov 14, 2018

roshammar commented Nov 15, 2018

alanakbik commented Nov 15, 2018

stale bot commented Apr 30, 2020

qiuwei commented Nov 14, 2018 •

edited