Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overestimated performance for NER in the 0.2 version #206

Closed
qiuwei opened this issue Nov 14, 2018 · 6 comments
Closed

Overestimated performance for NER in the 0.2 version #206

qiuwei opened this issue Nov 14, 2018 · 6 comments
Labels
wontfix This will not be worked on

Comments

@qiuwei
Copy link

qiuwei commented Nov 14, 2018

I found that there is a bug in the evaluation script for NER of the 0.2 version.
The perl script is originally written to handle BIO encoded input. When it's used on BIOES encoded input, it won't give any warning, but just gives an overestimated result(According to my experience, it's normally 0.3 ~ 0.4 points higher than the actual F1)

So if you used the 0.2 version to produce the numbers reported in the Coling paper, I guess it might be overestimated.

@qiuwei qiuwei changed the title Overestimated performance in the 0.2 version Overestimated performance for NER in the 0.2 version Nov 14, 2018
@alanakbik
Copy link
Collaborator

Hello @qiuwei thanks for pointing this out. I think the best thing to do is to run another parameter sweep to post updated numbers. We are gearing up to another round of bigger experiments in the context of a new paper and the 0.4. release of Flair, so we'll probably end up doing it all at the same time.

For now, I've run some initial experiments on 0.3.2 over CoNLL-03 with our current default training settings (locked dropout of 0.5, patience of 3, annealing rate of 0.5, mini batch size of 32 and learning rate of 0.1) to just compare BIOES and BIO with our new evaluation method (we no longer use the conll03 script). For each setting, I ran 5 runs each, giving us 92.842 +- 0.11 for BIOES, and 92.984 +- 0.13 for BIO. These are just preliminary numbers to give a rough indication - with parameter selection, these numbers might still change a little. I am hoping we complete a full evaluation and publish these numbers here very soon!

@stefan-it
Copy link
Member

@alanakbik sounds very interesting, would you share a kind of preprint of the upcoming paper with us?

@alanakbik
Copy link
Collaborator

@stefan-it As soon as there is one, absolutely :)

@roshammar
Copy link

Hello @alanakbik. Good timing on your comment. I am thinking of training a NER for Swedish (as in issue #3) and will probably want to benchmark my results.

Is there a way that I could take a look at how you did your different BIOES vs BIO tests and how you calculated the results with the new evaluation method?

@alanakbik
Copy link
Collaborator

Awesome, please do - we'd be very interested to hear how the approach performs for Swedish!

I would generally stick to the tutorial on how to train models and use the suggested values there - the current code uses the new evaluation script and uses BIOES by default which is generally a good option. We are running more experiments here wrt BIO vs BIOES but it seems that the difference is not so big between the two - depending on your luck of random seeds, one or the other works better.

@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 30, 2020
@stale stale bot closed this as completed May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants