New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NER for Polish #1070
Comments
Neat! We would be happy to include this. The bert model (or any other transformer) will be saved as a parameter in the model file, so that should work even without integrating. However, is there a limitation to integrating the code to convert the text input? It would make rebuilding the model in the future much easier, in case there are architectural improvements or the file format changes. We'd be happy to give you attribution, of course. In terms of using the Wikipedia Subcorpus for the charlm, have you tried adding the "free corpus" as well? It's also pretty straightforward to get a recent wikipedia dump and remove most of the non-text using wikiextractor I've been using the "pages meta current" files when extracting recent Wikipedia |
it's a little weird that the score by token and score by entity are so wildly different, isn't it? that doesn't usually happen |
Would it be better to prepare script to BIOES files) or to json files directly?
I suppose there are some issues with labeling start and end of entities. I need to look into that.
Do you have any dataset split guideline? To be sure, POS tags are not needed in training files? There are only tokens and NER annotations in example. |
For sure I would recommend BIO or BIOES. The conversion script to .json already does the formatting, and for that matter, we may eventually change that format if we ever move to Nested NER.
That could be it!
You mean for the charlm? It will randomly shuffle the data first.
POS tags are not needed |
I wrote this up in case it helps: |
* Add NER dataset for Polish Co-authored-by: ryszardtuora <ryszardtuora@gmail.com> Co-authored-by: Karol Saputa <ksaputa@gputrain.dariah.ipipan.waw.pl> This PR adds Polish NER dataset #1070
Thank you for the assistance! It is excellent when people add new models to our tool for us. After reordering the input files with
The other uses the transformer you recommended. It gets:
Without the charlm, scores drop to The charlm model is now the default NER model for PL if you build a Polish pipeline with 1.4.1 (try it!), and you can get the transformer model instead with the |
I'd like to add NER model for Polish. For now, I wonder what else is needed.
Datasets
Baseline models
Results
For char-lm model:
I could definitely improve these models further and share an update in the coming weeks.
I'd like to ask is there something more I need to prepare to include these in the next Stanza release.
Especially I'm not sure about
The text was updated successfully, but these errors were encountered: