New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NER model trained for Ukrainian (model and code provided) #319
Comments
Thank you for your contribution! We'll think about integrating this into our model zoo after thorough testing. Updated 05/25: added it to dev-branch nlp = stanza.Pipeline('uk', processors={'tokenize':'default','ner':'languk'}, package=None, ner_forward_charlm_path="", ner_backward_charlm_path="") |
Thank you! We have added this to our models directory as of 1.1.1. |
Thank you guys for all your great work on Stanza |
Thank you @gawy ! It's cool that Stanza now provides named entity recognition for Ukrainian! |
I'm one of the authors of lang-uk, and I happy to see how the products we've created are being used. We are planning to update the corpus and models for NER as well. @gawy do you want to help? |
Another possibility is we can incorporate the conversion script into stanza, if you don't mind @gawy We started putting together a conversion tool which handles multiple datasets: https://github.com/stanfordnlp/stanza/blob/dev/stanza/utils/datasets/ner/prepare_ner_dataset.py Just let us know when it's updated, @dchaplinsky and if you mind if we include your work, @gawy ! |
Sure thing, happy to help in both cases. @dchaplinsky let me know what kind of help you need with the update of the data set. I'd assume that structure remains the same and it's just the amount and quality of data that's changing. In such a case with the inclusion of the script into Stanza pipeline, the process should be rather straightforward. And the script will need to pull the data set in from some URL. Anyways let me know how can I help here. |
If you want to make a pull request which puts your conversion file into
stanza/utils/datasets/ner and the test file into tests, that would be
great, but we can also do that ourselves if you prefer
|
@gawy rather than ask you to do any more work I just went ahead and did it myself: Thank you again for providing this and agreeing to have us integrate it with our tools. You wrote a more thorough unit test than any of us have written recently! Did you use pretrained word vectors when building the model? If so, was it the same ones as with our UK pipeline, or something else? |
@gawy we are planning to release a small update (+20%) to the corpus and later update it to cover 100% of underlying brown-uk corpus. Later we are planning to add more texts from the news website. What I really want to try is to use ensemble of existing models (MITIE, Stanza, BERT, polyglot and heuristics) to pre-tag the text and later validate it through editor. To tag initial corpus I've actually paid some real money from my own pocket, now I'd like to get more texts tagged for the money :) Can we connect somehow outside of Github? We have telegram group and I'm available in any messenger available, except wechat. |
@AngledLuffa it's hard for me to say with 100% certainty but based on file name in the archive I used for training these were vectors that Stanza used by default. |
Describe the solution you'd like
For my own project I have trained a Stanza NER model using lang-uk community dataset and thought it could be great to include it as a part of an official package.
Describe alternatives you've considered
lang-uk community itself has trained NER model for https://github.com/mit-nlp/MITIE/
Additional context
Data conversion code and trained model can be found here
https://github.com/gawy/stanza-lang-uk/releases/tag/v0.9
https://github.com/gawy/stanza-lang-uk
Let me know if that up to your standards or what kind of help might be required.
The text was updated successfully, but these errors were encountered: