Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER model trained for Ukrainian (model and code provided) #319

Closed
gawy opened this issue May 19, 2020 · 11 comments
Closed

NER model trained for Ukrainian (model and code provided) #319

gawy opened this issue May 19, 2020 · 11 comments

Comments

@gawy
Copy link
Contributor

gawy commented May 19, 2020

Describe the solution you'd like
For my own project I have trained a Stanza NER model using lang-uk community dataset and thought it could be great to include it as a part of an official package.

Describe alternatives you've considered
lang-uk community itself has trained NER model for https://github.com/mit-nlp/MITIE/

Additional context
Data conversion code and trained model can be found here
https://github.com/gawy/stanza-lang-uk/releases/tag/v0.9
https://github.com/gawy/stanza-lang-uk

Let me know if that up to your standards or what kind of help might be required.

@yuhui-zh15
Copy link
Member

yuhui-zh15 commented May 20, 2020

Thank you for your contribution! We'll think about integrating this into our model zoo after thorough testing.

Updated 05/25: added it to dev-branch resources.json. Can be accessed by:

nlp = stanza.Pipeline('uk', processors={'tokenize':'default','ner':'languk'}, package=None, ner_forward_charlm_path="", ner_backward_charlm_path="")

@AngledLuffa
Copy link
Collaborator

Thank you! We have added this to our models directory as of 1.1.1.

@gawy
Copy link
Contributor Author

gawy commented Aug 14, 2020

Thank you guys for all your great work on Stanza

@andrkrav
Copy link

Thank you @gawy ! It's cool that Stanza now provides named entity recognition for Ukrainian!

@dchaplinsky
Copy link

I'm one of the authors of lang-uk, and I happy to see how the products we've created are being used.

We are planning to update the corpus and models for NER as well. @gawy do you want to help?

@AngledLuffa
Copy link
Collaborator

Another possibility is we can incorporate the conversion script into stanza, if you don't mind @gawy  

We started putting together a conversion tool which handles multiple datasets:

https://github.com/stanfordnlp/stanza/blob/dev/stanza/utils/datasets/ner/prepare_ner_dataset.py

Just let us know when it's updated, @dchaplinsky and if you mind if we include your work, @gawy !

@gawy
Copy link
Contributor Author

gawy commented Apr 28, 2021

Sure thing, happy to help in both cases.
I'm ok with the conversion script to be incorporated into stanza @AngledLuffa . Do you need any assistance here?

@dchaplinsky let me know what kind of help you need with the update of the data set. I'd assume that structure remains the same and it's just the amount and quality of data that's changing. In such a case with the inclusion of the script into Stanza pipeline, the process should be rather straightforward. And the script will need to pull the data set in from some URL. Anyways let me know how can I help here.

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Apr 28, 2021 via email

@AngledLuffa
Copy link
Collaborator

@gawy rather than ask you to do any more work I just went ahead and did it myself:

#683

Thank you again for providing this and agreeing to have us integrate it with our tools. You wrote a more thorough unit test than any of us have written recently! Did you use pretrained word vectors when building the model? If so, was it the same ones as with our UK pipeline, or something else?

@dchaplinsky
Copy link

@gawy we are planning to release a small update (+20%) to the corpus and later update it to cover 100% of underlying brown-uk corpus. Later we are planning to add more texts from the news website.

What I really want to try is to use ensemble of existing models (MITIE, Stanza, BERT, polyglot and heuristics) to pre-tag the text and later validate it through editor. To tag initial corpus I've actually paid some real money from my own pocket, now I'd like to get more texts tagged for the money :)

Can we connect somehow outside of Github? We have telegram group and I'm available in any messenger available, except wechat.

@gawy
Copy link
Contributor Author

gawy commented Apr 30, 2021

@AngledLuffa it's hard for me to say with 100% certainty but based on file name in the archive I used for training these were vectors that Stanza used by default.
I've experimented with several others but those did not produce better results so I think I kept default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants