NER model trained for Ukrainian (model and code provided) #319

gawy · 2020-05-19T16:03:32Z

Describe the solution you'd like
For my own project I have trained a Stanza NER model using lang-uk community dataset and thought it could be great to include it as a part of an official package.

Describe alternatives you've considered
lang-uk community itself has trained NER model for https://github.com/mit-nlp/MITIE/

Additional context
Data conversion code and trained model can be found here
https://github.com/gawy/stanza-lang-uk/releases/tag/v0.9
https://github.com/gawy/stanza-lang-uk

Let me know if that up to your standards or what kind of help might be required.

yuhui-zh15 · 2020-05-20T18:07:25Z

Thank you for your contribution! We'll think about integrating this into our model zoo after thorough testing.

Updated 05/25: added it to dev-branch resources.json. Can be accessed by:

nlp = stanza.Pipeline('uk', processors={'tokenize':'default','ner':'languk'}, package=None, ner_forward_charlm_path="", ner_backward_charlm_path="")

AngledLuffa · 2020-08-14T01:21:37Z

Thank you! We have added this to our models directory as of 1.1.1.

gawy · 2020-08-14T06:54:10Z

Thank you guys for all your great work on Stanza

andrkrav · 2020-08-21T12:38:39Z

Thank you @gawy ! It's cool that Stanza now provides named entity recognition for Ukrainian!

dchaplinsky · 2021-04-28T09:57:37Z

I'm one of the authors of lang-uk, and I happy to see how the products we've created are being used.

We are planning to update the corpus and models for NER as well. @gawy do you want to help?

AngledLuffa · 2021-04-28T15:45:38Z

Another possibility is we can incorporate the conversion script into stanza, if you don't mind @gawy

We started putting together a conversion tool which handles multiple datasets:

https://github.com/stanfordnlp/stanza/blob/dev/stanza/utils/datasets/ner/prepare_ner_dataset.py

Just let us know when it's updated, @dchaplinsky and if you mind if we include your work, @gawy !

gawy · 2021-04-28T20:12:07Z

Sure thing, happy to help in both cases.
I'm ok with the conversion script to be incorporated into stanza @AngledLuffa . Do you need any assistance here?

@dchaplinsky let me know what kind of help you need with the update of the data set. I'd assume that structure remains the same and it's just the amount and quality of data that's changing. In such a case with the inclusion of the script into Stanza pipeline, the process should be rather straightforward. And the script will need to pull the data set in from some URL. Anyways let me know how can I help here.

AngledLuffa · 2021-04-28T22:21:06Z

If you want to make a pull request which puts your conversion file into stanza/utils/datasets/ner and the test file into tests, that would be great, but we can also do that ourselves if you prefer

AngledLuffa · 2021-04-29T05:39:07Z

@gawy rather than ask you to do any more work I just went ahead and did it myself:

#683

Thank you again for providing this and agreeing to have us integrate it with our tools. You wrote a more thorough unit test than any of us have written recently! Did you use pretrained word vectors when building the model? If so, was it the same ones as with our UK pipeline, or something else?

dchaplinsky · 2021-04-29T07:24:39Z

@gawy we are planning to release a small update (+20%) to the corpus and later update it to cover 100% of underlying brown-uk corpus. Later we are planning to add more texts from the news website.

What I really want to try is to use ensemble of existing models (MITIE, Stanza, BERT, polyglot and heuristics) to pre-tag the text and later validate it through editor. To tag initial corpus I've actually paid some real money from my own pocket, now I'd like to get more texts tagged for the money :)

Can we connect somehow outside of Github? We have telegram group and I'm available in any messenger available, except wechat.

gawy · 2021-04-30T21:38:03Z

@AngledLuffa it's hard for me to say with 100% certainty but based on file name in the archive I used for training these were vectors that Stanza used by default.
I've experimented with several others but those did not produce better results so I think I kept default.

gawy added the enhancement label May 19, 2020

yuhui-zh15 added the fixed on dev label May 25, 2020

AngledLuffa closed this as completed Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER model trained for Ukrainian (model and code provided) #319

NER model trained for Ukrainian (model and code provided) #319

gawy commented May 19, 2020

yuhui-zh15 commented May 20, 2020 •

edited

AngledLuffa commented Aug 14, 2020

gawy commented Aug 14, 2020

andrkrav commented Aug 21, 2020

dchaplinsky commented Apr 28, 2021

AngledLuffa commented Apr 28, 2021

gawy commented Apr 28, 2021

AngledLuffa commented Apr 28, 2021 via email

AngledLuffa commented Apr 29, 2021

dchaplinsky commented Apr 29, 2021

gawy commented Apr 30, 2021

NER model trained for Ukrainian (model and code provided) #319

NER model trained for Ukrainian (model and code provided) #319

Comments

gawy commented May 19, 2020

yuhui-zh15 commented May 20, 2020 • edited

AngledLuffa commented Aug 14, 2020

gawy commented Aug 14, 2020

andrkrav commented Aug 21, 2020

dchaplinsky commented Apr 28, 2021

AngledLuffa commented Apr 28, 2021

gawy commented Apr 28, 2021

AngledLuffa commented Apr 28, 2021 via email

AngledLuffa commented Apr 29, 2021

dchaplinsky commented Apr 29, 2021

gawy commented Apr 30, 2021

yuhui-zh15 commented May 20, 2020 •

edited