NER for Polish #1070

k-sap · 2022-07-04T09:29:43Z

I'd like to add NER model for Polish. For now, I wonder what else is needed.

Datasets

Char-LM: Wikipedia Subcorpus
NER annotations: NKJP Corpus

Baseline models

Results
For char-lm model:

2022-06-28 13:39:24 INFO: Running NER tagger in predict mode	
2022-06-28 13:39:25 INFO: Loading data with batch size 32...	
2022-06-28 13:39:26 DEBUG: 38 batches created.	
2022-06-28 13:39:26 INFO: Start evaluation...	
2022-06-28 13:39:37 INFO: Score by entity:	
Prec.   Rec.    F1	
85.55   87.69   86.61	
2022-06-28 13:39:37 INFO: Score by token:	
Prec.   Rec.    F1	
68.59   68.98   68.78	
2022-06-28 13:39:37 INFO: NER tagger score:	
2022-06-28 13:39:37 INFO: pl_nkjp 86.61

I could definitely improve these models further and share an update in the coming weeks.
I'd like to ask is there something more I need to prepare to include these in the next Stanza release.

Especially I'm not sure about

BERT integration, for now I added only the training parameter in my version
to what extend sharing converted NER data & conversion code is needed

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2022-07-04T15:36:04Z

Neat! We would be happy to include this.

The bert model (or any other transformer) will be saved as a parameter in the model file, so that should work even without integrating.

However, is there a limitation to integrating the code to convert the text input? It would make rebuilding the model in the future much easier, in case there are architectural improvements or the file format changes. We'd be happy to give you attribution, of course.

In terms of using the Wikipedia Subcorpus for the charlm, have you tried adding the "free corpus" as well? It's also pretty straightforward to get a recent wikipedia dump and remove most of the non-text using wikiextractor I've been using the "pages meta current" files when extracting recent Wikipedia

AngledLuffa · 2022-07-04T22:57:43Z

it's a little weird that the score by token and score by entity are so wildly different, isn't it? that doesn't usually happen

k-sap · 2022-07-18T20:30:03Z

However, is there a limitation to integrating the code to convert the text input? It would make rebuilding the model in the future much easier, in case there are architectural improvements or the file format changes. We'd be happy to give you attribution, of course.

Would it be better to prepare script to BIOES files) or to json files directly?

it's a little weird that the score by token and score by entity are so wildly different, isn't it? that doesn't usually happen

I suppose there are some issues with labeling start and end of entities. I need to look into that.

In terms of using the Wikipedia Subcorpus for the charlm, have you tried adding the "free corpus" as well?
Not yet. I hope I will manage to do it at the beginning of the next month and then summarize everything clearly.

Do you have any dataset split guideline?

To be sure, POS tags are not needed in training files? There are only tokens and NER annotations in example.

AngledLuffa · 2022-07-18T20:47:59Z

Would it be better to prepare script to BIOES files) or to json files directly?

For sure I would recommend BIO or BIOES. The conversion script to .json already does the formatting, and for that matter, we may eventually change that format if we ever move to Nested NER.

I suppose there are some issues with labeling start and end of entities. I need to look into that.

That could be it!

Do you have any dataset split guideline?

You mean for the charlm? It will randomly shuffle the data first.

To be sure, POS tags are not needed in training files? There are only tokens and NER annotations in example.

POS tags are not needed

AngledLuffa · 2022-07-18T20:48:41Z

I wrote this up in case it helps:

https://stanfordnlp.github.io/stanza/new_language_ner.html

* Add NER dataset for Polish Co-authored-by: ryszardtuora <ryszardtuora@gmail.com> Co-authored-by: Karol Saputa <ksaputa@gputrain.dariah.ipipan.waw.pl> This PR adds Polish NER dataset #1070

AngledLuffa · 2022-09-14T19:48:55Z

Thank you for the assistance! It is excellent when people add new models to our tool for us.

After reordering the input files with sorted() as we discussed on the PR, I retrained two models. One uses Stanza's built in charlm and gets the following scores:

--- DEV ---
2022-09-13 11:48:56 INFO: Score by entity:
Prec.   Rec.    F1
88.22   87.66   87.94
2022-09-13 11:48:56 INFO: Score by token:
Prec.   Rec.    F1
89.12   88.26   88.69

--- TEST ---
2022-09-13 11:49:05 INFO: Score by entity:
Prec.   Rec.    F1
88.99   88.48   88.73
2022-09-13 11:49:05 INFO: Score by token:
Prec.   Rec.    F1
88.84   89.67   89.25

The other uses the transformer you recommended. It gets:

--- DEV ---
2022-09-13 10:40:58 INFO: Score by entity:
Prec.   Rec.    F1
91.18   91.42   91.30
2022-09-13 10:40:58 INFO: Score by token:
Prec.   Rec.    F1
91.21   91.88   91.54

--- TEST ---
2022-09-13 10:41:11 INFO: Score by entity:
Prec.   Rec.    F1
91.04   91.07   91.05
2022-09-13 10:41:11 INFO: Score by token:
Prec.   Rec.    F1
90.09   91.98   91.02

Without the charlm, scores drop to 85.96 on dev and 87.37 on test.

The charlm model is now the default NER model for PL if you build a Polish pipeline with 1.4.1 (try it!), and you can get the transformer model instead with the nkjp_bert package.

k-sap added the enhancement label Jul 4, 2022

k-sap mentioned this issue Aug 30, 2022

NER Polish #1110

Merged

AngledLuffa pushed a commit that referenced this issue Sep 4, 2022

NER Polish (#1110)

ee5b644

* Add NER dataset for Polish Co-authored-by: ryszardtuora <ryszardtuora@gmail.com> Co-authored-by: Karol Saputa <ksaputa@gputrain.dariah.ipipan.waw.pl> This PR adds Polish NER dataset #1070

AngledLuffa closed this as completed Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER for Polish #1070

NER for Polish #1070

k-sap commented Jul 4, 2022

AngledLuffa commented Jul 4, 2022

AngledLuffa commented Jul 4, 2022

k-sap commented Jul 18, 2022

AngledLuffa commented Jul 18, 2022

AngledLuffa commented Jul 18, 2022

AngledLuffa commented Sep 14, 2022

NER for Polish #1070

NER for Polish #1070

Comments

k-sap commented Jul 4, 2022

AngledLuffa commented Jul 4, 2022

AngledLuffa commented Jul 4, 2022

k-sap commented Jul 18, 2022

AngledLuffa commented Jul 18, 2022

AngledLuffa commented Jul 18, 2022

AngledLuffa commented Sep 14, 2022