Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER for Polish #1070

Closed
k-sap opened this issue Jul 4, 2022 · 6 comments
Closed

NER for Polish #1070

k-sap opened this issue Jul 4, 2022 · 6 comments

Comments

@k-sap
Copy link
Contributor

k-sap commented Jul 4, 2022

I'd like to add NER model for Polish. For now, I wonder what else is needed.

Datasets

Baseline models

Results
For char-lm model:

2022-06-28 13:39:24 INFO: Running NER tagger in predict mode	
2022-06-28 13:39:25 INFO: Loading data with batch size 32...	
2022-06-28 13:39:26 DEBUG: 38 batches created.	
2022-06-28 13:39:26 INFO: Start evaluation...	
2022-06-28 13:39:37 INFO: Score by entity:	
Prec.   Rec.    F1	
85.55   87.69   86.61	
2022-06-28 13:39:37 INFO: Score by token:	
Prec.   Rec.    F1	
68.59   68.98   68.78	
2022-06-28 13:39:37 INFO: NER tagger score:	
2022-06-28 13:39:37 INFO: pl_nkjp 86.61	

I could definitely improve these models further and share an update in the coming weeks.
I'd like to ask is there something more I need to prepare to include these in the next Stanza release.

Especially I'm not sure about

  • BERT integration, for now I added only the training parameter in my version
  • to what extend sharing converted NER data & conversion code is needed
@AngledLuffa
Copy link
Collaborator

Neat! We would be happy to include this.

The bert model (or any other transformer) will be saved as a parameter in the model file, so that should work even without integrating.

However, is there a limitation to integrating the code to convert the text input? It would make rebuilding the model in the future much easier, in case there are architectural improvements or the file format changes. We'd be happy to give you attribution, of course.

In terms of using the Wikipedia Subcorpus for the charlm, have you tried adding the "free corpus" as well? It's also pretty straightforward to get a recent wikipedia dump and remove most of the non-text using wikiextractor I've been using the "pages meta current" files when extracting recent Wikipedia

@AngledLuffa
Copy link
Collaborator

it's a little weird that the score by token and score by entity are so wildly different, isn't it? that doesn't usually happen

@k-sap
Copy link
Contributor Author

k-sap commented Jul 18, 2022

However, is there a limitation to integrating the code to convert the text input? It would make rebuilding the model in the future much easier, in case there are architectural improvements or the file format changes. We'd be happy to give you attribution, of course.

Would it be better to prepare script to BIOES files) or to json files directly?

it's a little weird that the score by token and score by entity are so wildly different, isn't it? that doesn't usually happen

I suppose there are some issues with labeling start and end of entities. I need to look into that.

In terms of using the Wikipedia Subcorpus for the charlm, have you tried adding the "free corpus" as well?
Not yet. I hope I will manage to do it at the beginning of the next month and then summarize everything clearly.

Do you have any dataset split guideline?

To be sure, POS tags are not needed in training files? There are only tokens and NER annotations in example.

@AngledLuffa
Copy link
Collaborator

Would it be better to prepare script to BIOES files) or to json files directly?

For sure I would recommend BIO or BIOES. The conversion script to .json already does the formatting, and for that matter, we may eventually change that format if we ever move to Nested NER.

I suppose there are some issues with labeling start and end of entities. I need to look into that.

That could be it!

Do you have any dataset split guideline?

You mean for the charlm? It will randomly shuffle the data first.

To be sure, POS tags are not needed in training files? There are only tokens and NER annotations in example.

POS tags are not needed

@AngledLuffa
Copy link
Collaborator

I wrote this up in case it helps:

https://stanfordnlp.github.io/stanza/new_language_ner.html

@k-sap k-sap mentioned this issue Aug 30, 2022
AngledLuffa pushed a commit that referenced this issue Sep 4, 2022
* Add NER dataset for Polish

Co-authored-by: ryszardtuora <ryszardtuora@gmail.com>
Co-authored-by: Karol Saputa <ksaputa@gputrain.dariah.ipipan.waw.pl>

This PR adds Polish NER dataset

#1070
@AngledLuffa
Copy link
Collaborator

Thank you for the assistance! It is excellent when people add new models to our tool for us.

After reordering the input files with sorted() as we discussed on the PR, I retrained two models. One uses Stanza's built in charlm and gets the following scores:

--- DEV ---
2022-09-13 11:48:56 INFO: Score by entity:
Prec.   Rec.    F1
88.22   87.66   87.94
2022-09-13 11:48:56 INFO: Score by token:
Prec.   Rec.    F1
89.12   88.26   88.69

--- TEST ---
2022-09-13 11:49:05 INFO: Score by entity:
Prec.   Rec.    F1
88.99   88.48   88.73
2022-09-13 11:49:05 INFO: Score by token:
Prec.   Rec.    F1
88.84   89.67   89.25

The other uses the transformer you recommended. It gets:

--- DEV ---
2022-09-13 10:40:58 INFO: Score by entity:
Prec.   Rec.    F1
91.18   91.42   91.30
2022-09-13 10:40:58 INFO: Score by token:
Prec.   Rec.    F1
91.21   91.88   91.54

--- TEST ---
2022-09-13 10:41:11 INFO: Score by entity:
Prec.   Rec.    F1
91.04   91.07   91.05
2022-09-13 10:41:11 INFO: Score by token:
Prec.   Rec.    F1
90.09   91.98   91.02

Without the charlm, scores drop to 85.96 on dev and 87.37 on test.

The charlm model is now the default NER model for PL if you build a Polish pipeline with 1.4.1 (try it!), and you can get the transformer model instead with the nkjp_bert package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants