Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER model for Armenian #1206

Closed
ShakeHakobyan opened this issue Mar 7, 2023 · 5 comments
Closed

NER model for Armenian #1206

ShakeHakobyan opened this issue Mar 7, 2023 · 5 comments

Comments

@ShakeHakobyan
Copy link
Contributor

Hello! I have trained a NER model for the Armenian language using the ArmTDP dataset and the xlm-roberta-base model.

After that, I attempted to test the model using stanza.Pipeline:

import stanza

config = {
'processors': 'tokenize, ner',
'lang': 'hy',
'ner_model_path': '/Lab/Projects/ner/models/hy_armtdp_nertagger_bert_18.pt',
}

nlp = stanza.Pipeline(**config)

nlp("some text in Arminian")

While working with the same data, I observed that the outputs after loading the model were different each time.
Although there was no such problem when testing the code using internal commands. Whenever I run the following code, I get the same output:

python3 -m stanza.utils.training.run_ner hy_armtdp --score_test

What could be the cause of this problem?

Additionally, I have added data conversion and BERT code for Armenian in this pull request (trained model can be downloaded from this drive).

If the problem is feasible, it would be great to integrate a NER model for Armenian in the main package

Thanks!

@AngledLuffa
Copy link
Collaborator

Thank you for doing this! Although I should point out that the pull request is currently against your own fork, not our dev branch. If you'll fix that, I can check this out tomorrow and try to diagnose the problem you're seeing.

@AngledLuffa
Copy link
Collaborator

Actually, if you would give an example of a sentence which causes the inconsistent labels, that would help a lot.

@AngledLuffa
Copy link
Collaborator

It was pretty easy for me to get the changes you made, so I replicated your pull request locally (with you as the author, of course)

https://github.com/stanfordnlp/stanza/pull/1212/commits

Let me know if that looks good to you.

I like having a non-bert model as the default so the pipeline is less expensive unless people know they want the bert model, so I will check that everything works by retraining the model. If you would find an example that was triggering the non-deterministic behavior, though, I can try to debug that.

Thanks for sending this!

@ShakeHakobyan
Copy link
Contributor Author

Hi there,
Thank you for your response. The pull request replication option looks good. I have included a link to my trained model without bert here. As for the issue with the pipeline that I mentioned earlier, it appears to have been resolved.

@AngledLuffa
Copy link
Collaborator

It's been merged, and the new models (retrained locally) are included in 1.5.0 Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants