New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of NER categories per language #904
Comments
We have attempted to document it here:
https://stanfordnlp.github.io/stanza/available_models.html#available-ner-models
and here:
https://stanfordnlp.github.io/stanza/performance.html#system-performance-on-ner-corpora
If there is an oversight, please let us know.
…On Mon, Dec 20, 2021 at 4:11 AM Paolo Magnani ***@***.***> wrote:
Hi! I'm using Stanza NER models for a long time. After using them in
different languages, I have seen that some models have *slightly
different names of tags*. For example I have seen that:
- *Ukrainian* NER has PERS instead of the common PER
- *Vietnamese* NER model has ORGANIZATION instead of ORG.
Until some months ago there were only 2 possibilities: *4-tags models*
(with PER, ORG, LOC and MISC) and *18-tags models* (like for English and
Chinese).
It would be helpful to have a sort of mapping between the most common
labels and the language-specific labels. And also to have a clear list of
labels for each language
—
Reply to this email directly, view it on GitHub
<#904>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWPFUJU7PMP7V3RUEMLUR4MQFANCNFSM5KNROALA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yes, I saw it, but I found some differences (tell me if I'm wrong). For example I'm trying to use Italian model (1.3.0) and I found tags as So I thought it had to be updated. It would be very helpful to have a list of tags for every language in a JSON (in order not to see every time the documentation in case of changes and to be sure to be updated). Or even better, a mapping between the used tags of every language and the classic tags ( |
That is a fair point. I will update the documentation for IT and VI. VI's tags are
and mentally I had considered that the same as the standard 4 tag set, so I did not mention that in the docs. Note that although you didn't list LOCATION, you do in fact get LOCATION in a sentence such as
Remapping tags to better fit the existing categories is not in our current plans. AFAIK, there is no Universal NER initiative equivalent to UD... |
So very thanks for the help. I understand at the beginning there were 2 variants of tags principally ( I tried to load all the NER models and attaching to a JSON file their tags, by changing def _set_up_model(self, config, use_gpu):
# set up trainer
args = {'charlm_forward_file': config.get('forward_charlm_path', None),
'charlm_backward_file': config.get('backward_charlm_path', None)}
self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu)
tags = self.get_known_tags()
try:
with open("tags.json") as json_file:
existing_tags = json.load(json_file)
except:
existing_tags = {}
existing_tags[config["lang"]] = tags
with open("tags.json", "w") as json_file:
json.dump(existing_tags, json_file) The result is the following if it can be interesting for you: {
"fr": [
"LOC",
"MISC",
"ORG",
"PER"
],
"en": [
"CARDINAL",
"DATE",
"EVENT",
"FAC",
"GPE",
"LANGUAGE",
"LAW",
"LOC",
"MONEY",
"NORP",
"ORDINAL",
"ORG",
"PERCENT",
"PERSON",
"PRODUCT",
"QUANTITY",
"TIME",
"WORK_OF_ART"
],
"zh-hans": [
"CARDINAL",
"DATE",
"EVENT",
"FAC",
"GPE",
"LANGUAGE",
"LAW",
"LOC",
"MONEY",
"NORP",
"ORDINAL",
"ORG",
"PERCENT",
"PERSON",
"PRODUCT",
"QUANTITY",
"TIME",
"WORK_OF_ART"
],
"ru": [
"LOC",
"MISC",
"ORG",
"PER"
],
"uk": [
"LOC",
"MISC",
"ORG",
"PERS"
],
"ar": [
"LOC",
"MISC",
"ORG",
"PER"
],
"hu": [
"LOC",
"MISC",
"ORG",
"PER"
],
"af": [
"LOC",
"MISC",
"ORG",
"PERS"
],
"bg": [
"EVT",
"LOC",
"ORG",
"PER",
"PRO"
],
"fi": [
"DATE",
"EVENT",
"LOC",
"ORG",
"PER",
"PRO"
],
"my": [
"LOC",
"NE",
"NUM",
"ORG",
"PNAME",
"RACE",
"TIME"
],
"it": [
"LOC",
"ORG",
"PER"
],
"de": [
"LOC",
"MISC",
"ORG",
"PER"
],
"nl": [
"LOC",
"MISC",
"ORG",
"PER"
],
"vi": [
"LOCATION",
"MISCELLANEOUS",
"ORGANIZATION",
"PERSON"
],
"es": [
"LOC",
"MISC",
"ORG",
"PER"
]
} |
Thanks! It looks like those values all agree with what we've posted. I
was thinking about making a script which would go through all the models
and produce the markdown necessary to rebuild that webpage, just to save
myself a bit of time
…On Wed, Dec 22, 2021 at 2:51 AM Paolo Magnani ***@***.***> wrote:
So very thanks for the help. I understand at the beginning there were 2
variants of tags principally (PER, LOC, ORG,MISC+ 18 tags of English and
Chinese). Now many little differences came out, so it is probably helpful
to have a list of tags.
I tried to load all the NER models and attaching to a JSON file their
tags, by changing _set_up_model function in *ner_processor.py* in this
way after applying the update in d4d6b6a
<d4d6b6a>
.
def _set_up_model(self, config, use_gpu):
# set up trainer
args = {'charlm_forward_file': config.get('forward_charlm_path', None),
'charlm_backward_file': config.get('backward_charlm_path', None)}
self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu)
tags = self.get_known_tags()
try:
with open("tags.json") as json_file:
existing_tags = json.load(json_file)
except:
existing_tags = {}
existing_tags[config["lang"]] = tags
with open("tags.json", "w") as json_file:
json.dump(existing_tags, json_file)
The result is the following if it can be interesting for you:
{
"fr": [
"LOC",
"MISC",
"ORG",
"PER"
],
"en": [
"CARDINAL",
"DATE",
"EVENT",
"FAC",
"GPE",
"LANGUAGE",
"LAW",
"LOC",
"MONEY",
"NORP",
"ORDINAL",
"ORG",
"PERCENT",
"PERSON",
"PRODUCT",
"QUANTITY",
"TIME",
"WORK_OF_ART"
],
"zh-hans": [
"CARDINAL",
"DATE",
"EVENT",
"FAC",
"GPE",
"LANGUAGE",
"LAW",
"LOC",
"MONEY",
"NORP",
"ORDINAL",
"ORG",
"PERCENT",
"PERSON",
"PRODUCT",
"QUANTITY",
"TIME",
"WORK_OF_ART"
],
"ru": [
"LOC",
"MISC",
"ORG",
"PER"
],
"uk": [
"LOC",
"MISC",
"ORG",
"PERS"
],
"ar": [
"LOC",
"MISC",
"ORG",
"PER"
],
"hu": [
"LOC",
"MISC",
"ORG",
"PER"
],
"af": [
"LOC",
"MISC",
"ORG",
"PERS"
],
"bg": [
"EVT",
"LOC",
"ORG",
"PER",
"PRO"
],
"fi": [
"DATE",
"EVENT",
"LOC",
"ORG",
"PER",
"PRO"
],
"my": [
"LOC",
"NE",
"NUM",
"ORG",
"PNAME",
"RACE",
"TIME"
],
"it": [
"LOC",
"ORG",
"PER"
],
"de": [
"LOC",
"MISC",
"ORG",
"PER"
],
"nl": [
"LOC",
"MISC",
"ORG",
"PER"
],
"vi": [
"LOCATION",
"MISCELLANEOUS",
"ORGANIZATION",
"PERSON"
],
"es": [
"LOC",
"MISC",
"ORG",
"PER"
]
}
—
Reply to this email directly, view it on GitHub
<#904 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWLI63JVKPIPRF3SK53USGUSFANCNFSM5KNROALA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. |
Hi! I'm using Stanza NER models for a long time. After using them in different languages, I have seen that some models have slightly different names of tags. For example I have seen that:
PERS
instead of the commonPER
ORGANIZATION
instead ofORG
.Until some months ago there were only 2 possibilities: 4-tags models (with
PER
,ORG
,LOC
andMISC
) and 18-tags models (like for English and Chinese).It would be helpful to have a sort of mapping between the most common labels and the language-specific labels. And also to have a clear list of labels for each language
The text was updated successfully, but these errors were encountered: