List of NER categories per language #904

paulthemagno · 2021-12-20T12:11:34Z

Hi! I'm using Stanza NER models for a long time. After using them in different languages, I have seen that some models have slightly different names of tags. For example I have seen that:

Ukrainian NER has PERS instead of the common PER
Vietnamese NER model has ORGANIZATION instead of ORG.

Until some months ago there were only 2 possibilities: 4-tags models (with PER, ORG, LOC and MISC) and 18-tags models (like for English and Chinese).

It would be helpful to have a sort of mapping between the most common labels and the language-specific labels. And also to have a clear list of labels for each language

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2021-12-20T19:44:09Z

We have attempted to document it here: https://stanfordnlp.github.io/stanza/available_models.html#available-ner-models and here: https://stanfordnlp.github.io/stanza/performance.html#system-performance-on-ner-corpora If there is an oversight, please let us know.

…

On Mon, Dec 20, 2021 at 4:11 AM Paolo Magnani ***@***.***> wrote: Hi! I'm using Stanza NER models for a long time. After using them in different languages, I have seen that some models have *slightly different names of tags*. For example I have seen that: - *Ukrainian* NER has PERS instead of the common PER - *Vietnamese* NER model has ORGANIZATION instead of ORG. Until some months ago there were only 2 possibilities: *4-tags models* (with PER, ORG, LOC and MISC) and *18-tags models* (like for English and Chinese). It would be helpful to have a sort of mapping between the most common labels and the language-specific labels. And also to have a clear list of labels for each language — Reply to this email directly, view it on GitHub <#904>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWPFUJU7PMP7V3RUEMLUR4MQFANCNFSM5KNROALA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

paulthemagno · 2021-12-21T09:16:57Z

Yes, I saw it, but I found some differences (tell me if I'm wrong).

For example I'm trying to use Italian model (1.3.0) and I found tags as PER,LOC and ORG, while on the documentation is written: The Italian FBK dataset uses LOCATION, ORGANIZATION, PERSON. And for Vietnamese I found tags like ORGANIZATION, PERSON and MISCELLANEOUS but it seems not to be mentioned on that web page.

So I thought it had to be updated. It would be very helpful to have a list of tags for every language in a JSON (in order not to see every time the documentation in case of changes and to be sure to be updated). Or even better, a mapping between the used tags of every language and the classic tags (PER,LOC,ORG,MISC).

AngledLuffa · 2021-12-21T18:18:21Z

That is a fair point. I will update the documentation for IT and VI. VI's tags are

['LOCATION', 'MISCELLANEOUS', 'ORGANIZATION', 'PERSON']

and mentally I had considered that the same as the standard 4 tag set, so I did not mention that in the docs. Note that although you didn't list LOCATION, you do in fact get LOCATION in a sentence such as

Tôi đang làm việc với một sinh viên từ Việt Nam

Remapping tags to better fit the existing categories is not in our current plans. AFAIK, there is no Universal NER initiative equivalent to UD...

AngledLuffa · 2021-12-21T18:20:55Z

d4d6b6a

AngledLuffa · 2021-12-21T18:24:28Z

c7dd0e1

paulthemagno · 2021-12-22T10:51:04Z

So very thanks for the help. I understand at the beginning there were 2 variants of tags principally (PER, LOC, ORG,MISC+ 18 tags of English and Chinese). Now many little differences came out, so it is probably helpful to have a list of tags.

I tried to load all the NER models and attaching to a JSON file their tags, by changing _set_up_model function in ner_processor.py in this way after applying the update in d4d6b6a.

    def _set_up_model(self, config, use_gpu):
        # set up trainer
        args = {'charlm_forward_file': config.get('forward_charlm_path', None),
                'charlm_backward_file': config.get('backward_charlm_path', None)}
        self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu)
        tags = self.get_known_tags()
        try:
            with open("tags.json") as json_file:
                existing_tags = json.load(json_file)
        except:
            existing_tags = {}
        existing_tags[config["lang"]] = tags
        with open("tags.json", "w") as json_file:
            json.dump(existing_tags, json_file)

The result is the following if it can be interesting for you:

{
    "fr": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "en": [
        "CARDINAL",
        "DATE",
        "EVENT",
        "FAC",
        "GPE",
        "LANGUAGE",
        "LAW",
        "LOC",
        "MONEY",
        "NORP",
        "ORDINAL",
        "ORG",
        "PERCENT",
        "PERSON",
        "PRODUCT",
        "QUANTITY",
        "TIME",
        "WORK_OF_ART"
    ],
    "zh-hans": [
        "CARDINAL",
        "DATE",
        "EVENT",
        "FAC",
        "GPE",
        "LANGUAGE",
        "LAW",
        "LOC",
        "MONEY",
        "NORP",
        "ORDINAL",
        "ORG",
        "PERCENT",
        "PERSON",
        "PRODUCT",
        "QUANTITY",
        "TIME",
        "WORK_OF_ART"
    ],
    "ru": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "uk": [
        "LOC",
        "MISC",
        "ORG",
        "PERS"
    ],
    "ar": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "hu": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "af": [
        "LOC",
        "MISC",
        "ORG",
        "PERS"
    ],
    "bg": [
        "EVT",
        "LOC",
        "ORG",
        "PER",
        "PRO"
    ],
    "fi": [
        "DATE",
        "EVENT",
        "LOC",
        "ORG",
        "PER",
        "PRO"
    ],
    "my": [
        "LOC",
        "NE",
        "NUM",
        "ORG",
        "PNAME",
        "RACE",
        "TIME"
    ],
    "it": [
        "LOC",
        "ORG",
        "PER"
    ],
    "de": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "nl": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "vi": [
        "LOCATION",
        "MISCELLANEOUS",
        "ORGANIZATION",
        "PERSON"
    ],
    "es": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ]
}

AngledLuffa · 2021-12-22T19:14:40Z

Thanks! It looks like those values all agree with what we've posted. I was thinking about making a script which would go through all the models and produce the markdown necessary to rebuild that webpage, just to save myself a bit of time

…

On Wed, Dec 22, 2021 at 2:51 AM Paolo Magnani ***@***.***> wrote: So very thanks for the help. I understand at the beginning there were 2 variants of tags principally (PER, LOC, ORG,MISC+ 18 tags of English and Chinese). Now many little differences came out, so it is probably helpful to have a list of tags. I tried to load all the NER models and attaching to a JSON file their tags, by changing _set_up_model function in *ner_processor.py* in this way after applying the update in d4d6b6a <d4d6b6a> . def _set_up_model(self, config, use_gpu): # set up trainer args = {'charlm_forward_file': config.get('forward_charlm_path', None), 'charlm_backward_file': config.get('backward_charlm_path', None)} self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu) tags = self.get_known_tags() try: with open("tags.json") as json_file: existing_tags = json.load(json_file) except: existing_tags = {} existing_tags[config["lang"]] = tags with open("tags.json", "w") as json_file: json.dump(existing_tags, json_file) The result is the following if it can be interesting for you: { "fr": [ "LOC", "MISC", "ORG", "PER" ], "en": [ "CARDINAL", "DATE", "EVENT", "FAC", "GPE", "LANGUAGE", "LAW", "LOC", "MONEY", "NORP", "ORDINAL", "ORG", "PERCENT", "PERSON", "PRODUCT", "QUANTITY", "TIME", "WORK_OF_ART" ], "zh-hans": [ "CARDINAL", "DATE", "EVENT", "FAC", "GPE", "LANGUAGE", "LAW", "LOC", "MONEY", "NORP", "ORDINAL", "ORG", "PERCENT", "PERSON", "PRODUCT", "QUANTITY", "TIME", "WORK_OF_ART" ], "ru": [ "LOC", "MISC", "ORG", "PER" ], "uk": [ "LOC", "MISC", "ORG", "PERS" ], "ar": [ "LOC", "MISC", "ORG", "PER" ], "hu": [ "LOC", "MISC", "ORG", "PER" ], "af": [ "LOC", "MISC", "ORG", "PERS" ], "bg": [ "EVT", "LOC", "ORG", "PER", "PRO" ], "fi": [ "DATE", "EVENT", "LOC", "ORG", "PER", "PRO" ], "my": [ "LOC", "NE", "NUM", "ORG", "PNAME", "RACE", "TIME" ], "it": [ "LOC", "ORG", "PER" ], "de": [ "LOC", "MISC", "ORG", "PER" ], "nl": [ "LOC", "MISC", "ORG", "PER" ], "vi": [ "LOCATION", "MISCELLANEOUS", "ORGANIZATION", "PERSON" ], "es": [ "LOC", "MISC", "ORG", "PER" ] } — Reply to this email directly, view it on GitHub <#904 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWLI63JVKPIPRF3SK53USGUSFANCNFSM5KNROALA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

stale · 2022-02-21T13:36:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2022-03-03T21:58:04Z

This issue has been automatically closed due to inactivity.

paulthemagno added the question label Dec 20, 2021

stale bot added the stale label Feb 21, 2022

stale bot closed this as completed Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of NER categories per language #904

List of NER categories per language #904

paulthemagno commented Dec 20, 2021

AngledLuffa commented Dec 20, 2021 via email

paulthemagno commented Dec 21, 2021

AngledLuffa commented Dec 21, 2021

AngledLuffa commented Dec 21, 2021

AngledLuffa commented Dec 21, 2021

paulthemagno commented Dec 22, 2021

AngledLuffa commented Dec 22, 2021 via email

stale bot commented Feb 21, 2022

stale bot commented Mar 3, 2022

List of NER categories per language #904

List of NER categories per language #904

Comments

paulthemagno commented Dec 20, 2021

AngledLuffa commented Dec 20, 2021 via email

paulthemagno commented Dec 21, 2021

AngledLuffa commented Dec 21, 2021

AngledLuffa commented Dec 21, 2021

AngledLuffa commented Dec 21, 2021

paulthemagno commented Dec 22, 2021

AngledLuffa commented Dec 22, 2021 via email

stale bot commented Feb 21, 2022

stale bot commented Mar 3, 2022