Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of NER categories per language #904

Closed
paulthemagno opened this issue Dec 20, 2021 · 9 comments
Closed

List of NER categories per language #904

paulthemagno opened this issue Dec 20, 2021 · 9 comments

Comments

@paulthemagno
Copy link

Hi! I'm using Stanza NER models for a long time. After using them in different languages, I have seen that some models have slightly different names of tags. For example I have seen that:

  • Ukrainian NER has PERS instead of the common PER
  • Vietnamese NER model has ORGANIZATION instead of ORG.

Until some months ago there were only 2 possibilities: 4-tags models (with PER, ORG, LOC and MISC) and 18-tags models (like for English and Chinese).

It would be helpful to have a sort of mapping between the most common labels and the language-specific labels. And also to have a clear list of labels for each language

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 20, 2021 via email

@paulthemagno
Copy link
Author

Yes, I saw it, but I found some differences (tell me if I'm wrong).

For example I'm trying to use Italian model (1.3.0) and I found tags as PER,LOC and ORG, while on the documentation is written: The Italian FBK dataset uses LOCATION, ORGANIZATION, PERSON. And for Vietnamese I found tags like ORGANIZATION, PERSON and MISCELLANEOUS but it seems not to be mentioned on that web page.

So I thought it had to be updated. It would be very helpful to have a list of tags for every language in a JSON (in order not to see every time the documentation in case of changes and to be sure to be updated). Or even better, a mapping between the used tags of every language and the classic tags (PER,LOC,ORG,MISC).

@AngledLuffa
Copy link
Collaborator

That is a fair point. I will update the documentation for IT and VI. VI's tags are

['LOCATION', 'MISCELLANEOUS', 'ORGANIZATION', 'PERSON']

and mentally I had considered that the same as the standard 4 tag set, so I did not mention that in the docs. Note that although you didn't list LOCATION, you do in fact get LOCATION in a sentence such as

Tôi đang làm việc với một sinh viên từ Việt Nam

Remapping tags to better fit the existing categories is not in our current plans. AFAIK, there is no Universal NER initiative equivalent to UD...

@AngledLuffa
Copy link
Collaborator

d4d6b6a

@AngledLuffa
Copy link
Collaborator

c7dd0e1

@paulthemagno
Copy link
Author

So very thanks for the help. I understand at the beginning there were 2 variants of tags principally (PER, LOC, ORG,MISC+ 18 tags of English and Chinese). Now many little differences came out, so it is probably helpful to have a list of tags.

I tried to load all the NER models and attaching to a JSON file their tags, by changing _set_up_model function in ner_processor.py in this way after applying the update in d4d6b6a.

    def _set_up_model(self, config, use_gpu):
        # set up trainer
        args = {'charlm_forward_file': config.get('forward_charlm_path', None),
                'charlm_backward_file': config.get('backward_charlm_path', None)}
        self._trainer = Trainer(args=args, model_file=config['model_path'], use_cuda=use_gpu)
        tags = self.get_known_tags()
        try:
            with open("tags.json") as json_file:
                existing_tags = json.load(json_file)
        except:
            existing_tags = {}
        existing_tags[config["lang"]] = tags
        with open("tags.json", "w") as json_file:
            json.dump(existing_tags, json_file)

The result is the following if it can be interesting for you:

{
    "fr": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "en": [
        "CARDINAL",
        "DATE",
        "EVENT",
        "FAC",
        "GPE",
        "LANGUAGE",
        "LAW",
        "LOC",
        "MONEY",
        "NORP",
        "ORDINAL",
        "ORG",
        "PERCENT",
        "PERSON",
        "PRODUCT",
        "QUANTITY",
        "TIME",
        "WORK_OF_ART"
    ],
    "zh-hans": [
        "CARDINAL",
        "DATE",
        "EVENT",
        "FAC",
        "GPE",
        "LANGUAGE",
        "LAW",
        "LOC",
        "MONEY",
        "NORP",
        "ORDINAL",
        "ORG",
        "PERCENT",
        "PERSON",
        "PRODUCT",
        "QUANTITY",
        "TIME",
        "WORK_OF_ART"
    ],
    "ru": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "uk": [
        "LOC",
        "MISC",
        "ORG",
        "PERS"
    ],
    "ar": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "hu": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "af": [
        "LOC",
        "MISC",
        "ORG",
        "PERS"
    ],
    "bg": [
        "EVT",
        "LOC",
        "ORG",
        "PER",
        "PRO"
    ],
    "fi": [
        "DATE",
        "EVENT",
        "LOC",
        "ORG",
        "PER",
        "PRO"
    ],
    "my": [
        "LOC",
        "NE",
        "NUM",
        "ORG",
        "PNAME",
        "RACE",
        "TIME"
    ],
    "it": [
        "LOC",
        "ORG",
        "PER"
    ],
    "de": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "nl": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ],
    "vi": [
        "LOCATION",
        "MISCELLANEOUS",
        "ORGANIZATION",
        "PERSON"
    ],
    "es": [
        "LOC",
        "MISC",
        "ORG",
        "PER"
    ]
}

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 22, 2021 via email

@stale
Copy link

stale bot commented Feb 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 21, 2022
@stale
Copy link

stale bot commented Mar 3, 2022

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants