Skip to content

GUM format writer does not check that cluster ids do not contain hyphens #97

@dan-zeman

Description

@dan-zeman

I ran this:

udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-train.conllu > newtrain.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-dev.conllu > newdev.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-test.conllu > newtest.conllu
validate.py --lang en --level 2 --coref new*.conllu

And got this:

[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref too-many-entity-attributes] Entity 'wsj-0001-c1--2' has 5 attributes while only 4 attributes are globally declared.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-entity-type] Spurious entity type '0001'.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-mention-head] Entity head index 'c1' must be a non-zero-starting integer.
...

The problem is that the cluster ids in the input format contain hyphens (e.g., "wsj-0001-c1"). One possibility would be to replace hyphens by empty strings or underscores or something else; but then one should check that the new id is not actually used elsewhere in the corpus, so perhaps it would be better to forcefully normalize cluster ids after reading via OldCorefUD, using the code that already is available in Udapi. The third option would be to simply throw a fatal error when trying to save in the GUM format entities with such ids; then it would be up to the caller to take care of them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions