-
Notifications
You must be signed in to change notification settings - Fork 33
Closed
Labels
Description
I ran this:
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-train.conllu > newtrain.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-dev.conllu > newdev.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-test.conllu > newtest.conllu
validate.py --lang en --level 2 --coref new*.conllu
And got this:
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref too-many-entity-attributes] Entity 'wsj-0001-c1--2' has 5 attributes while only 4 attributes are globally declared.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-entity-type] Spurious entity type '0001'.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-mention-head] Entity head index 'c1' must be a non-zero-starting integer.
...
The problem is that the cluster ids in the input format contain hyphens (e.g., "wsj-0001-c1"). One possibility would be to replace hyphens by empty strings or underscores or something else; but then one should check that the new id is not actually used elsewhere in the corpus, so perhaps it would be better to forcefully normalize cluster ids after reading via OldCorefUD, using the code that already is available in Udapi. The third option would be to simply throw a fatal error when trying to save in the GUM format entities with such ids; then it would be up to the caller to take care of them.