GUM format writer does not check that cluster ids do not contain hyphens

I ran this:

```
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-train.conllu > newtrain.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-dev.conllu > newdev.conllu
udapy -s read.OldCorefUD corefud.FixInterleaved < en_pcedt-ud-test.conllu > newtest.conllu
validate.py --lang en --level 2 --coref new*.conllu
```

And got this:

```
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref too-many-entity-attributes] Entity 'wsj-0001-c1--2' has 5 attributes while only 4 attributes are globally declared.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-entity-type] Spurious entity type '0001'.
[(in newdev.conllu) Line 6 Sent wsj-0001-s1]: [L6 Coref spurious-mention-head] Entity head index 'c1' must be a non-zero-starting integer.
...
```

The problem is that the cluster ids in the input format contain hyphens (e.g., "wsj-0001-c1"). One possibility would be to replace hyphens by empty strings or underscores or something else; but then one should check that the new id is not actually used elsewhere in the corpus, so perhaps it would be better to forcefully normalize cluster ids after reading via OldCorefUD, using the code that already is available in Udapi. The third option would be to simply throw a fatal error when trying to save in the GUM format entities with such ids; then it would be up to the caller to take care of them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GUM format writer does not check that cluster ids do not contain hyphens #97

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GUM format writer does not check that cluster ids do not contain hyphens #97

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions