GUM format support #96

martinpopel · 2022-02-02T17:51:21Z

Udapi now supports the bracketing "GUM-style" format of coreference annotations by default.
The CorefUD 0.1 (and 0.2) style format, is supported using a specialized reader (read.OldCorefUD) and writer (write.OldCorefUD).

The API has been changed just slightly (more more changes and renamings are planned):

CorefMention.__init__ has words as the second parameter (after self) because mention.words cannot be empty anymore.
mention.other (DualDict) instead of mention.misc (str)
ordering of mentions was re-defined (CorefMention.__lt__) so that longer mentions go first (if starting on the same node)
BridgingLinks serialization follows the new format (and link.relation and link.target are now mutable).
document.meta['global.Entity'] reflects the global.Entity header and can be used for reading and writing (so that the writer can use different positional attributes of entities annotations).

If two mentions start at the same word, the longer must be saved first, in the new format. However, we cannot cycle through `reversed(doc.coref_mentions)` because that would break the ordering of closing brackets. The easiest solution seems to be to redefine `CorefMention.__lt__`, so that it follows the order in which mentions must be stored in the new format.

but we can have multiple src mentions in a single `Bridge=` annotation, e.g. `Entity=(e5(e6|Bridge=e1<e5,e2<e5,e3<e6`, so instead of `bl=BridgingLinks(src_mention, string, clusters)`, we need to use a factory method which returns a **list** of BridgingLinks: `(bl5, bl6) = BridgingLinks.from_string("e1<e5,e2<e5,e3<e6", clusters)`. Cataphora SplitAnte and Brige are allowed again.

`mention.words += span_to_nodes(mention.head.root, f'{mention.words[-1].ord + 1}-{node.ord}')` This is slow, but also buggy because if `ord == "20.1"` then the following ord is not `ord + 1`.

if they end with a single-word part

by default, mark the **head** in the sequence of forms

see #25

If no etype is present, it will be None in the API and an empty string in the serialization When loading a file where a CorefCluster is created first when being part of `Bridge=`, it is crated with cluster_type None, but when loading a line with `Entity=`, its correct cluster_type is loaded.

so that we can do e.g. `bridge.relation = bridge.relation.lower()`

We need to break them into (continuous) subspans, treat each as a fake CorefMention and sort all mentions (real and fake), so that they are stored in the correct order.

corefud.MarkSameSubSpan detects such mentions see ufal/corefUD#26 for details

see ufal/corefUD#28 We need to * `discontinuous_mentions[eid].pop()` when closing the last subspan * find the correct mention from `discontinuous_mentions[eid]` when opening non-first subspan. It does not need to be the top of the stack. It is the first opened (i.e. unfinished) mention in `discontinuous_mentions[eid]`, i.e. ```python opened = [pair[0] for pair in unfinished_mentions[eid]] mention = next(m for m in discontinuous_mentions[eid] if m not in opened) ``` * We forbid discontinuous crossing same-cluster spans, so we can throw an exception when closing the last subspan of a mention which is not at the top of the stack (`_error(f"Closing mention {mention} at {node}, but it has unfinished nested mentions ({m})", 1)`). See ufal/corefUD#27

fixes #97

as suggested in #29 This will also prevent `Bridge=c1<c2:|Entity=...`, i.e. a colon not followed by a label, which is forbidden by validate.py.

i.e. fixing various bugs other than corefud.FixInterleaved and the very format conversion

in ufal/corefUD#29 (comment)

as agreed in ufal/corefUD#13 and ufal/corefUD#30.

…ormat

fixes #98

so that even with non-deterministic heads on the input, the output will be always the same Fixes #100

This block is needed for the conversion of the Czech data, together with the new stuff in gum_format.

…ntion

So we need to skipt the test that all SplitAnte links have the same source. Adding an example: `Entity=(e11-person(e12-person)|SplitAnte=e3<e11,e4<e11,e6<e12,e7<e12` which means that both e11 and e12 have split antecedents (e11=e3+e4, e12=e6+e7).

…ormat

michnov

Sorry, I haven't finished it, yet. But I don't want to delay it any longer so feel free to merge it.

michnov · 2022-02-08T13:37:54Z

udapi/core/basereader.py

@@ -28,6 +28,7 @@ def __init__(self, files='-', filehandle=None, zone='keep', bundles_per_doc=0, e
            logging.debug('Using sent_id_filter=%s', sent_id_filter)
        self.split_docs = split_docs
        self.ignore_sent_id = ignore_sent_id
+        self._global_entity = None


Shouldn't processing of the global.Entity "pragma" be limited only to CoNLL-U reader/writer?

Unfortunately, it must be here. read.Conllu uses a fast loading interface with read_trees(), which reads all the trees in a file at once, but it does not have access to the document instance, it just returns a sequence of trees (which may be split into multiple documents if bundles_per_doc is set). So read.Conllu cannot store the global.Entity in document.meta['global.Entity'] where it belongs. So it must be done here in the basereader and temporarily stored in self._global_entity.

udapi/core/coref.py

@ondfa

* Add .circleci/config.yml * CircleCI debugging * CircleCI debug * regexes need \r"" * allow len(document) Users may expect this to work, when document[i] works. * reader.read_documents() * add a comment explaining the hack from #96 * add a first test for coreference API * fix the bug revealed in test_coref.py thanks to @ondfa * switch from TravisCI to CircleCI

martinpopel added 30 commits January 16, 2022 00:44

global.Entity support

f46fde7

shortcuts: doc.coref_mentions and tree.document

872325a

reading and writing new (CorefUD 1.0) format of coreference

93eb57f

oops, bug in detecting discontinuous mentions

97ccad8

fix ordering of brackets in serialization for crossing mention spans

dc00773

parsing discontinuous mentions

e548310

bugfix in discontinuous mention parsing

9bfb859

`mention.words += span_to_nodes(mention.head.root, f'{mention.words[-1].ord + 1}-{node.ord}')` This is slow, but also buggy because if `ord == "20.1"` then the following ord is not `ord + 1`.

bugfix in head assignment in discontinuous mentions

154ee0f

if they end with a single-word part

corefud.PrintClusters mark_head=1

29e0acd

by default, mark the **head** in the sequence of forms

CorefMention.__init__ now requires words to be specified

6414473

corefud.MarkInterleaved

7d9c71d

see #25

fix bridging serialization

b7903a8

corefud.FixInterleaved

caf77df

BridgingLink should be mutable

8c2279a

so that we can do e.g. `bridge.relation = bridge.relation.lower()`

fix serialization of discontinuous mentions

79faa81

We need to break them into (continuous) subspans, treat each as a fake CorefMention and sort all mentions (real and fake), so that they are stored in the correct order.

corefud.FixInterleaved now fixes also "same-cluster same-subspan"

96c313e

corefud.MarkSameSubSpan detects such mentions see ufal/corefUD#26 for details

don't repeat Bridge for each subspan

5da0584

keep node._mentions updated

dad357d

fix serialization ordering of single-word mentions

f35f3dd

report whole block name in logging.info(f"Executing block {bname}")

ee58966

don't allow hyphens and other forbidden chars in cluster IDs

748ee8b

fixes #97

more elegant implementation of chars forbidden in ID

9f4f829

escape also round brackets in MentionMisc values

e6c2177

use empty string for missing/unknown bridging relations

22084da

as suggested in #29 This will also prevent `Bridge=c1<c2:|Entity=...`, i.e. a colon not followed by a label, which is forbidden by validate.py.

block corefud.FixCorefUD02 for converting CorefUD 0.2 to 1.0

5b96933

i.e. fixing various bugs other than corefud.FixInterleaved and the very format conversion

rename bridging relations as suggested by @dan-zeman

c30fd50

in ufal/corefUD#29 (comment)

martinpopel and others added 16 commits February 8, 2022 15:15

harmonize etype and gstype

57b55a8

as agreed in ufal/corefUD#13 and ufal/corefUD#30.

This new block could help solve #98.

4b63606

Merge branch 'gum-format' of github.com:udapi/udapi-python into gum-f…

545ecea

…ormat

We probably do not have to report every instance.

efdf6f1

allow infstat-minspan-identity as extra positional attributes in GUM

08404b7

global.Entity won't be written without newdoc

8171865

don't introduce same-span mentions

1552d72

fixes #98

allow corefud.MoveHead keep_head_if_possible=0

ce31a30

so that even with non-deterministic heads on the input, the output will be always the same Fixes #100

Selectively propagating changes in one file from master to gum_format.

7979989

This block is needed for the conversion of the Czech data, together with the new stuff in gum_format.

Fixing enhanced dependency labels.

abaf958

fix corefud.FixInterleaved so it does not merge an already deleted me…

2184071

…ntion

DualDict supports only str values

885ab34

Bug fix.

79414c0

Merge branch 'gum-format' of github.com:udapi/udapi-python into gum-f…

722ca88

…ormat

Another edeprel fix rule.

3910a75

michnov reviewed Feb 9, 2022

View reviewed changes

thanks @michnov for the review

a3e99c0

martinpopel merged commit c691804 into master Feb 10, 2022

martinpopel deleted the gum-format branch February 10, 2022 09:58

martinpopel added a commit that referenced this pull request Feb 22, 2022

add a comment explaining the hack from #96

a4ae10f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GUM format support #96

GUM format support #96

Uh oh!

martinpopel commented Feb 2, 2022

Uh oh!

michnov left a comment

Uh oh!

michnov Feb 8, 2022

Uh oh!

martinpopel Feb 10, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GUM format support #96

GUM format support #96

Uh oh!

Conversation

martinpopel commented Feb 2, 2022

Uh oh!

michnov left a comment

Choose a reason for hiding this comment

Uh oh!

michnov Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

martinpopel Feb 10, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!