Review core #29

martinpopel · 2016-12-23T09:36:29Z

WIP: more commits will follow.
Unfortunately, I don't have time to pack the refactoring into neat commented commits
because there are so many bugs and things to change.

This is just a first phase of a biger refactoring.

`bundle._document = bundle` this line was obviously wrong

as suggested in #29 This will also prevent `Bridge=c1<c2:|Entity=...`, i.e. a colon not followed by a label, which is forbidden by validate.py.

@dan-zeman

* global.Entity support * shortcuts: doc.coref_mentions and tree.document * reading and writing new (CorefUD 1.0) format of coreference * oops, bug in detecting discontinuous mentions * fix ordering of brackets in serialization for crossing mention spans * change CorefMention.__lt__, document rules for serialization If two mentions start at the same word, the longer must be saved first, in the new format. However, we cannot cycle through `reversed(doc.coref_mentions)` because that would break the ordering of closing brackets. The easiest solution seems to be to redefine `CorefMention.__lt__`, so that it follows the order in which mentions must be stored in the new format. * BridgingLinks string represenation now follows the new format but we can have multiple src mentions in a single `Bridge=` annotation, e.g. `Entity=(e5(e6|Bridge=e1<e5,e2<e5,e3<e6`, so instead of `bl=BridgingLinks(src_mention, string, clusters)`, we need to use a factory method which returns a **list** of BridgingLinks: `(bl5, bl6) = BridgingLinks.from_string("e1<e5,e2<e5,e3<e6", clusters)`. Cataphora SplitAnte and Brige are allowed again. * parsing discontinuous mentions * bugfix in discontinuous mention parsing `mention.words += span_to_nodes(mention.head.root, f'{mention.words[-1].ord + 1}-{node.ord}')` This is slow, but also buggy because if `ord == "20.1"` then the following ord is not `ord + 1`. * bugfix in head assignment in discontinuous mentions if they end with a single-word part * corefud.PrintClusters mark_head=1 by default, mark the **head** in the sequence of forms * CorefMention.__init__ now requires words to be specified * corefud.MarkInterleaved see #25 * fix bridging serialization * corefud.FixInterleaved * use empty string as etype instead of "unk" If no etype is present, it will be None in the API and an empty string in the serialization When loading a file where a CorefCluster is created first when being part of `Bridge=`, it is crated with cluster_type None, but when loading a line with `Entity=`, its correct cluster_type is loaded. * BridgingLink should be mutable so that we can do e.g. `bridge.relation = bridge.relation.lower()` * fix serialization of discontinuous mentions We need to break them into (continuous) subspans, treat each as a fake CorefMention and sort all mentions (real and fake), so that they are stored in the correct order. * corefud.FixInterleaved now fixes also "same-cluster same-subspan" corefud.MarkSameSubSpan detects such mentions see ufal/corefUD#26 for details * don't repeat Bridge for each subspan * fix Discontinuous nested mention see ufal/corefUD#28 We need to * `discontinuous_mentions[eid].pop()` when closing the last subspan * find the correct mention from `discontinuous_mentions[eid]` when opening non-first subspan. It does not need to be the top of the stack. It is the first opened (i.e. unfinished) mention in `discontinuous_mentions[eid]`, i.e. ```python opened = [pair[0] for pair in unfinished_mentions[eid]] mention = next(m for m in discontinuous_mentions[eid] if m not in opened) ``` * We forbid discontinuous crossing same-cluster spans, so we can throw an exception when closing the last subspan of a mention which is not at the top of the stack (`_error(f"Closing mention {mention} at {node}, but it has unfinished nested mentions ({m})", 1)`). See ufal/corefUD#27 * keep node._mentions updated * fix serialization ordering of single-word mentions * report whole block name in `logging.info(f"Executing block {bname}")` * don't allow hyphens and other forbidden chars in cluster IDs fixes #97 * more elegant implementation of chars forbidden in ID * escape also round brackets in MentionMisc values * use empty string for missing/unknown bridging relations as suggested in #29 This will also prevent `Bridge=c1<c2:|Entity=...`, i.e. a colon not followed by a label, which is forbidden by validate.py. * block corefud.FixCorefUD02 for converting CorefUD 0.2 to 1.0 i.e. fixing various bugs other than corefud.FixInterleaved and the very format conversion * rename bridging relations as suggested by @dan-zeman in ufal/corefUD#29 (comment) * harmonize etype and gstype as agreed in ufal/corefUD#13 and ufal/corefUD#30. * This new block could help solve #98. * We probably do not have to report every instance. * allow infstat-minspan-identity as extra positional attributes in GUM * global.Entity won't be written without newdoc * don't introduce same-span mentions fixes #98 * allow corefud.MoveHead keep_head_if_possible=0 so that even with non-deterministic heads on the input, the output will be always the same Fixes #100 * Selectively propagating changes in one file from master to gum_format. This block is needed for the conversion of the Czech data, together with the new stuff in gum_format. * Fixing enhanced dependency labels. * fix corefud.FixInterleaved so it does not merge an already deleted mention * If two mentions start in the same node, both can have SplitAnte. So we need to skipt the test that all SplitAnte links have the same source. Adding an example: `Entity=(e11-person(e12-person)|SplitAnte=e3<e11,e4<e11,e6<e12,e7<e12` which means that both e11 and e12 have split antecedents (e11=e3+e4, e12=e6+e7). * DualDict supports only str values * Bug fix. * Another edeprel fix rule. * thanks @michnov for the review Co-authored-by: Dan Zeman <zeman@ufal.mff.cuni.cz>

fixing bugs discovered by pylint and flake8

278dbd8

This is just a first phase of a biger refactoring.

martinpopel force-pushed the review-core branch from b09b57d to 278dbd8 Compare January 4, 2017 16:59

martinpopel added 2 commits January 4, 2017 18:22

bug fix

f689f5b

`bundle._document = bundle` this line was obviously wrong

code cleanup

1858aaa

martinpopel merged commit 923576e into master Jan 5, 2017

martinpopel added a commit that referenced this pull request Feb 7, 2022

use empty string for missing/unknown bridging relations

22084da

as suggested in #29 This will also prevent `Bridge=c1<c2:|Entity=...`, i.e. a colon not followed by a label, which is forbidden by validate.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Review core #29

Review core #29

Uh oh!

martinpopel commented Dec 23, 2016

Uh oh!

Uh oh!

Review core #29

Review core #29

Uh oh!

Conversation

martinpopel commented Dec 23, 2016

Uh oh!

Uh oh!