-
Notifications
You must be signed in to change notification settings - Fork 33
Review core #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Review core #29
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is just a first phase of a biger refactoring.
b09b57d
to
278dbd8
Compare
martinpopel
added a commit
that referenced
this pull request
Feb 7, 2022
as suggested in #29 This will also prevent `Bridge=c1<c2:|Entity=...`, i.e. a colon not followed by a label, which is forbidden by validate.py.
martinpopel
added a commit
that referenced
this pull request
Feb 10, 2022
* global.Entity support * shortcuts: doc.coref_mentions and tree.document * reading and writing new (CorefUD 1.0) format of coreference * oops, bug in detecting discontinuous mentions * fix ordering of brackets in serialization for crossing mention spans * change CorefMention.__lt__, document rules for serialization If two mentions start at the same word, the longer must be saved first, in the new format. However, we cannot cycle through `reversed(doc.coref_mentions)` because that would break the ordering of closing brackets. The easiest solution seems to be to redefine `CorefMention.__lt__`, so that it follows the order in which mentions must be stored in the new format. * BridgingLinks string represenation now follows the new format but we can have multiple src mentions in a single `Bridge=` annotation, e.g. `Entity=(e5(e6|Bridge=e1<e5,e2<e5,e3<e6`, so instead of `bl=BridgingLinks(src_mention, string, clusters)`, we need to use a factory method which returns a **list** of BridgingLinks: `(bl5, bl6) = BridgingLinks.from_string("e1<e5,e2<e5,e3<e6", clusters)`. Cataphora SplitAnte and Brige are allowed again. * parsing discontinuous mentions * bugfix in discontinuous mention parsing `mention.words += span_to_nodes(mention.head.root, f'{mention.words[-1].ord + 1}-{node.ord}')` This is slow, but also buggy because if `ord == "20.1"` then the following ord is not `ord + 1`. * bugfix in head assignment in discontinuous mentions if they end with a single-word part * corefud.PrintClusters mark_head=1 by default, mark the **head** in the sequence of forms * CorefMention.__init__ now requires words to be specified * corefud.MarkInterleaved see #25 * fix bridging serialization * corefud.FixInterleaved * use empty string as etype instead of "unk" If no etype is present, it will be None in the API and an empty string in the serialization When loading a file where a CorefCluster is created first when being part of `Bridge=`, it is crated with cluster_type None, but when loading a line with `Entity=`, its correct cluster_type is loaded. * BridgingLink should be mutable so that we can do e.g. `bridge.relation = bridge.relation.lower()` * fix serialization of discontinuous mentions We need to break them into (continuous) subspans, treat each as a fake CorefMention and sort all mentions (real and fake), so that they are stored in the correct order. * corefud.FixInterleaved now fixes also "same-cluster same-subspan" corefud.MarkSameSubSpan detects such mentions see ufal/corefUD#26 for details * don't repeat Bridge for each subspan * fix Discontinuous nested mention see ufal/corefUD#28 We need to * `discontinuous_mentions[eid].pop()` when closing the last subspan * find the correct mention from `discontinuous_mentions[eid]` when opening non-first subspan. It does not need to be the top of the stack. It is the first opened (i.e. unfinished) mention in `discontinuous_mentions[eid]`, i.e. ```python opened = [pair[0] for pair in unfinished_mentions[eid]] mention = next(m for m in discontinuous_mentions[eid] if m not in opened) ``` * We forbid discontinuous crossing same-cluster spans, so we can throw an exception when closing the last subspan of a mention which is not at the top of the stack (`_error(f"Closing mention {mention} at {node}, but it has unfinished nested mentions ({m})", 1)`). See ufal/corefUD#27 * keep node._mentions updated * fix serialization ordering of single-word mentions * report whole block name in `logging.info(f"Executing block {bname}")` * don't allow hyphens and other forbidden chars in cluster IDs fixes #97 * more elegant implementation of chars forbidden in ID * escape also round brackets in MentionMisc values * use empty string for missing/unknown bridging relations as suggested in #29 This will also prevent `Bridge=c1<c2:|Entity=...`, i.e. a colon not followed by a label, which is forbidden by validate.py. * block corefud.FixCorefUD02 for converting CorefUD 0.2 to 1.0 i.e. fixing various bugs other than corefud.FixInterleaved and the very format conversion * rename bridging relations as suggested by @dan-zeman in ufal/corefUD#29 (comment) * harmonize etype and gstype as agreed in ufal/corefUD#13 and ufal/corefUD#30. * This new block could help solve #98. * We probably do not have to report every instance. * allow infstat-minspan-identity as extra positional attributes in GUM * global.Entity won't be written without newdoc * don't introduce same-span mentions fixes #98 * allow corefud.MoveHead keep_head_if_possible=0 so that even with non-deterministic heads on the input, the output will be always the same Fixes #100 * Selectively propagating changes in one file from master to gum_format. This block is needed for the conversion of the Czech data, together with the new stuff in gum_format. * Fixing enhanced dependency labels. * fix corefud.FixInterleaved so it does not merge an already deleted mention * If two mentions start in the same node, both can have SplitAnte. So we need to skipt the test that all SplitAnte links have the same source. Adding an example: `Entity=(e11-person(e12-person)|SplitAnte=e3<e11,e4<e11,e6<e12,e7<e12` which means that both e11 and e12 have split antecedents (e11=e3+e4, e12=e6+e7). * DualDict supports only str values * Bug fix. * Another edeprel fix rule. * thanks @michnov for the review Co-authored-by: Dan Zeman <zeman@ufal.mff.cuni.cz>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP: more commits will follow.
Unfortunately, I don't have time to pack the refactoring into neat commented commits
because there are so many bugs and things to change.