-
Notifications
You must be signed in to change notification settings - Fork 33
GUM format support #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If two mentions start at the same word, the longer must be saved first, in the new format. However, we cannot cycle through `reversed(doc.coref_mentions)` because that would break the ordering of closing brackets. The easiest solution seems to be to redefine `CorefMention.__lt__`, so that it follows the order in which mentions must be stored in the new format.
but we can have multiple src mentions in a single `Bridge=` annotation, e.g. `Entity=(e5(e6|Bridge=e1<e5,e2<e5,e3<e6`, so instead of `bl=BridgingLinks(src_mention, string, clusters)`, we need to use a factory method which returns a **list** of BridgingLinks: `(bl5, bl6) = BridgingLinks.from_string("e1<e5,e2<e5,e3<e6", clusters)`. Cataphora SplitAnte and Brige are allowed again.
`mention.words += span_to_nodes(mention.head.root, f'{mention.words[-1].ord + 1}-{node.ord}')` This is slow, but also buggy because if `ord == "20.1"` then the following ord is not `ord + 1`.
if they end with a single-word part
by default, mark the **head** in the sequence of forms
If no etype is present, it will be None in the API and an empty string in the serialization When loading a file where a CorefCluster is created first when being part of `Bridge=`, it is crated with cluster_type None, but when loading a line with `Entity=`, its correct cluster_type is loaded.
so that we can do e.g. `bridge.relation = bridge.relation.lower()`
We need to break them into (continuous) subspans, treat each as a fake CorefMention and sort all mentions (real and fake), so that they are stored in the correct order.
corefud.MarkSameSubSpan detects such mentions see ufal/corefUD#26 for details
see ufal/corefUD#28 We need to * `discontinuous_mentions[eid].pop()` when closing the last subspan * find the correct mention from `discontinuous_mentions[eid]` when opening non-first subspan. It does not need to be the top of the stack. It is the first opened (i.e. unfinished) mention in `discontinuous_mentions[eid]`, i.e. ```python opened = [pair[0] for pair in unfinished_mentions[eid]] mention = next(m for m in discontinuous_mentions[eid] if m not in opened) ``` * We forbid discontinuous crossing same-cluster spans, so we can throw an exception when closing the last subspan of a mention which is not at the top of the stack (`_error(f"Closing mention {mention} at {node}, but it has unfinished nested mentions ({m})", 1)`). See ufal/corefUD#27
as suggested in #29 This will also prevent `Bridge=c1<c2:|Entity=...`, i.e. a colon not followed by a label, which is forbidden by validate.py.
i.e. fixing various bugs other than corefud.FixInterleaved and the very format conversion
as agreed in ufal/corefUD#13 and ufal/corefUD#30.
so that even with non-deterministic heads on the input, the output will be always the same Fixes #100
This block is needed for the conversion of the Czech data, together with the new stuff in gum_format.
So we need to skipt the test that all SplitAnte links have the same source. Adding an example: `Entity=(e11-person(e12-person)|SplitAnte=e3<e11,e4<e11,e6<e12,e7<e12` which means that both e11 and e12 have split antecedents (e11=e3+e4, e12=e6+e7).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I haven't finished it, yet. But I don't want to delay it any longer so feel free to merge it.
@@ -28,6 +28,7 @@ def __init__(self, files='-', filehandle=None, zone='keep', bundles_per_doc=0, e | |||
logging.debug('Using sent_id_filter=%s', sent_id_filter) | |||
self.split_docs = split_docs | |||
self.ignore_sent_id = ignore_sent_id | |||
self._global_entity = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't processing of the global.Entity
"pragma" be limited only to CoNLL-U reader/writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it must be here. read.Conllu
uses a fast loading interface with read_trees()
, which reads all the trees in a file at once, but it does not have access to the document instance, it just returns a sequence of trees (which may be split into multiple documents if bundles_per_doc
is set). So read.Conllu
cannot store the global.Entity
in document.meta['global.Entity']
where it belongs. So it must be done here in the basereader and temporarily stored in self._global_entity
.
* Add .circleci/config.yml * CircleCI debugging * CircleCI debug * regexes need \r"" * allow len(document) Users may expect this to work, when document[i] works. * reader.read_documents() * add a comment explaining the hack from #96 * add a first test for coreference API * fix the bug revealed in test_coref.py thanks to @ondfa * switch from TravisCI to CircleCI
Udapi now supports the bracketing "GUM-style" format of coreference annotations by default.
The CorefUD 0.1 (and 0.2) style format, is supported using a specialized reader (
read.OldCorefUD
) and writer (write.OldCorefUD
).The API has been changed just slightly (more more changes and renamings are planned):
CorefMention.__init__
haswords
as the second parameter (afterself
) becausemention.words
cannot be empty anymore.mention.other
(DualDict) instead ofmention.misc
(str)CorefMention.__lt__
) so that longer mentions go first (if starting on the same node)BridgingLinks
serialization follows the new format (andlink.relation
andlink.target
are now mutable).document.meta['global.Entity']
reflects theglobal.Entity
header and can be used for reading and writing (so that the writer can use different positional attributes of entities annotations).