Skip to content

GUM format support #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Feb 10, 2022
Merged

GUM format support #96

merged 47 commits into from
Feb 10, 2022

Conversation

martinpopel
Copy link
Contributor

Udapi now supports the bracketing "GUM-style" format of coreference annotations by default.
The CorefUD 0.1 (and 0.2) style format, is supported using a specialized reader (read.OldCorefUD) and writer (write.OldCorefUD).

The API has been changed just slightly (more more changes and renamings are planned):

  • CorefMention.__init__ has words as the second parameter (after self) because mention.words cannot be empty anymore.
  • mention.other (DualDict) instead of mention.misc (str)
  • ordering of mentions was re-defined (CorefMention.__lt__) so that longer mentions go first (if starting on the same node)
  • BridgingLinks serialization follows the new format (and link.relation and link.target are now mutable).
  • document.meta['global.Entity'] reflects the global.Entity header and can be used for reading and writing (so that the writer can use different positional attributes of entities annotations).

If two mentions start at the same word,
the longer must be saved first, in the new format.
However, we cannot cycle through `reversed(doc.coref_mentions)`
because that would break the ordering of closing brackets.
The easiest solution seems to be to redefine `CorefMention.__lt__`,
so that it follows the order in which mentions must be stored
in the new format.
but we can have multiple src mentions in a single `Bridge=` annotation,
e.g. `Entity=(e5(e6|Bridge=e1<e5,e2<e5,e3<e6`,
so instead of `bl=BridgingLinks(src_mention, string, clusters)`,
we need to use a factory method which returns a **list** of BridgingLinks:
`(bl5, bl6) = BridgingLinks.from_string("e1<e5,e2<e5,e3<e6", clusters)`.

Cataphora SplitAnte and Brige are allowed again.
`mention.words += span_to_nodes(mention.head.root, f'{mention.words[-1].ord + 1}-{node.ord}')`
This is slow, but also buggy because if `ord == "20.1"`
then the following ord is not `ord + 1`.
if they end with a single-word part
by default, mark the **head** in the sequence of forms
If no etype is present, it will be None in the API and an empty string in the serialization

When loading a file where a CorefCluster is created first when being part of `Bridge=`,
it is crated with cluster_type None, but when loading a line with `Entity=`,
its correct cluster_type is loaded.
so that we can do e.g. `bridge.relation = bridge.relation.lower()`
We need to break them into (continuous) subspans,
treat each as a fake CorefMention
and sort all mentions (real and fake),
so that they are stored in the correct order.
corefud.MarkSameSubSpan detects such mentions
see ufal/corefUD#26 for details
see ufal/corefUD#28
We need to
* `discontinuous_mentions[eid].pop()` when closing the last subspan
* find the correct mention from `discontinuous_mentions[eid]`
  when opening non-first subspan.
  It does not need to be the top of the stack.
  It is the first opened (i.e. unfinished) mention in `discontinuous_mentions[eid]`, i.e.
```python
opened = [pair[0] for pair in unfinished_mentions[eid]]
mention = next(m for m in discontinuous_mentions[eid] if m not in opened)
```
* We forbid discontinuous crossing same-cluster spans, so we can throw an exception
when closing the last subspan of a mention which is not at the top of the stack
(`_error(f"Closing mention {mention} at {node}, but it has unfinished nested mentions ({m})", 1)`).
See ufal/corefUD#27
as suggested in #29

This will also prevent `Bridge=c1<c2:|Entity=...`,
i.e. a colon not followed by a label,
which is forbidden by validate.py.
i.e. fixing various bugs other than corefud.FixInterleaved
and the very format conversion
martinpopel and others added 16 commits February 8, 2022 15:15
so that even with non-deterministic heads on the input,
the output will be always the same
Fixes #100
This block is needed for the conversion of the Czech data, together with the new stuff in gum_format.
So we need to skipt the test that all SplitAnte links have the same source.
Adding an example:

 `Entity=(e11-person(e12-person)|SplitAnte=e3<e11,e4<e11,e6<e12,e7<e12`
 which means that both e11 and e12 have split antecedents (e11=e3+e4, e12=e6+e7).
Copy link
Contributor

@michnov michnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I haven't finished it, yet. But I don't want to delay it any longer so feel free to merge it.

@@ -28,6 +28,7 @@ def __init__(self, files='-', filehandle=None, zone='keep', bundles_per_doc=0, e
logging.debug('Using sent_id_filter=%s', sent_id_filter)
self.split_docs = split_docs
self.ignore_sent_id = ignore_sent_id
self._global_entity = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't processing of the global.Entity "pragma" be limited only to CoNLL-U reader/writer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it must be here. read.Conllu uses a fast loading interface with read_trees(), which reads all the trees in a file at once, but it does not have access to the document instance, it just returns a sequence of trees (which may be split into multiple documents if bundles_per_doc is set). So read.Conllu cannot store the global.Entity in document.meta['global.Entity'] where it belongs. So it must be done here in the basereader and temporarily stored in self._global_entity.

@martinpopel martinpopel merged commit c691804 into master Feb 10, 2022
@martinpopel martinpopel deleted the gum-format branch February 10, 2022 09:58
martinpopel added a commit that referenced this pull request Feb 22, 2022
martinpopel added a commit that referenced this pull request Feb 22, 2022
* Add .circleci/config.yml

* CircleCI debugging

* CircleCI debug

* regexes need \r""

* allow len(document)

Users may expect this to work, when document[i] works.

* reader.read_documents()

* add a comment explaining the hack from #96

* add a first test for coreference API

* fix the bug revealed in test_coref.py thanks to @ondfa

* switch from TravisCI to CircleCI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants