Skip to content

Review core #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 5, 2017
Merged

Review core #29

merged 3 commits into from
Jan 5, 2017

Conversation

martinpopel
Copy link
Contributor

WIP: more commits will follow.
Unfortunately, I don't have time to pack the refactoring into neat commented commits
because there are so many bugs and things to change.

This is just a first phase of a biger refactoring.
`bundle._document = bundle` this line was obviously wrong
@martinpopel martinpopel merged commit 923576e into master Jan 5, 2017
martinpopel added a commit that referenced this pull request Feb 7, 2022
as suggested in #29

This will also prevent `Bridge=c1<c2:|Entity=...`,
i.e. a colon not followed by a label,
which is forbidden by validate.py.
martinpopel added a commit that referenced this pull request Feb 10, 2022
* global.Entity support

* shortcuts: doc.coref_mentions and tree.document

* reading and writing new (CorefUD 1.0) format of coreference

* oops, bug in detecting discontinuous mentions

* fix ordering of brackets in serialization for crossing mention spans

* change CorefMention.__lt__, document rules for serialization

If two mentions start at the same word,
the longer must be saved first, in the new format.
However, we cannot cycle through `reversed(doc.coref_mentions)`
because that would break the ordering of closing brackets.
The easiest solution seems to be to redefine `CorefMention.__lt__`,
so that it follows the order in which mentions must be stored
in the new format.

* BridgingLinks string represenation now follows the new format

but we can have multiple src mentions in a single `Bridge=` annotation,
e.g. `Entity=(e5(e6|Bridge=e1<e5,e2<e5,e3<e6`,
so instead of `bl=BridgingLinks(src_mention, string, clusters)`,
we need to use a factory method which returns a **list** of BridgingLinks:
`(bl5, bl6) = BridgingLinks.from_string("e1<e5,e2<e5,e3<e6", clusters)`.

Cataphora SplitAnte and Brige are allowed again.

* parsing discontinuous mentions

* bugfix in discontinuous mention parsing

`mention.words += span_to_nodes(mention.head.root, f'{mention.words[-1].ord + 1}-{node.ord}')`
This is slow, but also buggy because if `ord == "20.1"`
then the following ord is not `ord + 1`.

* bugfix in head assignment in discontinuous mentions

if they end with a single-word part

* corefud.PrintClusters mark_head=1

by default, mark the **head** in the sequence of forms

* CorefMention.__init__ now requires words to be specified

* corefud.MarkInterleaved

see #25

* fix bridging serialization

* corefud.FixInterleaved

* use empty string as etype instead of "unk"

If no etype is present, it will be None in the API and an empty string in the serialization

When loading a file where a CorefCluster is created first when being part of `Bridge=`,
it is crated with cluster_type None, but when loading a line with `Entity=`,
its correct cluster_type is loaded.

* BridgingLink should be mutable

so that we can do e.g. `bridge.relation = bridge.relation.lower()`

* fix serialization of discontinuous mentions

We need to break them into (continuous) subspans,
treat each as a fake CorefMention
and sort all mentions (real and fake),
so that they are stored in the correct order.

* corefud.FixInterleaved now fixes also "same-cluster same-subspan"

corefud.MarkSameSubSpan detects such mentions
see ufal/corefUD#26 for details

* don't repeat Bridge for each subspan

* fix Discontinuous nested mention

see ufal/corefUD#28
We need to
* `discontinuous_mentions[eid].pop()` when closing the last subspan
* find the correct mention from `discontinuous_mentions[eid]`
  when opening non-first subspan.
  It does not need to be the top of the stack.
  It is the first opened (i.e. unfinished) mention in `discontinuous_mentions[eid]`, i.e.
```python
opened = [pair[0] for pair in unfinished_mentions[eid]]
mention = next(m for m in discontinuous_mentions[eid] if m not in opened)
```
* We forbid discontinuous crossing same-cluster spans, so we can throw an exception
when closing the last subspan of a mention which is not at the top of the stack
(`_error(f"Closing mention {mention} at {node}, but it has unfinished nested mentions ({m})", 1)`).
See ufal/corefUD#27

* keep node._mentions updated

* fix serialization ordering of single-word mentions

* report whole block name in `logging.info(f"Executing block {bname}")`

* don't allow hyphens and other forbidden chars in cluster IDs

fixes #97

* more elegant implementation of chars forbidden in ID

* escape also round brackets in MentionMisc values

* use empty string for missing/unknown bridging relations

as suggested in #29

This will also prevent `Bridge=c1<c2:|Entity=...`,
i.e. a colon not followed by a label,
which is forbidden by validate.py.

* block corefud.FixCorefUD02 for converting CorefUD 0.2 to 1.0

i.e. fixing various bugs other than corefud.FixInterleaved
and the very format conversion

* rename bridging relations as suggested by @dan-zeman

in ufal/corefUD#29 (comment)

* harmonize etype and gstype

as agreed in
ufal/corefUD#13 and
ufal/corefUD#30.

* This new block could help solve #98.

* We probably do not have to report every instance.

* allow infstat-minspan-identity as extra positional attributes in GUM

* global.Entity won't be written without newdoc

* don't introduce same-span mentions

fixes #98

* allow corefud.MoveHead keep_head_if_possible=0

so that even with non-deterministic heads on the input,
the output will be always the same
Fixes #100

* Selectively propagating changes in one file from master to gum_format.

This block is needed for the conversion of the Czech data, together with the new stuff in gum_format.

* Fixing enhanced dependency labels.

* fix corefud.FixInterleaved so it does not merge an already deleted mention

* If two mentions start in the same node, both can have SplitAnte.

So we need to skipt the test that all SplitAnte links have the same source.
Adding an example:

 `Entity=(e11-person(e12-person)|SplitAnte=e3<e11,e4<e11,e6<e12,e7<e12`
 which means that both e11 and e12 have split antecedents (e11=e3+e4, e12=e6+e7).

* DualDict supports only str values

* Bug fix.

* Another edeprel fix rule.

* thanks @michnov for the review

Co-authored-by: Dan Zeman <zeman@ufal.mff.cuni.cz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant