shall I use convert_to_heads when using CoNLL-U? #6

brgsk · 2021-12-01T17:34:18Z

Hi, thanks so much for your work!
I have a question regarding convert_to_heads.py script. I'm trying to make RoBERTa learn coreference resolution, but my data is in .conllu format. I have quite hard time trying to preprocess data/modify some of your code to make it work. Can you share some insights/thoughts on that? I would be very much obliged.

Cheers

The text was updated successfully, but these errors were encountered:

vdobrovolskii · 2021-12-01T17:57:22Z

Hi!
Is there any coreference data in your files? I think, the least painful way will be to convert your data to .conll format as described here in the *_conll File Format section.
Not all the columns are necessary, for instance, you can omit lemmas, frameset id, word sense and named entities.
Then you use convert_to_jsonlines.py and convert_to_heads.py on the result.

The alternative way will be to drop my convert_to_jsonlines.py script completely and to preprocess your data yourself. You need to output files of the following structure:
one json per line, each with the following fields:

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping
pos:            List[str]              # word id to POS mapping
deprel:         List[str]              # word id to dependency relation mapping
head:           List[int]              # word id to head word id mapping, None for root
clusters:       List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

The resulting file should be passed to convert_to_heads.py.

You can go even further and drop the convert_to_heads.py script. Then you need to output the following jsonlines file:

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping

span_clusters:  List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

head_clusters:  List[List[int]]        # list of clusters,
                                       #     each cluster is a list of span heads


head2span:      List[List[int]]        # list of training examples
                                       #     each example is a list of three ints
                                       #     head, span start, span end
                                       # this is used to train the model to predict spans from span heads

Let me know if I can help with anything else.

brgsk · 2021-12-08T07:18:41Z

@vdobrovolskii thanks for your reply! Looking at convert_to_jsonlines.py right now, and if I understand it and your comment above correctly, then one json per line means one json-formatted sentence?

vdobrovolskii · 2021-12-08T10:01:13Z

"One json per line" is rather "one document per line".
You can read more about the format here.

brgsk · 2021-12-08T10:13:23Z

Yeah I'm familiar with jsonls, I was wondering about your data's structure.
Thanks again, closing this one 👍

brgsk · 2021-12-08T11:06:14Z

One more question!
In CorefSpansHolder._add_one() when appending span to self.spans why the appended list is [word_id, word_id + 1], shouldn't it be [word_id, word_id]?

vdobrovolskii · 2021-12-08T11:11:15Z

Span indices don't include the upper bound.
For instance, in the following example the span [0, 2] means "words with indices starting with 0 and going up to, but not including, 2", i.e. indices 0 and 1 ("The", "boy").

The boy went upstairs.

By that logic, [word_id, word_id] will be a span of length 0.

brgsk · 2021-12-08T11:21:42Z

Sure, thanks alot 😃

brgsk closed this as completed Dec 8, 2021

brgsk reopened this Dec 8, 2021

brgsk closed this as completed Dec 8, 2021

egilron mentioned this issue May 28, 2022

More jsonlines examples #17

Closed

vdobrovolskii mentioned this issue Jul 4, 2022

Meaning of head #19

Closed

vdobrovolskii mentioned this issue Jul 12, 2022

about the training data format #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shall I use convert_to_heads when using CoNLL-U? #6

shall I use convert_to_heads when using CoNLL-U? #6

brgsk commented Dec 1, 2021

vdobrovolskii commented Dec 1, 2021

brgsk commented Dec 8, 2021

vdobrovolskii commented Dec 8, 2021

brgsk commented Dec 8, 2021 •

edited

brgsk commented Dec 8, 2021 •

edited

vdobrovolskii commented Dec 8, 2021

brgsk commented Dec 8, 2021

shall I use convert_to_heads when using CoNLL-U? #6

shall I use convert_to_heads when using CoNLL-U? #6

Comments

brgsk commented Dec 1, 2021

vdobrovolskii commented Dec 1, 2021

brgsk commented Dec 8, 2021

vdobrovolskii commented Dec 8, 2021

brgsk commented Dec 8, 2021 • edited

brgsk commented Dec 8, 2021 • edited

vdobrovolskii commented Dec 8, 2021

brgsk commented Dec 8, 2021

brgsk commented Dec 8, 2021 •

edited

brgsk commented Dec 8, 2021 •

edited