Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shall I use convert_to_heads when using CoNLL-U? #6

Closed
brgsk opened this issue Dec 1, 2021 · 7 comments
Closed

shall I use convert_to_heads when using CoNLL-U? #6

brgsk opened this issue Dec 1, 2021 · 7 comments

Comments

@brgsk
Copy link

brgsk commented Dec 1, 2021

Hi, thanks so much for your work!
I have a question regarding convert_to_heads.py script. I'm trying to make RoBERTa learn coreference resolution, but my data is in .conllu format. I have quite hard time trying to preprocess data/modify some of your code to make it work. Can you share some insights/thoughts on that? I would be very much obliged.

Cheers

@vdobrovolskii
Copy link
Owner

Hi!
Is there any coreference data in your files? I think, the least painful way will be to convert your data to .conll format as described here in the *_conll File Format section.
Not all the columns are necessary, for instance, you can omit lemmas, frameset id, word sense and named entities.
Then you use convert_to_jsonlines.py and convert_to_heads.py on the result.

The alternative way will be to drop my convert_to_jsonlines.py script completely and to preprocess your data yourself. You need to output files of the following structure:
one json per line, each with the following fields:

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping
pos:            List[str]              # word id to POS mapping
deprel:         List[str]              # word id to dependency relation mapping
head:           List[int]              # word id to head word id mapping, None for root
clusters:       List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

The resulting file should be passed to convert_to_heads.py.

You can go even further and drop the convert_to_heads.py script. Then you need to output the following jsonlines file:

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping

span_clusters:  List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

head_clusters:  List[List[int]]        # list of clusters,
                                       #     each cluster is a list of span heads


head2span:      List[List[int]]        # list of training examples
                                       #     each example is a list of three ints
                                       #     head, span start, span end
                                       # this is used to train the model to predict spans from span heads 

Let me know if I can help with anything else.

@brgsk
Copy link
Author

brgsk commented Dec 8, 2021

@vdobrovolskii thanks for your reply! Looking at convert_to_jsonlines.py right now, and if I understand it and your comment above correctly, then one json per line means one json-formatted sentence?

@vdobrovolskii
Copy link
Owner

"One json per line" is rather "one document per line".
You can read more about the format here.

@brgsk
Copy link
Author

brgsk commented Dec 8, 2021

Yeah I'm familiar with jsonls, I was wondering about your data's structure.
Thanks again, closing this one 👍

@brgsk brgsk closed this as completed Dec 8, 2021
@brgsk
Copy link
Author

brgsk commented Dec 8, 2021

One more question!
In CorefSpansHolder._add_one() when appending span to self.spans why the appended list is [word_id, word_id + 1], shouldn't it be [word_id, word_id]?

@brgsk brgsk reopened this Dec 8, 2021
@vdobrovolskii
Copy link
Owner

Span indices don't include the upper bound.
For instance, in the following example the span [0, 2] means "words with indices starting with 0 and going up to, but not including, 2", i.e. indices 0 and 1 ("The", "boy").

The boy went upstairs.

By that logic, [word_id, word_id] will be a span of length 0.

@brgsk
Copy link
Author

brgsk commented Dec 8, 2021

Sure, thanks alot 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants