Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the training data format #23

Closed
leileilin opened this issue Jul 12, 2022 · 5 comments
Closed

about the training data format #23

leileilin opened this issue Jul 12, 2022 · 5 comments

Comments

@leileilin
Copy link

Hello, I'd like to ask about the .jsonlines file executived through convert_ to_ jsonlines. py, Can some attributes in the jsonlines file be successfully trained after being discarded?
Such as speaker, pos.

@vdobrovolskii
Copy link
Owner

vdobrovolskii commented Jul 12, 2022

Do you mean, can you train without some of the attributes?
You can totally replace every element in the "speaker" array by any string and the model will be able to learn.
As for other keys, here are the ones that must be there for training (all the others are either legacy or needed only during preprocessing):

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping

span_clusters:  List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

head_clusters:  List[List[int]]        # list of clusters,
                                       #     each cluster is a list of span heads


head2span:      List[List[int]]        # list of training examples
                                       #     each example is a list of three ints
                                       #     head, span start, span end
                                       # this is used to train the model to predict spans from span heads 

See this issue.

@leileilin
Copy link
Author

Do you mean, can you train without some of the attributes? You can totally replace every element in the "speaker" array by any string and the model will be able to learn. As for other keys, here are the ones that must be there for training (all the others are either legacy or needed only during preprocessing):

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping

span_clusters:  List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

head_clusters:  List[List[int]]        # list of clusters,
                                       #     each cluster is a list of span heads


head2span:      List[List[int]]        # list of training examples
                                       #     each example is a list of three ints
                                       #     head, span start, span end
                                       # this is used to train the model to predict spans from span heads 

See this issue.

thanks, So you mean that the attribute speaker cannot be discarded, right?

@vdobrovolskii
Copy link
Owner

it cannot be discarded, but it can be replaced with a placeholder value

@leileilin
Copy link
Author

it cannot be discarded, but it can be replaced with a placeholder value

thanks, i got it.

@leileilin
Copy link
Author

it cannot be discarded, but it can be replaced with a placeholder value

I have another new problem, I don't understand split_jsonlines function in convert_to_jsonlines.py use for?
we can use mv command to transfer the .jsonlines file from temp dir to data dir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants