New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions_dataset-representation #10
Comments
Hi! Training and evaluation themselves require the following: "pos", "deprel", "head" are only used during data preparation to be able to convert a span-based dataset to a word-based one. Inference only needs the following:
|
Thanks for the explanation.... For the head2span. how we get the 'head' data? I still confused in this line of code. What is actually avg_spans do? why it is required? |
span start, end is the index where the span starting, ending in a text? or it is start, end of span in sentence? |
Heads are calculated in this function here. avg_spans from the line of the code you quoted calculate the average number of coreferent spans in a document. It is used to weigh the loss function here Span start and end are the word indices of the text, not the sentence. |
How we get the doc['head']. Is it given by OntoNotes05 dataset? |
More or less. The OntoNotes has got constituency syntax data which is converted to dependency syntax data. This is where the head/deprel/pos come from. The reason for the conversion was because it was easier for me to deal with dependency graphs than with constituency structures. But both can be used, although one will need to rewrite the convert_to_heads bit to make it work with constituents. |
Thank you for your excellence support.... I will close this issue |
Based on my observation in this code base, training use the following features, e.g:
cased_words, sent_id, speaker, pos, deprel, head, clusters.
then converted into:
cased_words, sent_id, speaker, pos, deprel, head, head2span, word_clusters, span_clusters.
while in inference data example, the feature used only cased_words, sent_id, and optionally speaker information.
My questions is.
Thank you
The text was updated successfully, but these errors were encountered: