Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions_dataset-representation #10

Closed
fajarmuslim opened this issue Dec 22, 2021 · 7 comments
Closed

Questions_dataset-representation #10

fajarmuslim opened this issue Dec 22, 2021 · 7 comments

Comments

@fajarmuslim
Copy link

Based on my observation in this code base, training use the following features, e.g:
cased_words, sent_id, speaker, pos, deprel, head, clusters.

then converted into:
cased_words, sent_id, speaker, pos, deprel, head, head2span, word_clusters, span_clusters.

while in inference data example, the feature used only cased_words, sent_id, and optionally speaker information.

My questions is.

  1. how we get the pos, deprel, head, and clusters data from in inference mode? It is derived from cased_words or not?
  2. in training mode, is the speaker, pos, deprel, head, clusters data is used as well?

Thank you

@vdobrovolskii
Copy link
Owner

Hi!

Training and evaluation themselves require the following:
"cased_words": tokenized words of the text
"sent_id": sentence index for each word of the text
"speaker": speaker name for each word of the text
"head2span": triples of [span head, span start, span end] to train the span predictor
"word_clusters": lists of coreference clusters, where each cluster is a list of word indices
"span_clusters": lists of coreference clusters, where each cluster is a list of spans

"pos", "deprel", "head" are only used during data preparation to be able to convert a span-based dataset to a word-based one.

Inference only needs the following:
"cased_words": tokenized words of the text
"sent_id": sentence index for each word of the text
"speaker": speaker name for each word of the text (optional)

  1. We don't get the pos, deprel and head data during inference, because we don't use them. Cluster data is the actual output of the model.
  2. See above.

@fajarmuslim
Copy link
Author

Thanks for the explanation....

For the head2span. how we get the 'head' data?

I still confused in this line of code. What is actually avg_spans do? why it is required?
avg_spans = sum(len(doc["head2span"]) for doc in docs) / len(docs)

@fajarmuslim
Copy link
Author

span start, end is the index where the span starting, ending in a text? or it is start, end of span in sentence?

@vdobrovolskii
Copy link
Owner

vdobrovolskii commented Dec 22, 2021

Heads are calculated in this function here.

avg_spans from the line of the code you quoted calculate the average number of coreferent spans in a document. It is used to weigh the loss function here

Span start and end are the word indices of the text, not the sentence.

@fajarmuslim
Copy link
Author

fajarmuslim commented Dec 22, 2021

How we get the doc['head']. Is it given by OntoNotes05 dataset?

@vdobrovolskii
Copy link
Owner

More or less. The OntoNotes has got constituency syntax data which is converted to dependency syntax data. This is where the head/deprel/pos come from.

The reason for the conversion was because it was easier for me to deal with dependency graphs than with constituency structures. But both can be used, although one will need to rewrite the convert_to_heads bit to make it work with constituents.

@fajarmuslim
Copy link
Author

Thank you for your excellence support....

I will close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants