Questions_dataset-representation #10

fajarmuslim · 2021-12-22T14:56:36Z

Based on my observation in this code base, training use the following features, e.g:
cased_words, sent_id, speaker, pos, deprel, head, clusters.

then converted into:
cased_words, sent_id, speaker, pos, deprel, head, head2span, word_clusters, span_clusters.

while in inference data example, the feature used only cased_words, sent_id, and optionally speaker information.

My questions is.

how we get the pos, deprel, head, and clusters data from in inference mode? It is derived from cased_words or not?
in training mode, is the speaker, pos, deprel, head, clusters data is used as well?

Thank you

vdobrovolskii · 2021-12-22T15:05:10Z

Hi!

Training and evaluation themselves require the following:
"cased_words": tokenized words of the text
"sent_id": sentence index for each word of the text
"speaker": speaker name for each word of the text
"head2span": triples of [span head, span start, span end] to train the span predictor
"word_clusters": lists of coreference clusters, where each cluster is a list of word indices
"span_clusters": lists of coreference clusters, where each cluster is a list of spans

"pos", "deprel", "head" are only used during data preparation to be able to convert a span-based dataset to a word-based one.

Inference only needs the following:
"cased_words": tokenized words of the text
"sent_id": sentence index for each word of the text
"speaker": speaker name for each word of the text (optional)

We don't get the pos, deprel and head data during inference, because we don't use them. Cluster data is the actual output of the model.
See above.

fajarmuslim · 2021-12-22T15:10:35Z

Thanks for the explanation....

For the head2span. how we get the 'head' data?

I still confused in this line of code. What is actually avg_spans do? why it is required?
avg_spans = sum(len(doc["head2span"]) for doc in docs) / len(docs)

fajarmuslim · 2021-12-22T15:13:26Z

span start, end is the index where the span starting, ending in a text? or it is start, end of span in sentence?

vdobrovolskii · 2021-12-22T15:15:57Z

Heads are calculated in this function here.

avg_spans from the line of the code you quoted calculate the average number of coreferent spans in a document. It is used to weigh the loss function here

Span start and end are the word indices of the text, not the sentence.

fajarmuslim · 2021-12-22T15:24:19Z

How we get the doc['head']. Is it given by OntoNotes05 dataset?

vdobrovolskii · 2021-12-22T15:29:21Z

More or less. The OntoNotes has got constituency syntax data which is converted to dependency syntax data. This is where the head/deprel/pos come from.

The reason for the conversion was because it was easier for me to deal with dependency graphs than with constituency structures. But both can be used, although one will need to rewrite the convert_to_heads bit to make it work with constituents.

fajarmuslim · 2021-12-22T15:33:55Z

Thank you for your excellence support....

I will close this issue

fajarmuslim closed this as completed Dec 22, 2021

vdobrovolskii mentioned this issue Feb 18, 2022

what is the equivalent of "edu.stanford.nlp.trees.EnglishGrammaticalStructure" for arabic coreference resolution task #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions_dataset-representation #10

Questions_dataset-representation #10

fajarmuslim commented Dec 22, 2021

vdobrovolskii commented Dec 22, 2021

fajarmuslim commented Dec 22, 2021

fajarmuslim commented Dec 22, 2021

vdobrovolskii commented Dec 22, 2021 •

edited

fajarmuslim commented Dec 22, 2021 •

edited

vdobrovolskii commented Dec 22, 2021

fajarmuslim commented Dec 22, 2021

Questions_dataset-representation #10

Questions_dataset-representation #10

Comments

fajarmuslim commented Dec 22, 2021

vdobrovolskii commented Dec 22, 2021

fajarmuslim commented Dec 22, 2021

fajarmuslim commented Dec 22, 2021

vdobrovolskii commented Dec 22, 2021 • edited

fajarmuslim commented Dec 22, 2021 • edited

vdobrovolskii commented Dec 22, 2021

fajarmuslim commented Dec 22, 2021

vdobrovolskii commented Dec 22, 2021 •

edited

fajarmuslim commented Dec 22, 2021 •

edited