How to slice <eos> token with different sentence length #23

rshaojimmy · 2022-02-01T14:47:22Z

As I want the model to predict the end token by excluding it from the input into the model, I simply slice the token off the end of the sequence. Thus:

trg = [sos, x_1, x_2, x_3, eos]
trg[:-1] = [sos, x_1, x_2, x_3]

This is also same as your implementation.

But actually many datasets collect sentences with different length, ans thus the last elements of sentences are tokens, such as:

trg = [sos, x_1, x_2, x_3, eos, pad, pad, pad]
trg[:-1] = [sos, x_1, x_2, x_3, eos, pad, pad]

In such a case, I can’t slice the token, may I ask how can I solve this issue?

saahiluppal · 2022-02-01T16:15:32Z

while <pad> in array:
    remove <pad> from array

remove <eos> from array

rshaojimmy · 2022-02-02T02:08:32Z

Thanks for your quick reply!

But if I remove eos from array, how can model learn to stop generating sentence without encountering the eos token?

saahiluppal · 2022-02-02T07:54:07Z

Model itself will predict the eos token. If the model doesn't predict eos token, and the entire sentence is gibberish, then the model isn't generalized well or data is insufficient.

…

On Wed, 2 Feb, 2022, 7:38 am Rui Shao, ***@***.***> wrote: Thanks for your quick reply! But if I remove from array, how can model learn to stop generating sentence without encountering the token? — Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALJ7DKFWWHLASC3EPZFGGODUZCG2XANCNFSM5NJLJS4Q> . You are receiving this because you commented.Message ID: ***@***.***>

rshaojimmy · 2022-02-03T02:17:48Z

But we should let trg[:-1] have eos token when we calculate the loss, right?
like this:
trg[:-1] = [x_1, x_2, x_3, eos, pad, pad]
or
trg[:-1] = [x_1, x_2, x_3, eos]

saahiluppal · 2022-02-03T06:46:31Z

Depends on your training dataset. If your dataset have special tokens like <bos> and <eos>, then yes, these should be considered in loss. While <pad> tokens do not contribute to loss.

…

On Thu, 3 Feb, 2022, 7:47 am Rui Shao, ***@***.***> wrote: But we should let trg[:-1] have eos token when we calculate the loss, right? like this: trg[:-1] = [sos, x_1, x_2, x_3, eos, pad, pad] or trg[:-1] = [sos, x_1, x_2, x_3, eos] — Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALJ7DKBJ734ASWNV7LUJFFLUZHQVPANCNFSM5NJLJS4Q> . You are receiving this because you commented.Message ID: ***@***.***>

rshaojimmy · 2022-02-03T07:33:02Z

Thanks.

In all, I just want to create a dataset with sequences of different lengths. In such a dataset, I insert bos, eos in into the beginning and end of each sequence as the ground-truth. like this:

caps = [sos, x_1, x_2, x_3, eos]

In such a case,

caps[:, :-1] = [sos, x_1, x_2, x_3]
caps[:, 1:] = [x_1, x_2, x_3, eos]

This is what we want for the loss calculation.

outputs = model(samples, caps[:, :-1], cap_masks[:, :-1])
loss = criterion(outputs.permute(0, 2, 1), caps[:, 1:])

However, given different lengths, I have to further insert pad tokens to make them consistent, such as:

caps = [sos, x_1, x_2, x_3, eos, pad, pad, pad]

In such case,

caps[:, :-1] = [sos, x_1, x_2, x_3, eos, pad, pad]
caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad]

The input of model (caps[:, :-1]) will contain the eos token, which we want to remove.

Considering this, I just further replace the eos token with pad token as pad token will not be calculated for the loss, like this:

caps[:, :-1] = [sos, x_1, x_2, x_3, pad, pad, pad]

And I remain the caps[:, 1:] as

caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad].

May I ask does this make sense?

saahiluppal · 2022-02-06T18:14:32Z

you should consider eos token in loss. Because you want your model to learn when to stop generating a sentence.

…

On Thu, 3 Feb, 2022, 1:03 pm Rui Shao, ***@***.***> wrote: Thanks. In all, I just want to create a dataset with sequences of different lengths. In such a dataset, I insert bos, eos in into the beginning and end of each sequence as the ground-truth. like this: caps = [sos, x_1, x_2, x_3, eos] In such a case, caps[:, :-1] = [sos, x_1, x_2, x_3] caps[:, 1:] = [x_1, x_2, x_3, eos] This is what we want for the loss calculation. outputs = model(samples, caps[:, :-1], cap_masks[:, :-1]) loss = criterion(outputs.permute(0, 2, 1), caps[:, 1:]) However, given different lengths, I have to further insert pad tokens to make them consistent, such as: caps = [sos, x_1, x_2, x_3, eos, pad, pad, pad] In such case, caps[:, :-1] = [sos, x_1, x_2, x_3, eos, pad, pad] caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad] The input of model (caps[:, :-1]) will contain the eos token, which we want to remove. Considering this, I just further replace the eos token with pad token as pad token will not be calculated for the loss, like this: caps[:, :-1] = [sos, x_1, x_2, x_3, pad, pad, pad] And I remain the caps[:, 1:] as caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad]. May I ask does this make sense? — Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALJ7DKAEX266V3MZ7V4YBDTUZIVTTANCNFSM5NJLJS4Q> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to slice <eos> token with different sentence length #23

How to slice <eos> token with different sentence length #23

rshaojimmy commented Feb 1, 2022

saahiluppal commented Feb 1, 2022

rshaojimmy commented Feb 2, 2022 •

edited

Loading

saahiluppal commented Feb 2, 2022 via email

rshaojimmy commented Feb 3, 2022 •

edited

Loading

saahiluppal commented Feb 3, 2022 via email

rshaojimmy commented Feb 3, 2022 •

edited

Loading

saahiluppal commented Feb 6, 2022 via email

How to slice <eos> token with different sentence length #23

How to slice <eos> token with different sentence length #23

Comments

rshaojimmy commented Feb 1, 2022

saahiluppal commented Feb 1, 2022

rshaojimmy commented Feb 2, 2022 • edited Loading

saahiluppal commented Feb 2, 2022 via email

rshaojimmy commented Feb 3, 2022 • edited Loading

saahiluppal commented Feb 3, 2022 via email

rshaojimmy commented Feb 3, 2022 • edited Loading

saahiluppal commented Feb 6, 2022 via email

rshaojimmy commented Feb 2, 2022 •

edited

Loading

rshaojimmy commented Feb 3, 2022 •

edited

Loading

rshaojimmy commented Feb 3, 2022 •

edited

Loading