Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to slice <eos> token with different sentence length #23

Open
rshaojimmy opened this issue Feb 1, 2022 · 7 comments
Open

How to slice <eos> token with different sentence length #23

rshaojimmy opened this issue Feb 1, 2022 · 7 comments

Comments

@rshaojimmy
Copy link

As I want the model to predict the end token by excluding it from the input into the model, I simply slice the token off the end of the sequence. Thus:

trg = [sos, x_1, x_2, x_3, eos]
trg[:-1] = [sos, x_1, x_2, x_3]

This is also same as your implementation.

But actually many datasets collect sentences with different length, ans thus the last elements of sentences are tokens, such as:

trg = [sos, x_1, x_2, x_3, eos, pad, pad, pad]
trg[:-1] = [sos, x_1, x_2, x_3, eos, pad, pad]

In such a case, I can’t slice the token, may I ask how can I solve this issue?

@saahiluppal
Copy link
Owner

while <pad> in array:
    remove <pad> from array

remove <eos> from array

@rshaojimmy
Copy link
Author

rshaojimmy commented Feb 2, 2022

Thanks for your quick reply!

But if I remove eos from array, how can model learn to stop generating sentence without encountering the eos token?

@saahiluppal
Copy link
Owner

saahiluppal commented Feb 2, 2022 via email

@rshaojimmy
Copy link
Author

rshaojimmy commented Feb 3, 2022

But we should let trg[:-1] have eos token when we calculate the loss, right?
like this:
trg[:-1] = [x_1, x_2, x_3, eos, pad, pad]
or
trg[:-1] = [x_1, x_2, x_3, eos]

@saahiluppal
Copy link
Owner

saahiluppal commented Feb 3, 2022 via email

@rshaojimmy
Copy link
Author

rshaojimmy commented Feb 3, 2022

Thanks.

In all, I just want to create a dataset with sequences of different lengths. In such a dataset, I insert bos, eos in into the beginning and end of each sequence as the ground-truth. like this:

caps = [sos, x_1, x_2, x_3, eos]

In such a case,

caps[:, :-1] = [sos, x_1, x_2, x_3]
caps[:, 1:] = [x_1, x_2, x_3, eos]

This is what we want for the loss calculation.

outputs = model(samples, caps[:, :-1], cap_masks[:, :-1])
loss = criterion(outputs.permute(0, 2, 1), caps[:, 1:])

However, given different lengths, I have to further insert pad tokens to make them consistent, such as:

caps = [sos, x_1, x_2, x_3, eos, pad, pad, pad]

In such case,

caps[:, :-1] = [sos, x_1, x_2, x_3, eos, pad, pad]
caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad]

The input of model (caps[:, :-1]) will contain the eos token, which we want to remove.

Considering this, I just further replace the eos token with pad token as pad token will not be calculated for the loss, like this:

caps[:, :-1] = [sos, x_1, x_2, x_3, pad, pad, pad]

And I remain the caps[:, 1:] as

caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad].

May I ask does this make sense?

@saahiluppal
Copy link
Owner

saahiluppal commented Feb 6, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants