Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why only last token used in datasets for training? #37

Closed
VityaVitalich opened this issue Mar 4, 2024 · 1 comment
Closed

Why only last token used in datasets for training? #37

VityaVitalich opened this issue Mar 4, 2024 · 1 comment

Comments

@VityaVitalich
Copy link

Dear maintainers,

Thank you for your awesome paper and open-source project. I recently ran into the detail of dataset pre-processing, which i can not understand properly.

In the processing of datasets you ignore all the labels exept the last one (as it is in

tar[:, :-1] = -100
). This seems a bit odd as long as transformers perform causal masking inside forward and during training LM model we want to propagate loss through all of the tokens and not the last one.

Could you please explain this detail?

@Godofnothing
Copy link
Collaborator

Godofnothing commented Mar 4, 2024

@VityaVitalich actually it is an artifact from GPTQ code. tar is never used in our code (as well as in the original repository).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants