Why only last token used in datasets for training? #37

VityaVitalich · 2024-03-04T14:52:25Z

Dear maintainers,

Thank you for your awesome paper and open-source project. I recently ran into the detail of dataset pre-processing, which i can not understand properly.

In the processing of datasets you ignore all the labels exept the last one (as it is in

AQLM/src/datautils.py

Line 51 in bf84f39

tar[:, :-1] = -100

). This seems a bit odd as long as transformers perform causal masking inside forward and during training LM model we want to propagate loss through all of the tokens and not the last one.

Could you please explain this detail?

Godofnothing · 2024-03-04T18:16:26Z

@VityaVitalich actually it is an artifact from GPTQ code. tar is never used in our code (as well as in the original repository).

VityaVitalich closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why only last token used in datasets for training? #37

Why only last token used in datasets for training? #37

VityaVitalich commented Mar 4, 2024

Godofnothing commented Mar 4, 2024 •

edited

Loading

Why only last token used in datasets for training? #37

Why only last token used in datasets for training? #37

Comments

VityaVitalich commented Mar 4, 2024

Godofnothing commented Mar 4, 2024 • edited Loading

Godofnothing commented Mar 4, 2024 •

edited

Loading