Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tokenization] preprocessing inputs and labels #51

Closed
ArashAhmadian opened this issue Jun 25, 2023 · 2 comments
Closed

[tokenization] preprocessing inputs and labels #51

ArashAhmadian opened this issue Jun 25, 2023 · 2 comments

Comments

@ArashAhmadian
Copy link

ArashAhmadian commented Jun 25, 2023

Hello,

Firstly thanks for open-sourcing all components of alpaca_farm!

I'm looking into data_preprocessor.py and am wondering where/if the labels are set to the input_ids shifted by 1. (something like input_ids = input_ids[...,1:] labels = input_ids[...,:-1]classic next token prediction)

However, it seems like they're set to input_ids without any shifting? I'm not sure what I'm missing but any clarification would be great :)

Screen Shot 2023-06-25 at 5 54 51 PM
@ArashAhmadian
Copy link
Author

NVM, for baseline sft, since a default hugging-face trainer is used, shifting happens inside the default forward implementation :))
Screen Shot 2023-06-25 at 6 10 24 PM

@lxuechen
Copy link
Collaborator

Yeah, you’re right. Shifting is handled internally in the trainer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants