[tokenization] preprocessing inputs and labels #51

ArashAhmadian · 2023-06-25T22:03:44Z

Hello,

Firstly thanks for open-sourcing all components of alpaca_farm!

I'm looking into data_preprocessor.py and am wondering where/if the labels are set to the input_ids shifted by 1. (something like input_ids = input_ids[...,1:] labels = input_ids[...,:-1]classic next token prediction)

However, it seems like they're set to input_ids without any shifting? I'm not sure what I'm missing but any clarification would be great :)

The text was updated successfully, but these errors were encountered:

ArashAhmadian · 2023-06-25T22:11:31Z

NVM, for baseline sft, since a default hugging-face trainer is used, shifting happens inside the default forward implementation :))

lxuechen · 2023-06-25T23:23:05Z

Yeah, you’re right. Shifting is handled internally in the trainer.

ArashAhmadian closed this as completed Jun 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tokenization] preprocessing inputs and labels #51

[tokenization] preprocessing inputs and labels #51

ArashAhmadian commented Jun 25, 2023 •

edited

Loading

ArashAhmadian commented Jun 25, 2023

lxuechen commented Jun 25, 2023

[tokenization] preprocessing inputs and labels #51

[tokenization] preprocessing inputs and labels #51

Comments

ArashAhmadian commented Jun 25, 2023 • edited Loading

ArashAhmadian commented Jun 25, 2023

lxuechen commented Jun 25, 2023

ArashAhmadian commented Jun 25, 2023 •

edited

Loading