-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add other flagship instruction dataset builders #541
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/541
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9f033a8 with merge base 34aeb98 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
✅ Deploy Preview for torchtune-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow I don't have much to comment except for some minor fixes, this looks super clean. thanks again for adding two additional datasets for two new tasks!
torchtune/datasets/_samsum.py
Outdated
|
||
def samsum_dataset( | ||
tokenizer: Tokenizer, | ||
train_on_input: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by default we want to leave train_on_input on False, as I've seen most datasets do mask out the prompt when computing loss (except alpaca)
torchtune/datasets/_grammar.py
Outdated
|
||
def grammar_dataset( | ||
tokenizer: Tokenizer, | ||
train_on_input: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment below
Context
Based on the RFC #493 and the alpaca dataset refactor PR #520, added 2 other flagship instruction datasets (grammar, samsum) type we'd like to support in torchtune
Changelog
Test plan
Added unit test
E2E test that kick-off finetune training with new added datasets
Discussion item