Add other flagship instruction dataset builders #541

SLR722 · 2024-03-21T06:56:18Z

Context

Based on the RFC #493 and the alpaca dataset refactor PR #520, added 2 other flagship instruction datasets (grammar, samsum) type we'd like to support in torchtune

Changelog

add grammar dataset builder and related unit test
add samsum dataset builder and related unit test
modify InstructionDataset class to make it more error prone to partially completed column_map
add more detailed docstrings for column_map setup

Test plan

Added unit test

pytest tests/torchtune/datasets/test_grammar_dataset.py
pytest tests/torchtune/datasets/test_samsum_dataset.py

E2E test that kick-off finetune training with new added datasets

grammar dataset

- samsum dataset

Discussion item

The first time using grammar dataset to kick off training need long time (> 1h) to generate training split

pytorch-bot · 2024-03-21T06:56:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/541

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9f033a8 with merge base 34aeb98 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

netlify · 2024-03-21T06:56:36Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`9f033a8`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65fd32d19b6b090008af3637
😎 Deploy Preview	https://deploy-preview-541--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

RdoubleA

wow I don't have much to comment except for some minor fixes, this looks super clean. thanks again for adding two additional datasets for two new tasks!

RdoubleA · 2024-03-22T00:22:57Z

torchtune/datasets/_samsum.py

+
+def samsum_dataset(
+    tokenizer: Tokenizer,
+    train_on_input: bool = True,


I think by default we want to leave train_on_input on False, as I've seen most datasets do mask out the prompt when computing loss (except alpaca)

RdoubleA · 2024-03-22T00:23:12Z

torchtune/datasets/_grammar.py

+
+def grammar_dataset(
+    tokenizer: Tokenizer,
+    train_on_input: bool = True,


same comment below

add other instruction dataset builders

0da066c

SLR722 requested review from RdoubleA and kartikayk March 21, 2024 06:56

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2024

RdoubleA approved these changes Mar 22, 2024

View reviewed changes

SLR722 added 2 commits March 21, 2024 23:44

set train_on_input default to false

562b2a8

fix unit test

9f033a8

SLR722 merged commit dc8f090 into main Mar 22, 2024
22 checks passed

joecummings deleted the add_other_instruction_dataset_builders branch April 11, 2024 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add other flagship instruction dataset builders #541

Add other flagship instruction dataset builders #541

SLR722 commented Mar 21, 2024 •

edited

pytorch-bot bot commented Mar 21, 2024 •

edited

netlify bot commented Mar 21, 2024 •

edited

RdoubleA left a comment

RdoubleA Mar 22, 2024

RdoubleA Mar 22, 2024

Add other flagship instruction dataset builders #541

Add other flagship instruction dataset builders #541

Conversation

SLR722 commented Mar 21, 2024 • edited

Context

Changelog

Test plan

Discussion item

pytorch-bot bot commented Mar 21, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/541

✅ No Failures

netlify bot commented Mar 21, 2024 • edited

✅ Deploy Preview for torchtune-preview ready!

RdoubleA left a comment

Choose a reason for hiding this comment

RdoubleA Mar 22, 2024

Choose a reason for hiding this comment

RdoubleA Mar 22, 2024

Choose a reason for hiding this comment

SLR722 commented Mar 21, 2024 •

edited

pytorch-bot bot commented Mar 21, 2024 •

edited

netlify bot commented Mar 21, 2024 •

edited