Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add other flagship instruction dataset builders #541

Merged
merged 3 commits into from
Mar 22, 2024

Conversation

SLR722
Copy link
Contributor

@SLR722 SLR722 commented Mar 21, 2024

Context

Based on the RFC #493 and the alpaca dataset refactor PR #520, added 2 other flagship instruction datasets (grammar, samsum) type we'd like to support in torchtune

Changelog

  • add grammar dataset builder and related unit test
  • add samsum dataset builder and related unit test
  • modify InstructionDataset class to make it more error prone to partially completed column_map
  • add more detailed docstrings for column_map setup

Test plan

Added unit test

  • pytest tests/torchtune/datasets/test_grammar_dataset.py
  • pytest tests/torchtune/datasets/test_samsum_dataset.py

E2E test that kick-off finetune training with new added datasets

  • grammar dataset
Screenshot 2024-03-20 at 7 09 49 PM - samsum dataset Screenshot 2024-03-20 at 7 09 31 PM

Discussion item

  • The first time using grammar dataset to kick off training need long time (> 1h) to generate training split
Screenshot 2024-03-20 at 7 09 49 PM

Copy link

pytorch-bot bot commented Mar 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/541

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9f033a8 with merge base 34aeb98 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2024
Copy link

netlify bot commented Mar 21, 2024

Deploy Preview for torchtune-preview ready!

Name Link
🔨 Latest commit 9f033a8
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/65fd32d19b6b090008af3637
😎 Deploy Preview https://deploy-preview-541--torchtune-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow I don't have much to comment except for some minor fixes, this looks super clean. thanks again for adding two additional datasets for two new tasks!


def samsum_dataset(
tokenizer: Tokenizer,
train_on_input: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by default we want to leave train_on_input on False, as I've seen most datasets do mask out the prompt when computing loss (except alpaca)


def grammar_dataset(
tokenizer: Tokenizer,
train_on_input: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment below

@SLR722 SLR722 merged commit dc8f090 into main Mar 22, 2024
22 checks passed
@joecummings joecummings deleted the add_other_instruction_dataset_builders branch April 11, 2024 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants