Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SlimOrca Dataset to the datasets collection #116

Merged
merged 20 commits into from
Jan 23, 2024
Merged

Conversation

gokulavasan
Copy link
Contributor

@gokulavasan gokulavasan commented Dec 20, 2023

Changelog

  • Added slimorca-dedup dataset https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup/
  • Added logic to truncate input token list to max_token_length as the longest sample exceeds 4K max sequence length of llama2 default in torchtune (more details on truncate here - Add truncate option in Tokenizer #213 [the comment on the diff was to move the truncate logic to slimorca dataset]
  • Use default max token length to be 1K (even though llama2 torchtune accepts 4K token list but I noticed OOM within 8 iterations with seed=10, thus setting it to 1K by default to make it go forward for more batches

Test plan

  • Ran the finetine_llm code with slimorca dataset option and it ran few steps with seed=10:
    1|61|Loss: 1.058661937713623: 0%| | 60/181746 [02:35<146:53:52, 2.91s/it]

Copy link

netlify bot commented Dec 20, 2023

Deploy Preview for torchtune-preview ready!

Name Link
🔨 Latest commit ee79dcf
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/65af9af83676560008214c3b
😎 Deploy Preview https://deploy-preview-116--torchtune-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2023
def transform(
self, prompt: str, instruction: str, response: str
) -> Tuple[List[int], List[int]]:
# Add instruction and response tags to construct the input string
Copy link
Contributor Author

@gokulavasan gokulavasan Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking to get feedback about these tags. I just followed the Alpaca dataset approach but I am unsure if that is correct. Cc @rohan-varma @joecummings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a canonical reference implementation, but here's one that may be worth checking out: https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/finetuning/data_utils.py#L285

torchtune/datasets/slimorca.py Show resolved Hide resolved
def transform(
self, prompt: str, instruction: str, response: str
) -> Tuple[List[int], List[int]]:
# Add instruction and response tags to construct the input string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a canonical reference implementation, but here's one that may be worth checking out: https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/finetuning/data_utils.py#L285

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, will stamp once there's logs of training w/loss decreasing. Also, general q, how do we want to think about unittesting for this dataset and datasets in general?

torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved
prompt + instruction_tag + instruction + response_tag
)
labels = self._tokenizer.encode(response)
return instructions_and_inputs, labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, although curious what other libraries such as HF / lit-gpt do here. might be useful to check lit-gpt and see which label, target pairs they return.

@gokulavasan
Copy link
Contributor Author

gokulavasan commented Dec 30, 2023

@rohan-varma @joecummings @ebsmothers @pbontrager

Would like your opinion on this:

So far from reading a bit about HF chat templates (https://huggingface.co/docs/transformers/main/en/chat_templating), it looks like the template for formatting the training sample is usually tied to how a model is pre-trained. Llama recommends a particular format - https://fburl.com/83srjcjj (though there are other formats that have also worked well - https://www.pinecone.io/learn/llama-2/.

So in this case, I plan to change the formatting to the one prescribed by llama. I might have to copy/paste sections of code from llama generation.py above to massage the data in SlimOrca to the format suggested by llama. But this also means that we need to switch the template if we use a different pre-trained model.

Basically my question is, in cases like Alpaca dataset, is the instruction, prompt, response formatting selected by the dataset or the pre-trained model? That is, will the current alpaca dataset checked-in work out of the box for a non-llama pre-trained model?

@gokulavasan
Copy link
Contributor Author

Minor side note, @joecummings I noticed that in lit-gpt, prepare-alpaca, the label contains the the full prompt+instruction+input+response - https://fburl.com/eko93dyj instead of just the response. Is there any reason to go one way vs another (also for the lit-gpt approach, is the model expected to output the entire prompt+instruction+input+response)?

@joecummings
Copy link
Contributor

@rohan-varma @joecummings @ebsmothers @pbontrager

Would like your opinion on this:

So far from reading a bit about HF chat templates (https://huggingface.co/docs/transformers/main/en/chat_templating), it looks like the template for formatting the training sample is usually tied to how a model is pre-trained. Llama recommends a particular format - https://fburl.com/83srjcjj (though there are other formats that have also worked well - https://www.pinecone.io/learn/llama-2/.

So in this case, I plan to change the formatting to the one prescribed by llama. I might have to copy/paste sections of code from llama generation.py above to massage the data in SlimOrca to the format suggested by llama. But this also means that we need to switch the template if we use a different pre-trained model.

Basically my question is, in cases like Alpaca dataset, is the instruction, prompt, response formatting selected by the dataset or the pre-trained model? That is, will the current alpaca dataset checked-in work out of the box for a non-llama pre-trained model?

Great catch! The instruction, prompt, response formatting is selected from the dataset; HOWEVER, the chat template is selected from the pre-trained model. This includes things like [INST] tags, EOS tags. The current alpaca dataset checked-in will "work" for something like Mistral, but is not optimal. We should address this issue.

One possible solution is to add an API to our Tokenizer class called apply_chat_template like HuggingFace. Then, in examples, we can show how a user could utilize a HuggingFace tokenizer in our workflow.

@gokulavasan
Copy link
Contributor Author

gokulavasan commented Jan 12, 2024

@rohan-varma Please take a look when you get a chance. Also do recommend if I should add a new recipe for slimorca. Here is the loss values over a bunch of iterations - P1011481434.

One thing that I want to call out is the training doesn't complete and fails midway either due to GPU OOM or because the seq length of input exceeds max seq length. There are few options in this case:

i) Pretokenize the dataset in init method and drop the sample that is beyond max_seq_length (this will lead to a long time for first batch)
ii) Truncate the sample if it is beyond seq length (this is a reasonable choice if such samples are uncommon)
iii) Replace the sample with another sample (have to careful to ensure this is reproducible) (functionality needs to be implemented)

We can start with (ii) and implement (iii). Or if we want to add the pre-prep option as we discussed, we can use this as a motivation to add option (i).

@rohan-varma
Copy link
Member

@gokulavasan

Also do recommend if I should add a new recipe for slimorca

let's discuss this w/ @pbontrager but for now, I think we should be mostly unblocked with this in terms of validating correctness in the same recipe. Ideally we should eventually have a separate recipe, yes

Either (i) or (ii) sound good to me re: exceeding sequence length. Pretokenization is interesting, I wonder if this will prohibitively increase TTFB and/or if pretokenized results can be cached on disk to help this.

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LG, but let's also add unittests (i.e. the data format output is expected) and docs/doc rendering? thanks!

torchtune/datasets/slimorca.py Show resolved Hide resolved
torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved
torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved
@gokulavasan
Copy link
Contributor Author

gokulavasan commented Jan 14, 2024

720 samples have input len longer than 4096 out of 363000 samples (0.1%)

@gokulavasan
Copy link
Contributor Author

@rohan-varma @kartikayk Can you take another look? I moved truncation logic to this PR as suggested here #213 (comment), also added unit tests, added loss progress to the description

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome! Have some questions / comments. Also in general, notice some APIs like prompt_with_system are implemented but not used or testing, AFAICT. Pls ensure all APIs we're exposing are unittested.

Docstrings, doc rendering will need to be part of this PR as well.

tests/torchtune/datasets/test_slimorca_dataset.py Outdated Show resolved Hide resolved
tests/torchtune/datasets/test_slimorca_dataset.py Outdated Show resolved Hide resolved
tests/torchtune/datasets/test_slimorca_dataset.py Outdated Show resolved Hide resolved
torchtune/datasets/slimorca.py Show resolved Hide resolved
torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved
torchtune/datasets/slimorca.py Show resolved Hide resolved
torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved
torchtune/datasets/slimorca.py Show resolved Hide resolved
torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved
torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved
@gokulavasan
Copy link
Contributor Author

Thanks for the review @rohan-varma! Addressed all comments, please take a look again.

tests/torchtune/datasets/test_slimorca_dataset.py Outdated Show resolved Hide resolved
torchtune/datasets/slimorca.py Show resolved Hide resolved
from torchtune.modules import Tokenizer


class Llama2ChatFormatConstants:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be public? It it is, it should be documented

tests/torchtune/datasets/test_slimorca_dataset.py Outdated Show resolved Hide resolved
tests/torchtune/datasets/test_slimorca_dataset.py Outdated Show resolved Hide resolved
torchtune/datasets/alpaca.py Outdated Show resolved Hide resolved
torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG overall. Will stamp once we have the relevant documentation, thanks!

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@gokulavasan
Copy link
Contributor Author

@rohan-varma I have ensured the API reference docs torchtune.datasets (from docstrings) render properly as part of this PR (similar to torchtune.models/modules). Are you referring to different documentation? If yes, is there an example of docs that you can share that I can use as an example to follow for this PR?

@gokulavasan
Copy link
Contributor Author

gokulavasan commented Jan 23, 2024

Attaching doc screenshot:

image image

@NicolasHug
Copy link
Member

I can confirm the docs look good - thanks Gokul. I'm still wondering about #116 (comment) though, the comment was resolved but I don't see this class being either private or documented.

@gokulavasan @rohan-varma @kartikayk on a separate note: the tests seem to take quite a long time to run: about 20 seconds on my laptop (time on CI seem to be in the same ballpark). 20 seconds just for testing a single dataset seems quite long and that time will add up very fast. Comparatively, testing a dataset in torchvision takes ~1s.
There are other unrelated tests that seem to have a really long execution time, so perhaps this is a discussion to be had more globally, but I think torchtune should be mindful of the duration of the test suite, right from the beginning. A long test suite is an expensive one (and you'll be pressured to take shortcuts), and one that is annoying to run, so you end up waiting too long for CI to be green.

@gokulavasan
Copy link
Contributor Author

the comment was resolved but I don't see this class being either private or documented.

I added docstring but let me convert it to private class as we don't want to expose/generalize it just yet. Cc @rohan-varma

@pbontrager
Copy link
Contributor

This looks really good overall. I would like in the future for all datasets to match a protocol design and use the same method names but we can leave that to future work on dataset generalization. Also to Nicolas's note on testing, I think we should think hard about how we can test these without downloading the datasets. I think that rethinking the unit tests can also be deferred to when we have more than two datasets here though.

@gokulavasan gokulavasan merged commit e983194 into main Jan 23, 2024
15 checks passed
@gokulavasan
Copy link
Contributor Author

Thanks for the pointer @pbontrager. For this particular dataset unit test, created a GI #238 as a follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants