Adding SlimOrca Dataset to the datasets collection #116

gokulavasan · 2023-12-20T10:23:01Z

Changelog

Added slimorca-dedup dataset https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup/
Added logic to truncate input token list to max_token_length as the longest sample exceeds 4K max sequence length of llama2 default in torchtune (more details on truncate here - Add truncate option in Tokenizer #213 [the comment on the diff was to move the truncate logic to slimorca dataset]
Use default max token length to be 1K (even though llama2 torchtune accepts 4K token list but I noticed OOM within 8 iterations with seed=10, thus setting it to 1K by default to make it go forward for more batches

Test plan

Ran the finetine_llm code with slimorca dataset option and it ran few steps with seed=10:
1|61|Loss: 1.058661937713623: 0%| | 60/181746 [02:35<146:53:52, 2.91s/it]

netlify · 2023-12-20T10:23:06Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`ee79dcf`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65af9af83676560008214c3b
😎 Deploy Preview	https://deploy-preview-116--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

gokulavasan · 2023-12-20T10:25:39Z

torchtune/datasets/slimorca.py

+    def transform(
+        self, prompt: str, instruction: str, response: str
+    ) -> Tuple[List[int], List[int]]:
+        # Add instruction and response tags to construct the input string


Looking to get feedback about these tags. I just followed the Alpaca dataset approach but I am unsure if that is correct. Cc @rohan-varma @joecummings

Not sure if there's a canonical reference implementation, but here's one that may be worth checking out: https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/finetuning/data_utils.py#L285

torchtune/datasets/slimorca.py

ebsmothers · 2023-12-20T16:54:11Z

torchtune/datasets/slimorca.py

+    def transform(
+        self, prompt: str, instruction: str, response: str
+    ) -> Tuple[List[int], List[int]]:
+        # Add instruction and response tags to construct the input string


Not sure if there's a canonical reference implementation, but here's one that may be worth checking out: https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/finetuning/data_utils.py#L285

rohan-varma

LG, will stamp once there's logs of training w/loss decreasing. Also, general q, how do we want to think about unittesting for this dataset and datasets in general?

torchtune/datasets/slimorca.py

rohan-varma · 2023-12-26T08:17:47Z

torchtune/datasets/slimorca.py

+            prompt + instruction_tag + instruction + response_tag
+        )
+        labels = self._tokenizer.encode(response)
+        return instructions_and_inputs, labels


LG, although curious what other libraries such as HF / lit-gpt do here. might be useful to check lit-gpt and see which label, target pairs they return.

gokulavasan · 2023-12-30T23:15:21Z

@rohan-varma @joecummings @ebsmothers @pbontrager

Would like your opinion on this:

So far from reading a bit about HF chat templates (https://huggingface.co/docs/transformers/main/en/chat_templating), it looks like the template for formatting the training sample is usually tied to how a model is pre-trained. Llama recommends a particular format - https://fburl.com/83srjcjj (though there are other formats that have also worked well - https://www.pinecone.io/learn/llama-2/.

So in this case, I plan to change the formatting to the one prescribed by llama. I might have to copy/paste sections of code from llama generation.py above to massage the data in SlimOrca to the format suggested by llama. But this also means that we need to switch the template if we use a different pre-trained model.

Basically my question is, in cases like Alpaca dataset, is the instruction, prompt, response formatting selected by the dataset or the pre-trained model? That is, will the current alpaca dataset checked-in work out of the box for a non-llama pre-trained model?

gokulavasan · 2023-12-30T23:27:12Z

Minor side note, @joecummings I noticed that in lit-gpt, prepare-alpaca, the label contains the the full prompt+instruction+input+response - https://fburl.com/eko93dyj instead of just the response. Is there any reason to go one way vs another (also for the lit-gpt approach, is the model expected to output the entire prompt+instruction+input+response)?

joecummings · 2024-01-02T19:05:08Z

@rohan-varma @joecummings @ebsmothers @pbontrager

Would like your opinion on this:

So far from reading a bit about HF chat templates (https://huggingface.co/docs/transformers/main/en/chat_templating), it looks like the template for formatting the training sample is usually tied to how a model is pre-trained. Llama recommends a particular format - https://fburl.com/83srjcjj (though there are other formats that have also worked well - https://www.pinecone.io/learn/llama-2/.

So in this case, I plan to change the formatting to the one prescribed by llama. I might have to copy/paste sections of code from llama generation.py above to massage the data in SlimOrca to the format suggested by llama. But this also means that we need to switch the template if we use a different pre-trained model.

Basically my question is, in cases like Alpaca dataset, is the instruction, prompt, response formatting selected by the dataset or the pre-trained model? That is, will the current alpaca dataset checked-in work out of the box for a non-llama pre-trained model?

Great catch! The instruction, prompt, response formatting is selected from the dataset; HOWEVER, the chat template is selected from the pre-trained model. This includes things like [INST] tags, EOS tags. The current alpaca dataset checked-in will "work" for something like Mistral, but is not optimal. We should address this issue.

One possible solution is to add an API to our Tokenizer class called apply_chat_template like HuggingFace. Then, in examples, we can show how a user could utilize a HuggingFace tokenizer in our workflow.

gokulavasan · 2024-01-12T01:42:21Z

@rohan-varma Please take a look when you get a chance. Also do recommend if I should add a new recipe for slimorca. Here is the loss values over a bunch of iterations - P1011481434.

One thing that I want to call out is the training doesn't complete and fails midway either due to GPU OOM or because the seq length of input exceeds max seq length. There are few options in this case:

i) Pretokenize the dataset in init method and drop the sample that is beyond max_seq_length (this will lead to a long time for first batch)
ii) Truncate the sample if it is beyond seq length (this is a reasonable choice if such samples are uncommon)
iii) Replace the sample with another sample (have to careful to ensure this is reproducible) (functionality needs to be implemented)

We can start with (ii) and implement (iii). Or if we want to add the pre-prep option as we discussed, we can use this as a motivation to add option (i).

rohan-varma · 2024-01-12T19:43:54Z

@gokulavasan

Also do recommend if I should add a new recipe for slimorca

let's discuss this w/ @pbontrager but for now, I think we should be mostly unblocked with this in terms of validating correctness in the same recipe. Ideally we should eventually have a separate recipe, yes

Either (i) or (ii) sound good to me re: exceeding sequence length. Pretokenization is interesting, I wonder if this will prohibitively increase TTFB and/or if pretokenized results can be cached on disk to help this.

rohan-varma

Code LG, but let's also add unittests (i.e. the data format output is expected) and docs/doc rendering? thanks!

torchtune/datasets/slimorca.py

gokulavasan · 2024-01-14T05:06:23Z

720 samples have input len longer than 4096 out of 363000 samples (0.1%)

gokulavasan · 2024-01-21T21:13:54Z

@rohan-varma @kartikayk Can you take another look? I moved truncation logic to this PR as suggested here #213 (comment), also added unit tests, added loss progress to the description

rohan-varma

awesome! Have some questions / comments. Also in general, notice some APIs like prompt_with_system are implemented but not used or testing, AFAICT. Pls ensure all APIs we're exposing are unittested.

Docstrings, doc rendering will need to be part of this PR as well.

tests/torchtune/datasets/test_slimorca_dataset.py

torchtune/datasets/slimorca.py

gokulavasan · 2024-01-22T14:52:55Z

Thanks for the review @rohan-varma! Addressed all comments, please take a look again.

tests/torchtune/datasets/test_slimorca_dataset.py

torchtune/datasets/slimorca.py

NicolasHug · 2024-01-22T15:56:35Z

torchtune/datasets/slimorca.py

+from torchtune.modules import Tokenizer
+
+
+class Llama2ChatFormatConstants:


Is this meant to be public? It it is, it should be documented

tests/torchtune/datasets/test_slimorca_dataset.py

torchtune/datasets/alpaca.py

torchtune/datasets/slimorca.py

rohan-varma

LG overall. Will stamp once we have the relevant documentation, thanks!

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

gokulavasan · 2024-01-23T09:58:32Z

@rohan-varma I have ensured the API reference docs torchtune.datasets (from docstrings) render properly as part of this PR (similar to torchtune.models/modules). Are you referring to different documentation? If yes, is there an example of docs that you can share that I can use as an example to follow for this PR?

gokulavasan · 2024-01-23T10:22:29Z

Attaching doc screenshot:

NicolasHug · 2024-01-23T10:25:17Z

I can confirm the docs look good - thanks Gokul. I'm still wondering about #116 (comment) though, the comment was resolved but I don't see this class being either private or documented.

@gokulavasan @rohan-varma @kartikayk on a separate note: the tests seem to take quite a long time to run: about 20 seconds on my laptop (time on CI seem to be in the same ballpark). 20 seconds just for testing a single dataset seems quite long and that time will add up very fast. Comparatively, testing a dataset in torchvision takes ~1s.
There are other unrelated tests that seem to have a really long execution time, so perhaps this is a discussion to be had more globally, but I think torchtune should be mindful of the duration of the test suite, right from the beginning. A long test suite is an expensive one (and you'll be pressured to take shortcuts), and one that is annoying to run, so you end up waiting too long for CI to be green.

gokulavasan · 2024-01-23T10:37:53Z

the comment was resolved but I don't see this class being either private or documented.

I added docstring but let me convert it to private class as we don't want to expose/generalize it just yet. Cc @rohan-varma

pbontrager · 2024-01-23T16:27:22Z

This looks really good overall. I would like in the future for all datasets to match a protocol design and use the same method names but we can leave that to future work on dataset generalization. Also to Nicolas's note on testing, I think we should think hard about how we can test these without downloading the datasets. I think that rethinking the unit tests can also be deferred to when we have more than two datasets here though.

gokulavasan · 2024-01-23T16:32:42Z

Thanks for the pointer @pbontrager. For this particular dataset unit test, created a GI #238 as a follow up.

gokulavasan requested review from rohan-varma and joecummings December 20, 2023 10:23

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2023

gokulavasan force-pushed the slimorca-dataset branch from d621e6d to 443b13a Compare December 20, 2023 10:23

gokulavasan commented Dec 20, 2023

View reviewed changes

ebsmothers reviewed Dec 20, 2023

View reviewed changes

rohan-varma reviewed Dec 26, 2023

View reviewed changes

gokulavasan force-pushed the slimorca-dataset branch 2 times, most recently from d7fdd9f to d8e283c Compare January 11, 2024 17:34

gokulavasan force-pushed the slimorca-dataset branch from d8e283c to d744fe2 Compare January 12, 2024 00:37

gokulavasan mentioned this pull request Jan 12, 2024

Generalize Transformer API rather than making it specific to SentencePiece #182

Closed

rohan-varma reviewed Jan 12, 2024

View reviewed changes

torchtune/datasets/slimorca.py Show resolved Hide resolved

torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved

torchtune/datasets/slimorca.py Outdated Show resolved Hide resolved

gokulavasan mentioned this pull request Jan 18, 2024

Add truncate option in Tokenizer #213

Closed

gokulavasan force-pushed the slimorca-dataset branch from 91d7c56 to 1983024 Compare January 21, 2024 18:46

rohan-varma reviewed Jan 22, 2024

View reviewed changes

gokulavasan force-pushed the slimorca-dataset branch from b0074fb to c6a7e3b Compare January 22, 2024 14:55

gokulavasan mentioned this pull request Jan 22, 2024

Stateful Dataloader for Map Datasets #198

Merged

NicolasHug reviewed Jan 22, 2024

View reviewed changes

ebsmothers reviewed Jan 23, 2024

View reviewed changes

rohan-varma reviewed Jan 23, 2024

View reviewed changes

Adding SlimOrca dataset option

ef5e531

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

gokulavasan added 18 commits January 23, 2024 01:52

Rebase and make changes to use llama2-chat prompt style

52d5a83

changes to slimorca

422a396

Tokens

f97cef1

Fix token generation

3cee056

Add link to HF dataset

3b6b498

Rebase

b6548b8

Remove

4d4b893

Docs

05a9292

Making progress

c059e3e

Add truncation logic and unit tests

fc0f49e

Change default to match llama2 default

34f1ae6

switch to 1k instead of 4k

ada3ab9

Address PR comments

045121c

Address PR comments

4f7ab09

Doc stuff

f88b3b6

Change location of noqa

4efd7bc

Revert change

d4851e0

Address PR comments

97f536b

gokulavasan force-pushed the slimorca-dataset branch from 4e60361 to 97f536b Compare January 23, 2024 10:54

Make chat format constants class private

ee79dcf

pbontrager approved these changes Jan 23, 2024

View reviewed changes

gokulavasan merged commit e983194 into main Jan 23, 2024
15 checks passed

gokulavasan deleted the slimorca-dataset branch January 23, 2024 16:32

NicolasHug mentioned this pull request Jan 23, 2024

SlimOrca dataset unit test runtime is 3x longer because of HuggingFace dataset download #238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SlimOrca Dataset to the datasets collection #116

Adding SlimOrca Dataset to the datasets collection #116

gokulavasan commented Dec 20, 2023 •

edited

Loading

netlify bot commented Dec 20, 2023 •

edited

Loading

gokulavasan Dec 20, 2023 •

edited

Loading

ebsmothers Dec 20, 2023

ebsmothers Dec 20, 2023

rohan-varma left a comment

rohan-varma Dec 26, 2023

gokulavasan commented Dec 30, 2023 •

edited

Loading

gokulavasan commented Dec 30, 2023

joecummings commented Jan 2, 2024

gokulavasan commented Jan 12, 2024 •

edited

Loading

rohan-varma commented Jan 12, 2024

rohan-varma left a comment

gokulavasan commented Jan 14, 2024 •

edited

Loading

gokulavasan commented Jan 21, 2024

rohan-varma left a comment

gokulavasan commented Jan 22, 2024

NicolasHug Jan 22, 2024

rohan-varma left a comment

gokulavasan commented Jan 23, 2024

gokulavasan commented Jan 23, 2024 •

edited

Loading

NicolasHug commented Jan 23, 2024

gokulavasan commented Jan 23, 2024

pbontrager commented Jan 23, 2024

gokulavasan commented Jan 23, 2024

		from torchtune.modules import Tokenizer


		class Llama2ChatFormatConstants:

Adding SlimOrca Dataset to the datasets collection #116

Adding SlimOrca Dataset to the datasets collection #116

Conversation

gokulavasan commented Dec 20, 2023 • edited Loading

Changelog

Test plan

netlify bot commented Dec 20, 2023 • edited Loading

✅ Deploy Preview for torchtune-preview ready!

gokulavasan Dec 20, 2023 • edited Loading

Choose a reason for hiding this comment

ebsmothers Dec 20, 2023

Choose a reason for hiding this comment

ebsmothers Dec 20, 2023

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Dec 26, 2023

Choose a reason for hiding this comment

gokulavasan commented Dec 30, 2023 • edited Loading

gokulavasan commented Dec 30, 2023

joecummings commented Jan 2, 2024

gokulavasan commented Jan 12, 2024 • edited Loading

rohan-varma commented Jan 12, 2024

rohan-varma left a comment

Choose a reason for hiding this comment

gokulavasan commented Jan 14, 2024 • edited Loading

gokulavasan commented Jan 21, 2024

rohan-varma left a comment

Choose a reason for hiding this comment

gokulavasan commented Jan 22, 2024

NicolasHug Jan 22, 2024

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

gokulavasan commented Jan 23, 2024

gokulavasan commented Jan 23, 2024 • edited Loading

NicolasHug commented Jan 23, 2024

gokulavasan commented Jan 23, 2024

pbontrager commented Jan 23, 2024

gokulavasan commented Jan 23, 2024

gokulavasan commented Dec 20, 2023 •

edited

Loading

netlify bot commented Dec 20, 2023 •

edited

Loading

gokulavasan Dec 20, 2023 •

edited

Loading

gokulavasan commented Dec 30, 2023 •

edited

Loading

gokulavasan commented Jan 12, 2024 •

edited

Loading

gokulavasan commented Jan 14, 2024 •

edited

Loading

gokulavasan commented Jan 23, 2024 •

edited

Loading