Add truncate option in Tokenizer #213

gokulavasan · 2024-01-18T00:34:48Z

Context

Some training samples, when encoded, could result in token list being really long beyond what the model can accept (max_sequence_length of a model). This will result in a training failure as the model won't be able to accept that input.
Add an option to be able to truncate the token id list returned by the tokenizer
Though alpaca dataset has input ids all fitting within 4096, slimorca (Adding SlimOrca Dataset to the datasets collection #116) has samples that are longer than 4096.

Reference implementation - HF BertTokenizer - https://colab.research.google.com/drive/1BBq5BPf1zjlPs0A5ky0mP-gNu_zFi0r5#scrollTo=iqmKgNj647FN. If there is a different tokenizer that is recommended, please do suggest. Note that lit-gpt performs truncation on the right and including the EOS while HF BertTokenizer doesn't drop the EOS.

Changelog

Add max_len to tokenizer constructor that is used during encode operation if truncation is set to True. So tokenizer's max_len would be set during it's initialization to be max_seq_length of the model. In this case, llama_tokenizer is initialized to max_seq_len of the llama2 model
Add truncate option to encode which when set will use the max_len param set for the tokenizer.

Test plan

Added unit tests that verify this

netlify · 2024-01-18T00:35:17Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`c8cb6ec`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65aa49ab4c57c20007357fb1
😎 Deploy Preview	https://deploy-preview-213--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

torchtune/modules/tokenizer.py

RdoubleA · 2024-01-18T00:56:39Z

torchtune/modules/tokenizer.py

            text,
            add_bos=add_bos,
            add_eos=add_eos,
            out_type=int,
        )
+        if truncate and self.max_len is not None:


If user sets truncate to True but forgets to set max length, perhaps we should raise a warning that the output will not be truncated

yeah, or an error.

@RdoubleA @rohan-varma Is there a way to log a warning/error only once (or say once every 10 seconds)? If the truncate is called and max len isn't set, it might flood the logs.

Shouldn't the warning be logged from the constructor? If so, why would it flood the logs?

in that case let's just go with an error if truncate is true but max length is none

@kartikayk The max token len option is generally derived from the model's max seq length and so it is part of the tokenizer constructor. Whether the dataset wants to perform truncation or not is controlled at the encode method call. What I can probably do is, add a warning at tokenizer initialization to call out the max token len has not been set and thus no truncation will be performed.

@gokulavasan thanks for calling this out, I was wondering about this while reviewing this PR but decided against asking this question.

Whether the dataset wants to perform truncation or not is controlled at the encode method call

The fact that the dataset decides whether truncation needs to be performed or not makes a LOT of sense to me. If this is true, then shouldn't the truncation happen where encode is called instead of within the encode function itself? What's the value in having encode (and tokenizer) be aware of this param? It's not like it's saving you anything since you still to tokenize the entire input AND then truncate.

@kartikayk Is your suggestion that the truncate be before in the dataset code right after the tokenizer.encode is called? I would imagine both alpaca and slimorca datasets using truncation feature (based on the max sequence length of the model) - thus thought having this in tokenizer would allow that capability.

In HF tokenizer, it is available in the encode method - https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L2553-L2554.

In lit-gpt tokenizer, it is also available in the encode method - https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/tokenizer.py#L88.

If you think we don't have to move this to tokenizer just yet and do the truncation in slimorca dataset, I can do that instead and revisit this PR later.

@gokulavasan yeh exactly. Thinking through this conceptually, I'm trying to figure out what the value for adding this to the encode of the tokenizer is since this doesn't really impact the tokenization functionality. max_seq_len depends on the (dataset, model) tuple. We can just configure this as a param in the dataset and when getting the sample tokenize, truncate and return? Warning can be added to the dataset init. Tests can just be coupled with the dataset itself to make sure samples are being truncated appropriately. WDYT?

torchtune/modules/tokenizer.py

rohan-varma

same question as @RdoubleA regarding the eos token. Please also take some time to build and render the documentation and add docs as needed which might as well be done if this file is being touched.

tests/torchtune/modules/test_tokenizer.py

torchtune/modules/tokenizer.py

rohan-varma · 2024-01-18T02:24:57Z

torchtune/modules/tokenizer.py

            text,
            add_bos=add_bos,
            add_eos=add_eos,
            out_type=int,
        )
+        if truncate and self.max_len is not None:


yeah, or an error.

gokulavasan · 2024-01-18T22:35:05Z

Updated the PR @rohan-varma @RdoubleA. Do you guys have any suggestion on limiting logger logs (every N seconds)? Or I can just logging with warning log level

torchtune/modules/tokenizer.py

kartikayk · 2024-01-19T04:48:58Z

@gokulavasan thanks for adding this. Have we validated this with any reference code? If so, do you mind adding that information in the context of the PR?

RdoubleA

Accepting to unblock, just one comment and the truncate error thing

torchtune/modules/tokenizer.py

gokulavasan · 2024-01-19T10:08:27Z

@kartikayk Added reference implementation in the description (HF BertTokenizer). Note that lit-gpt has a different behavior where the EOS is truncated as well.

kartikayk · 2024-01-19T16:46:07Z

Thanks @gokulavasan, I figured as much. So do we need to let the calling function figure out whether it wants to truncate EOS or not? And make this a flag in that function for us to turn on and off?

kartikayk

Changing this to "Request Changes" since we have a couple of open discussions. happy to revert back to "Approved" if those dont make sense to address here.

gokulavasan · 2024-01-21T21:14:51Z

Closing this PR as I moved the truncation logic from this PR to SlimOrca Dataset PR #116

gokulavasan requested review from pbontrager, rohan-varma and joecummings January 18, 2024 00:34

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 18, 2024

RdoubleA reviewed Jan 18, 2024

View reviewed changes

rohan-varma reviewed Jan 18, 2024

View reviewed changes

gokulavasan requested review from rohan-varma and RdoubleA January 18, 2024 22:43

gokulavasan added 5 commits January 18, 2024 15:25

Add truncate option

524d5d4

Set max_len of llama2 tokenizer to llama2 max_seq_len

a1a661f

Fix unit test

dafefbd

Address comments

312d0b6

Correct arg name

4a8057b

gokulavasan force-pushed the tokenizer-truncate-option branch from 88261c6 to 4a8057b Compare January 18, 2024 23:25

kartikayk reviewed Jan 19, 2024

View reviewed changes

torchtune/modules/tokenizer.py Show resolved Hide resolved

RdoubleA approved these changes Jan 19, 2024

View reviewed changes

torchtune/modules/tokenizer.py Show resolved Hide resolved

Address PR comments

c8cb6ec

kartikayk requested changes Jan 19, 2024

View reviewed changes

gokulavasan mentioned this pull request Jan 21, 2024

Adding SlimOrca Dataset to the datasets collection #116

Merged

gokulavasan closed this Jan 21, 2024

joecummings deleted the tokenizer-truncate-option branch April 12, 2024 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add truncate option in Tokenizer #213

Add truncate option in Tokenizer #213

gokulavasan commented Jan 18, 2024 •

edited

Loading

netlify bot commented Jan 18, 2024 •

edited

Loading

RdoubleA Jan 18, 2024

rohan-varma Jan 18, 2024

gokulavasan Jan 18, 2024

kartikayk Jan 19, 2024

RdoubleA Jan 19, 2024

gokulavasan Jan 19, 2024

kartikayk Jan 19, 2024

gokulavasan Jan 19, 2024 •

edited

Loading

kartikayk Jan 19, 2024

rohan-varma left a comment

rohan-varma Jan 18, 2024

gokulavasan commented Jan 18, 2024

kartikayk commented Jan 19, 2024

RdoubleA left a comment

gokulavasan commented Jan 19, 2024

kartikayk commented Jan 19, 2024

kartikayk left a comment

gokulavasan commented Jan 21, 2024

Add truncate option in Tokenizer #213

Add truncate option in Tokenizer #213

Conversation

gokulavasan commented Jan 18, 2024 • edited Loading

Context

Changelog

Test plan

netlify bot commented Jan 18, 2024 • edited Loading

✅ Deploy Preview for torchtune-preview ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gokulavasan Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gokulavasan commented Jan 18, 2024

kartikayk commented Jan 19, 2024

RdoubleA left a comment

Choose a reason for hiding this comment

gokulavasan commented Jan 19, 2024

kartikayk commented Jan 19, 2024

kartikayk left a comment

Choose a reason for hiding this comment

gokulavasan commented Jan 21, 2024

gokulavasan commented Jan 18, 2024 •

edited

Loading

netlify bot commented Jan 18, 2024 •

edited

Loading

gokulavasan Jan 19, 2024 •

edited

Loading