Add a fast path that doesn't include normalized chunks in tokenize #11017
+12
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tests addedUser visible changes (including notable bug fixes) are documented inwhats-new.rstThe idea in this PR is to include a fast path for
open_datasetthat just uses the token that is passed into_maybe_chunkand doesn't worry about including chunks within the token.Before:

After:

This PR shaves ~30 sec off the previous runtime for the dataset from the original issue. I was still seeing pretty intense memory consumption 17.14GB for this
open_datasetcall though - not a new thing, just wanted to flag