Faster tokenization by converting byte arrays to strings #6898
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview of the enhancement
The current version of the tokenizer uses a byte array to store the SQL input which comes as a string. In order to eliminate the unnecessary copy of the input string, the tokenizer is changed to use strings instead.
Initial Byte Array Implementation
BenchmarkParse1-16 13350778 10810 ns/op 2431 B/op 76 allocs/op
BenchmarkParse2-16 4257296 33834 ns/op 8672 B/op 266 allocs/op
BenchmarkParse2Parallel-16 25955780 6434 ns/op 5904 B/op 175 allocs/op
BenchmarkParse3-16 47076 2968535 ns/op 6337672 B/op 359 allocs/op
BenchmarkParseBigQuery-16 8679 16439766 ns/op 2541376 B/op 133468 allocs/op
BenchmarkWithNormalizer-16 40436 7523121 ns/op 7450373 B/op 638 allocs/op
Strings Implementation
BenchmarkParse1-16 13014916 11534 ns/op 2375 B/op 91 allocs/op
BenchmarkParse2-16 4180118 34032 ns/op 8511 B/op 315 allocs/op
BenchmarkParse2Parallel-16 25814864 6391 ns/op 5785 B/op 224 allocs/op
BenchmarkParse3-16 40479 3474247 ns/op 6293153 B/op 405 allocs/op
BenchmarkParseBigQuery-16 8089 17375202 ns/op 2663017 B/op 213834 allocs/op
BenchmarkWithNormalizer-16 33951 5906691 ns/op 8474301 B/op 695 allocs/op
Strings With Copy On Write Implementation
BenchmarkParse1-16 13779513 10704 ns/op 2167 B/op 56 allocs/op
BenchmarkParse2-16 4595913 31356 ns/op 7557 B/op 160 allocs/op
BenchmarkParse2Parallel-16 29976123 5207 ns/op 4834 B/op 69 allocs/op
BenchmarkParse3-16 39510 3700988 ns/op 6128685 B/op 314 allocs/op
BenchmarkParseBigQuery-16 10000 14835516 ns/op 1614678 B/op 52400 allocs/op
BenchmarkWithNormalizer-16 29342 5871095 ns/op 8309850 B/op 604 allocs/op
CPU and Memory Profiles of these benchmarks are available at https://drive.google.com/drive/folders/1jTFyySRiVyTSYdFV3U_H8HGDUHF3443Y?usp=sharing