Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster tokenization by converting byte arrays to strings #6898

Closed

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Oct 19, 2020

Overview of the enhancement

The current version of the tokenizer uses a byte array to store the SQL input which comes as a string. In order to eliminate the unnecessary copy of the input string, the tokenizer is changed to use strings instead.

Initial Byte Array Implementation
BenchmarkParse1-16 13350778 10810 ns/op 2431 B/op 76 allocs/op
BenchmarkParse2-16 4257296 33834 ns/op 8672 B/op 266 allocs/op
BenchmarkParse2Parallel-16 25955780 6434 ns/op 5904 B/op 175 allocs/op
BenchmarkParse3-16 47076 2968535 ns/op 6337672 B/op 359 allocs/op
BenchmarkParseBigQuery-16 8679 16439766 ns/op 2541376 B/op 133468 allocs/op
BenchmarkWithNormalizer-16 40436 7523121 ns/op 7450373 B/op 638 allocs/op

Strings Implementation
BenchmarkParse1-16 13014916 11534 ns/op 2375 B/op 91 allocs/op
BenchmarkParse2-16 4180118 34032 ns/op 8511 B/op 315 allocs/op
BenchmarkParse2Parallel-16 25814864 6391 ns/op 5785 B/op 224 allocs/op
BenchmarkParse3-16 40479 3474247 ns/op 6293153 B/op 405 allocs/op
BenchmarkParseBigQuery-16 8089 17375202 ns/op 2663017 B/op 213834 allocs/op
BenchmarkWithNormalizer-16 33951 5906691 ns/op 8474301 B/op 695 allocs/op

Strings With Copy On Write Implementation
BenchmarkParse1-16 13779513 10704 ns/op 2167 B/op 56 allocs/op
BenchmarkParse2-16 4595913 31356 ns/op 7557 B/op 160 allocs/op
BenchmarkParse2Parallel-16 29976123 5207 ns/op 4834 B/op 69 allocs/op
BenchmarkParse3-16 39510 3700988 ns/op 6128685 B/op 314 allocs/op
BenchmarkParseBigQuery-16 10000 14835516 ns/op 1614678 B/op 52400 allocs/op
BenchmarkWithNormalizer-16 29342 5871095 ns/op 8309850 B/op 604 allocs/op

CPU and Memory Profiles of these benchmarks are available at https://drive.google.com/drive/folders/1jTFyySRiVyTSYdFV3U_H8HGDUHF3443Y?usp=sharing

@GuptaManan100 GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Parser labels Oct 19, 2020
Signed-off-by: GuptaManan100 <manan@planetscale.com>
Signed-off-by: GuptaManan100 <manan@planetscale.com>
Signed-off-by: GuptaManan100 <manan@planetscale.com>
Signed-off-by: GuptaManan100 <manan@planetscale.com>
@sougou
Copy link
Contributor

sougou commented Nov 21, 2020

Closing this. But let's keep the branch alive in case we want to revisit this.

@sougou sougou closed this Nov 21, 2020
@vmg vmg mentioned this pull request Mar 5, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Query Serving Status: Won't Fix Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants