Replace Ragel tokenizer with String Scanner #4369

tenderlove · 2023-03-03T21:30:42Z

This commit replaces the Ragel tokenizer with a "hand written" string scanner based tokenizer. This tokenizer is about ~2x faster than the previous tokenizer.

To test, I used this benchmark:

https://gist.github.com/tenderlove/fdce1610a3eecd8426f5845391a6c373

On this branch:

$ ruby -I.:lib:spec bench.rb
Warming up --------------------------------------
             graphql   506.000  i/100ms
Calculating -------------------------------------
             graphql      5.090k (± 0.4%) i/s -     25.806k in   5.070096s

On master:

$ ruby -I.:lib:spec bench.rb
Warming up --------------------------------------
             graphql   284.000  i/100ms
Calculating -------------------------------------
             graphql      2.880k (± 0.4%) i/s -     14.484k in   5.029007s

To ensure compatibility I ran this tokenizer in parallel with the existing tokenizer and compared the results. There is one difference that I think is acceptable and that is when the tokenizer encounters unknown input.

For example:

require "graphql/language/lexer"
require "graphql/language/token"

str = "😘"
p GraphQL::Language::Lexer.tokenize(str)

Output on master is this:

$ ruby -I.:lib:spec test.rb
[(UNKNOWN_CHAR "\xF0" [1:1]), (UNKNOWN_CHAR "\x9F" [1:2]), (UNKNOWN_CHAR "\x98" [1:3]), (UNKNOWN_CHAR "\x98" [1:4])]

Output on my branch is this:

$ ruby -I.:lib:spec test.rb
[(UNKNOWN_CHAR "😘" [1:1])]

This is because Ragel only understands byte streams where Ruby knows we have a UTF-8 string.

I didn't make many changes to the helper functions that were originally included in the Ragel parser. Now that we're not working with byte arrays, I think could probably improve the code in those helper functions too.

tenderlove · 2023-03-03T23:25:19Z

By the way, I made the block quote string work the way the tests want it to work, but the way GraphQL-Ruby tokenizes block quotes doesn't match the way the reference implementation for JS tokenizes them.

I was specifically checking the 10 quote test. In that case the JS tokenizer sees one triple quote block (the first 6 quotes), then an unterminated triple quote block (the following 3, then 1 quote, for a total of 4 quotes).

I also looked at the """""e""""" test. The reference implementation parses this as a block quote with the contents ""e, followed by an empty regular string. This is different than the test in this repo that asserts it's a single block quoted string with the contents ""e"".

tenderlove · 2023-03-04T00:26:07Z

Sorry, I didn't realize there were benchmarks in the repo.

Here is master:

$ bundle exec rake bench:parse
Warming up --------------------------------------
scan - introspection   130.000  i/100ms
parse - introspection
                       102.000  i/100ms
    scan - fragments   229.000  i/100ms
   parse - fragments   176.000  i/100ms
    scan - big query    22.000  i/100ms
   parse - big query    17.000  i/100ms
Calculating -------------------------------------
scan - introspection      1.306k (± 0.3%) i/s -      6.630k in   5.077063s
parse - introspection
                          1.024k (± 0.7%) i/s -      5.202k in   5.078804s
    scan - fragments      2.300k (± 0.4%) i/s -     11.679k in   5.077442s
   parse - fragments      1.756k (± 0.7%) i/s -      8.800k in   5.011090s
    scan - big query    226.848  (± 0.4%) i/s -      1.144k in   5.043134s
   parse - big query    176.933  (± 0.6%) i/s -    901.000  in   5.092648s

This branch:

$ bundle exec rake bench:parse
Warming up --------------------------------------
scan - introspection   306.000  i/100ms
parse - introspection
                       187.000  i/100ms
    scan - fragments   444.000  i/100ms
   parse - fragments   280.000  i/100ms
    scan - big query    50.000  i/100ms
   parse - big query    31.000  i/100ms
Calculating -------------------------------------
scan - introspection      3.052k (± 0.4%) i/s -     15.300k in   5.012466s
parse - introspection
                          1.872k (± 0.9%) i/s -      9.537k in   5.095884s
    scan - fragments      4.418k (± 1.1%) i/s -     22.200k in   5.026021s
   parse - fragments      2.809k (± 1.0%) i/s -     14.280k in   5.084464s
    scan - big query    505.981  (± 0.2%) i/s -      2.550k in   5.039741s
   parse - big query    312.802  (± 0.6%) i/s -      1.581k in   5.054577s

Just to make comparison easier, I made a bar graph that compares them. The graph is IPS on this branch / IPS on master so we can see the % speedup compared to master for each test:

tenderlove · 2023-03-04T00:35:06Z

Just for fun I decided to try with YJIT too. Here are the results with YJIT:

$ RUBYOPT='-v --yjit' bundle exec rake bench:parse
ruby 3.2.0 (2022-12-25 revision a528908271) +YJIT [arm64-darwin22]
Warming up --------------------------------------
scan - introspection   475.000  i/100ms
parse - introspection
                       261.000  i/100ms
    scan - fragments   684.000  i/100ms
   parse - fragments   392.000  i/100ms
    scan - big query    77.000  i/100ms
   parse - big query    43.000  i/100ms
Calculating -------------------------------------
scan - introspection      4.728k (± 0.7%) i/s -     23.750k in   5.023954s
parse - introspection
                          2.608k (± 1.0%) i/s -     13.050k in   5.004062s
    scan - fragments      6.840k (± 0.8%) i/s -     34.200k in   5.000240s
   parse - fragments      3.927k (± 0.9%) i/s -     19.992k in   5.090927s
    scan - big query    760.453  (± 2.8%) i/s -      3.850k in   5.066784s
   parse - big query    436.250  (± 0.9%) i/s -      2.193k in   5.027286s

This commit replaces the Ragel tokenizer with a "hand written" string scanner based tokenizer. This tokenizer is about ~2x faster than the previous tokenizer. To test, I used this benchmark: https://gist.github.com/tenderlove/fdce1610a3eecd8426f5845391a6c373 On this branch: ``` $ ruby -I.:lib:spec bench.rb Warming up -------------------------------------- graphql 506.000 i/100ms Calculating ------------------------------------- graphql 5.090k (± 0.4%) i/s - 25.806k in 5.070096s ``` On master: ``` $ ruby -I.:lib:spec bench.rb Warming up -------------------------------------- graphql 284.000 i/100ms Calculating ------------------------------------- graphql 2.880k (± 0.4%) i/s - 14.484k in 5.029007s ``` To ensure compatibility I ran this tokenizer in parallel with the existing tokenizer and compared the results. There is one difference that I think is acceptable and that is when the tokenizer encounters unknown input. For example: ```ruby require "graphql/language/lexer" require "graphql/language/token" str = "😘" p GraphQL::Language::Lexer.tokenize(str) ``` Output on master is this: ``` $ ruby -I.:lib:spec test.rb [(UNKNOWN_CHAR "\xF0" [1:1]), (UNKNOWN_CHAR "\x9F" [1:2]), (UNKNOWN_CHAR "\x98" [1:3]), (UNKNOWN_CHAR "\x98" [1:4])] ``` Output on my branch is this: ``` $ ruby -I.:lib:spec test.rb [(UNKNOWN_CHAR "😘" [1:1])] ``` This is because Ragel only understands byte streams where Ruby knows we have a UTF-8 string. I didn't make many changes to the helper functions that were originally included in the Ragel parser. Now that we're not working with byte arrays, I think could probably improve the code in those helper functions too.

rmosolgo · 2023-03-06T15:06:47Z

Thanks for sharing this work! Simpler and faster, sounds great to me.

That's interesting about the divergence between this project and the reference implementation. Too bad 😖 If it becomes a problem, I'm definitely open to making this project match graphql-js.

I plan to continue with #4366 since it's still an order of magnitude faster, but I see an advantage to maintaining both a plain-Ruby and a C implementation.

tenderlove · 2023-03-06T18:04:19Z

I plan to continue with #4366 since it's still an order of magnitude faster, but I see an advantage to maintaining both a plain-Ruby and a C implementation.

Makes sense. I have a vested interest in a good Ruby implementation because YJIT can speed up Ruby code. I ran the benchmarks on the the C branch too, but I'm only seeing 4x increase as compared to master?

I could be running the benchmarks wrong though.

In general, I think it will be possible for YJIT to beat a C extension someday, so I'd be glad if a pure Ruby version sticks around.

byroot · 2023-03-06T18:23:14Z

I'd also suggest making the c ext purely optional. As someone who maintain a bunch of C exts, you commonly get error report from users with weird compilation issues of all sorts. So having the possibility to just tell them to not use the c-ext is gold as a maintainer.

It can be in the same repo, but shipped as a separate gem (e.g. like redis-client and hiredis-client)

rmosolgo · 2023-03-06T20:12:26Z

4x increase

Ah yeah -- I should have clarified, it's just the scan-only benchmarks that show such a big speedup. But I added those in #4366. For the whole parse step, yes, I saw ~3x faster as well. (I'm investigating C for parsing, too: #4368)

making the c ext purely optional

Thanks for the suggestion, @byroot, I'll take a look at those Redis clients and try to set up something similar 👍

rmosolgo · 2023-03-08T18:49:19Z

Thanks for this improvement!

The rewritten lexer in #4369 removed UTF8 forced encoding. It's possible for applications to force encoding document strings themselves before passing it to the GraphQL gem, but it should be done at this layer to ensure that invalid encoding is properly handled.

tenderlove force-pushed the remove-ragel branch 2 times, most recently from ad3f08f to 773cc72 Compare March 3, 2023 23:10

tenderlove force-pushed the remove-ragel branch from 773cc72 to 03d8e37 Compare March 4, 2023 17:43

rmosolgo merged commit bbd8637 into rmosolgo:master Mar 8, 2023

rmosolgo added this to the 2.0.18 milestone Mar 8, 2023

rmosolgo mentioned this pull request Mar 9, 2023

Port token scanning to C, use arrays instead of tokens #4366

Merged

7 tasks

swalkinshaw mentioned this pull request May 9, 2023

Force UTF-8 encoding in lexer #4467

Merged

rmosolgo mentioned this pull request Jul 8, 2023

Warning: mismatched indentations #4545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Ragel tokenizer with String Scanner #4369

Replace Ragel tokenizer with String Scanner #4369

tenderlove commented Mar 3, 2023

tenderlove commented Mar 3, 2023

tenderlove commented Mar 4, 2023

tenderlove commented Mar 4, 2023 •

edited

rmosolgo commented Mar 6, 2023 •

edited

tenderlove commented Mar 6, 2023

byroot commented Mar 6, 2023

rmosolgo commented Mar 6, 2023

rmosolgo commented Mar 8, 2023

Replace Ragel tokenizer with String Scanner #4369

Replace Ragel tokenizer with String Scanner #4369

Conversation

tenderlove commented Mar 3, 2023

tenderlove commented Mar 3, 2023

tenderlove commented Mar 4, 2023

tenderlove commented Mar 4, 2023 • edited

rmosolgo commented Mar 6, 2023 • edited

tenderlove commented Mar 6, 2023

byroot commented Mar 6, 2023

rmosolgo commented Mar 6, 2023

rmosolgo commented Mar 8, 2023

tenderlove commented Mar 4, 2023 •

edited

rmosolgo commented Mar 6, 2023 •

edited