Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Ragel tokenizer with String Scanner #4369

Merged
merged 1 commit into from
Mar 8, 2023

Conversation

tenderlove
Copy link
Contributor

This commit replaces the Ragel tokenizer with a "hand written" string scanner based tokenizer. This tokenizer is about ~2x faster than the previous tokenizer.

To test, I used this benchmark:

https://gist.github.com/tenderlove/fdce1610a3eecd8426f5845391a6c373

On this branch:

$ ruby -I.:lib:spec bench.rb
Warming up --------------------------------------
             graphql   506.000  i/100ms
Calculating -------------------------------------
             graphql      5.090k (± 0.4%) i/s -     25.806k in   5.070096s

On master:

$ ruby -I.:lib:spec bench.rb
Warming up --------------------------------------
             graphql   284.000  i/100ms
Calculating -------------------------------------
             graphql      2.880k (± 0.4%) i/s -     14.484k in   5.029007s

To ensure compatibility I ran this tokenizer in parallel with the existing tokenizer and compared the results. There is one difference that I think is acceptable and that is when the tokenizer encounters unknown input.

For example:

require "graphql/language/lexer"
require "graphql/language/token"

str = "😘"
p GraphQL::Language::Lexer.tokenize(str)

Output on master is this:

$ ruby -I.:lib:spec test.rb
[(UNKNOWN_CHAR "\xF0" [1:1]), (UNKNOWN_CHAR "\x9F" [1:2]), (UNKNOWN_CHAR "\x98" [1:3]), (UNKNOWN_CHAR "\x98" [1:4])]

Output on my branch is this:

$ ruby -I.:lib:spec test.rb
[(UNKNOWN_CHAR "😘" [1:1])]

This is because Ragel only understands byte streams where Ruby knows we have a UTF-8 string.

I didn't make many changes to the helper functions that were originally included in the Ragel parser. Now that we're not working with byte arrays, I think could probably improve the code in those helper functions too.

@tenderlove tenderlove force-pushed the remove-ragel branch 2 times, most recently from ad3f08f to 773cc72 Compare March 3, 2023 23:10
@tenderlove
Copy link
Contributor Author

By the way, I made the block quote string work the way the tests want it to work, but the way GraphQL-Ruby tokenizes block quotes doesn't match the way the reference implementation for JS tokenizes them.

I was specifically checking the 10 quote test. In that case the JS tokenizer sees one triple quote block (the first 6 quotes), then an unterminated triple quote block (the following 3, then 1 quote, for a total of 4 quotes).

I also looked at the """""e""""" test. The reference implementation parses this as a block quote with the contents ""e, followed by an empty regular string. This is different than the test in this repo that asserts it's a single block quoted string with the contents ""e"".

@tenderlove
Copy link
Contributor Author

Sorry, I didn't realize there were benchmarks in the repo.

Here is master:

$ bundle exec rake bench:parse
Warming up --------------------------------------
scan - introspection   130.000  i/100ms
parse - introspection
                       102.000  i/100ms
    scan - fragments   229.000  i/100ms
   parse - fragments   176.000  i/100ms
    scan - big query    22.000  i/100ms
   parse - big query    17.000  i/100ms
Calculating -------------------------------------
scan - introspection      1.306k (± 0.3%) i/s -      6.630k in   5.077063s
parse - introspection
                          1.024k (± 0.7%) i/s -      5.202k in   5.078804s
    scan - fragments      2.300k (± 0.4%) i/s -     11.679k in   5.077442s
   parse - fragments      1.756k (± 0.7%) i/s -      8.800k in   5.011090s
    scan - big query    226.848  (± 0.4%) i/s -      1.144k in   5.043134s
   parse - big query    176.933  (± 0.6%) i/s -    901.000  in   5.092648s

This branch:

$ bundle exec rake bench:parse
Warming up --------------------------------------
scan - introspection   306.000  i/100ms
parse - introspection
                       187.000  i/100ms
    scan - fragments   444.000  i/100ms
   parse - fragments   280.000  i/100ms
    scan - big query    50.000  i/100ms
   parse - big query    31.000  i/100ms
Calculating -------------------------------------
scan - introspection      3.052k (± 0.4%) i/s -     15.300k in   5.012466s
parse - introspection
                          1.872k (± 0.9%) i/s -      9.537k in   5.095884s
    scan - fragments      4.418k (± 1.1%) i/s -     22.200k in   5.026021s
   parse - fragments      2.809k (± 1.0%) i/s -     14.280k in   5.084464s
    scan - big query    505.981  (± 0.2%) i/s -      2.550k in   5.039741s
   parse - big query    312.802  (± 0.6%) i/s -      1.581k in   5.054577s

Just to make comparison easier, I made a bar graph that compares them. The graph is IPS on this branch / IPS on master so we can see the % speedup compared to master for each test:

Untitled 2023-03-03 16-24-59

@tenderlove
Copy link
Contributor Author

tenderlove commented Mar 4, 2023

Just for fun I decided to try with YJIT too. Here are the results with YJIT:

$ RUBYOPT='-v --yjit' bundle exec rake bench:parse
ruby 3.2.0 (2022-12-25 revision a528908271) +YJIT [arm64-darwin22]
Warming up --------------------------------------
scan - introspection   475.000  i/100ms
parse - introspection
                       261.000  i/100ms
    scan - fragments   684.000  i/100ms
   parse - fragments   392.000  i/100ms
    scan - big query    77.000  i/100ms
   parse - big query    43.000  i/100ms
Calculating -------------------------------------
scan - introspection      4.728k (± 0.7%) i/s -     23.750k in   5.023954s
parse - introspection
                          2.608k (± 1.0%) i/s -     13.050k in   5.004062s
    scan - fragments      6.840k (± 0.8%) i/s -     34.200k in   5.000240s
   parse - fragments      3.927k (± 0.9%) i/s -     19.992k in   5.090927s
    scan - big query    760.453  (± 2.8%) i/s -      3.850k in   5.066784s
   parse - big query    436.250  (± 0.9%) i/s -      2.193k in   5.027286s

Untitled 2023-03-03 22-15-25

This commit replaces the Ragel tokenizer with a "hand written" string
scanner based tokenizer.  This tokenizer is about ~2x faster than the
previous tokenizer.

To test, I used this benchmark:

  https://gist.github.com/tenderlove/fdce1610a3eecd8426f5845391a6c373

On this branch:

```
$ ruby -I.:lib:spec bench.rb
Warming up --------------------------------------
             graphql   506.000  i/100ms
Calculating -------------------------------------
             graphql      5.090k (± 0.4%) i/s -     25.806k in   5.070096s
```

On master:

```
$ ruby -I.:lib:spec bench.rb
Warming up --------------------------------------
             graphql   284.000  i/100ms
Calculating -------------------------------------
             graphql      2.880k (± 0.4%) i/s -     14.484k in   5.029007s
```

To ensure compatibility I ran this tokenizer in parallel with the
existing tokenizer and compared the results.  There is one difference that I
think is acceptable and that is when the tokenizer encounters unknown
input.

For example:

```ruby
require "graphql/language/lexer"
require "graphql/language/token"

str = "😘"
p GraphQL::Language::Lexer.tokenize(str)
```

Output on master is this:

```
$ ruby -I.:lib:spec test.rb
[(UNKNOWN_CHAR "\xF0" [1:1]), (UNKNOWN_CHAR "\x9F" [1:2]), (UNKNOWN_CHAR "\x98" [1:3]), (UNKNOWN_CHAR "\x98" [1:4])]
```

Output on my branch is this:

```
$ ruby -I.:lib:spec test.rb
[(UNKNOWN_CHAR "😘" [1:1])]
```

This is because Ragel only understands byte streams where Ruby knows we
have a UTF-8 string.

I didn't make many changes to the helper functions that were originally
included in the Ragel parser.  Now that we're not working with byte
arrays, I think could probably improve the code in those helper
functions too.
@rmosolgo
Copy link
Owner

rmosolgo commented Mar 6, 2023

Thanks for sharing this work! Simpler and faster, sounds great to me.

That's interesting about the divergence between this project and the reference implementation. Too bad 😖 If it becomes a problem, I'm definitely open to making this project match graphql-js.

I plan to continue with #4366 since it's still an order of magnitude faster, but I see an advantage to maintaining both a plain-Ruby and a C implementation.

@tenderlove
Copy link
Contributor Author

I plan to continue with #4366 since it's still an order of magnitude faster, but I see an advantage to maintaining both a plain-Ruby and a C implementation.

Makes sense. I have a vested interest in a good Ruby implementation because YJIT can speed up Ruby code. I ran the benchmarks on the the C branch too, but I'm only seeing 4x increase as compared to master?

Untitled 2023-03-06 10-00-36

I could be running the benchmarks wrong though.

In general, I think it will be possible for YJIT to beat a C extension someday, so I'd be glad if a pure Ruby version sticks around.

@byroot
Copy link
Contributor

byroot commented Mar 6, 2023

I'd also suggest making the c ext purely optional. As someone who maintain a bunch of C exts, you commonly get error report from users with weird compilation issues of all sorts. So having the possibility to just tell them to not use the c-ext is gold as a maintainer.

It can be in the same repo, but shipped as a separate gem (e.g. like redis-client and hiredis-client)

@rmosolgo
Copy link
Owner

rmosolgo commented Mar 6, 2023

4x increase

Ah yeah -- I should have clarified, it's just the scan-only benchmarks that show such a big speedup. But I added those in #4366. For the whole parse step, yes, I saw ~3x faster as well. (I'm investigating C for parsing, too: #4368)

making the c ext purely optional

Thanks for the suggestion, @byroot, I'll take a look at those Redis clients and try to set up something similar 👍

@rmosolgo
Copy link
Owner

rmosolgo commented Mar 8, 2023

Thanks for this improvement!

@rmosolgo rmosolgo merged commit bbd8637 into rmosolgo:master Mar 8, 2023
@rmosolgo rmosolgo added this to the 2.0.18 milestone Mar 8, 2023
swalkinshaw added a commit that referenced this pull request May 9, 2023
The rewritten lexer in
#4369 removed UTF8 forced
encoding.

It's possible for applications to force encoding document strings
themselves before passing it to the GraphQL gem, but it should be done
at this layer to ensure that invalid encoding is properly handled.
swalkinshaw added a commit that referenced this pull request May 11, 2023
The rewritten lexer in
#4369 removed UTF8 forced
encoding.

It's possible for applications to force encoding document strings
themselves before passing it to the GraphQL gem, but it should be done
at this layer to ensure that invalid encoding is properly handled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants