Set UTF-8 encoding other parsed tokens #516

Merged
merged 1 commit into from Feb 3, 2017

Projects

None yet

2 participants

@josh
Contributor
josh commented Feb 2, 2017

So I was trying to track down why most AST Language::Nodes had UTF-8 tagged strings and why a few others were untagged as binary.

Stuff like Language::Node#name gets surfaced in different ways on introspection queries. This leads to some binary tagged strings leaking into execution result.

I figured fixing the encoding at the source would be the best way to solve this. It looks like some of these strings are already being retagged after they are repacked. https://github.com/rmosolgo/graphql-ruby/blob/571ceab99f3076f269a3d92cc133d559151e3bfd/lib/graphql/language/lexer.rl#L190

Does the parser even support parsing input strings in anything else besides UTF-8?

@rmosolgo
Owner
rmosolgo commented Feb 3, 2017

anything else besides UTF-8

🎉 It's undefined behavior 😬 Do you think it's worth specifying in the test suite, or rejecting non-UTF-8 strings?

@josh
Contributor
josh commented Feb 3, 2017

It looks like the parser always operates on the string assuming UTF-8 bytes regardless of how the string is tagged. It's probably fine. Most of these query strings will likely be coming from IO somehow and people usually screw up tagging those correctly. I feel like the strict thing to do would be to raise unless its given a UTF-8 tagged string as its required by the spec. But being liberal with the inputs is probably acceptable here. As long as the parsed AST results in UTF-8 strings, its sane for an application to deal with.

@@ -1004,7 +1004,7 @@ def self.record_comment(ts, te, meta)
def self.emit(token_name, ts, te, meta)
meta[:tokens] << token = GraphQL::Language::Token.new(
name: token_name,
- value: meta[:data][ts...te].pack("c*"),
+ value: meta[:data][ts...te].pack("c*").force_encoding("UTF-8"),
@josh
josh Feb 3, 2017 Contributor

Actually, it might be better to just change this to repack as UTF-8 with .pack("U*"). That results in the string being tagged correctly.

@josh
josh Feb 3, 2017 Contributor

Nevermind, that screws up some unicode BOM parsing. Tried it and got some failing tests.

@rmosolgo rmosolgo merged commit 0549068 into rmosolgo:master Feb 3, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@josh josh deleted the josh:lexer-encodings branch Feb 3, 2017
@rmosolgo
Owner
rmosolgo commented Feb 3, 2017

Added a test for UTF-8 comments in 11b8c74

@rmosolgo
Owner
rmosolgo commented Feb 3, 2017

Thanks for tracking this down!

@rmosolgo rmosolgo modified the milestone: 1.4.3 Feb 8, 2017
@rmosolgo
Owner
rmosolgo commented Feb 8, 2017

🚢 !

@josh
Contributor
josh commented Feb 8, 2017

Thanks!

This is going to be a fun upgrade to try to get running on github.com. I'm sure theres a number of spots we're screwing this up. Git data + encodings = fun times.

@josh
Contributor
josh commented Feb 8, 2017

Oh--wrong thread. Thought I was replying to #517.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment