Skip to content

Force UTF-8 encoding in lexer#4467

Merged
rmosolgo merged 1 commit intomasterfrom
lexer-force-utf-8-encoding
May 12, 2023
Merged

Force UTF-8 encoding in lexer#4467
rmosolgo merged 1 commit intomasterfrom
lexer-force-utf-8-encoding

Conversation

@swalkinshaw
Copy link
Copy Markdown
Collaborator

The rewritten lexer in #4369 removed UTF8 forced encoding.

It's possible for applications to force encode document strings themselves before passing it to the GraphQL gem, but it should be done at this layer to ensure that invalid encoding is properly handled.

Two questions:

  1. is there any reason not to restore this behaviour?
  2. is this the proper place? Previously it was done in 3 separate places, but with the new simpler lexer I believe this should cover it.

cc @tenderlove

@rmosolgo
Copy link
Copy Markdown
Owner

👍 Looks like a good solution to me, especially since it used to work this way. It'd be nice to get a test for it though, so we don't regress again. How did this surface in practice?

@swalkinshaw
Copy link
Copy Markdown
Collaborator Author

Oh yeah I'll add a test.

This surfaced for requests with the application/graphql content type. Since that uses the POST body directly without any JSON parsing (which encodes to UTF8). The raw POST body was being tagged as ASCII-8BIT in some cases causing downstream failures (either at the DB level, or within GraphQL::Types::String which requires UTF8).

The rewritten lexer in
#4369 removed UTF8 forced
encoding.

It's possible for applications to force encoding document strings
themselves before passing it to the GraphQL gem, but it should be done
at this layer to ensure that invalid encoding is properly handled.
@swalkinshaw swalkinshaw force-pushed the lexer-force-utf-8-encoding branch from 3dc76a4 to 4220dc3 Compare May 11, 2023 13:27
@swalkinshaw
Copy link
Copy Markdown
Collaborator Author

Added a test. Without force encoding, the first token would be IDENTIFIER, now its BAD_UNICODE_ESCAPE

@rmosolgo rmosolgo added this to the 2.0.22 milestone May 12, 2023
@rmosolgo
Copy link
Copy Markdown
Owner

Thanks for taking care of this!

@rmosolgo rmosolgo merged commit 0d8ea69 into master May 12, 2023
@swalkinshaw swalkinshaw deleted the lexer-force-utf-8-encoding branch June 19, 2023 19:02
@tenderlove
Copy link
Copy Markdown
Contributor

tenderlove commented Aug 8, 2023

Sorry, I didn't see this issue. I don't think this is the right thing to do. According to the spec, GraphQL queries "are expressed as a sequence of Unicode characters". When someone sends a query to the GraphQL parser, it's up to them to ensure that the input string is valid UTF-8 encoding.

Imagine a situation where someone makes an HTTP Post with a Content-Type header that indicates the incoming POST body is encoded as Shift-JIS. The encoding is correct, there is no "BAD UNICODE ESCAPE". But now GraphQL will insist that there is. The correct way to fix problems like this is to look at the source of the string and whatever out of band information there is that indicates the encoding. In the case of HTTP, that would be the Content-Type header. Using that information, we can correctly tag the ASCII-8BIT string as Shift-JIS, and know to transform it to UTF-8 before handing it to GraphQL.

Using force_encoding without external information is rarely the correct solution. I understand doing the legwork to find where these ASCII-8BIT strings are coming from is a lot of work, but it's the correct way to handle this. My opinion is that GraphQL should simply reject strings that are not tagged as US-ASCII or UTF-8 (or have an invalid_encoding?), and should never call force_encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants