New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArgumentError: invalid byte sequence in UTF-8 #673
Comments
👍 |
1 similar comment
+1 |
It is a web servers responsibility to translate IO to valid binary representations for the application layer. This isn't the whole picture though, in this case, the webserver has done that - the webserver does not know the encoding of the URI... It is the responsibility of the IETF to define the validity of URI data in various encodings (not done), and so it is not entirely valid for web servers to make no assumptions for this field for the above... Rack itself uses a binary regular expression here, which expects binary input strings. This is our response to the above subtleties. In normal operation (say, Webrick + Rack), this error is not raised... The reason that this error is raised in your application is: You have middleware in your stack that is forcing this string to UTF-8, even when it is not valid UTF-8. The code that is doing this is bugged. Observe: s = "a=\xff"
# => "a=\xFF"
s.force_encoding("binary")
# => "a=\xFF"
s.valid_encoding?
# => true
Rack::Utils.parse_nested_query(s)
# => {"a"=>"\xFF"}
s.force_encoding("utf-8")
# => "a=\xFF"
s.valid_encoding?
# => false
Rack::Utils.parse_nested_query(s)
ArgumentError: invalid byte sequence in UTF-8
from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/rack-1.5.2/lib/rack/utils.rb:93:in `split'
from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/rack-1.5.2/lib/rack/utils.rb:93:in `parse_nested_query'
from (irb):21
from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/bin/irb:12:in `<main>' This is a rails bug. Calls to force_encoding should always assert that their output is valid. |
@raggi Thx :-) sorry for that - cheers! |
This part does default to UTF-8, though. And thanks to that, trying to do a
|
Just leaving this in case of anyone is having trouble with Rails: https://github.com/whitequark/rack-utf8_sanitizer/ ❤️ |
Doing so provides us with no advantages, and has the side effect of triggering [this bug](rack/rack#673) in Rack when said form is submitted in browsers that do not support UTF-8 Note that that enforcement is intended as a protection against users submitting Latin-1 values when we expect unicode; since this form supports no user-provided values and doesn't persist any data, that's not an issue we need to worry about. Note also that the only browsers for which this error is likely to be generated are browsers that we do not officially support, so an argument could be made for simply ignoring this error. But, I figure, I already got the fix right here.
Doing so provides us with no advantages, and has the side effect of triggering [this bug](rack/rack#673) in Rack when said form is submitted in browsers that do not support UTF-8 Note that that enforcement is intended as a protection against users submitting Latin-1 values when we expect unicode; since this form supports no user-provided values and doesn't persist any data, that's not an issue we need to worry about. Note also that the only browsers for which this error is likely to be generated are browsers that we do not officially support, so an argument could be made for simply ignoring this error. But, I figure, I already got the fix right here.
I get this occasional exception in production, what is the correct solution for this?
|
@rgaufman in your question you ask about what is the correct solution, but I'm not clear on the problem. You reference an exception that occurs due to malformed input. What is the problem you want to solve? |
I suppose this does what I want: https://github.com/whitequark/rack-utf8_sanitizer/ - trying it now to see if these exceptions go away. These exceptions are coming from I'm guessing random hacking attempts? - what is the right way to handle these bad requests? -- should I use fail to ban or something similar, are there examples of what to do about these? |
The main issue here is that (with many common setups) Rack will raise this exception when anyone in the world (bots, people testing, search engines) drive by your Rack-powered site and submit a URL with invalid params. This is super annoying if you use an exception collation service. I'm not sure if the issue stems from other parts of the Ruby application sending Rack invalid text, but I've mostly seen it occur with both Puma + Rails and Unicorn + Rails. One or several of those gem authors may not know that they are responsible for sending Rack cleaned up-input or being the guardian process, like you specified in this comment #673 (comment) This error also takes other forms such as exception messages containing:
|
Another way to state the problem would be: Given that there's usually three "layers" to a Ruby web app:
Which interface is officially responsible for handling invalid UTF-8 or incoming malformed text in URLs or HTTP headers is not well-agreed by the different gem authors, which leads to encoding problems sometimes blowing up in the Rack layer (and people creating middleware to just catch/sanitize/discard/422 the requests "manually") |
IMO, there are really only two layers, webserver and application. The rack layer I think refers to middleware, but middleware are just rack applications that (usually) delegate to other rack applications. Middleware can be written using the same frameworks used to write applications, though I'm don't think Rails supports operating as middleware. So IMO, there is no difference between the rack and application layers. Also IMO, the application layer is the proper layer for such exception handling. Most applications try to show nice error pages even for invalid input, and in general that is only possible at the application layer. Some applications might only want to deal with UTF-8, and reject input in other encodings (or transcode to UTF-8), while there are other applications (less frequent these days) that deal with non-UTF-8 data, and may want to reject UTF-8 or convert it to another encoding (if possible). I think classes the rack library exposes (e.g. |
In terms of the call stack in @rgaufman's recent first comment, the multipart middleware could catch this exception and handle it gracefully, returning a 400 Bad Request instead. A patch would need to be introduced here: https://github.com/rack/rack/blob/master/lib/rack/multipart.rb#L45. As @jeremyevans rightly notes, a more specific error would be helpful - in the past there's been discussion about requesting that the Stdlib itself produces a more specific error, too. In the more general case, the issue with this exception is that it can come from any use of query parsing in any middleware that uses either the Rack helper libraries that delegate to the stdlib parser, or from use of the stdlib parser itself. The helper libraries can not return a 400 themselves - the middleware / app callers must do so, so they must catch the exception. @csuhta your response implies a lot of blame, suggesting that the issue is as simple as a lack of agreement. Many of the code bases involved are themselves encoding agnostic at large, it's once you start making use of encoding specific features that you run into more interesting challenges. When using a framework like Rails, being sure to use it's helpers to generate forms and so on will add explicit declarations to improve the situation. Using, say, part of Rails, but then writing raw form tags yourself or with an entirely separate template system may lead to these situations, as does serving an API and having customers using older HTTP client libraries, or say, defaults on a lot of more enterprise stacks on Windows. "We" the authors have actually discussed encodings at length, I believe Yehuda even wrote quite a bit about those discussions, back in the day. We've also done some relatively extensive browser behavior research such as https://github.com/rack/multifail. A more general solution to the "please don't raise exceptions" problem: if you know your application is going to be using these helpers, perform this parse first (hit Request#params) handling the exception. Rack::Request#params caches the parse results, so this should not add significant overhead to your application. Such a middleware would look like so:
|
My teams currently use a middleware very much like that, and I agree that a much more specific exception being raised would be really helpful (rescuing all I didn't mean to imply that anyone in particular is to blame or is dropping the ball, just that the solution in this scenario is often unclear to the application author or the responsibilities are not being explicitly set. For example, if you are starting out with Rails development and you choose Puma + Rails (and Rack implicitly), and your web app starts getting even moderate traffic, you will run into these kinds of encoding exceptions just from drive-by visitors, and from there it makes it seem like the error is coming from Rack and the proper pattern to solve it is really only discussed in GitHub threads like these. I agree it should probably be handled by the application layer or middleware, so that you can serve a nice error page or you can attempt to sanitize the string if you really need to be kind to certain incoming requests. A good way to approach it in Rack 3 (at least for my team's use cases) would be a suite of more specific exceptions being raised, and also documentation that the application is responsible for deciding how they deal with the exceptions/strings |
I would accept a PR for a more specific exception. |
The exception is generated from Ruby, it would be great to start there to
get a more specific error
…On Tue, Jan 25, 2022, 4:02 PM Samuel Williams ***@***.***> wrote:
Reopened #673 <#673>.
—
Reply to this email directly, view it on GitHub
<#673 (comment)>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAACXB4YWIH4722CPMYTY3UX42YVANCNFSM4ANYIZPQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This was fixed by c1e5fbb |
Doing so provides us with no advantages, and has the side effect of triggering [this bug](rack/rack#673) in Rack when said form is submitted in browsers that do not support UTF-8 Note that that enforcement is intended as a protection against users submitting Latin-1 values when we expect unicode; since this form supports no user-provided values and doesn't persist any data, that's not an issue we need to worry about. Note also that the only browsers for which this error is likely to be generated are browsers that we do not officially support, so an argument could be made for simply ignoring this error. But, I figure, I already got the fix right here.
Hey guys. Rack gets crazy when you pass invalid UTF-8 string.
You can try with putting this into app URL:
It will brake app with this error: ArgumentError: invalid byte sequence in UTF-8
More details also here: http://dev.mensfeld.pl/2014/03/rack-argument-error-invalid-byte-sequence-in-utf-8/
The text was updated successfully, but these errors were encountered: