handle incorrect encoding names #45

Closed
himynameisjonas opened this Issue Nov 10, 2011 · 3 comments

Projects

None yet

4 participants

@himynameisjonas

I found a site that returns an incorrect encoding name in the response header:

Content-Type: text/html; charset=iso-88591

But the response body has a correct encoding specified

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />

Is it possible that Patron could handle this incorrect encoding name in header and fall back on the charset specified in the response body?

As of now i get the following error:

ArgumentError: unknown encoding name - iso-88591
  from /home/useraname/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:69:in `force_encoding'
  from /home/useraname/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:69:in `convert_to_default_encoding!'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:42:in `block in initialize'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:41:in `each'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:41:in `initialize'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:222:in `handle_request'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:222:in `request'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:125:in `get'
@toland toland was assigned Nov 21, 2011
@toland
Owner
toland commented Nov 21, 2011

Patron currently doesn't parse the content for anything and I would like to keep it that way. What I will do is make the charset coercion optional and allow users to specify which Content-Types should be coerced and a fallback type.

@sesam
sesam commented Apr 12, 2013

is it optional now? (Docs?)
edit: My particular issue was solved by setting session.default_response_charset

@julik
Collaborator
julik commented Apr 15, 2016

This needs to be tackled a bit more broadly IMO. The problem you saw is not unique, in that for example in Russia there was long a custom of forcing charset headers onto pages that had an entirely different charset specified in the HTML. Parsing HTML is out of scope for Patron (I agree with @toland on this), but I think there might be an extra method on the Response called something like binary_body for the cases when the encoding detection failed or didn't work for some reason. Then the user would be able to revert to handling the encoding manually, including all possible scenarios such as a corpus-based charset guesser, or an HTML parser, or whatevs.

@julik julik closed this Apr 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment