handle incorrect encoding names #45

himynameisjonas · 2011-11-10T10:34:54Z

I found a site that returns an incorrect encoding name in the response header:

Content-Type: text/html; charset=iso-88591

But the response body has a correct encoding specified

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />

Is it possible that Patron could handle this incorrect encoding name in header and fall back on the charset specified in the response body?

As of now i get the following error:

ArgumentError: unknown encoding name - iso-88591
  from /home/useraname/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:69:in `force_encoding'
  from /home/useraname/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:69:in `convert_to_default_encoding!'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:42:in `block in initialize'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:41:in `each'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:41:in `initialize'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:222:in `handle_request'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:222:in `request'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:125:in `get'

The text was updated successfully, but these errors were encountered:

toland · 2011-11-21T23:46:48Z

Patron currently doesn't parse the content for anything and I would like to keep it that way. What I will do is make the charset coercion optional and allow users to specify which Content-Types should be coerced and a fallback type.

sesam · 2013-04-12T09:30:39Z

is it optional now? (Docs?)
edit: My particular issue was solved by setting session.default_response_charset

julik · 2016-04-15T10:52:34Z

This needs to be tackled a bit more broadly IMO. The problem you saw is not unique, in that for example in Russia there was long a custom of forcing charset headers onto pages that had an entirely different charset specified in the HTML. Parsing HTML is out of scope for Patron (I agree with @toland on this), but I think there might be an extra method on the Response called something like binary_body for the cases when the encoding detection failed or didn't work for some reason. Then the user would be able to revert to handling the encoding manually, including all possible scenarios such as a corpus-based charset guesser, or an HTML parser, or whatevs.

ghost assigned toland Nov 21, 2011

julik closed this as completed Apr 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle incorrect encoding names #45

handle incorrect encoding names #45

himynameisjonas commented Nov 10, 2011

toland commented Nov 21, 2011

sesam commented Apr 12, 2013

julik commented Apr 15, 2016

handle incorrect encoding names #45

handle incorrect encoding names #45

Comments

himynameisjonas commented Nov 10, 2011

toland commented Nov 21, 2011

sesam commented Apr 12, 2013

julik commented Apr 15, 2016