I found a site that returns an incorrect encoding name in the response header:
Content-Type: text/html; charset=iso-88591
But the response body has a correct encoding specified
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
Is it possible that Patron could handle this incorrect encoding name in header and fall back on the charset specified in the response body?
As of now i get the following error:
ArgumentError: unknown encoding name - iso-88591
from /home/useraname/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:69:in `force_encoding'
from /home/useraname/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:69:in `convert_to_default_encoding!'
from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:42:in `block in initialize'
from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:41:in `each'
from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:41:in `initialize'
from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:222:in `handle_request'
from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:222:in `request'
from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:125:in `get'
Patron currently doesn't parse the content for anything and I would like to keep it that way. What I will do is make the charset coercion optional and allow users to specify which Content-Types should be coerced and a fallback type.
is it optional now? (Docs?)
edit: My particular issue was solved by setting session.default_response_charset
This needs to be tackled a bit more broadly IMO. The problem you saw is not unique, in that for example in Russia there was long a custom of forcing charset headers onto pages that had an entirely different charset specified in the HTML. Parsing HTML is out of scope for Patron (I agree with @toland on this), but I think there might be an extra method on the Response called something like binary_body for the cases when the encoding detection failed or didn't work for some reason. Then the user would be able to revert to handling the encoding manually, including all possible scenarios such as a corpus-based charset guesser, or an HTML parser, or whatevs.