Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle incorrect encoding names #45

Closed
himynameisjonas opened this issue Nov 10, 2011 · 3 comments
Closed

handle incorrect encoding names #45

himynameisjonas opened this issue Nov 10, 2011 · 3 comments
Assignees

Comments

@himynameisjonas
Copy link

I found a site that returns an incorrect encoding name in the response header:

Content-Type: text/html; charset=iso-88591

But the response body has a correct encoding specified

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />

Is it possible that Patron could handle this incorrect encoding name in header and fall back on the charset specified in the response body?

As of now i get the following error:

ArgumentError: unknown encoding name - iso-88591
  from /home/useraname/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:69:in `force_encoding'
  from /home/useraname/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:69:in `convert_to_default_encoding!'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:42:in `block in initialize'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:41:in `each'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/response.rb:41:in `initialize'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:222:in `handle_request'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:222:in `request'
  from /home/username/shared/bundle/ruby/1.9.1/gems/patron-0.4.16/lib/patron/session.rb:125:in `get'
@ghost ghost assigned toland Nov 21, 2011
@toland
Copy link
Owner

toland commented Nov 21, 2011

Patron currently doesn't parse the content for anything and I would like to keep it that way. What I will do is make the charset coercion optional and allow users to specify which Content-Types should be coerced and a fallback type.

@sesam
Copy link

sesam commented Apr 12, 2013

is it optional now? (Docs?)
edit: My particular issue was solved by setting session.default_response_charset

@julik
Copy link
Collaborator

julik commented Apr 15, 2016

This needs to be tackled a bit more broadly IMO. The problem you saw is not unique, in that for example in Russia there was long a custom of forcing charset headers onto pages that had an entirely different charset specified in the HTML. Parsing HTML is out of scope for Patron (I agree with @toland on this), but I think there might be an extra method on the Response called something like binary_body for the cases when the encoding detection failed or didn't work for some reason. Then the user would be able to revert to handling the encoding manually, including all possible scenarios such as a corpus-based charset guesser, or an HTML parser, or whatevs.

@julik julik closed this as completed Apr 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants