Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Force answer content to be always UTF-8 encoded #11

Open
weppos opened this Issue · 7 comments

4 participants

@weppos
Owner

Internally WHOIS should always prefer UTF-8 encoding regardless server encoding.

@axic

You might want to check this: http://github.com/axic/whois/commit/955d5157c3b92679e62cca57d469713dedcbe5d1

It implements this feature.

@weppos
Owner

I checked the commit, but it doesn't really close this issue. Instead, it only provides a limited solution for two specific TLDs.
I know there are many other TLDs that would benefit from this feature. Instead, I would prefer to apply a single patch/commit instead of a separate list of per-TLD fixes.

Also, this feature definitely need an extensive test suite.

@weppos
Owner

If you want to work on this feature, I suggest you to move to a dedicated branch.
Let me know if you have any update, I'll be more than happy to integrate your changes into the mainstream repository.

Thanks for your contribution.

@semaperepelitsa

I've stopped getting encoding errors in my app after I passed whois answer through ActiveSupport::Multibyte::Unicode.tidy_bytes. (It replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.)

@weppos
Owner

Very interesting method. Unfortunately, it's not that simple. There's a very wide range of possible encodings (so far, I counted more than 10) and there are cases where a multipart whois record is returned with several different encodings.

I would love to find a solution that doesn't rely on third party Gems. Probably, it will be compatible with Ruby 1.9 only because Ruby 1.8 doesn't have encoding support.

@woodrow

Hi @weppos. I was hoping to revive this thread and ask what strategy you think reasonable for properly handling of the various character encodings received from Whois servers around the Internet? In particular I was wondering about providing a list of hints for character encodings returned by well-known Whois servers (i.e. those in lib/definitions), or if you had other thoughts?

@weppos
Owner

In particular I was wondering about providing a list of hints for character encodings returned by well-known Whois servers (i.e. those in lib/definitions), or if you had other thoughts?

I tried this approach in the past. Unfortunately, it's not very effective. Maintaining that list is such a pain in the ***, because it can be very long and changes might not be immediately applied. Also, a few registries are able to return a response in more than one single encoding (this is insane, I know).

The only solution is to guess the encoding at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.