Force answer content to be always UTF-8 encoded #11

Closed
weppos opened this Issue Feb 6, 2010 · 8 comments

4 participants

@weppos
Owner

Internally WHOIS should always prefer UTF-8 encoding regardless server encoding.

@axic

You might want to check this: http://github.com/axic/whois/commit/955d5157c3b92679e62cca57d469713dedcbe5d1

It implements this feature.

@weppos
Owner

I checked the commit, but it doesn't really close this issue. Instead, it only provides a limited solution for two specific TLDs.
I know there are many other TLDs that would benefit from this feature. Instead, I would prefer to apply a single patch/commit instead of a separate list of per-TLD fixes.

Also, this feature definitely need an extensive test suite.

@weppos
Owner

If you want to work on this feature, I suggest you to move to a dedicated branch.
Let me know if you have any update, I'll be more than happy to integrate your changes into the mainstream repository.

Thanks for your contribution.

@semaperepelitsa

I've stopped getting encoding errors in my app after I passed whois answer through ActiveSupport::Multibyte::Unicode.tidy_bytes. (It replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.)

@weppos
Owner

Very interesting method. Unfortunately, it's not that simple. There's a very wide range of possible encodings (so far, I counted more than 10) and there are cases where a multipart whois record is returned with several different encodings.

I would love to find a solution that doesn't rely on third party Gems. Probably, it will be compatible with Ruby 1.9 only because Ruby 1.8 doesn't have encoding support.

@woodrow

Hi @weppos. I was hoping to revive this thread and ask what strategy you think reasonable for properly handling of the various character encodings received from Whois servers around the Internet? In particular I was wondering about providing a list of hints for character encodings returned by well-known Whois servers (i.e. those in lib/definitions), or if you had other thoughts?

@weppos
Owner

In particular I was wondering about providing a list of hints for character encodings returned by well-known Whois servers (i.e. those in lib/definitions), or if you had other thoughts?

I tried this approach in the past. Unfortunately, it's not very effective. Maintaining that list is such a pain in the ***, because it can be very long and changes might not be immediately applied. Also, a few registries are able to return a response in more than one single encoding (this is insane, I know).

The only solution is to guess the encoding at runtime.

@weppos
Owner

Closing as this is a very old topic and there is no viable solution the client can implement at the time.

@weppos weppos closed this Jun 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment