Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDN <-> ACE Domain Names #3

Closed
GoogleCodeExporter opened this issue May 25, 2015 · 1 comment
Closed

IDN <-> ACE Domain Names #3

GoogleCodeExporter opened this issue May 25, 2015 · 1 comment

Comments

@GoogleCodeExporter
Copy link

Hi,

this module is incredible good but it cannot handle domains names with
(german) "Umlaute" (Ä, Ö, Ü, ...). Any ideas how to deal with this problem?

Thanks,
Felix.


Original issue reported on code.google.com by feliz...@gmx.de on 21 Jan 2010 at 2:32

@GoogleCodeExporter
Copy link
Author

Hi Felix,

if I understand correctly you called DefaultExtractor#getText(URL) with an URL 
like "http://www.äöü.xyz/". 
This seems to be unsupported by Java 6 (see 
http://java.sun.com/docs/books/tutorial/i18n/network/iri.html 
). In particular, what you passed was then an IRI, not a URL.

A workaround for now could be creating the URLs like this
URL u = new URL("http://"+IDN.toASCII("www.äöü.xyz")+"/");

However, since the getText(URL) method is explicitly marked as "show case 
only", you might also consider 
using a dedicated HTTP client library, such as HttpClient 
(http://hc.apache.org/) and call getText(InputSource) 
instead.

I would not recommend using getText(URL) in a production setup. You will sooner 
or later run into problems 
that are out of scope for boilerpipe (robots.txt, broken servers, proxies, ...)

Marking as WontFix.

Best,
Christian

Original comment by ckkohl79 on 24 Jan 2010 at 4:09

  • Changed state: WontFix
  • Added labels: OpSys-All, Priority-Low
  • Removed labels: Priority-Medium

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant