You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
this module is incredible good but it cannot handle domains names with
(german) "Umlaute" (Ä, Ö, Ü, ...). Any ideas how to deal with this problem?
Thanks,
Felix.
Original issue reported on code.google.com by feliz...@gmx.de on 21 Jan 2010 at 2:32
The text was updated successfully, but these errors were encountered:
Hi Felix,
if I understand correctly you called DefaultExtractor#getText(URL) with an URL
like "http://www.äöü.xyz/".
This seems to be unsupported by Java 6 (see
http://java.sun.com/docs/books/tutorial/i18n/network/iri.html
). In particular, what you passed was then an IRI, not a URL.
A workaround for now could be creating the URLs like this
URL u = new URL("http://"+IDN.toASCII("www.äöü.xyz")+"/");
However, since the getText(URL) method is explicitly marked as "show case
only", you might also consider
using a dedicated HTTP client library, such as HttpClient
(http://hc.apache.org/) and call getText(InputSource)
instead.
I would not recommend using getText(URL) in a production setup. You will sooner
or later run into problems
that are out of scope for boilerpipe (robots.txt, broken servers, proxies, ...)
Marking as WontFix.
Best,
Christian
Original comment by ckkohl79 on 24 Jan 2010 at 4:09
Original issue reported on code.google.com by
feliz...@gmx.de
on 21 Jan 2010 at 2:32The text was updated successfully, but these errors were encountered: