Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use java.net.URLDecoder to fix UTF-8 related issues #108

Merged
merged 1 commit into from Feb 23, 2017

Conversation

sinkuu
Copy link
Contributor

@sinkuu sinkuu commented Feb 11, 2017

Hi, I found MultiProtocolURL.unescape has a bug that eliminates some characters from non-ASCII input. i.e. new MultiProtocolURL("URL containing non-ASCII") sometimes breaks URL.

Example: new MultiProtocolURL("http://www.example.com/q=東方儚月抄").toString() => "http://www.example.com/q=東儚月"

This PR changes MultiProtocolURL to use the URL decoding functionality from the standard library.

@luccioman
Copy link
Member

Hi @sinkuu, thanks for your report and also for detailing a bit more your example.

Given the core nature of this code, we should be careful with any eventual side-effects. Would you have also some higher level failing scenarios examples to provide?

@sinkuu
Copy link
Contributor Author

sinkuu commented Feb 14, 2017

URLDecoder.decode seems to be equivalent to unescape (except this bug), but escape does more than URLEncoder.encode (detecting &, and leaving some more punctuations as-is).

I've reverted escape change.

Would you have also some higher level failing scenarios examples to provide?

I found this bug via OpenSearch integration - when I search with a keyword containing non-ASCII, broken query is sent to OpenSearch provider.

@sinkuu sinkuu changed the title Use java.net.URL{Decoder,Encoder} to fix UTF-8 related issues Use java.net.URLDecoder to fix UTF-8 related issues Feb 14, 2017
@luccioman
Copy link
Member

Ok thanks, I also effectively reproduce this bug with the OpenSearch connector.
I also found that remote Solr requests to other peers do not suffer from this as the Solrj client internally uses URLDecoder and URLEncoder.

@luccioman luccioman merged commit a46b232 into yacy:master Feb 23, 2017
luccioman added a commit that referenced this pull request Feb 23, 2017
After @sinkuu Pull Request #108 added JUnit tests, updated some JavaDoc
and also improved URL tokenization to support non ASCII characters.
@luccioman
Copy link
Member

Hi @sinkuu , I found no regression after running hopefully enough high-level tests. By the way I added some unit tests and modifications to your proposal : if you have some time to perform some tests yourself, your feedback will be welcome.
Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants