Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI malformed when trying to get URL with URL-encoded chars #420

Closed
k03mad opened this issue Nov 17, 2017 · 7 comments
Closed

URI malformed when trying to get URL with URL-encoded chars #420

k03mad opened this issue Nov 17, 2017 · 7 comments
Labels
enhancement This change will extend Got features ✭ help wanted ✭

Comments

@k03mad
Copy link

k03mad commented Nov 17, 2017

Hello,

Catched URI malformed error in new "got", when I trying to send request to URL with URL-encoded chars.
URL examples:
https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/
https://www.kinopoisk.ru/news/keyword/%C7%E2%E5%E7%E4%ED%FB%E5+%E2%EE%E9%ED%FB/

nodejs: 9.2.0
got: 8.0.0

Failed code:

const got = require('got');

(async () => {
	try {
		const response = await got('https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/');
		console.log(response);
	} catch (error) {
		console.log(error);
	}
})();
URIError: URI malformed
    at decodeURI (<anonymous>)
    at module.exports (/Users/kirill-m/git/test/node_modules/normalize-url/index.js:87:21)
    at /Users/kirill-m/git/test/node_modules/cacheable-request/src/index.js:43:16
    at get (/Users/kirill-m/git/test/node_modules/got/index.js:98:20)
    at Promise.resolve.then.size (/Users/kirill-m/git/test/node_modules/got/index.js:274:5)
    at <anonymous>

Broken at this commit: 3c79205

Any ideas?

@brandon93s
Copy link
Contributor

brandon93s commented Nov 18, 2017

Explanation of failure:

This website is windows-1252 encoded which is unsupported by js-native decode utilities which operate on and assume UTF-8 input. The encoded portion of the provided URLs contain sequences that are invalid for UTF-8 encoding, and as a result cannot be decoded properly. This error can be reproduced in any browser console or repl properly implementing the spec (e.g. repl.it) which is expecting UTF-8.

Take %D2%E0%EB%EB%E8%ED as an example which represents Òàëëèí in windows-1252 encoding. The equivalent in UTF-8 would be %C3%92%C3%A0%C3%AB%C3%AB%C3%A8%C3%AD giving a URL of https://www.kinopoisk.ru/community/city/%C3%92%C3%A0%C3%AB%C3%AB%C3%A8%C3%AD/. Unfortunately, this URL won't work due to the encoding of the website.


Why this fails now, but not before:

The introduction of caching via lukechilds/cacheable-request introduced the package sindresorhus/normalize-url which uses decodeURI internally. This module could perform a best-effort decoding - falling back to the encoded value - when the string is not UTF-8 encoded. This would allow URLs that happen to be encoded unexpectedly to process successfully.

I don't think a fix, if any, would be applied here directly in Got.

@brandon93s
Copy link
Contributor

According to RFC 3986, UTF-8 encoding of URLs is spec and a requirement. The widely used Express will also throw on non UTF-8 encoded URLs. Got throwing is now enforcing URLs to be spec compliant before making a request, which isn't necessarily a bad thing.

@sindresorhus
Copy link
Owner

Thanks for elaborating @brandon93s. I think the correct fix here is to detect the case early in Got and throw a user-friendly error about the URL having an invalid encoding.

@sindresorhus sindresorhus added enhancement This change will extend Got features ✭ help wanted ✭ labels Nov 19, 2017
@brandon93s
Copy link
Contributor

@sindresorhus Are we okay with a brute force try...catch around a decodeURI early-ish on in normalizeArguments to catch any potential errors with a user-friendly message? Any decodeURI failure will present an issue, so we might as well check upfront and inform the consumer!

Glad to implement...

@sindresorhus
Copy link
Owner

Yes

@reschke
Copy link

reschke commented Jan 25, 2023

FWIW, the assumption that any valid http(s) URI must be a UTF-8 octet sequence after percent-decoding is incorrect.

(Speaking as co-author of the HTTP specification(s))

@duerst
Copy link

duerst commented Jan 30, 2023

For a generic HTTP library, not enforcing http/https URLs to be UTF-8 is the right decision. But such a library should make it easy to use UTF-8 for URIs, And wherever possible, servers should use UTF-8 for their URIs if they contain non-ASCII characters, and should use a suitable baseXX encoding for binary data such as digital signatures and the like.

Btw, contrary to what @Brandon93 says at the start of this thread, https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ is not in Windows-1252 (Western Europe), but in Windows-1251 (Russia). This of course makes sense because the site has a Russian domain name. The city is Таллин, in Latin letters this is Tallin. You can easily check this by using the URL in a browser. Using Windows-1252 makes no sense because there is no language that contains words like "Òàëëèí" (accented vowels only).

This shows the advantage of using UTF-8. It avoids the mess of regional encodings, and because of its internal structure cannot easily be mistaken for some other encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This change will extend Got features ✭ help wanted ✭
Projects
None yet
Development

No branches or pull requests

5 participants