URI malformed when trying to get URL with URL-encoded chars #420

k03mad · 2017-11-17T15:53:12Z

Hello,

Catched URI malformed error in new "got", when I trying to send request to URL with URL-encoded chars.
URL examples:
https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/
https://www.kinopoisk.ru/news/keyword/%C7%E2%E5%E7%E4%ED%FB%E5+%E2%EE%E9%ED%FB/

nodejs: 9.2.0
got: 8.0.0

Failed code:

const got = require('got');

(async () => {
	try {
		const response = await got('https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/');
		console.log(response);
	} catch (error) {
		console.log(error);
	}
})();

URIError: URI malformed
    at decodeURI (<anonymous>)
    at module.exports (/Users/kirill-m/git/test/node_modules/normalize-url/index.js:87:21)
    at /Users/kirill-m/git/test/node_modules/cacheable-request/src/index.js:43:16
    at get (/Users/kirill-m/git/test/node_modules/got/index.js:98:20)
    at Promise.resolve.then.size (/Users/kirill-m/git/test/node_modules/got/index.js:274:5)
    at <anonymous>

Broken at this commit: 3c79205

Any ideas?

The text was updated successfully, but these errors were encountered:

brandon93s · 2017-11-18T03:31:04Z

Explanation of failure:

This website is windows-1252 encoded which is unsupported by js-native decode utilities which operate on and assume UTF-8 input. The encoded portion of the provided URLs contain sequences that are invalid for UTF-8 encoding, and as a result cannot be decoded properly. This error can be reproduced in any browser console or repl properly implementing the spec (e.g. repl.it) which is expecting UTF-8.

Take %D2%E0%EB%EB%E8%ED as an example which represents Òàëëèí in windows-1252 encoding. The equivalent in UTF-8 would be %C3%92%C3%A0%C3%AB%C3%AB%C3%A8%C3%AD giving a URL of https://www.kinopoisk.ru/community/city/%C3%92%C3%A0%C3%AB%C3%AB%C3%A8%C3%AD/. Unfortunately, this URL won't work due to the encoding of the website.

Why this fails now, but not before:

The introduction of caching via lukechilds/cacheable-request introduced the package sindresorhus/normalize-url which uses decodeURI internally. This module could perform a best-effort decoding - falling back to the encoded value - when the string is not UTF-8 encoded. This would allow URLs that happen to be encoded unexpectedly to process successfully.

I don't think a fix, if any, would be applied here directly in Got.

brandon93s · 2017-11-18T04:06:40Z

According to RFC 3986, UTF-8 encoding of URLs is spec and a requirement. The widely used Express will also throw on non UTF-8 encoded URLs. Got throwing is now enforcing URLs to be spec compliant before making a request, which isn't necessarily a bad thing.

sindresorhus · 2017-11-19T16:15:12Z

Thanks for elaborating @brandon93s. I think the correct fix here is to detect the case early in Got and throw a user-friendly error about the URL having an invalid encoding.

brandon93s · 2017-12-09T05:35:36Z

@sindresorhus Are we okay with a brute force try...catch around a decodeURI early-ish on in normalizeArguments to catch any potential errors with a user-friendly message? Any decodeURI failure will present an issue, so we might as well check upfront and inform the consumer!

Glad to implement...

sindresorhus · 2017-12-09T10:21:18Z

Yes

reschke · 2023-01-25T10:39:09Z

FWIW, the assumption that any valid http(s) URI must be a UTF-8 octet sequence after percent-decoding is incorrect.

(Speaking as co-author of the HTTP specification(s))

duerst · 2023-01-30T04:02:05Z

For a generic HTTP library, not enforcing http/https URLs to be UTF-8 is the right decision. But such a library should make it easy to use UTF-8 for URIs, And wherever possible, servers should use UTF-8 for their URIs if they contain non-ASCII characters, and should use a suitable baseXX encoding for binary data such as digital signatures and the like.

Btw, contrary to what @Brandon93 says at the start of this thread, https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ is not in Windows-1252 (Western Europe), but in Windows-1251 (Russia). This of course makes sense because the site has a Russian domain name. The city is Таллин, in Latin letters this is Tallin. You can easily check this by using the URL in a browser. Using Windows-1252 makes no sense because there is no language that contains words like "Òàëëèí" (accented vowels only).

This shows the advantage of using UTF-8. It avoids the mess of regional encodings, and because of its internal structure cannot easily be mistaken for some other encoding.

sindresorhus added enhancement This change will extend Got features ✭ help wanted ✭ labels Nov 19, 2017

brandon93s mentioned this issue Dec 18, 2017

Validate that url is UTF-8 encoded #441

Merged

sindresorhus closed this as completed in #441 Dec 23, 2017

brandon93s mentioned this issue Jan 5, 2018

Resolve Dependency correctly if the target asset is URI-encoded parcel-bundler/parcel#401

Merged

ckcr4lyf mentioned this issue Jan 25, 2023

seed some seeders in redis ckcr4lyf/kiryuu-gauge#5

Merged

ckcr4lyf mentioned this issue Jan 25, 2023

Do not enforce URLs to be UTF-8 #2208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URI malformed when trying to get URL with URL-encoded chars #420

URI malformed when trying to get URL with URL-encoded chars #420

k03mad commented Nov 17, 2017 •

edited

brandon93s commented Nov 18, 2017 •

edited

brandon93s commented Nov 18, 2017

sindresorhus commented Nov 19, 2017

brandon93s commented Dec 9, 2017

sindresorhus commented Dec 9, 2017

reschke commented Jan 25, 2023

duerst commented Jan 30, 2023

URI malformed when trying to get URL with URL-encoded chars #420

URI malformed when trying to get URL with URL-encoded chars #420

Comments

k03mad commented Nov 17, 2017 • edited

brandon93s commented Nov 18, 2017 • edited

brandon93s commented Nov 18, 2017

sindresorhus commented Nov 19, 2017

brandon93s commented Dec 9, 2017

sindresorhus commented Dec 9, 2017

reschke commented Jan 25, 2023

duerst commented Jan 30, 2023

k03mad commented Nov 17, 2017 •

edited

brandon93s commented Nov 18, 2017 •

edited