Don't force query string normalization #1234

Giotino · 2020-05-07T17:15:52Z

What problem are you trying to solve?

In an url like http://example.org/random?param=SOMETHING~SOMETHING the special character ~ is percent-encoded before the request, resulting in http://example.org/random?param=SOMETHING%7ESOMETHING which is not supported (decoded) by some HTTP servers.

As described by RFC 3986 in section 2.3
"
URI comparison implementations do not always perform normalization prior to comparison. For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.
"

Also in RFC 3986, section 6.2.2.2
"
The percent-encoding mechanism is a frequent source of variance among otherwise identical URIs. In addition to the case normalization issue noted above, some URI producers percent-encode octets that do not require percent-encoding, resulting in URIs that are equivalent to their non-encoded counterparts. These URIs should be normalized by decoding any percent-encoded octet that corresponds to an unreserved character, as described in Section 2.3.
"

The percent-encoding of ~ by got happen because NodeJS follows the "WHATWG URL API" (https://nodejs.org/api/url.html#url_the_whatwg_url_api) which misses ~ from the unreserved characters (https://url.spec.whatwg.org/#interface-urlsearchparams, the Note below the example, and https://url.spec.whatwg.org/#urlencoded-serializing) and, by the way, includes *.

Describe the feature

My proposal is to add a flag to the options to prevent the normalization by skipping the append and delete of "_GOT_INTERNAL_TRIGGER_NORMALIZATION"

Checklist

I have read the documentation and made sure this feature doesn't already exist.

The text was updated successfully, but these errors were encountered:

szmarczak · 2020-05-07T19:56:50Z

Duplicate of #1180 #1202 #1113

szmarczak · 2020-05-07T20:01:39Z

Unfortunately we cannot do anything about this :(

Giotino · 2020-05-08T08:52:05Z

Sorry, I hadn't seen the previous issues, but i think that something can be done.

#1180 and #1202 described that behavior as a bug, i don't think of it as bug, NodeJS decided to go with the WHATWG URL specs, and it follows it.

What can be done is giving the user a choice about the standard (or whether the normalization should take place), maybe with a flag into the options.
example:

got('http://example.org/random?param=SOMETHING~SOMETHING', { rfc3986: true })
or
got('http://example.org/random?param=SOMETHING~SOMETHING', { preventSearchNomalization: true })

Another thing that can be done is to only reimplement the URLSearchParams class to generate the URL as described by the RFC3986 (without giving the user a choice).

The last thing that came to my mind is to remove this code (core/index.ts:713)

if (options.url.search) {
  const triggerSearchParameters = "_GOT_INTERNAL_TRIGGER_NORMALIZATION";

  options.url.searchParams.append(triggerSearchParameters, "");
  options.url.searchParams.delete(triggerSearchParameters);
}

That would be a little hack to the Javascript url parser. It seems that when a url is parsed (new URL('SOME_URL')) the query string is not re-generated (normalized), it stays as in SOME_URL, in fact got has the code above to normalize it.
I don't think this last one is a valid solution and I don't think it will be reliable for the future.

Actually the thing is to have a little discussion about what has to be done: leaving things the way they are blaming the NodeJS specs, or giving the user tools to decide by himself.

All that said, I'm willing to help with the implementation of any solution.

szmarczak · 2020-05-08T09:10:55Z

Unfortunately the WHATWG URL is used in the Node.js http module. Got just enforces the behavior to be always the same. If we didn't do that, the sometimes you'd end up with ~ and sometimes with %7E.

Giotino · 2020-05-08T09:13:34Z

Ok, but what about leaving the query string as it is when it's inside the URL as the http module do?

got('http://example.org/random?param=SOMETHING~SOMETHING') => /random?param=SOMETHING~SOMETHING

got('http://example.org/random', { searchParams: { param: 'SOMETHING~SOMETHING' } }) => `/random?param=SOMETHING%7ESOMETHING`

szmarczak · 2020-05-08T09:25:08Z

That's the behavior we want to avoid. The proper "fix" would be to pass an object instead of a URL instance to the http module.

Giotino · 2020-05-08T09:53:08Z

That's the behavior we want to avoid.

So with got I can't send a request like this:

GET /random?=^~ HTTP/1.1
...

Because /random?=^~ gets normalized to /random?=%5E%7E

Event if it respects the HTTP/1.1 (RFC 2616) and URI (RFC 3986) RFCs?

RFC 3986 section 2.2
"
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
"

RFC 2616 section 6.2.2 describes an HTTP URL as
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

szmarczak · 2020-05-08T09:57:57Z

WHATWG URL specs, and it follows it.

The spec is buggy. If you want to get this fixed faster, feel free to send a temporary fix to Node.js. On Got side we just enforce normalization.

Giotino · 2020-05-08T10:47:31Z

My previous example was a little misleading.
What I'm trying to say is that whenever Got find something as URL.search (when parsing the URL) it treats that string as URLSearchParams.

But that's wrong.
If I make a request to an URL like http://example.org/?aaa, Got normalize it to an equivalent of http://example.org/?aaa= and so the server receives /?aaa= as url.

With this last example i think this is going to be a "bug report" instead of a "feature request"

szmarczak · 2020-05-08T11:20:27Z

According to this RFC it's still valid because the value is empty:

However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, it is sometimes better for usability to avoid percent-
encoding those characters.

Giotino · 2020-05-08T11:30:17Z

The RFC states that query components are often (not always) in the form of "key=value".

szmarczak · 2020-05-08T12:02:24Z

That's right. But ?key= means the same as ?key. Both have no value.

Giotino · 2020-05-08T12:30:22Z

The problem is that I can't find a single standard that states that the query string must be interpreted as key-value and be in the form of key1=value1&key2=value2 (application/x-www-form-urlencoded)

A lot of servers parse it like that (and also offer access to the raw string) but to me it seems not to be mandatory.
I think most servers do that because it's the preferred (most used) way.

Event the WHATWG URL spec does not states that the query string must be application/x-www-form-urlencoded, but it propose a standard to apply to it.

Giotino · 2020-05-09T11:33:55Z

I found two replies on stackoverflow and stackexchange that pretty much sum up what I've said about the "query component" in an URI.

https://stackoverflow.com/questions/39266970/what-is-the-difference-between-url-parameters-and-query-strings#39294675
https://softwareengineering.stackexchange.com/questions/260119/different-operators-in-query-string-like-var-value#260205

szmarczak · 2020-05-09T11:38:46Z

The spec is buggy. If you want to get this fixed faster, feel free to send a temporary fix to Node.js. On Got side we just enforce normalization.

#1234 (comment)

Giotino · 2020-05-09T11:42:22Z

The spec is fine, but Got tries to enforce application/x-www-form-urlencoded onto the query even if it's not.

On Got side we just enforce normalization.

You are trying to normalize things that should not be normalized (because there's no standard).

szmarczak · 2020-05-09T12:15:00Z

The spec is fine

It's not. Even one of their maintainers says it's broken: https://twitter.com/domenic/status/1257377082704900096

You are trying to normalize things that should not be normalized (because there's no standard).

@sindresorhus One of the workarounds would be to consider the query in the input and the searchParams option mutually exclusive. But that would be a breaking change.

Giotino · 2020-05-09T12:26:03Z

What i really meant with

The spec is fine

was that that's not the problem.

I know this "Feature request" started as "I've a problem with ~, damn WHATWG spec", but after this comment #1234 (comment) , it became a "Bug report" on the incorrect handling of the query string.

I wanted to discuss exhaustively about this issue because, as you said, it would be a breaking change. In particular I think it is going to break the merging of the URL params.

Giotino · 2020-05-11T11:43:12Z

Any news about this issue?

Should this issue be reopened?

Giotino · 2020-05-11T14:29:24Z

May I ask how decodeURI is going to solve this issue?

I think the only correct solution is to leave the URL intact (if given as a string).

BTW, this issue isn't breaking some random small HTTP server, it's breaking compatibility with things like Akamai's "Adaptive Media Delivery"
Example URL: http://domain.akamaihd.net/my/uri/file.mp4?token=exp=1462383360~data=my_signature_data~hmac=2d5a69e9a3d2b6d304958a05afc2c139045adb71f65115861736be5cba98359d
If the = (and ~) inside the token "value" are percent encoded the URL stops working.

szmarczak · 2020-05-11T15:03:38Z

Sorry, I meant decodeURIComponent.

Giotino · 2020-05-11T15:17:27Z

Sorry, I meant decodeURIComponent.

My doubts remain the same as before.

This time you have %## disappearing from the URL. Also this won't fix the appended =.

First example:
URL: http://example.org/?key=some%legit%value, if you call decodeURIComponent('some%legit%value') the result is an URIError.

Second example:
URL: http://example.org/?some_legit_query_string is going to became http://example.org/?some_legit_query_string= which is different.

szmarczak · 2020-05-11T15:32:46Z

First example:

Invalid URL. The % character should be escaped.

Second example:

Will be fixed.

Giotino · 2020-05-11T15:51:21Z

Invalid URL. The % character should be escaped.

It's not invalid, your statement would be true if the format applied to the query string was application/x-www-form-urlencoded, which requires that % character should be escaped.

What about a custom format that don't use percent escapes?
It's still into the specs, as described before HTTP do not have specs about the query string content (except for the # character, as it's the terminator of the query string).

You can argue that application/x-www-form-urlencoded is a spec and Got follows it, but in the homepage I read

Human-friendly and powerful HTTP request library for Node.js

and not

Human-friendly and powerful HTTP (with query string as application/x-www-form-urlencoded) request library for Node.js

szmarczak · 2020-05-11T16:29:30Z

The WHATWG URL standard (query string) specifies that only these characters may be unescaped: https://url.spec.whatwg.org/#url-code-points

decodeURIComponent('some%legit%value') throws an error. The thing that is incorrect in the WHATWG spec is that searchParams assumes that the query is x-www-form-urlencoded.

Giotino · 2020-05-11T16:56:37Z

the WHATWG spec is that searchParams assumes that the query is x-www-form-urlencoded

I don't think that statement is true. (they also make an example against it)

There's why:
There's also search which is the raw string.
Reading the WHATWG spec seems like searchParams is an attempt to read search as x-www-form-urlencoded but I can't find anything about its mandatoriness of that format (meaning that if search it's not x-www-form-urlencoded searchParams has no real meaning)

The WHATWG URL standard (query string) specifies that only these characters may be unescaped

Sorry, I missed it... (it's only written in the implementation of the parser)

szmarczak · 2020-05-11T17:02:09Z

I don't think that statement is true. (they also make an example against it)

Wrong example. They're talking about query string, not search params.

Sorry, I missed it... (it's only written in the implementation of the parser)

No problem, I just figured out that too. Was wondering why decodeURIComponent('some%legit%value') was throwing...

Giotino · 2020-05-11T17:07:26Z

Wrong example. They're talking about query string, not search params.

My bad, I phrased it wrong.
I meant that there's still the search value (which represents the query string) and there's nothing about how to parse it (except the escapes), hence WHATWG is not trying to force a spec on the query string, it's only offering a format/tool (URLSearchParams) that can be used or not.

szmarczak · 2020-05-11T17:09:01Z

WHATWG is not trying to force a spec on the query string

But there are inconsistencies in the normalization...

Giotino · 2020-05-11T17:16:26Z

But there are inconsistencies in the normalization...

I'm sorry, I can't understand what you are referring to, can you explain

stevenvachon · 2020-05-11T20:15:54Z

This occurs with basic auth as well:

const url = new URL('http://host');
url.username = '=user';
url.password = '=pass';
got(new URL(url));
// sends "%3Duser" and "%3Dpass" to `http.request`

szmarczak · 2020-05-11T21:48:45Z

I'm sorry, I can't understand what you are referring to, can you explain

You can pass a perfectly valid urlencoded query in new URL(string) and if you call .append('_', '').delete('_') it will be improperly escaped. Hence those issues.

szmarczak · 2020-05-11T21:49:34Z

@stevenvachon You need to compute the authorization header on your own for now...

Giotino · 2020-05-12T07:39:42Z

You can pass a perfectly valid urlencoded query in new URL(string) and if you call .append('_', '').delete('_') it will be improperly escaped. Hence those issues.

The URL Class of WHATWG that states that searchParams is readonly, but I can't understand if it means that it can't be swapped (readonly reference) or it can't be "touched" (readonly object).

In any case if you are using searchParams (with append and delete) it means that you treat the query string as x-www-form-urlencoded, hence it's correct to apply that normalization.

If you're using x-www-form-urlencoded there's no need to "keep the URL as it is" because that spec define how to encode/decode that string, so it will be fine.

Giotino changed the title ~~Better percent encoding~~ Better serach normalization May 7, 2020

Giotino changed the title ~~Better serach normalization~~ Better search normalization May 7, 2020

szmarczak closed this as completed May 7, 2020

Giotino changed the title ~~Better search normalization~~ Wrong URL query string handling May 9, 2020

szmarczak reopened this May 11, 2020

szmarczak changed the title ~~Wrong URL query string handling~~ decodeURI(url) before sending a request May 11, 2020

szmarczak added the enhancement This change will extend Got features label May 11, 2020

szmarczak changed the title ~~decodeURI(url) before sending a request~~ decodeURIComponent(url) before sending a request May 11, 2020

szmarczak changed the title ~~decodeURIComponent(url) before sending a request~~ decodeURIComponent(query) before sending a request May 11, 2020

szmarczak mentioned this issue May 11, 2020

Don't force query string normalization #1246

Merged

4 tasks

stevenvachon mentioned this issue May 11, 2020

decodeURIComponent basic auth #1249

Closed

2 tasks

Giotino mentioned this issue May 12, 2020

Get an empty body or zero length rawBody in some case #1251

Closed

2 tasks

Giotino changed the title ~~decodeURIComponent(query) before sending a request~~ Don't force query string normalization May 12, 2020

szmarczak closed this as completed in #1246 May 12, 2020

Giotino mentioned this issue Dec 1, 2020

How is the URL build in got? Part 2 #1546

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't force query string normalization #1234

Don't force query string normalization #1234

Giotino commented May 7, 2020 •

edited

Loading

szmarczak commented May 7, 2020

szmarczak commented May 7, 2020

Giotino commented May 8, 2020 •

edited

Loading

szmarczak commented May 8, 2020

Giotino commented May 8, 2020 •

edited

Loading

szmarczak commented May 8, 2020

Giotino commented May 8, 2020 •

edited

Loading

szmarczak commented May 8, 2020

Giotino commented May 8, 2020 •

edited

Loading

szmarczak commented May 8, 2020

Giotino commented May 8, 2020

szmarczak commented May 8, 2020

Giotino commented May 8, 2020 •

edited

Loading

Giotino commented May 9, 2020

szmarczak commented May 9, 2020

Giotino commented May 9, 2020 •

edited

Loading

szmarczak commented May 9, 2020 •

edited

Loading

Giotino commented May 9, 2020 •

edited

Loading

Giotino commented May 11, 2020

Giotino commented May 11, 2020 •

edited

Loading

szmarczak commented May 11, 2020

Giotino commented May 11, 2020 •

edited

Loading

szmarczak commented May 11, 2020

Giotino commented May 11, 2020 •

edited

Loading

szmarczak commented May 11, 2020

Giotino commented May 11, 2020 •

edited

Loading

szmarczak commented May 11, 2020

Giotino commented May 11, 2020

szmarczak commented May 11, 2020 •

edited

Loading

Giotino commented May 11, 2020

stevenvachon commented May 11, 2020 •

edited

Loading

szmarczak commented May 11, 2020

szmarczak commented May 11, 2020

Giotino commented May 12, 2020 •

edited

Loading

Don't force query string normalization #1234

Don't force query string normalization #1234

Comments

Giotino commented May 7, 2020 • edited Loading

What problem are you trying to solve?

Describe the feature

Checklist

szmarczak commented May 7, 2020

szmarczak commented May 7, 2020

Giotino commented May 8, 2020 • edited Loading

szmarczak commented May 8, 2020

Giotino commented May 8, 2020 • edited Loading

szmarczak commented May 8, 2020

Giotino commented May 8, 2020 • edited Loading

szmarczak commented May 8, 2020

Giotino commented May 8, 2020 • edited Loading

szmarczak commented May 8, 2020

Giotino commented May 8, 2020

szmarczak commented May 8, 2020

Giotino commented May 8, 2020 • edited Loading

Giotino commented May 9, 2020

szmarczak commented May 9, 2020

Giotino commented May 9, 2020 • edited Loading

szmarczak commented May 9, 2020 • edited Loading

Giotino commented May 9, 2020 • edited Loading

Giotino commented May 11, 2020

Giotino commented May 11, 2020 • edited Loading

szmarczak commented May 11, 2020

Giotino commented May 11, 2020 • edited Loading

szmarczak commented May 11, 2020

Giotino commented May 11, 2020 • edited Loading

szmarczak commented May 11, 2020

Giotino commented May 11, 2020 • edited Loading

szmarczak commented May 11, 2020

Giotino commented May 11, 2020

szmarczak commented May 11, 2020 • edited Loading

Giotino commented May 11, 2020

stevenvachon commented May 11, 2020 • edited Loading

szmarczak commented May 11, 2020

szmarczak commented May 11, 2020

Giotino commented May 12, 2020 • edited Loading

Giotino commented May 7, 2020 •

edited

Loading

Giotino commented May 8, 2020 •

edited

Loading

Giotino commented May 8, 2020 •

edited

Loading

Giotino commented May 8, 2020 •

edited

Loading

Giotino commented May 8, 2020 •

edited

Loading

Giotino commented May 8, 2020 •

edited

Loading

Giotino commented May 9, 2020 •

edited

Loading

szmarczak commented May 9, 2020 •

edited

Loading

Giotino commented May 9, 2020 •

edited

Loading

Giotino commented May 11, 2020 •

edited

Loading

Giotino commented May 11, 2020 •

edited

Loading

Giotino commented May 11, 2020 •

edited

Loading

Giotino commented May 11, 2020 •

edited

Loading

szmarczak commented May 11, 2020 •

edited

Loading

stevenvachon commented May 11, 2020 •

edited

Loading

Giotino commented May 12, 2020 •

edited

Loading