Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML-encode the ampersand in the URL #32

Merged
merged 1 commit into from Oct 28, 2019

Conversation

@petdance
Copy link
Contributor

petdance commented Oct 26, 2019

No description provided.

@@ -1,3 +1,3 @@
<link rel="dns-prefetch" href="https://fonts.gstatic.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css?family=Bitter:400,400i,700&display=swap" rel="stylesheet">
<link href="https://fonts.googleapis.com/css?family=Bitter:400,400i,700&amp;display=swap" rel="stylesheet">

This comment has been minimized.

Copy link
@yous

yous Oct 27, 2019

Owner

This line is from https://fonts.google.com, and & is for distinguishing family and display key, with the value Bitter:400,400i,700 and swap. If we encode & to &amp; then the only family key will be parsed with the value Bitter:400,400i,700&display=swap.

This comment has been minimized.

Copy link
@yous

yous Oct 27, 2019

Owner

Is there any problem with this when & remains &?

This comment has been minimized.

Copy link
@yous

yous Oct 27, 2019

Owner

Wait... I think I'm confused with &amp; and %26.

@petdance petdance changed the title URL-encode the ampersand in the URL HTML-encode the ampersand in the URL Oct 27, 2019
@petdance

This comment has been minimized.

Copy link
Contributor Author

petdance commented Oct 27, 2019

My mistake, it needs to be HTML-encoded, not URL-encoded.

Yes, there is a problem with the unencoded ampersand. Consider:

$ cat foo.html
<a href="http://example.com/?this&that">
$ tidy -qe foo.html
line 1 column 34 - Warning: unescaped & or unknown entity "&that"

The & in the URL has to be encoded as &amp; just like any other ampersand in the HTML document.

Note that this change does not change the URL. It's simply encoding it correctly.

@yous

This comment has been minimized.

Copy link
Owner

yous commented Oct 27, 2019

There is an ambiguous ampersand section in HTML5 spec.

An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more alphanumeric ASCII characters, followed by a ";" (U+003B) character, where these characters do not match any of the names given in the named character references section.

But there is no matching named character &d;, &di;, ..., &display, etc. HTML 4.01 is also similar with HTML5: https://mathiasbynens.be/notes/ambiguous-ampersands.

I'll try tidy-html5 against this repository. As it generates some warnings, I'll consider replacing single & to &amp; to be explicit.

@yous

This comment has been minimized.

Copy link
Owner

yous commented Oct 27, 2019

This repository uses html5validator which is based on The Nu Html Checker (v.Nu), it generates errors with following HTML:

<link href="https://fonts.googleapis.com/css?family=Bitter:400,400i,700&copy" rel="stylesheet">
$ html5validator _site/2017/01/02/my-example-post/index.html
ERROR:html5validator.validator:"file:/Users/yous/src/whiteglass/_site/2017/01/02/my-example-post/index.html":40.1-40.72: error: The string following "&" was interpreted as a character reference. ("&" probably should have been escaped as "&amp;".)

As a named character reference &copy; exists. But with the original URL, html5validator doesn't generate errors.

@petdance

This comment has been minimized.

Copy link
Contributor Author

petdance commented Oct 28, 2019

Character entities have to have a semicolon at the end.

If the &copy in the URL is supposed to be the copyright symbol entity, then it should have the semicolon at the end, as &copy;. And if it's not supposed to be the copyright symbol entity, then it should be &amp;copy.

@yous

This comment has been minimized.

Copy link
Owner

yous commented Oct 28, 2019

Yes, right. I meant, if there was an ambiguous ampersand in URL, then html5validator would give some errors. But the actual URL doesn't contain ambiguous ampersand as there is no &dis;, &disp;, etc.

I tried some snippets:

  1. With <a title="&copy=foo">link</a>, html5validator doesn't give errors, the hover text is &copy=foo.
  2. With <a title="&copy;=foo">link</a>, html5validator doesn't give errors, the hover text is ©=foo.
  3. With <a title="&copy">link</a>, html5validator gives an error, the hover text is ©.
  4. With <a title="&copyfoo">link</a>, html5validator doesn't give errors, the hover text is &copyfoo.

So when the entity is not explicitly a character entity but will be parsed as a character entity, html5validator will give errors.

I think using html5validator is enough, are there something more to consider or am I missing something?

@petdance

This comment has been minimized.

Copy link
Contributor Author

petdance commented Oct 28, 2019

I understand that &display=swap is not ambiguous, and that browsers may handle it OK. Still, the correct way to do it is with &amp;display=swap. I'm not seeing a reason not to.

This instance in fonts.html is the only instance of this problem that I've seen.

@yous

This comment has been minimized.

Copy link
Owner

yous commented Oct 28, 2019

Okay. There are not so many ampersands, merging now.

@yous yous merged commit 8ec0c03 into yous:master Oct 28, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.