Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom category names seem to not be escaped in HTML #945

Closed
kirb opened this issue Mar 13, 2018 · 10 comments
Closed

Custom category names seem to not be escaped in HTML #945

kirb opened this issue Mar 13, 2018 · 10 comments
Labels

Comments

@kirb
Copy link

kirb commented Mar 13, 2018

Using for instance:

custom_categories:
  - name: "Florp: Constants"
    children:
      - ThingConstant
      - OtherThingConstant

generates the following HTML:

<li class="nav-group-task">
  <a class="nav-group-task-link" href="Florp: Constants.html#/c:@ThingConstant">ThingConstant</a>
</li>

which will cause the browser to attempt to load Florp://%20Constants.html rather than http://url/to/docs/Florp%3A%20Constants.html. Reproduced the same with ?, #, %00, and the space above isn’t encoded as %20, so it seems no characters are being escaped at all. Tested with the fullwidth theme.

@SDGGiesbrecht
Copy link
Contributor

There are two distinct issues here. One is the filename, the other is the link.

Filename

Way back when --use-safe-filenames was created for the filenames derived from the Swift source (e.g. type names), the consensus was that the option should leave user defined ones (e.g. categories) alone. The idea was that the user has direct control over them and can choose which characters to use, so those URLs could remain human‐readable instead of becoming an illegible mess. Because what is and is not valid varies depending on the server, being completely safe—which was necessary wherever the user had no control—meant mangling far more than many people like.

However, it has been a while since then. Does that decision need to be rethought? What do others think?

Link

The href attributes may as well be percent encoded no matter what. No one sees them anyway and doing so would not break existing links. (Percent encoding will have no effect on the already safe mangled names.)

@kirb
Copy link
Author

kirb commented Mar 13, 2018

I’d say the filename should avoid characters the browser will display encoded in the address, and ones that are definitely illegal as they have special meanings to the browser or filesystem, although that also means spaces would have to be substituted which some might feel is pretty ugly. Markdown docs pages seem do this already as Quick Brown Fox.md becomes quick-brown-fox.html. Personally I’d rather category names use the same substitution. Interested to see anyone’s counterarguments for that though.

Worth noting Firefox seems to display literal spaces in the address bar while Chrome and Safari leave it as %20. Chrome in fact shows literal spaces in the bottom-left status bar. Don’t think the display of an address is standardised at all so nothing official we can go by.

Hrefs definitely should be encoded no matter what, you’d expect anything you throw at it to just work. It’s impossible to use a character needing escaping as it is now, and I can’t use %3A as then the filename would literally contain %3A and that’d have to be encoded as %253A.

@SDGGiesbrecht
Copy link
Contributor

I’d say the filename should avoid characters the browser will display encoded in the address.

Percent encoding is intended to be analogous to network byte order. (The same goes for Punycode for the domain part.) All HTTP traffic is supposed to use it to make sure it can pass through any ancient gateway it encounters along the way, but the intention is that local operations should all use the native format: Unicode. Users are supposed to only ever see the characters themselves, and data is supposed to remain in that way right up until the request is sent, and be decoded immediately when it is received. This is particularly important, because percent encoding breaks when moved from a string in one text encoding to one in another, such as by copy‐and‐paste.

Every browser is responsible to display the address in its decoded form. Not doing so is a dangerous bug (see the end of the last paragraph). Once they all follow the rules, there will be no characters displaying as gibberish ever. In reality though, browsers aren’t necessarily caught up to the standard, so in practice 99.99991% of scalars are vulnerable to appearing in encoded form.

Even the href tag is intended to be in the decoded form. (See here and scroll up a couple centimeters.) However, it is unusual in that the specification states that encoding here is only applied to scalars above U+007F, which effectively makes it so spaces and other ACSII scalars still need to be manually encoded, and for the rest it does not actually matter either way. Since % is in ASCII, there is no risk of doubly encoding anything—probably the very reason for this exception.

Note that when I just tested it with Safari, href="אבג.html" works and displays just like you would expect. href="α β γ.html" fails andhref="а%20б%20в" works, but displays the same, partly garbled way.

Markdown docs pages seem do this already as Quick Brown Fox.md becomes quick-brown-fox.html.

I had not noticed that before, but looking closer at it now, this is in and of itself a severe problem. Instead of mangling the names in a way that preserves uniqueness like we did with --use-safe-filenames, that algorithm is extremely many‐to‐one. Whichever Jazzy writes second overwrites the first. Here you can see an example where under “Guides” 🇬🇷ΕΛ Με διαβάστε.md and 🇬🇷ΕΛ Συγγενικά έργα.md both received the same filename, --.html, so both links point at the second file and the first file does not even exist.


So yes, we may as well percent encode every last href attribute. That will solve the links.

But as far as the filenames themselves, I still do not think we should strip colons or spaces from the path element, as they are not invalid unless defined as such by the particular server, and because browsers are supposed to be able to display them. However, we could:

A) provide the ability to specify a filename for the category, and/or

B) extend --use-safe-filenames to apply to user‐defined files as well, since it is opt‐in anyway.

In addition, the guide filename algorithm either needs fixing or to be modifiable by A or B.

I was hoping other seasoned contributors might have thoughts of their own: @johnfairh, @jpsim, @segiddins, @pcantrell.

@segiddins
Copy link
Collaborator

I think the current scheme exists to try and make as many file names as possible not require percent encoding?

@johnfairh
Copy link
Collaborator

Yeah I think so too. We should slug the category pages like we do for guide pages: until recently the data model didn't let these pages have different filenames to their display names but this is possible now. For compatibility, we probably have to do this only when needed vs. always.

I don't think slug collision is a problem in practice but for sure, would be possible to spot and do something about.

@johnfairh johnfairh added the bug label Mar 14, 2018
@SDGGiesbrecht
Copy link
Contributor

@johnfairh,

Until recently the data model didn’t let these pages have different filenames to their display names but this is possible now.

Can the filenames already be specified from the interface? (As a configuration or command line option?) That would be a viable workaround if it is.


I don’t think slug collision is a problem in practice.

Right now anything written in a language that does not use the latin alphabet gets completely obliterated so almost everything collides with almost everything. The example already given is a live (albeit superseded) project:

🇬🇷ΕΛ Με διαβάστε.md and 🇬🇷ΕΛ Συγγενικά έργα.md both received the same filename, --.html.

That is equivalent to having Read Me.md and Related Projects.md both be reduced to -.html. The two are in no way similar.

@johnfairh
Copy link
Collaborator

johnfairh commented Mar 15, 2018

Nope, filenames can't be set; by data model I meant the internal ruby/mustache data.

@SDGGiesbrecht
Copy link
Contributor

SDGGiesbrecht commented Mar 15, 2018

So it looks like fixing the filenames would be more involved and require more design thought.

But we do all agree that any and all href attributes can be percent encoded, and that is an easy, beginner level fix. Just find the right spot and use CGI.escape(...). Such a pull request would therefore be welcome from anyone. (And it would be enough to solve your original problem, @kirb.)

@johnfairh
Copy link
Collaborator

#1091 improves the guide filename mapping.

#1094 fixes the 'category name contains odd characters' issue from OP by URL-escaping hrefs.

I think slugging the category filenames would be too backwards incompatible, and not doing so doesn't actually break anything, so leaving it alone.

@johnfairh
Copy link
Collaborator

Fixed in master via #1094.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants