Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCF Document: RFC3986 or RFC3987? #808

Closed
iherman opened this issue Aug 16, 2016 · 23 comments · Fixed by #1670
Closed

OCF Document: RFC3986 or RFC3987? #808

iherman opened this issue Aug 16, 2016 · 23 comments · Fixed by #1670
Assignees
Labels
Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-OCF The issue affects the OCF section of the core EPUB 3 specification

Comments

@iherman
Copy link
Member

iherman commented Aug 16, 2016

(I know this is a huge can of worms, and I regularly get it wrong; maybe it is the case this time, too...)

The issue is what the allowed characters are for file names. More specifically, what the allowed characters are for the value of full-path in the container.xml file.

  • In Section 3.3, the reference is to RFC3986 (URI) and RFC3987 (IRI) are both listed. Which sounds about right; this means a file name of the form /téléphone is all right, although this would not be acceptable via RFC3986. (Although I wonder whether the reference to RFC3986 is necessary in the first place; in this respect RFC3987 supersedes RFC3986, doesn't it)
  • In 3.5.3.1, towards the end, only RFC3986 is reference in the defintion of full-path. This seems to be in contradiction with what is in Section 3.3
  • Section 3.4 seems to give a specific definition for file names; it is not clear to me whether this is necessary. Isn't it enough to refer to the same RFC3987 (well, the path portion thereof) to avoid mixup? Isn't it enough to quote RFC3987 and, if necessary, list the possible restrictions (I have not checked which of the characters listed in the 4th bullet point are excluded from the path segment of an IRI anyway) when it comes to file name? (Yes, of course, we have to refer to the last portion of an IRI path as the file name.)
  • And here is the rub: I think there is a discrepancy at this moment. If I correctly interpret either RFC3986 or RFC3987, the space (i.e., U+0020) character is excluded from the path, whereas Section 3.4 does not exclude it. This means that the examples are also wrong, because full-path="EPUB/Great Expectation.opf" is not a valid value for @full-path, although it is indeed a valid file name per Section 3.4:-(

As i said, it is a can of worm, and one of you guys may prove me wrong in my interpretation... But if I am right, my proposal would be:

  1. Use RFC3987 only in the references (with a possible note relating it to RFC3986)
  2. Reduce section 3.4 by saying that a file name should be conform to the path of RFC3987, minus a number of characters (if necessary, ie, check whether all those entries are necessary)
  3. Change the examples...
@iherman iherman added the Topic-OCF The issue affects the OCF section of the core EPUB 3 specification label Aug 16, 2016
@iherman iherman added this to the EPUB 3.1 milestone Aug 16, 2016
@mattgarrish
Copy link
Member

I don't believe 3987 supersedes 3986; more that it builds on it. The base URI is defined in 3986, for example. 3987 only makes reference to using the algorithms in 3986 in the relative IRI references section. ODF does similar, which I expect is what OCF drew on (see, for example, http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part3.html#__RefHeading__752821_826425813).

The space question is interesting. Most of this prose goes back to EPUB 2.0, and I can't recall it being questioned except for some additions to the list in 3.4.

I just tested percent encoding a space, and epubcheck and Readium couldn't make sense of the rootfile. They do handle unencoded spaces, though. I wonder if the example has come to influence implementations to the point where if we enforce what is supposed to be done what problems will it cause for existing implementations and content?

But this isn't my comfort zone, either.

@iherman
Copy link
Member Author

iherman commented Aug 17, 2016

Hm. I am even more lost at this point, probably because I never looked at the OASIS document before. But what the reference you give seems to say that ODF relies solely on 3987 which reinforces my first point, actually: there is no real reasons why our OCF document should refer to 3986 at all!

@iherman
Copy link
Member Author

iherman commented Aug 17, 2016

To be more precise: 3987 does refer to 3986 for the various BNF constructions et al. But that is internal to 3987; our own starting point should be 3987.

@iherman
Copy link
Member Author

iherman commented Aug 17, 2016

On the issue of space characters, I have made some extra research, for reference.

RFC 3987 says at Relative IRI references:

Processing of relative IRI references against a base is handled straightforwardly; the algorithms of [RFC3986] can be applied directly, treating the characters additionally allowed in IRI references in the same way that unreserved characters are in URI references.

Earlier in the RFC3987 it says:

IRIs are defined similarly to URIs in [RFC3986], but the class of unreserved characters is extended by adding the characters of the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject to the limitations given in the syntax rules below and in section 6.1.

The limitations it refers to are in the ABNF Rules. The important entry is the reference to ipchar, which allows for percent encoded characters, delimiters, and unreserved characters. The next important point is the latter, which is

iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

The last entry is a bunch of extra Unicode characters, irrelevant here; The next point to look at is the definition of ALPHA. The strange thing is that ALPHA is not formally defined in 3987 as Unicode ranges, but it refers to 3986, where it says (again without Unicode ranges?):

This specification uses the Augmented Backus-Naur Form (ABNF) notation of [RFC2234], including the following core ABNF syntax rules defined by that specification: ALPHA (letters), CR (carriage return), DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal digits), LF (line feed), and SP (space).

I.e., ALPHA does not include the space character. What this means is that the space character is definitely not allowed as a relative URI in RFC3987 (or RFC3986).

More pragmatically, I believe this makes sense in practice. While it is true that a file name on my Mac may include a space character, I am not sure this is true on all Linux systems or Windows (mainly in Windows 10 it is fine, never checked). I know that whenever I push a file from my machine to my Web server, I am careful in exchanging the space character to, say, a _ character. Using this restriction for EPUB sounds like a safe solution anyway.

@murata2makoto
Copy link
Contributor

I wonder what I have been doing....

First, I think that we should explicitly say that a file or path name matches isegment-nz in RFC 3987. Or, we might want to even use ipath-rootless (but have to disallow a trailing empty segment.)

Second, W3C has Legacy Extended IRIs (LEIRIs). It shows a list of characters allowed by legacy variants of IRIs but disallowed by RF 3987.

Some characters in this list are explicitly disallowed by OCF 3.1 but they do not have to be. They are shown below:

  • QUOTATION MARK: " (U+0022)
  • LESS-THAN SIGN: < (U+003C)
  • GREATER-THAN SIGN: > (U+003E)
  • REVERSE SOLIDUS: \ (U+005C)
  • C0 range (U+0000 … U+001F)
  • C1 range (U+0080 … U+009F)
  • Private Use Area (U+E000 … U+F8FF)
  • Non characters in Arabic Presentation Forms-A (U+FDDO … U+FDEF)
  • Tags and Variation Selectors Supplement (U+E0000 … U+E0FFF)
  • Supplementary Private Use Area-A (U+F0000 … U+FFFFF)
  • Supplementary Private Use Area-B (U+100000 … U+10FFFF)

Third, I believe that the characters in the next itemized list are already disallowed by RFC 3987 and do not have to be menteiond in OCF 3.1.

  • COLON: : (U+003A), which is a gen-delim
  • QUESTION MARK: ? (U+003F), which is a gen-delim
  • SOLIDUS: / (U+002F)
  • DEL (U+007F)
  • Specials (U+FFF0 … U+FFFF)

Fourth, we disallow some characters, although they are allowed by RFC 3987. I do not think that we should lift this limitation.

  • ASTERISK: * (U+002A), which is a sub-delim
  • FULL STOP as the last character: . (U+002E), which is a iunreserved

Fifth, we should explicitly allow the use of the space chracter, although we use RFC 3987 as a basis.

@murata2makoto
Copy link
Contributor

As far as I know, the only reason for referencing RFC 3986 is to borrow some terms which are not even mentioned in RFC 3987.

@iherman
Copy link
Member Author

iherman commented Aug 17, 2016

Just one comment…

Fifth, we should explicitly allow the use of the space chracter, although we use RFC 3987 as a basis.

I actually believe it is not a good idea to allow the space and there are good reasons why RFC 3986/7 have never allowed them. As I said in my comment, there are operating systems where space characters are disallowed in file names; if and EPUB 3.1 is unzipped in such a system, mess may occur. I know that the EPUB 3.1 spec does not talk about that alternative explicitly, because we delegated the BFF work into an auxiliary activity, I believe we should not allow something that may backfire on us later.

@iherman
Copy link
Member Author

iherman commented Aug 17, 2016

On 17 Aug 2016, at 12:02, Ivan Herman ivan@w3.org wrote:

Just one comment…

Fifth, we should explicitly allow the use of the space chracter, although we use RFC 3987 as a basis.

I actually believe it is not a good idea to allow the space and there are good reasons why RFC 3986/7 have never allowed them. As I said in my comment, there are operating systems where space characters are disallowed in file names; if and EPUB 3.1 is unzipped in such a system, mess may occur. I know that the EPUB 3.1 spec does not talk about that alternative explicitly, because we delegated the BFF work into an auxiliary activity, I believe we should not allow something that may backfire on us later.

Let me roll back a little bit: I have just checked and it seems that space is allowed in Linux; my bad. However, if one uses a URL to access a file that is in the 'exploded' version of an EPUB instance, that URL MUST use %20 instead of a space per existing RFC-s. Ie, although my argument on Linux is wrong, I still keep to my conclusion: it may backfire on us later if we allow spaces in file names within a publication...

@laudrain
Copy link

I have to mention that spaces in files names inside the EPUB package are a subject for rejection of EPUB files by distributors today.

@iherman iherman added the Status-Proposed Solution A proposed solution has been included in the issue for working group review label Aug 17, 2016
@iherman
Copy link
Member Author

iherman commented Aug 17, 2016

At this point, I believe that a radical re-write of the OCF document in terms of references may be too much to do; let us leave this (and flag this for clean-up!) for a later version. But (also in view of @laudrain's comment) I believe the issue of the space character is real and a bug we should not perpetuate.

To solve the issue for EPUB 3.1, I would propose the following: change all the examples in the document (there quite some) by removing the space character, or replacing it with the _ character. Leave everything else unchanged, except that we may add a note to the document somewhere, as a warning, saying something like:

Examples in the previous release of EPUB [EPUB 3.01] erroneously used a space character in the file names. The space character (U+0020) is, however, excluded from the list of valid characters in a file name, see [RFC3987].

@GarthConboy
Copy link

+1 to Ivan's above. I don't think we should wade into spec changes beyond the examples.

@laudrain
Copy link

+1 for me, just change the examples.

mattgarrish added a commit that referenced this issue Sep 1, 2016
#755 - change alt-script to alt-rep and clarify language
#761 - make image cmts required when there is a viewport
#773 - update roadmap and add diagram
#778 - clarify package conformance
#780 - generalize backwards compatibility statement
#800 - clarify svg handling for fxl documents
#808 - replace spaces with underscores in rootfile examples
#822 - fix obsolete feature labels/descriptions
#823 - add note about incomplete RS requirements for scrolled-continuous
#824 - add clearer content model for nav elements
#826 - note toc nav is required in intro
#828 - clarify ordering requirements for toc nav references
#829 - note optional use of pagebreak with page-list

adds a link to the informative a11y faq;
patches errata not applied to doi examples;
probably some other minor stuff, too
@mattgarrish mattgarrish removed the Status-Proposed Solution A proposed solution has been included in the issue for working group review label Sep 1, 2016
@murata2makoto
Copy link
Contributor

Should we do something in EPUB 3.2?

@mattgarrish mattgarrish added the EPUB32 Issues from 3.0.1 resolved in the EPUB 3.2 specification label Aug 14, 2018
@mattgarrish
Copy link
Member

mattgarrish commented Jan 19, 2019

This issue just came up in an epubcheck discussion about the reporting of spaces in file names and URIs.

Epubcheck emits a warning if spaces are included in file names, but that's being done without any specific resolution to this issue. (Technically, the only invalid aspect right now is if URIs that reference the files are not percent encoded.)

Do we want to revisit this and perhaps note in the section on file naming restrictions that the use of spaces, while maybe not forbidden, is not recommended so that we can pair the warning up with a proper statement in the specification?

@mattgarrish mattgarrish reopened this Jan 19, 2019
@murata2makoto
Copy link
Contributor

Should we reference the WHATWG URL specification? Should we consider WHATWG URL API in Node.js as a reference implementation?

@murata2makoto
Copy link
Contributor

But I am also aware of MY URL ISN’T YOUR URL.

@iherman
Copy link
Member Author

iherman commented Jan 19, 2019

Should we reference the WHATWG URL specification?

This is certainly what newer W3C documents refer do (similarly to the references to the WhatWG HTML spec). The W3C and the WhatWG are hammering out an agreement on working together, and the URL spec is definitely part of that agreement.

Should we consider WHATWG URL API in Node.js as a reference implementation?

I am not sure why we look at a reference implementation here. I would think what really count is the implementations in browsers rather than node.js and, in this sense, what counts is the relevant URL test suite...

@llemeurfr
Copy link

I'm curious: most OSes (Linux, Windows, MacOS) accept spaces in file names, and URI or IRI referencing those files % encode spaces. Nothing special here.
Therefore I don't see why section 3.4 should refuse spaces in file names and blocking this at the level of EPUBCheck does not seem right to me.

@mattgarrish
Copy link
Member

most OSes (Linux, Windows, MacOS) accept spaces in file names

Ya, they just have a way of tripping up command line tools, piping operations, etc.

Making them illegal is probably a bit much, but I tested back to the epubcheck 3.0.1 release from 2013 and it has been emitting a warning about their use since at least then, so adding a warning to the specification isn't really changing reality in any way.

But if we decide they should be allowed, then conversely epubcheck needs to be modified.

@mattgarrish mattgarrish removed the EPUB32 Issues from 3.0.1 resolved in the EPUB 3.2 specification label Aug 26, 2020
@mattgarrish mattgarrish removed this from the EPUB 3.1 milestone Aug 26, 2020
@dauwhe
Copy link
Contributor

dauwhe commented Sep 30, 2020

Should we reference the WHATWG URL specification?

Yes, to retain compatibility with HTML! For example, here's how HTML defines the src attribute for img:

The src attribute must be present, and must contain a valid non-empty URL potentially surrounded by spaces referencing a non-interactive, optionally animated, image resource that is neither paged nor scripted.

@dauwhe dauwhe added the Agenda+ Issues that should be discussed during the next working group call. label May 4, 2021
@dauwhe
Copy link
Contributor

dauwhe commented May 4, 2021

I think we need to talk about the space character, and whether we can move to using the WHATWG URL spec.

@mattgarrish
Copy link
Member

and whether we can move to using the WHATWG URL spec.

I don't think this is an issue anymore.

There was a time when the URL specification only defined parsing of URLs (as complained about in the article mentioned above), but it now includes a section that defines the syntax for valid URLs. Looks like that was added sometime around 2017.

Without that syntax, we'd have lost validation, and that would have made for an interoperability mess.

@iherman
Copy link
Member Author

iherman commented May 7, 2021

The issue was discussed in a meeting on 2021-05-07

List of resolutions:

View the transcript

5. OCF Document: RFC3986 or RFC3987?

See github issue #808.

Dave Cramer: basically the question is generally how we define URLs in our spec
… essentially how URLs work on the web is determined by the WHAT-WG

Ivan Herman: +1

Dave Cramer: if we change our spec to refer to that instead of these RFCs then we are better off

George Kerscher: Agree to getting closer to the current web

Proposed resolution: Use the WHATWG URL specification instead of the RFCs (Wendy Reid)

Ivan Herman: +1

Wendy Reid: +1

Matthew Chan: +1

Toshiaki Koike: +1

Bill Kasdorf: +1

Masakazu Kitahara: +1

George Kerscher: +1

Ben Schroeter: +1

Dave Cramer: +1

Tzviya Siegman: +1

Dan Lazin: +1

Gregorio Pellegrino: +1

Resolution #3: Use the WHATWG URL specification instead of the RFCs

@iherman iherman removed the Agenda+ Issues that should be discussed during the next working group call. label May 7, 2021
@iherman iherman added the Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation label May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-OCF The issue affects the OCF section of the core EPUB 3 specification
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants