Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define what RS should do when the manifest has duplicate entries #1686

Closed
rdeltour opened this issue May 27, 2021 · 8 comments · Fixed by #1889
Closed

Define what RS should do when the manifest has duplicate entries #1686

rdeltour opened this issue May 27, 2021 · 8 comments · Fixed by #1889
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.3 Recommendation Topic-PackageDoc The issue affects package documents

Comments

@rdeltour
Copy link
Member

The EPUB core spec forbids two item elements to identify the same resource. But the RS spec could (and should IMO) define how an RS must handle such non-conforming EPUBs.

@rdeltour
Copy link
Member Author

Copying what @mattgarrish said in #1374

We tend to tolerate mistakes like this. The caveat to authors is if you don't follow these kinds of rules then bad things will happen. In this case, the reading system may get conflicting information about the resource. It's possible it might break the spine, too, as if you navigate to the resource I imagine it will complicate reading systems looking up which spine item you're in if there's more than one entry that matches the resource.

I guess we could recommend that reading systems ignore duplicate entries, though that wouldn't solve the problem of inaccuracies between the listings.

@mattgarrish mattgarrish added Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.3 Recommendation Topic-PackageDoc The issue affects package documents labels May 28, 2021
@mattgarrish
Copy link
Member

I guess we could recommend that reading systems ignore duplicate entries

Although, on consideration, this is probably more complicated than simply ignoring the entries as the spine could have separate references to each entry.

It's more of a lookup consideration - the reading system should use the first manifest item in document order that matches the resource in the case of duplicate entries that resolve to the same resource.

It would be helpful to know what, if anything, existing reading systems do with duplicate references to a resource, though. @wareid @danielweck @bduga ?

@dauwhe dauwhe added the Agenda+ Issues that should be discussed during the next working group call. label Jun 9, 2021
@iherman
Copy link
Member

iherman commented Jun 11, 2021

The issue was discussed in a meeting on 2021-06-10

List of resolutions:

  • Resolution No. 2: Absolute URLs for manifest items should have a special scheme that is not file:, close issue 1688
View the transcript

2. URLs and the package document

See github issue #1681, #1374, #1688, #1686.

Dave Cramer: this is a bunch of issues that revolve around how you interpret URLs in the package document, especially if they're absolute URLs
… came from an issue in epubcheck
… and there's also an older issue about what the IRI of the package document is
… or what if there are file scheme URLs in the manifest
… and what happens if two URLs resolve to the same item in the manifest?

Matt Garrish: in epubcheck there was a root-relative URL that caused an error, and that spawned all of this
… e.g. "/something/thing"
… so what is the root of the epub?
… to me it doesn't make sense that we even allow these root-relative URLs
… the root differs based on the RS
… and Romain mentioned that we require that all resources resolve to something inside container, but depending on what RS does, there is even ambiguity about what that even is

Dave Cramer: in issue 1688 Romain he suggests that manifest items should have one of the special schemes (except file:)

Matt Garrish: there are edge cases where file scheme items make sense, but not generally for epub

Dave Cramer: it goes against epub as a portable format, and the file scheme ties the epub to a specific file system
… how much out there does have file URLs on purpose, not by accident?

Matt Garrish: never heard of one
… and they'd end up being remote resources

Dave Cramer: okay, so what if we just say no file URLs in epub?
… what is the risk that we break something?
… maybe this is something where we try to enforce it and see if anyone complains

Matt Garrish: most RS probably won't do anything with file URL
… probably security concern

Wendy Reid: depending on platform you might not even be able to access parts of the file system (e.g. iOS apps)

Dave Cramer: can we start by resolving on this point from 1688?

Proposed resolution: Absolute URLs for manifest items should have a special scheme that is not file:, close issue 1688 (Wendy Reid)

Dave Cramer: +1

Matthew Chan: +1

Matt Garrish: +1

Wendy Reid: +1

Toshiaki Koike: +1

Shinya Takami (高見真也): +1

Ben Schroeter: +1

Resolution #2: Absolute URLs for manifest items should have a special scheme that is not file:, close issue 1688

Dan Lazin: is there a use case for some of these other schemes? Why would you have an FTP in your epub?

Matt Garrish: if we go too far, do we prevent future stuff? will we have to come back and re-add this in the future?
… FTP kind of fits within the web framework
… maybe we just leave it to authors to stick with HTTP, HTTPS, etc.

Ben Schroeter: is the idea that if we disallow file scheme, then we also disallow "slash URLs"?

Matt Garrish: not sure those are the same
… i think 1681 is contingent on us forcing RS to unpack epub in a certain way
… otherwise we can't say there is a single consistent root that can be referenced
… and we don't tell RS how/where to unpack right now
… this kind of came up 5 years ago with multiple rendition, but we left it buried in the discussions we had

Dave Cramer: what would be the consequences of forbidding root-relative paths?

Matt Garrish: not sure there are any, because epubcheck had forbidden these until a recent update
… we're reasonably safe from backwards compatibility point of view

Dave Cramer: and this is just for href on manifest?

Matt Garrish: no, this would be anywhere, e.g. in content docs too
… all the "../" stuff would still be okay
… i proposed somewhere that we say all content must be below the packat document
… if we could enforce an authoring requirement that made a root, then we could enforce these relative paths
… but maybe its cleaner to just disallow them

Dan Lazin: do we support the base tag?
… and does that have implications for the handling of these issues?

Dave Cramer: we've been phasing out xml:base, its been forbidden from package file for example

Dan Lazin: the base tag allows you to define what the relative path is relative to
… so if we're allowing or disallowing certain types of URLs, maybe we should take a stance on base too
… not sure what stance though

Matt Garrish: base would force you to have all external resources, right? It exists, but I don't imagine anyone really going there

Dan Lazin: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

Marisa DeMeglio: there was a resolution a few weeks ago about dumping xml:base from the spec

Dave Cramer: and that's separate from the HTML base element
… i think i just want to say no root-relative URLs

Dan Lazin: if you set base to some website, and then use root-relative URLs, your URLs would appear to be relative, when they are actually absolute
… but maybe that's too far of a stretch

Dave Cramer: but can we really say anything about base because its part of HTML?

Matt Garrish: so you must not use root-relative URLs unless you use a base?
… but it also applies to SVGs, to the package document...

Dan Lazin: what was the harm in not banning root-relative?

Matt Garrish: because the RS might treat zip root as the root, but they could also treat location of package doc as root
… so no consistent root

Dan Lazin: maybe permit it, but use SHOULD NOT?
… is it acceptable for an author to write an epub for a specific RS?
… and where it has undefined behavior for other RS
… probably acceptable, right?

Dave Cramer: yes, e.g. with books that only work with iBooks because of scripting support

Matt Garrish: maybe just a note that root-relative could cause issues if authors use it?

Dave Cramer: so does that mean that there are epubs that could be built to work in some RS, but expose an interop issue if opened in another RS?

Matt Garrish: right
… usually this happens in epubs that try to go from one folder to a sibling folder
… but when all content is below the package document its fine
… but we don't specify that right now, only that content must be below the root

Dave Cramer: not sure what the right course of action is, but maybe we can continue this another time with Romain present

Wendy Reid: we need RS people here on next call that know exactly what RSes are doing right now

Marisa DeMeglio: one of the github threads has a sample, but I wasn't able to download it
… maybe if we wrote to the mailing list Romain could provide samples
… would also love to have a list of epubs that must absolutely continue to work

Dan Lazin: I have filed #1699

Matt Garrish: also, there's not much hand authoring, and most tools will put all the content into one folder
… we only ran into an issue with this with multiple renditions, and that hasn't really gone anywhere
… so is this maybe more of a theoretical issue

@iherman
Copy link
Member

iherman commented Jun 18, 2021

The issue was discussed in a meeting on 2021-06-18

  • no resolutions were taken
View the transcript

2. What is the relationship between URLs and the package doc (what is home?)

See github issue #1681, #1374, #1687, #1686.

Wendy Reid: we started this discussion last week. Core question is: Where is home (given we allow both relative and absolute URLs) in the epub context

Romain Deltour: we have to keep in mind: 1) what things have to be put in epub core spec, and 2) what are the rules for epub RS spec
… later is more important because we can say whatever we want in core, but authors may deviate, and then it is up to RS to decide how to react
… also, i think we should look into question of what is home first, and that will inform what to do with root-relative URLs

Wendy Reid: okay, so what is the IRI of the package document then?

Ivan Herman: we can't really answer what the IRI of the package is, and i'm not sure we should try
… rather, what do we expect RS to do conceptually?
… who epub structure relies on the idea that epub is kind of a frozen website
… i think we say this is the conceptual model within which epub exists, and we should not say exactly how RS can do that
… just as long as the observable behavior is identical
… so as long as after epub is unpacked there is a root that we can refer to, it is fine
… and whether this root is the same IRI of the package or not is none of our business

Matt Garrish: we have 2 issues, 1) are these resources within the container and how do we determine that? 2) what happens when you unpack, and where do these resources go?
… so I don't think there can be a consistent root unless we start to enforce these things
… inside epub resources can be within the container, but that might not be true once the epub is unpacked
… e.g. do you have to unpack everything in the zip? Or just whatever is in the epub under the package?

Brady Duga: so absolute URIs are not allowed, and what relative IRI is interpreted by the language in question (e.g. HTML, or CSS, depending on what type of document it is)
… so why do we have to define what root is if we don't allow absolute URIs?

Matt Garrish: i think the issue is root-relative is still a relative path, so do we have to say "all relative is allowed, except root relative"

Romain Deltour: even with regular relative URLs, the spec is silent on what happens if the relative URL tries to go below the container root?
… and is it possible to look at RSes today and test what they do?

Ivan Herman: i was surprised to find that some RS don't automatically unpack the whole zip
… i thought this was obvious
… but then what if there is a relative URL that is not on manifest, but also happens to be in zip?

Matt Garrish: we have requirement in OCF that all relative resources must resolve to something in container
… i don't think that was the issue

Gregorio Pellegrino: i know that Colibrio streams files out of zip without unzipping

Wendy Reid: yes, there are more examples of RS doing that beyond that

Ivan Herman: but conceptually an RS unpacks the whole zip file onto a domain (as if it were a file system). If we do that then all these concepts become clear
… but i'm not sure if a streaming based solution meets that conceptual model

Hadrien Gardeur: streaming from zip is what Readium does by default
… unzipping is a problem for DRM. Some expectation that you keep the epub zipped. And we've done some optimizations with this in mind

Romain Deltour: i'm surprised that resources that are not in the same directory tree as the OPF would not be accessible in the epub
… going back to the point about defining what should happen conceptually, the spec could say that we define a URL that must be used as the base when resolving relative URLs (e.g., https://ocf.example.org))

Ivan Herman: +1 to romain

Romain Deltour: this defines unambiguously how relative URLs are to be resolved
… and we can say this URL is the root of the OCF
… this makes it so that relative URLs cannot go outside of the container
… and then RSes know what relative URLs point to

Wendy Reid: going back to romain's point about testing, there are a variety of ways that RSes handle these URLs
… we are especially unsure what happens when files are outside the container
… so this is good reason to do some testing

Ivan Herman: would some sort of conceptual model clash with how things are implemented?

Hadrien Gardeur: we treat OPF as base, and that seems to work in most cases. Seems to make more sense to us than treating zip as base
… but these two are most common implementations

Matt Garrish: this originally came up in multiple renditions when we had issues referencing across sibling directories
… not sure if this is still an obstacle, worth testing

Romain Deltour: drawback of conceptual solution is that sometimes adding this layer of abstraction makes spec harder to use
… so we want to respect people who are actually having to implement it

Wendy Reid: is the best way forward at this point for us to do some sort of testing? (e.g. OPF as base, zip as base, examples of files living outside when OPF is base)

Ivan Herman: i think we should also test environment where multiple renditions is implemented
… if we end up with something that makes multiple renditions impossible, then we should just remove the multiple rendition note

Wendy Reid: do we know if a functioning implementation of multiple renditions?

Hadrien Gardeur: barnes and noble were using multiple renditions for newspapers and magazines
… not sure if they still use it

Wendy Reid: okay, so maybe we test on Nook app
… okay, so for now we test. Will have to ask Dan and the rest of the testing folk to help
… for now we don't have consensus on any sort of language, right?

@iherman
Copy link
Member

iherman commented Jun 25, 2021

The issue was discussed in a meeting on 2021-06-24

List of resolutions:

  • Resolution No. 1: Provide a note in the core spec that this is a known issue, include non-normative advice about what to do, close issue 1687
  • Resolution No. 2: Declare root relative paths not recommend (should not be used), close 1681
View the transcript

1. Refine the requirements on how RS must process the container structure

See github issue #1687, #1681, #1686.

Wendy Reid: per discussion last week, mgarrish made us a test epub for this
… we've put it through various RS, Apple, Thorium, Colibrio, Kobo Desktop, Kobo iOS, ADE, more...
… aside from Apple and ADE, the test epub has worked
… it seems like most RS are flexible in their sourcing, but with our two fail cases, there is some variability in implementation

Brady Duga: and most of this was done via sideloading, and publisher pipelines are often different
… if publisher sent apple a book, we might have gotten a different result

Matt Garrish: we still have the problem that the spec doesn't say anything about this. There is no authoring requirement for where to put your content (other that below the root). And for RS there is no requirement for how to unpack, etc.
… it seems like it should be common sense. But beyond what we've already said, not sure what we should do. Maybe note it as a potential issue?

Wendy Reid: it probably doesn't hurt to refine language, but at this point creating a firm requirement would impact some existing RS implementations
… and it might make authors uncomfortable
… do we note that there is some confusion as to implementation, but clarify that we aren't going to enforce anything?

Matt Garrish: easiest solution is probably an authoring requirement. Esp. because most authors have probably never tried to do anything like the test epub
… so say authors should put their content under the package document

Brady Duga: this has been an issue forever, and the only time we noticed was with multiple renditions, which hasn't been implemented really. So is a 3rd solution to just leave it?
… if some publisher creates an epub that just doesn't work on Apple, maybe that can just be between that specific publisher and Apple...

Matt Garrish: this whole thing really only came up because of that root-relative thing, so on that issue maybe we just say not to use those

Wendy Reid: right, so we advise not to use root-relative, and we can't say specifically how RS will behave if you do it

Matt Garrish: can we resolve just to use something similar to the note we were going to have for multiple renditions?

Proposed resolution: Provide a note in the core spec that this is a known issue, include non-normative advice about what to do, close issue 1687 (Wendy Reid)

Brady Duga: +1

Wendy Reid: +1

Matthew Chan: +1

Matt Garrish: +1

Masakazu Kitahara: +1

Ben Schroeter: +1

Toshiaki Koike: +1

Shinya Takami (高見真也): +1

Resolution #1: Provide a note in the core spec that this is a known issue, include non-normative advice about what to do, close issue 1687

Wendy Reid: the other two related issues first are root relative paths valid? is this now moot?

Matt Garrish: i think we are on safer ground to just disallow those, especially because in the past epubcheck has had those come up as an error
… it may work on some RS, but that's fine

Proposed resolution: Declare root relative paths not recommend (should not be used), close 1681 (Wendy Reid)

Wendy Reid: +1

Matthew Chan: +1

Matt Garrish: +1

Toshiaki Koike: +1

Masakazu Kitahara: +1

Ben Schroeter: +1

Brady Duga: +1

Resolution #2: Declare root relative paths not recommend (should not be used), close 1681

Wendy Reid: the second one: what should RS do when manifest item has duplicate entries?
… this is worth testing (and should be easy enough to test)

Matt Garrish: i think the issue with this was that if there were multiple copies of the same item in manifest, then RS might not know which manifest item to go to when one copy is referenced

@dauwhe dauwhe added Agenda+ F2F Possible agenda item for F2F and removed Agenda+ Issues that should be discussed during the next working group call. labels Sep 10, 2021
@bduga
Copy link
Collaborator

bduga commented Oct 29, 2021

How about something like:

When presented with a single manifest item that is repeated multiple times in the linear flow of the spine, reading systems should do their best to display that content in the correct location of that linear flow. The reading system should treat these as distinct pages for UI purposes (for example, each occurrence could be independently bookmarked or annotated), but when following an internal link to that item the reading system should move to the position of the first occurrence of the document in the linear flow.

@iherman
Copy link
Member

iherman commented Oct 30, 2021

The issue was discussed in a meeting on 2021-10-29

  • no resolutions were taken
View the transcript

2.1. Define what RS should do when the manifest has duplicate entries (issue epub-specs#1686)

See github issue epub-specs#1686.

Dave Cramer: Question - how should a reading system handle this situation?.

Romain Deltour: I think that we need to hear from Reading Systems.

Hadrien Gardeur: For reading systems, duplicated resources are not really an issue when moving forward/backward in the spine. It becomes an issue when you need to "jump" to a resource (link or ToC), since we don't know where to jump to..
… The idea of getting rid of the second reference in the spine should work for reading systems..

Dave Cramer: we have a proposal if there's duplicate items to just ignore, but the first one.
… does it work for other RSs?.

Brady Duga: I think we don't have EPUBs like that.
… because they don't pass EPUBcheck.

Romain Deltour: yes, now it is picked by EPUBcheck.
… but it's one of the reason for which we have RSs specifications.

Rick Johnson: we also filters EPUBs via EPUBcheck.
… it could be an issue for side loading EPUBs.

Brady Duga: it may be an issue for linking.
… where the link will go?.
… how many times the document will be displayed?.

Dan Lazin: since we don't have a definition in the spec, but we are blocking it via EPUBcheck... I think it is not really important.
… to specify.

Matt Garrish: I think there should be a consistent manage of non conformant EPUBs.

Dave Cramer: my question is: is there any interoperable problem to solve?.

Hadrien Gardeur: I think we don't know enough how RSs handle this problem.
… maybe for the moment a note in the spec if enough for RSs implementers.
… we may add it in the next EPUB version.

Laurence Zaysser: I think it's an authoring problem.
… having multiple times the same document in the spine creates problems for pagelist.

Ivan Herman: as an editor I think the core spec document have to say that the elements must not be repeated.
… the RSs spec should say SHOULD ignore any multiple duplication of elements in the spine.

Matt Garrish: It's difficult to answer... I think that adding a guideline in the RSs spec should be good.

Brady Duga: as a RS point of view I have multiple questions: links, display, annotations, bookmarks, etc..
… what happen it the same document is twice (or more) in the spine?.
… I think we need a guide on manage it.

Laurence Zaysser: I think this case may happen in text-books.

Ivan Herman: maybe the solution is to remove the duplicate content.

Dave Cramer: I think we should not remove content.

Matt Garrish: I don't link the idea to hide the content, but I don't want to allow them on the authors side.
… we can make a note explaining how RSs should manage these cases.

Ivan Herman: I officially withdraw my proposal :-).

Rick Johnson: I agree with comments from Brady and Matt, that showing it benefits the user (they see what the author wanted), the only issue is clicking on a link will take them to an unexpected place in the reading order (which is the bug/issue we can point to).

Brady Duga: I may write something as a proposal for the note.

Ivan Herman: For the records brady put in a proposal for the duplicate items before the end of the call: #1686 (comment).

Dave Cramer: ok, we ask brady to propose a text and then we can have a resolution in an another meeting.

@rdeltour
Copy link
Member Author

rdeltour commented Nov 1, 2021

My understanding is that PR #1889 only tackles one part of the issue: a same item referenced multiple times in the spine.

But there is another aspect (in fact, the primary issue presented in the OP), which is two different manifest item elements pointing to the same resource.

For instance the OPF has:

<package>
    <manifest>
        <item id="ch1" href="../doc.xhtml" />
        <item id="ch42" href="smokescreen/../../doc.xhtml" />
    </manifest>
</package>

This is disallowed in EPUB too, but this trick could be used to circumvent a logic based only on spine-level references. From the spine point of view, you refer to two different item elements, but after parsing they represent the same resource.

@mattgarrish mattgarrish added EPUB33 Issues addressed in the EPUB 3.3 revision and removed Agenda+ F2F Possible agenda item for F2F labels Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.3 Recommendation Topic-PackageDoc The issue affects package documents
Projects
None yet
5 participants