Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does some RS check whether each resource is listed in the Package Document ? #810

Closed
murata2makoto opened this issue Aug 17, 2016 · 25 comments · Fixed by #1567
Closed

Does some RS check whether each resource is listed in the Package Document ? #810

murata2makoto opened this issue Aug 17, 2016 · 25 comments · Fixed by #1567
Labels
EPUB32 Issues from 3.0.1 resolved in the EPUB 3.2 specification EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-PackageDoc The issue affects package documents
Milestone

Comments

@murata2makoto
Copy link
Contributor

If an HTML content document references an image file that is not listed in the current package document, does some RS refuse to handle it?

2.2 Reading System Conformance

It must not use any resources not listed in the Package Document in the processing of the Package (e.g., META-INF files [OCF 3.1] and resources specific to other Renditions of the EPUB Publication).

@mattgarrish
Copy link
Member

This requirement was previously buried in OCF. See issue #626.

I'm not terribly interested in what happens to invalid epubs. We allow resources to be transported in the container that aren't for use in any publication, so technically the instruction would be to not unpack the image so it's not available at all, just like the other "stuff". Whether this requirement is enforceable depends on whether there is active processing of the content.

I'm hesitant to remove it, as without it it creates a situation where you could build out functionality by ignoring the manifest.

@mattgarrish mattgarrish added the Topic-PackageDoc The issue affects package documents label Aug 17, 2016
@mattgarrish mattgarrish added this to the EPUB 3.1 milestone Aug 17, 2016
@mattgarrish
Copy link
Member

@murata0204 is there a change you want to propose here?

@murata2makoto
Copy link
Contributor Author

I suppose that this requirement is enforcable by epubcheck, but I don't believe that RSs bother to check it. If few RSs don't check it, future conformance testing will be hampered. I would like to replace MUST by SHOULD.

@mattgarrish
Copy link
Member

It would seem to invalidate certain reading systems, like ibooks, that use their own configuration files, too.

I'm still not sure if the rendition mapping document is used in the processing of a rendition or fits in some magical space above.

And while ideal that a reading system will only unpack the resources listed in the manifest, that's not even enough. The check would also have to happen at rendering, as it's easy to dream up a scenario where a resource is legitimately available but not valid (a two rendition publication where both reference the same image but one misses the manifest entry).

I had suggested we change to "It must not depend on any resources not listed in..." but then removed the comment. Maybe it is worth considering, since isn't the intent more that the RS must not fail to process a package that doesn't have a proprietary file of some sort. (I don't like that it validates proprietary implementations, but that bridge has already been crossed.)

Otherwise, I guess a "should" at least makes it more realistic.

@iherman
Copy link
Member

iherman commented Aug 28, 2016

"should" sounds good to me...

@mattgarrish
Copy link
Member

Closing this issue as we agreed to no change. My memory fails me if it was only a deferral until after the draft, though, so reopen if I have this wrong.

@mattgarrish mattgarrish added the EPUB32 Issues from 3.0.1 resolved in the EPUB 3.2 specification label Aug 14, 2018
@dauwhe
Copy link
Contributor

dauwhe commented Jan 8, 2021

Reopening because I wrote a test for Makoto's scenario, and it is not interoperable. Thorium does not display an image that is not in the manifest, but iBooks, ADE, and Calibre do. I expect most commercial reading systems would show the image, for fear of breaking content.

The language is a challenge:

It MUST NOT use any resources not listed in the Package Document in the processing of the Package.

What do we mean by "processing" or by "use any resources." The old com.apple.ibook.display-options.xml file certainly can influence the rendering of an EPUB. But is it a resource? Is the activity of reading that file (not mentioned in the package file) part of processing of the Package?

@dauwhe dauwhe reopened this Jan 8, 2021
@OriIdan
Copy link

OriIdan commented Jan 8, 2021 via email

@iherman
Copy link
Member

iherman commented Jan 9, 2021

To be very pragmatic: this means that, under the current spec, iBooks or Calibre would not be conformant implementations, and I do not think it would by in anybody's interest to get there.

A not-too-distant analogy is the HTML content vs. browser behaviour. We know that browsers do accept invalid content, and they do something reasonable with it. Similarly, while a conformant EPUB 3.3 MUST list all those resources, I think it would be perfectly fine for a RS to say that it SHOULD NOT use any of those resources.

I think this type of approach should be valid for the RS-s in general, i.e., we may want to go through the RS requirements and see where similar situations occur. With the current separation of content vs. RS into two documents this may become much easier.

@mattgarrish
Copy link
Member

There are clearly files not listed in the package document that are needed for processing (the container file), so the restriction is confusing (it also only appeared in 3.1 when we introduced packages). I believe we already give the package document logic precedence elsewhere, so it's not really a loophole to fork the standard.

Is there a security reason why we need to state anything about resources not listed in the package document, though, otherwise what is the end goal of this restriction whether required/recommended? I'm not sure why unlisted resources are any less secure than listed ones, as if you can sneak a malicious file into the container and modify a content document to reference it, I can't imagine it's that hard to modify the package document, too. Accidental omission seems like the more likely cause of unlisted resources, plus the possibility of multiple renditions (but again not sure why this matters for processing a specific package document).

I'm not even sure this belongs in the package document section, since resources not listed in the manifest by definition are external to any processing of the package document. It also intersects with OCF processing in a weird way. Plus there are publication resources that could be outside the container, so it can get confusing with any web-linked content.

We probably need to go back to basics on this one and figure out exactly what it is we're trying to prevent and why, in other words.

@llemeurfr
Copy link

Is there a security reason why we need to state anything about resources not listed in the package document ...

It's not about security imo, it is about interoperability. RS developers should find in the spec guidance about the standard behaviors expected by the community.

Therefore, if an XHTML page contains an image which is not referenced in the package (certainly an omission), all RSes should behave the same.

Note: It would be also good to allow files in the zip that are not listed in the package document, because zip archives may then contain mixed formats (for instance, an additional JSON manifest which should be left unknown from the EPUB machinery).

@mattgarrish
Copy link
Member

It's not about security imo, it is about interoperability.

But I don't see that we'll ever get interoperability if we only recommend practices. It sounds like this change justifies what everyone is doing, not moving all reading systems to provide the same experience. (Everyone passes a recommendation, in other words.)

If there aren't security issues, and we want the same experience regardless of RS, it seems like the requirement should be to not ignore resources needed to render the publication regardless of whether they are listed in the manifest.

But even that seems complicated if some reading systems won't unpack resources not listed in the manifest. We need a lot more depth in terms of processing logic (especially since unzipping is not required).

@llemeurfr
Copy link

If there aren't security issues, and we want the same experience regardless of RS, it seems like the requirement should be to not ignore resources needed to render the publication regardless of whether they are listed in the manifest.

I must disagree and even oppose to such requirement (for reasons related to the architecture of Readium software, which must internally list resources which are served to the internal browser engine). If the EPUB spec forbids having resources used in content but not listed in package document, the spec authors cannot force RSes to handle such ghost resources.

I don't see that we'll ever get interoperability if we only recommend practices.

One must be pragmatic: EPUB defines a file format and RSes have been developed without a precisely defined processing model so far. It is too late to create fully constraining processing models now. In most cases, a proper set of best practices is the best we can do today.

@mattgarrish
Copy link
Member

RSes have been developed without a precisely defined processing model so far

That's the big problem we face in any efforts at interoperability.

And to be clear, I don't think interoperability is a realistic outcome exactly because it's far too late in the life of EPUB 3. If that's our goal, then we can't have recommendations. I'm not proposing it's the solution here.

I'm still more interested in finding out what the current requirement is hoping to achieve before we try to rewrite it:

  • If it's not about security, then what does changing it to "should not" achieve? Why are we suggesting it's a bad thing? Is the opposite what we want?
  • If it's not specifically about HTML/SVG rendering, what files are we trying to stop reading systems from using? They need to use some files not listed in a package document to process a publication (meta-inf, rs configuration files, etc.).

@dauwhe
Copy link
Contributor

dauwhe commented Jan 11, 2021

Perhaps there's a way to resolve this. I believe it's entirely reasonable for a reading system to display, for example, an image linked from a content document that's not listed in the manifest. I also think it's entirely reasonable for the core spec to require all resources to be listed in the manifest. How about we keep this as an authoring requirement in the core (and in EPUBCheck) and remove this restriction in the RS spec?

@shiestyle
Copy link

+1 to Dave.

@iherman
Copy link
Member

iherman commented Jan 12, 2021

How about we keep this as an authoring requirement in the core (and in EPUBCheck) and remove this restriction in the RS spec?

This sounds reasonable to me, but an attentive content creator may very well ask the question of "If a Reading System does not check/care about this, why am I obligated to add a complete list of resources in my package file?" We should have a generic description why we have these, what is the reason, etc, the content creator should really follow the spec (and not only for fear of epubcheck...)

(I must admit I do not have a clear answer either.)

@danielweck
Copy link
Member

Hello, from the standpoint of Readium implementations, there is an expectation that publication resources are properly declared in the manifest, because each individual asset can be associated with additional properties defined at authoring time (i.e. beyond the mere fact of being present in the directory of the zip container, or on the filesystem in the case of exploded / unzipped publications).

Most notably, publication resources can be obfuscated or encrypted. This kind of "meta" information is not intrinsic to the assets themselves, this requires additional authored data. For this reason, a typical publication server in Readium implementations makes no attempt to fetch "local" resources that cannot be found / are not declared in the publication manifest (I used the term "local" in contrast with "remote" HTTP resources that completely bypass the locally-instantiated publication server).

Technically-speaking, it would be possible to refactor existing Readium implementations to include a fallback to the zip directory of the publication container (or the filesystem in the case of exploded / unzipped publications), whenever a referenced resource cannot be found in the publication manifest.

However, personally I like to think of a "publication" as a well-defined / bounded set of resources, even if this requires more effort at authoring stage to produce the exhaustive list of referenced assets (i.e. publication manifest).

@larscwallin
Copy link

larscwallin commented Jan 12, 2021

First of all Colibrio supports loading resources that are not listed in the OPF manifest.

We use EPUB OCF's (the zip file's) central directory as our "canonical manifest" as we can always trust that this is complete (which is almost never the case for the OPF). The OPF manifest we treat more like additional resource metadata for the EPUB Publication context.

So I think this may be a helpful way to think about the future role of the OPF manifest. We can re-define it's usage to be a collection of extra metadata for publication resources. And use the OCF as the "real", complete manifest, which is something that we get out of the box anyway.

For now though, until we decide otherwise, we should REQUIRE all publication resources to be listed in the manifest, and in the Reading System document we should tell implementors to handle exceptions by loading the resources anyway, or to degrade gracefully.

PS. I am really a sucker for the manifest and am very for keeping it complete.

@OriIdan
Copy link

OriIdan commented Jan 12, 2021 via email

@larscwallin
Copy link

larscwallin commented Jan 12, 2021

We can however rephrase it for RS requirements to SHOULD NOT use resources
that are not in the manifest.

"SHOULD NOT" would be too strict in the RS document I think. This will break many existing publications.

@OriIdan
Copy link

OriIdan commented Jan 12, 2021 via email

@mattgarrish
Copy link
Member

mattgarrish commented Jan 12, 2021

Further to the good points that @danielweck has made, I'd add another thing unlisted resources does is provide a means of circumventing core media type rules and fallbacks, as well as requirements on where resources are hosted.

Authoring requirements are a good start, but they're also brittle as all an author has to do is ignore them depending on how they distribute the content. I wonder, similar to what @iherman says, if additional context would help here on both sides.

For example, the RS requirement might become:

It SHOULD NOT use non-publication resources in the rendering of an EPUB Publication due to the inherent limitations and risks involved (e.g., lack of information about the resource and how to process it, security risks from remotely-hosted sources, lack of fallbacks, etc.).

Similarly, on the authoring side we can note that these are the reasons why authors need to make sure the manifest is complete (i.e., to ensure complete rendering).

@teytag
Copy link

teytag commented Jan 12, 2021

@iherman : .... the content creator should really follow the spec (and not only for fear of epubcheck...)

I'm creating content that meets the specification. And then I verify my EPUB3 file with EPUBCheck. The approval I received after this control process means that I am a content creator who has made a package according to specification and has technical reliability.

To ensure "interoperability" understanding of my content with RS:

  • RS, my content must behave according to the "manifest" values. If RS behaves the other way around, I don't know what my readers will get. This is not a nice thing.

  • For example; The number of XHTML pages that I use in the package I created is certain. However, they can make watermarks by adding a new XHTML page into my original content package without my permission. A behavior that interferes with the content.

I have a suggestion for "interoperability" (between the creator and RS) and that this or other similar behavior can occur.

Permission to Modify Content: The creator can allow RS behavior in the manifest (OPF). With this new attribute, the creator knows that its permission will be taken into account by RS.

This desire of the content creator will also not be perceived as the stringency of the EPUB3 specifications (MUST).

@larscwallin
Copy link

larscwallin commented Jan 12, 2021

What we could do is to suggest that Reading Systems show a user facing warning, or a confirmation when a content document requests a resource that is not listed in the manifest. This would allow the RS to "fix the quirk", but only if granted explicit permission from the user.

I have suggested a similar thing for unlisted remote resources before.

@mattgarrish mattgarrish added the EPUB33 Issues addressed in the EPUB 3.3 revision label Mar 17, 2021
@mattgarrish mattgarrish added the Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation label Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPUB32 Issues from 3.0.1 resolved in the EPUB 3.2 specification EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-PackageDoc The issue affects package documents
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants