Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MUST the manifest include information about secondary resources or not? #23

Closed
iherman opened this issue Aug 9, 2017 · 44 comments
Closed

Comments

@iherman
Copy link
Member

iherman commented Aug 9, 2017

This has been discussed in different other issues, and it is probably better to separate this as an explicit question to be followed.

@iherman
Copy link
Member Author

iherman commented Aug 9, 2017

See #22 (comment) in a separate issue that is relevant here

@mattgarrish
Copy link
Member

Do we mean secondary resources as currently defined -- needed for the rendering of a primary resource -- or as modified by issue #16 -- essential for the rendering of a publication?

That's why my tongue was getting tied in #22. Maybe we should just think of primary resource as any top-level resource for the sake of this issue, and avoid correlating them as required in the default reading order.

As I said previously, I'd love to share the pain of manifests and have the author only responsible for listing all the primary resources (as defined in the previous paragraph). To me, that's the minimum we need to establish the bounds of the publication.

The user agent can compile (if perhaps imperfectly in some known cases) the dependencies of those resources if it has a need for that level of information (caching, building a pwp, etc.).

I'm not against listing secondary resources, and depending on how your WP is created it may not even be an issue to list them, but for simplicity of authoring I suspect it would help to bend a little on an EPUB-like hard requirement to list everything. As others have suggested, maybe not every author wants their content taken offline, or turned into a pwp, or used in a way that a full manifest would help simplify. We still need the bounds, but not every dependency.

But I'm repeating myself, so I'll leave my comments at that.

@lrosenthol
Copy link

@iherman wrote:

I also believe that the manifest in the abstract sense must contain information on the secondary resources. As @mattgarrish put it in another comment, the boundaries of a WP must be set, otherwise a WP might fold the whole of the Web.

And I (and others) disagree. I am fine with this as a should - where an author/publisher can do that if they see fit, but I see absolutely no reason to mandate it. In fact, were we to connect this particular requirement with the offline one (as some have done) - then I even more strongly oppose the mandating. Why? Because the ability to not include a secondary resource would be the WP equivalent of "Cache-Control: no-cache" (aka do not cache/offline).

Put it another way, it MUST ensure that the UA is in position to discover the boundaries of the WP, and to decide whether a particular resource is within or outside a Web Publication.

Why? I don't understand this requirement

@iherman
Copy link
Member Author

iherman commented Aug 9, 2017

@lrosenthol, my problem is: how would you ensure that a WP (as a collection of resources) would not, via for example a reference to a wikipedia page, essentially pull 'in' the whole Web? Ie, although not mathematically but practically be infinite? What would be of any practical use of such a WP?

But I see your point of the "no-cache", so the issue is how we reconcile these two things, and I am not sure exactly how. At the moment, secondary resources are those that are necessary to render the publication, wherever those resources are. Maybe we should restrict that definition to those resources that are to be used offline, or something like that; after all, the resources that are not to be "cached" are "just" Web resources that are not specific to the WP.

My experience comes from the program I wrote to turn W3C specs into EPUB. I had to collect all the resources (primary or secondary) and, of course, the secondary resources were the problematic ones. I had to use a heuristic (which works for that specific application) to list only those resources that are in the same directory tree than the document itself, plus some common W3C CSS files. Thereby I could keep things finite. Otherwise I simply do not know how I could have created an EPUB3 file, with some specs referring to (indirectly) zillions of other pages. What I am looking for is a framework for that type of heuristics which, I believe, is necessary.

@lrosenthol
Copy link

lrosenthol commented Aug 9, 2017 via email

@iherman
Copy link
Member Author

iherman commented Aug 9, 2017

@mattgarrish

The user agent can compile (if perhaps imperfectly in some known cases) the dependencies of those resources if it has a need for that level of information (caching, building a pwp, etc.).

I do not really believe in the feasibility of such compilation without some sort of a hint...

I'm not against listing secondary resources, and depending on how your WP is created it may not even be an issue to list them, but for simplicity of authoring I suspect it would help to bend a little on an EPUB-like hard requirement to list everything. As others have suggested, maybe not every author wants their content taken offline, or turned into a pwp, or used in a way that a full manifest would help simplify. We still need the bounds, but not every dependency.

You have a point insofar as not every WP is meant to be offline (although I do not have a good example from the top of my head). So I am happy to mellow the requirement:-)

But then we need some sort of a clear definition what the UA has to do if there is no information.

What I could imagine:

  1. The Manifest MUST include info on all primary resources
  2. The Manifest MAY (or SHOULD?) include information on secondary resources (not necessarily all). Those should be taken into account by the UA when going offline and for any other purpose (eg, as subjects of metadata, if applicable). That set of information is not necessarily exhaustive v.a.v. secondary resources
  3. Any secondary resource not referred to explicitly is ignored for offline or other usages.

(In practice the "information" for secondary resources may either be an explicit list or some sort of a scoping mechanism through scope URL-s or URL patterns, or something more clever.)

Would that direction (possibly refined) work?

(this is relevant to issue #22 :-)

@iherman
Copy link
Member Author

iherman commented Aug 9, 2017

Sorry @lrosenthol

I don't understand the context where one would "pull in" anything? Are you talking about the case where a WP is going offline? Or when it wants to make a PWP? Or something else? It would help to understand the use case(s) where this need is arising.

Yes, I meant going offline and/or packaged.

@lrosenthol
Copy link

lrosenthol commented Aug 9, 2017 via email

@BigBlueHat
Copy link
Member

The user agent can compile (if perhaps imperfectly in some known cases) the dependencies of those resources if it has a need for that level of information (caching, building a pwp, etc.).

I do not really believe in the feasibility of such compilation without some sort of a hint...

This is where the browser's existing prerender and prefetch approaches could provide some value. Given a list of primary resources they and their dependencies would be prerendered/cached/offline-d in a similar fashion to how prerender works (and is being improved to work).

You have a point insofar as not every WP is meant to be offline (although I do not have a good example from the top of my head). So I am happy to mellow the requirement:-)

Also, I'd be careful to not open the door to a WP that can't be taken offline. If that's the case, then it's a broken WP...or "just" a web page/site.

So. Riffing of @iherman's list:

  1. The Manifest MUST reference all primary resources.
  2. All Primary Resources MUST descriptively (i.e. not via JS code tucked inside scripts) reference their dependent Secondary Resources.

The result being that any resource not descriptively referenced someplace (the manifest or a primary resource) will not be available offline.

@HadrienGardeur
Copy link

I prefer @iherman take on this than what @BigBlueHat just proposed.

@lrosenthol
Copy link

lrosenthol commented Aug 9, 2017 via email

@BigBlueHat
Copy link
Member

@HadrienGardeur because?

@lrosenthol because? If they don't want those WP features, then they should build a Web Page, Web Site, or Web App (sans-offline-ability).

A Web Publication must be available and functional while the user is offline.

http://w3c.github.io/dpub-pwp/#whatisawebpublication

@HadrienGardeur
Copy link

My take on this is that it's entirely up to the author to decide what's listed as primary and secondary resources, that's how they define the boundary of a publication.

We shouldn't force authors to list resources used by a primary resource in a secondary resource list. Only the author can know what should or shouldn't be listed and no one else.

@iherman
Copy link
Member Author

iherman commented Aug 9, 2017

@BigBlueHat, this example has been used a lot in the DPUB IG and we might all be sick of it:-), but let us suppose a WP online uses some sophisticated font which cannot or shouldn't be put offline (either because it is way too large or whether there are some legal restrictions). The WP uses that font of line, ie, it is, strictly speaking, a secondary resource, but it is not declared in the manifest to avoid being downloaded.

@pkra
Copy link
Member

pkra commented Aug 9, 2017

This may be a silly question if I simply missed another discussion:

What is the list of secondary resource for?
Why would an author list something as a secondary resource?

Some people seem to expect that the list of primary and secondary resources feeds into some kind of serviceworker cache (though it sounds more like appcache manifests).

Was there agreement on that? If not, what else is the list for?

@HadrienGardeur
Copy link

@pkra in the context of Readium-2 we've used the list of secondary resources to do the following things so far:

  • download them using the HTTP Link header and the prefetch relation (this way we have them in the browser's network cache), this is limited to CSS/JS/fonts in our current implementation
  • cache them for offline viewing using a Service Worker (cache first policy) and code that parses a manifest and use CacheStorage to store them (the resources end up in a different cache storage than when we prefetch the resources)
  • generate an AppCache manifest as a fallback for Safari (since Service Workers are not available yet)

You're correct that how these resources will be cached is more similar to AppCache Manifest than a Service Worker + CacheStorage.
AppCache Manifest has a declarative approach, while CacheStorage is a scripted approach to caching.

That said, the list of primary and secondary resources is purely a hint for a UA. How the content is actually cached is entirely up to the UA to decide.

@mattgarrish
Copy link
Member

Is not listing resources really a viable approach to no-cache?

The user agent can determine cache-ability from the headers if it attempts to offline the publication, and should respect those. It shouldn't also have to inspect the manifest.

And what stops someone with a manifest from turning your web publication into a pwp whether you want them to or not? All they need is a program that will use the manifest to find and inspect the listed resources and they'll pull it down whether you like it or not, and regardless of what is listed. A directive not to pwp-ize a publication for conforming user agents might be a start, but it's not going to stop someone who is determined to steal your publication.

@HadrienGardeur
Copy link

@mattgarrish stealing is a strong word, but you're right that there are multiple approaches to generate a PWP from a WP.

The "simple" approach is to simply package resources listed in the manifest, but you could also take the approach that Chrome is taking to handle prerender by rendering the resources, figuring out which resources are in use and packaging all of them.

@pkra
Copy link
Member

pkra commented Aug 9, 2017

@HadrienGardeur thanks for describing Readium2. I've finally caught up on #22 completely so this makes a bit more sense (not sure how I got here first -- apologies for adding noise). In particular @baldurbjarnason's comment at #22 (comment) covers a lot of relevant issues.

@mattgarrish
Copy link
Member

stealing is a strong word

Fair enough. I'm just thinking of the intellectual property theft concern of prohibiting a pwp. Telling publishers that if they don't list a resource is not the security they'll want.

But, yes, someone might, and can, still try to pwp-ize a publication the author doesn't want made into a pwp without it being theft.

@lrosenthol
Copy link

lrosenthol commented Aug 9, 2017 via email

@baldurbjarnason
Copy link
Contributor

Since nobody has presented a pre-existing real world publishing industry use case for leaving resources that are clearly a part of the publication out of the manifest:

One of the frequent issues I encountered back when I was still churning out ePubs as a living was the sheer, mind-boggling size that the 'include everything' requirement induced. This had serious consequences on a lot of product decisions with real-life economic impact because the size could literally prevent customers from fetching, loading, or storing the ePub. Not to mention the fact that for the files we were distributing via KDP, Amazon was still charging for downloads at an exorbitant rate (they might still be doing this, thankfully I don't care any more).

So we were faced with questions like:

  1. Do we decrease the resolution on all of the images to reduce the file to a manageable size?
  2. Do we make the videos external (even if that's badly supported and buggy)?
  3. Do we link to the videos from an image (and risk having the file rejected by vendors)?

In the end the decision was always: no videos, reduce the images even if it makes them look bad in some contexts, don't risk rejection.

Web Publications are going to have the same economic constraints when it comes to offline as device space is not increasing that fast and large parts of the world still have slow networks.

If we relax the 'must include everything' restriction then we could solve 1 by only listing the resources linked to from the default src attribute in images and leave the higher resolution versions external and linked to via srcset. It would solve 2 by making it clear from the start that UAs need to support this. And, hopefully, being on the open web platform will go a long way to solve 3. 😊

@BigBlueHat
Copy link
Member

@iherman for the font license scenario you mentioned, I'd put the onus of enforcing usage of the font on the font's licensor not on the author/licensee. Meaning, Content Security Policy (CSP), CORS, etc, would be the technological system(s) to prevent misuse of the font...not it's reference by the author/creator of a WP.

@HadrienGardeur it's true that the CacheStorage system is scripted, but in the case of Readium-2 it's being populated from a descriptive document. Everything mentioned so far is focused on descriptive definition of the "binding" for a publication. The question (back to the issue title) is more about where the dependencies (secondary resources) are declared.

@pkra ...

What is the list of secondary resource for?

They're dependencies of primary resources...some of which MAY be content related (i.e. quiz answer keys, pop-up something or others, etc).

Why would an author list something as a secondary resource?

To make sure it's part of the publication as considered by the reading system. The question is less "would an author list something" as it is "where would an author list something" as a secondary resource.

Some people seem to expect that the list of primary and secondary resources feeds into some kind of serviceworker cache (though it sounds more like appcache manifests). Was there agreement on that? If not, what else is the list for?

The frequent mention of ServiceWorker's is mostly about shimming/poly-filling these ideas today. In the future (one would hope) such a shim/poly-fill would go away.

We should certainly look to where AppCache went wrong--which seems mostly centered around whether it or the referencing documents were "in charge"...and all the related issues caused when that was unclear.

If there's a singular document that defines the "binding" of the publication, then my hope is that we'd avoid the split-brain situation of AppCache. Hopefully. 😃

@BigBlueHat
Copy link
Member

@lrosenthol still curious as to why "readable offline" would be the domain of the author/publisher to decide and not the reader.

@dauwhe
Copy link
Contributor

dauwhe commented Aug 9, 2017

One of the frequent issues I encountered back when I was still churning out ePubs as a living was the sheer, mind-boggling size that the 'include everything' requirement induced.

And there was the the sheer, mind-boggling labour of documenting & listing 'everything', and coming up with an ID for everything, where the ID wasn't used for anything, and I could have produced the list with unzip -1.

If I make a website, I don't have to list all the resources anywhere. If I write a book, I don't have to make a separate list of all the words I used. I dream that WP authors won't have to make such a list, hence my interest in other ways of obtaining such admittedly useful information.

@baldurbjarnason
Copy link
Contributor

@BigBlueHat

The frequent mention of ServiceWorker's is mostly about shimming/poly-filling these ideas today. In the future (one would hope) such a shim/poly-fill would go away.

I can't speak for others but when I mention ServiceWorkers it's because many browser vendors prefer functionality of new features be defined in terms of how lower level features work. E.g. features involving network requests are now generally being specified in terms of the fetch API even if the feature would be implemented entirely in C++. Defining a higher level feature in terms of how a lower level feature works adds clarity and aids implementation even if that implementation does not use the lower level feature. E.g. discussing the specifics of how a publication be made offline using the manifest and a service worker clarifies things like how you'd handle security updates and cache strategies, even if you don't use a service worker to implement it.

And we need a clear answer on the security update question. That can't be up to the implementation. If one popular implementation gets it wrong, the results could be wide-ranging and devastating.

@baldurbjarnason
Copy link
Contributor

@BigBlueHat

We should certainly look to where AppCache went wrong--which seems mostly centered around whether it or the referencing documents were "in charge"...and all the related issues caused when that was unclear.

AppCache was a security nightmare. Its updating mechanism resulted in stale and insecure versions of code often being persisted forever. One-time exploits could be made permanent and irreversible by a hacker by updating the appcache to cache itself (no more updates, current versions of assets are what you're stuck with). And given that appcache worked on non-secure origins, some websites could be hacked just by virtue of accessing them via Café wifi. There's very little about AppCache that wasn't awful in some way.

It also highlights why implicit offline storage of resources via scope is a potential security nightmare. It resulted in a lot of resources being unintentionally stored and then never updated.

@lrosenthol
Copy link

lrosenthol commented Aug 9, 2017 via email

@lrosenthol
Copy link

lrosenthol commented Aug 9, 2017 via email

@iherman
Copy link
Member Author

iherman commented Aug 10, 2017 via email

@iherman
Copy link
Member Author

iherman commented Aug 10, 2017

@BigBlueHat

@iherman for the font license scenario you mentioned, I'd put the onus of enforcing usage of the font on the font's licensor not on the author/licensee. Meaning, Content Security Policy (CSP), CORS, etc, would be the technological system(s) to prevent misuse of the font...not it's reference by the author/creator of a WP.

Let us not bog down in the specific example. The reason why an author would want to exclude a font may also be because the font is huge, and unnecessary complex for offlining on a slow device. Similar considerations may come from huge images, or complex javascripts that should not be used offline (or cannot be used because the author knows that there are extensive external references in the script).

What my proposal in #23 (comment) ensures that the author/publisher has control, if needed.

@iherman
Copy link
Member Author

iherman commented Aug 10, 2017

@mattgarrish

Fair enough. I'm just thinking of the intellectual property theft concern of prohibiting a pwp. Telling publishers that if they don't list a resource is not the security they'll want.

Correct. And we should be very clear in our prose when writing the spec that any restriction on the secondary resource listings should not be viewed as an efficient security measure. It is certainly not.

@HadrienGardeur
Copy link

@iherman

I agree. Hence my proposal that when we get down to the way the abstract manifest is expressed we allow for, say, URL patterns. Ie, I can simply say, via the manifest, that "all resources in this directory are secondary resources to be listed in the manifest".

I'm not sure how this could work on the Web, there's no way you can list all resources in a "directory". Are you talking specifically about PWP?

If we want UAs to use secondary resources for something, we should IMO work with URLs not templates/patterns.

@iherman
Copy link
Member Author

iherman commented Aug 10, 2017

@HadrienGardeur

I agree. Hence my proposal that when we get down to the way the abstract manifest is expressed we allow for, say, URL patterns. Ie, I can simply say, via the manifest, that "all resources in this directory are secondary resources to be listed in the manifest".

I'm not sure how this could work on the Web, there's no way you can list all resources in a "directory". Are you talking specifically about PWP?

If we want UAs to use secondary resources for something, we should IMO work with URLs not templates/patterns.

I am sorry, I did not use the right terms. What I meant was to use URL templates[1], which is an existing IETF Proposed Standard (which already has Javascript implementations[2]). We will have to examine whether it really works for us but, as far as I could see, it is a regexp-like formalism tailored to the URL syntax.

  1. https://tools.ietf.org/html/rfc6570
  2. https://github.com/bramstein/url-template

@HadrienGardeur
Copy link

@iherman

We use URI templates in Readium-2 to discover and interact with various APIs (search, locator resolver and media overlay).

But URI templates are not suited for listing resources. To do that you'd need to have some sort of convention in how the URIs are built (I don't think that's reasonable).

When a UA decides to download a resource (for prefetching, caching or packaging) it needs to know exactly what the URL will be.

@GarthConboy
Copy link
Contributor

As I've been out, I'm gonna briefly riff back to later in this thread -- specifically Ivan's proposal which has seemed to gain some consensus. I too think that's good progress, and can certainly accept it for now.

I think we can defer to post-FPWD the argument on whether the #2 is a MAY, SHOULD, or MUST.

Though, we need to keep in mind that a WP has a requirement (in our charter) of "A Web Publication must be available and functional while the user is offline", and PWP is described (in our charter) as: "This specification defines a way to combine the resources of a Web Publication into a distributable file using a packaging format."

To move away from MUST we'll need to consider changing our definitions (which could be done via consensus) from "all WPs must be offline-able and must be transformable to PWPs" to "WPs may/should be constructed such that they can be made available offline and be transformable to PWPs". Again, for where we are now, I think we can defer a decision here.

@iherman
Copy link
Member Author

iherman commented Aug 10, 2017

But URI templates are not suited for listing resources. To do that you'd need to have some sort of convention in how the URIs are built (I don't think that's reasonable).

Hm. Why?

My model was that the user/publisher would use a URL template as a regex-like specification for which secondary resources are "listed" in the manifest (eg, for offlining). The UA can then take any specific URL it meets and can check whether it resolves to (one of the) pattern(s).

But we are getting down to the weeds. It is too early.

@HadrienGardeur
Copy link

@iherman

The UA can then take any specific URL it meets and can check whether it resolves to (one of the) pattern(s).

So you're expecting the UA to prerender the primary resources or parse them to:

  1. find URLs
  2. then check such URLs against a number of URI templates that would pretty much act as a white list?

I don't find this easier to author (at least not much) and it clearly makes the job much more difficult for the UA.

@iherman
Copy link
Member Author

iherman commented Aug 10, 2017 via email

@HadrienGardeur
Copy link

HadrienGardeur commented Aug 10, 2017

I wouldn't call a regular-expression like syntax a "no brainer" for anyone (authors or implementers).

With such a syntax, there's also no way to provide important information about a resource (dimension of an image, duration of an audio track).

Saying "it's up to the implementers to figure it out" is IMO not a good strategy, you have to consider everyone when designing such documents.

Putting my Readium hat on, I can tell you that if we have to support prerendering to handle caching resources, this will take years to figure it out (the Chrome team has been working on this problem for >5 years now).

@iherman
Copy link
Member Author

iherman commented Aug 10, 2017

@HadrienGardeur, I would propose to postpone this discussion for a later time. This issue is whether we should/may/must have a list of secondary resources in some way or other in the manifest (abstract and concrete). What syntax we would use, whether we would use other tools like URL templates, etc, is a detail. Let us concentrate on the original issue for now.

I shouldn't have brought up these details for now. My mistake.

@stain
Copy link

stain commented Sep 26, 2017

Sorry for jumping in ..

The manifest will help determine what is inside or outside the WP - which of course is also useful for attribution purposes. Without the manifest I am not sure what is the point as you could just have index.html as the starting point.

I think preservation should also be taken into consideration - not just being able to go offline as a client. A web resource you depend can also disappear - for example when using interactive figures.

A secondary entry in the manifest means it is made explicit (that is - detectable by a simple Python script) that somewhere in your WP you have used sloppy and fragile Cloud hosting like <script src="https://dl.dropboxusercontent.com/u/13540018/d3po.init.js"></script> and that your WP is now broken.

An UA can warn "This page requires 'external content'" if a resource outside the manifest is attempted loaded and probably should be allowed to disregard it.

@iherman
Copy link
Member Author

iherman commented Mar 2, 2018

Propose closing: the infoset in the draft has now a number of entries, and this issue itself became extremely long an a bit lost focus. We may be better off closing it and, if necessary, open new, more focused issues when the time comes.

@iherman
Copy link
Member Author

iherman commented Mar 13, 2018

@iherman iherman closed this as completed Mar 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests