Skip to content
This repository has been archived by the owner on Jun 27, 2023. It is now read-only.

How do we identify a web publication and its components? #10

Closed
dauwhe opened this issue Jun 26, 2017 · 13 comments
Closed

How do we identify a web publication and its components? #10

dauwhe opened this issue Jun 26, 2017 · 13 comments

Comments

@dauwhe
Copy link

dauwhe commented Jun 26, 2017

A Web Publication (WP) is a collection of one or more constituent resources, organized together in a uniquely identifiable grouping that may be presented using standard Open Web Platform technologies. A Web Publication is not just a collection of links— the act of publishing involves obtaining resources and organizing them into a publication, which must be “manifested” (in the FRBR sense) by having the resources available on a Web server. Thus the publisher provides an origin for the WP, and a URL that can uniquely identify that manifestation.

Perhaps the simplest possible answer to these questions is just a URL: https://www.example.com/MobyDick/ would both identify the publication and mean that everything whose URL starts with this is part of the publication.

So I guess that I’m looking for reasons to make this more complicated :)

@GarthConboy
Copy link

I think we'll need to point to the "manifest" -- we'll need to be able to download or package the entire publication and its constituent resources (given Brady's correct observation that with scripting scanning the markup can reliably determine what's really referenced). Also need to know what markup file should initially displayed and how to progress from there.

If your URL is a "directory root," one could say everything under it is inherently part of the publication (which could resolve the scanning the markup issue [maybe]), but one will still need to find the manifest to know where to start rendering and the reading order thereafter.

@dauwhe
Copy link
Author

dauwhe commented Jun 26, 2017

but one will still need to find the manifest to know where to start rendering and the reading order thereafter.

This reminds me of a concern about progressive enhancement. Say I point my browser at https://www.example.com/MobyDick/ but JS is disabled or the user agent doesn't yet support web publications. What should happen?

One option would be to give the first document you want displayed a special name, say, index.html. This file could also point to the manifest, or include it directly.

@mattgarrish
Copy link
Member

The case usually given for complexity is open textbooks and course packs, where content is aggregated from different locations without having to actually amass the resources under a single domain/directory.

Does "everything" here only refer to html pages? How realistic is it that all the resources are going to be neatly stored together? What if my css is two levels higher up from the publication under a common folder? What if I'm pulling in css or scripts from another domain?

I'm all for simplification, don't get me wrong, but I'm not optimistic about a model that requires the user agent to traverse and parse all the documents to figure out what is in scope and needed, if that's where this is leading.

but one will still need to find the manifest to know where to start rendering and the reading order thereafter

Isn't this where we've considered using link/rel to establish the "belonging"? (And another case of why cross-domain publications get complicated quickly, since their parentage can only be established by starting at an author-controlled location, which then has to be maintained despite what the linked resources might indicate.)

@dauwhe
Copy link
Author

dauwhe commented Jun 26, 2017

The case usually given for complexity is open textbooks and course packs, where content is aggregated from different locations without having to actually amass the resources under a single domain/directory.

Do we need to design something that will support content documents ("spine items" in EPUB-speak) hosted on multiple origins?

https://www.example.com/MobyDick/chapter-001.html
https://www.foo.com/MasterAndMargarita/chapter-002.html

@dauwhe
Copy link
Author

dauwhe commented Jun 27, 2017

I think we'll need to point to the "manifest"

So the URL of the WP would point to the “manifest” rather than a directory? This would then imply (I believe) that the manifest be discoverable from some sort of file. So what sort of file? I would argue that pointing to HTML would be better than the alternatives, given all user agents know what to do with HTML files. But that leaves open the question of whether this HTML file contains the manifest, or just points to the manifest.

@mattgarrish
Copy link
Member

Do we need to design something that will support content documents ("spine items" in EPUB-speak) hosted on multiple origins?

We need to consider it, at least. Intertwined with what I mentioned above is the problem of iframes and bringing in entire chunks of content below the level of the spine. We need to be open to how the web works and not just publications as we're used to making them.

The problem doesn't seem confined to content documents but affects their constituent resources, as well, so we need some solution.

Taking a publication offline is less of a problem than what happens to references in a packaged web pub. So while we can ignore the problem at this level, we probably do so at our own peril later. Or maybe we add rules farther down the chain that limit what a packaged web pub can reference? (That's kind of a nasty gotcha I'd hate to discover, though.)

@GarthConboy
Copy link

"I would argue that pointing to HTML would be better than the alternatives, given all user agents know what to do with HTML files. But that leaves open the question of whether this HTML file contains the manifest, or just points to the manifest." -- interesting. As long as the manifest was discoverable in a known location, I guess that would okay -- I think a browser might be interested in a first HTML page, whereas a Reading System would want to start with the manifest.

"Do we need to design something that will support content documents ("spine items" in EPUB-speak) hosted on multiple origins?" -- I would think "no".

@iherman
Copy link
Member

iherman commented Jun 27, 2017

@dauwhe

One option would be to give the first document you want displayed a special name, say, index.html. This file could also point to the manifest, or include it directly.

This is already how the Web works. We routinely use URLs to a directory, and it is up to the server setup on what this means in practice. It can return the index.html file in that directory, if available; in Apache one can actually set up a whole priority list of alternatives. Although not frequent, it can also return, as a first order index.svg (that may be useful for some documents).

Bottom line: I believe your first statement, whereby https://www.example.com/MobyDick/ is the identifier of a particular document is perfectly fine.

@iherman
Copy link
Member

iherman commented Jun 27, 2017

I think the scope notion of the Web App Manifest is interesting here. If we want to include content document from different "origins", then we may use a scope listing several documents. Although, for different reasons, I may be tempted to say that all directories listed in the scope should be on the same domain.

@baldurbjarnason
Copy link

The scope notion would play nicely with the proposed packaging spec which IIRC relies on it quite a bit.

Outlining how identification for web publications would work if it followed the expectations set by the rest of the web stack (e.g. web app manifests, atom/rss feeds, etc.):

  1. Each HTML page that is a part of a publication links (with a specified link relation) to a manifest document of some sort to indicate the publication it is a part of.
  2. That manifest then somehow indicates which resources are under its purview. This could be done using scope (like web app manifests) or an explicit listing of some sort, or both.
  3. The manifest lists an authoritative url that identifies the publication it is describing. This might be done indirectly as simply the root URL for the scope it covers or directly as an explicit property. I think most web developers would prefer an explicit URL property but that's just a hunch not backed by data. That would also make the manifest more complementary to the packaging spec if that spec becomes a reality.
  4. That identifying root URL has to return an HTML file that is within the manifest's scope and that file has to link back to the manifest as well.

This is the basic pattern used by feeds, web app manifests, service workers, etc: component files link to a central document with metadata, indication of scope, link to self, and an identifying URL. Even AMP uses a variation of this theme. And as I mentioned above sometimes the identifying URL and scope definitions are interrelated. E.g. atom feeds link to the URL whose updates they list (explicit id, implicit scope).

This pattern gives us discovery (direct links to chapters let you discover the publication ID, its metadata, and all related assets) as well as a single source of truth for the publication ID, publication-level metadata, and publication assets (the manifest). And this guarantees that the publication id is itself a URL to a human-readable HTML resource that in turn lets you discover the manifest.

Of course, this is just going from what you'd expect if you were coming at this from the web development community. I realise that they aren't the only constituency at play here.

And this does not necessarily dictate anything about the format of the manifest. Although, if we're going by the principle of least surprise, most web developers would at least expect a JSON file.


On service workers

Service workers achieve this process programmatically, but the pattern is very similar overall. Although a lot of service worker behaviour by necessity violates common developer expectations.

  • Service workers scope defines the pages whose network access they control. Which (counter-intuitively for some) means that they can control cross origin requests for the pages they control.
  • But the requests the service worker makes itself are no-cors by default (IIRC).
  • You also have foreign fetch service workers who control the network requests for pages outside of their scope as their scope is defined by the resources being fetched not the pages doing the fetching.

Basically, even though service workers are awesome, they do also have a deserved reputation for being confusing (this is only scratching the surface) so anything we can do to avoid that complexity is a win. That means not letting the publication manifest claim scope over cross-domain resources and not letting it control requests in any way.


(Apologies for the brain dump. I didn't have time to edit this down to a concise note 😊)

@HadrienGardeur
Copy link

What you're describing is almost exactly what we do in Readium-2 @baldurbjarnason, there are only minor differences or observations that I need to add.

Each HTML page that is a part of a publication links (with a specified link relation) to a manifest document of some sort to indicate the publication it is a part of.

Ideally yes, but what if a resource is included in multiple Web Publications ? What if you can't change the HTML or HTTP headers for that resource ? IMO, such a link to a publication is an important part of how discovery is handled, but it's not an absolute requirement.

That manifest then somehow indicates which resources are under its purview. This could be done using scope (like web app manifests) or an explicit listing of some sort, or both.

In Readium-2 we list all resources under two separate collections: spine for the core resources that are listed in reading order and resources for other resources.

This has some clear benefits over a simple scope:

  • since we know all the resources (URIs) necessary to render a resource, we can easily cache them however we want (Service Worker, App Cache Manifest, proxy, local cache storage for native apps)
  • we also optimize the UX by preloading specific resources (fonts, JS, CSS) and prerendering some of them (using multiple webviews in our mobile apps)
  • since our manifest is using JSON-LD + schema.org, a client that understands schema.org can index these resources as being part of the publication

The manifest lists an authoritative url that identifies the publication it is describing. This might be done indirectly as simply the root URL for the scope it covers or directly as an explicit property. I think most web developers would prefer an explicit URL property but that's just a hunch not backed by data. That would also make the manifest more complementary to the packaging spec if that spec becomes a reality.

That's one of our only requirements. In Readium-2 we always provide a link that points back to the manifest.

The other two requirements are:

  • at least a title in the publication's metadata
  • at least one resource in the spine

That identifying root URL has to return an HTML file that is within the manifest's scope and that file has to link back to the manifest as well.

That's pretty much the only difference between what you're describing and Readium-2/Readium Web Publication Manifest. The "root URL" (a link with self as its relation) points to the JSON manifest, not the first (or any document) from the spine.

One reason for that is tied to the fact that we'd like anyone to create a Web Publication by remixing content already available on the Web.

On Service Workers

I really don't think that Service Workers should in any way influence our design for Web Publications. There are many different ways that content can be cached, and Service Workers are only one method among others.

Let's keep our options open and let people use all the possibilities offered.

@llemeurfr
Copy link

So, to come back to the initial question, Readium-2 folks propose:

  • an IRI as globally unique identifier for the publication, included in the manifest as one of the few mandatory metadata.
  • a URL linking back the manifest to its origin.

@dauwhe
Copy link
Author

dauwhe commented Jul 5, 2017

This issue was moved to w3c/wpub#5

@dauwhe dauwhe closed this as completed Jul 5, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants