Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Role of the HTML document returned by the WPUB URL #103

Closed
HadrienGardeur opened this issue Nov 15, 2017 · 73 comments
Closed

Role of the HTML document returned by the WPUB URL #103

HadrienGardeur opened this issue Nov 15, 2017 · 73 comments

Comments

@HadrienGardeur
Copy link

HadrienGardeur commented Nov 15, 2017

We've had long discussions in issue #94 about what the WPUB URL resolves to and there's a consensus that it should return an HTML document.

What's not well defined though is the exact role of that document.

Several proposals have been made so far:

  • discover the Web Publication (through a link to its manifest)
  • provide information about the Web Publication
  • enable the acquisition of a Web Publication (if the publication requires authentication or is behind some sort of paywall)
  • provide a Web App or a polyfill to handle various WP specific features (suggested by @BigBlueHat in this thread)

There's also an on-going discussion whether this document MUST belong to the Web Publication itself or remains external to the publication (I'll let @mattgarrish and @GarthConboy repost some of their relevant comments here).
Every agrees that the document MAY belong to the publication.

If the document belongs to the Web Publication, it MAY be:

  • the table of contents
  • the cover (not a concept that's been clearly discussed or defined at this point in time)
  • the recommended resource to start reading the publication (could be different from the first resource in the reading order)
@dauwhe
Copy link
Contributor

dauwhe commented Nov 15, 2017

Hadrien, could you describe how it would work if the HTML document returned by the WP URL is not part of the WP, and doesn't contain a link to the WP's manifest?

@baldurbjarnason
Copy link
Contributor

Hadrien, could you describe how it would work if the HTML document returned by the WP URL is not part of the WP, and doesn't contain a link to the WP's manifest?

Is there any reason why it wouldn't contain a link to the manifest? E.g. for RSS, sub-pages (like 'about') are generally not a part of what appears in the feed but still contain the link. Was there a specific reason cited in that other thread of why that can't be the case?

I have to confess that issue #94 totally lost me about halfway through so I may well have missed this point if it was made later on.

@HadrienGardeur
Copy link
Author

@dauwhe that's a little more specific than the question that I just raised in this issue.

First of all, I think that the document will always return information about the publication, so out of the three things that I've listed, this one is always true.

I'll use concrete examples:

Info about the Web Publication, no link to the manifest, not part of the WP

This could apply to a book behind a paywall, for example Hachette Book Group could start selling Web Publications directly from their website.

If the user is not logged in and/or hasn't bought that specific publication, he/she gets redirected to a product page with info about the book (use case 1) and a way to buy it (use case 3).

Info about the Web Publication, with a link to the manifest, not part of the WP

This could apply to a publisher distributing strictly Open Access content.

The catalog for that publisher has a page per publication to provide a broad presentation (metadata, cover, summary, tags) which would fit use case 1. That page would also include a link to start reading that WP (which would point to a resource that's part of the WP and in reading order) along with a link to the manifest (use case 2) that could trigger specific behaviours in a UA (similar to Chrome Install banner for PWA, we could imagine a similar banner that shows the cover, metadata and interactions such as "read" or "add to your shelf").

@HadrienGardeur
Copy link
Author

E.g. for RSS, sub-pages (like 'about') are generally not a part of what appears in the feed but still contain the link. Was there a specific reason cited in that other thread of why that can't be the case?

@baldurbjarnason I've made that point several times already but a few people (including @mattgarrish) find that behaviour "weird".

@mattgarrish
Copy link
Member

My question was what are the implications of a resource initiating the web publication when that resource is not itself a part of the web publication, and done via the linking mechanism defined in the spec?

If a page with a link is not part of the publication, but some sort of launcher out to a publication, how does the user agent determine which is which? As I said in #94, it seems like a security hole when there is no specific scope to a web publication and no sure way to verify what is or is not a publication resource, since the resource list is not required to be complete.

@HadrienGardeur
Copy link
Author

@mattgarrish unlike a Web Application, a Web Publication must list all of the resources that are part of it. If a resource is not listed, it's by definition outside the scope of the Web Publication.

This effectively replaces the scope in WAM and the UA could trigger different behaviours for resources inside the WP vs resources outside.

@mattgarrish
Copy link
Member

No, there must be a list that must include the resources in the default reading order, but otherwise it doesn't have to be exhaustive. That's as far as we got agreement.

That's why I'm not a fan of that section and concerned about how those implications spill outward. We've said the bounds are important, but we aren't strong in defining them.

I don't have a real position in favour of MUST/SHOULD/MAY here, only that everything fit together.

@HadrienGardeur
Copy link
Author

Also, we're leaning heavily into the realm of implementation.

Each UA will effectively implement for WP differently, simply displaying a link in a banner or an icon in the address bar for example is IMO not such a big deal.

if this is not a security issue for RSS/Atom, I don't see why it would be a security issue for WP.

@HadrienGardeur
Copy link
Author

No, there must be a list that must include the resources in the default reading order, but otherwise it doesn't have to be exhaustive. That's as far as we got agreement.

Yeah and I don't like that part at all. We can't say that bounds are important and be as weak as we are in defining them.

@mattgarrish
Copy link
Member

simply displaying a link in a banner or an icon in the address bar for example is IMO not such a big deal

For me, it's more what happens after you click that button or icon in this scenario. Typically, I'd expect that page to be initiated (paginated, table of contents link appears, etc.). But it potentially puts any page into the publication with our weak boundedness.

Assuming we tighten up our resource list requirements to solve that, do we also need to consider start_url for these cases?

I realize some user agents will just make a publication out of the manifest and drop the user at the first page (or whatever), but some additional direction seems useful if the address is not a publication resource. With those bits in place, I'd be fine dropping to either should/may.

@HadrienGardeur
Copy link
Author

Here's a good example that illustrates a few use cases that we'll have to deal with: http://books.openedition.org/

They have three different models:

They currently provide multiple ways to access content:

  • Read online (the TOC is accessible from the publication's page)
  • Read using a special viewer
  • EPUB
  • PDF

With Simple Open Access, you can do all four for free.
With Open Access Freemium, you can read online for free, but for the other three options you must go through a paywall first.
With Exclusive Access, everything is behind a paywall.

Let's imagine that they decide to replace their special viewer with a Web Publication instead:

  • Simple Open Access: TOC on the page, link to the manifest
  • Open Access Freemium: TOC on the page, link to acquire the Web Publication
  • Exclusive Access: TOC on the page (but only points to excerpts), link to acquire the Web Publication

@iherman
Copy link
Member

iherman commented Nov 15, 2017

Info about the Web Publication, no link to the manifest, not part of the WP

This could apply to a book behind a paywall, for example Hachette Book Group could start selling Web Publications directly from their website.

If the user is not logged in and/or hasn't bought that specific publication, he/she gets redirected to a product page with info about the book (use case 1) and a way to buy it (use case 3).

This is similar as the 'le Monde' example. The identifier of the Hachette WPUB is behind the paywall, ie, the surrounding system will force you, as a reader, to go through some HTTP redirections or whatever, before the HTTP response is indeed the content of the entry page you refer to. In other words, the content of that page at that URL can, without any problems, contain both the link to the manifest and be part of the publication.

Hachette may of course decide to provide a separate page where customers would go to buy/access the book, but the URL of that page is, for me, not the address/identifier of that book.

Info about the Web Publication, with a link to the manifest, not part of the WP

This could apply to a publisher distributing strictly Open Access content.

The catalog for that publisher has a page per publication to provide a broad presentation (metadata, cover, summary, tags) which would fit use case 1. That page would also include a link to start reading that WP (which would point to a resource that's part of the WP and in reading order) along with a link to the manifest (use case 2) that could trigger specific behaviours in a UA (similar to Chrome Install banner for PWA, we could imagine a similar banner that shows the cover, metadata and interactions such as "read" or "add to your shelf").

I believe we are mixing the address/identifier of the book, and any kind of other page on the vendor's/publisher's site that provides other services and information. In some cases the entry page can play both roles, and in some cases the two roles are played by two different pages.

@HadrienGardeur
Copy link
Author

The identifier of the Hachette WPUB is behind the paywall, ie, the surrounding system will force you, as a reader, to go through some HTTP redirections or whatever, before the HTTP response is indeed the content of the entry page you refer to. In other words, the content of that page at that URL can, without any problems, contain both the link to the manifest and be part of the publication.

Hachette may of course decide to provide a separate page where customers would go to buy/access the book, but the URL of that page is, for me, not the address/identifier of that book.

@iherman OK so let me turn that into a bullet list to make sure that we're on the same page:

  • The WPUB URL must serve an HTML document that contains info about the publication + a link to the manifest
  • ... but it's perfectly fine to redirect the user instead to a separate URL that does not contain a link to the manifest or any content from the WP itself

Is that an accurate summary?

This doesn't cover whether the document returned by the WP URL is within the boundaries of the WP BTW, that's a different issue.

@iherman
Copy link
Member

iherman commented Nov 15, 2017

@HadrienGardeur,

Let's imagine that they decide to replace their special viewer with a Web Publication instead:

  • Simple Open Access: TOC on the page, link to the manifest
  • Open Access Freemium: TOC on the page, link to acquire the Web Publication
  • Exclusive Access: TOC on the page (but only points to excerpts), link to acquire the Web Publication

I believe we are conflating the WPUB itself, and the various accesses to the WPUB. As far as I am concerned, these are different, and it is up to the vendor to ensure access to the same WPUB. How it happens (via the open access or not) is not the subject of this specification.

@GarthConboy
Copy link
Contributor

Agree with "The WPUB URL must serve an HTML document that contains info about the publication + a link to the manifest"

Not quite sure where you're ( @HadrienGardeur ) going with "... but it's perfectly fine to redirect the user instead to a separate URL that does not contain a link to the manifest or any content from the WP itself". I do think any purchase/paywall stuff should be on the "outside/before" of the WP, and not within our scope of definition.

Re "doesn't cover whether the document returned by the WP URL is within the boundaries of the WP BTW" -- I used to be in the "doesn't matter" camp, but if majority is at "must be", I'm okay with that too.

I think we're somewhat going around in circles... this issue is largely a re-open of #94 -- perhaps it's better to resolve on the Monday call... I predict will be the major agenda item. :-)

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Nov 15, 2017

Not quite sure where you're ( @HadrienGardeur ) going with "... but it's perfectly fine to redirect the user instead to a separate URL that does not contain a link to the manifest or any content from the WP itself". I do think any purchase/paywall stuff should be on the "outside/before" of the WP, and not within our scope of definition.

I'm simply rephrasing what @iherman is suggesting.

Also, if the WPUB URL itself is behind a paywall, this is definitely relevant here. It means that just knowing the URL won't be enough to discover/read the publication.

@GarthConboy
Copy link
Contributor

Also, if the WPUB URL itself is behind a paywall, this is definitely relevant here. It means that just knowing the URL won't be enough to discover/read the publication.

I think any commerce is outside our scope and must happen before one gets to the WPUB URL. We should not be dealing merchandising, commerce, shelving at the WP level.

@HadrienGardeur
Copy link
Author

From an HTTP perspective, this is purely status codes not merchandising/commerce/shelving/whatever.

This essentially means that we can't expect a 200 HTTP status code and an HTML document with all the things we've discussed so far.
No need to redefine anything about how HTTP handling, but this indicates a requirement for a failure mode in our spec.

@GarthConboy
Copy link
Contributor

If you don't get a 200 from the WPUB URL, it's not one you have access to. I don't quite know if we agree or not.

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Nov 15, 2017

Sure, just want to make it very clear that this is not a magical URL that will always do what's been suggested so far. Many times it'll result in being redirected elsewhere or just getting an error (4xx/5xx without any redirect).

But the main point that I wanted to raise in this issue is the role of this document, and for now it's very very vague at best.

While it's clear that the URL itself identifies the WP, it seems that the only consensus we might have so far about the document that is returned is discovery of the manifest (and I insist on the term discovery, the presence of a link to a manifest is not meant to say that the current document belongs to the WP).

mattgarrish added a commit that referenced this issue Nov 15, 2017
@BigBlueHat
Copy link
Member

Until a URL results in an entry page that identifies that URL as the address of the Web Publication, that URL is not a Web Publication URL.

Consequently, the scenarios you provided earlier are not about "a Web Publication URL":

Info about the Web Publication, no link to the manifest, not part of the WP
Info about the Web Publication, with a link to the manifest, not part of the WP

Neither of those result in an "entry page," so neither of those URLs are (at that point) the address of the Web Publication.

What's needed next is more clarity around how the "entry page" declares itself to be a Web Publication--it can't simply be a link to a manifest document as we've also defined that as a means for discovering Web Publications.

This entry page needs to declare to the UA that it authoritatively provides the infoset (through yet to be determined means) for a specific Web Publication.

@HadrienGardeur
Copy link
Author

Neither of those result in an "entry page," so neither of those URLs are (at that point) the address of the Web Publication.

We don't even know what an "entry page" is, we haven't even agreed on the terminology, we're simply using this term because "landing page" was confusing for everyone involved in these discussions.

What's needed next is more clarity around how the "entry page" declares itself to be a Web Publication--it can't simply be a link to a manifest document as we've also defined that as a means for discovering Web Publications.

What's needed is actually quite different:

  • we need a clear idea of what this document does, along with its role and relationship to the rest of the WP
  • we also need to agree on a term, "entry page" is barely different from "landing page" and it sounds incredibly similar to start_url in the Web Application Manifest (maybe that's the point, because they're the same thing?)

I'm not convinced at all that we need any additional mechanism for a WP to "declare itself".
We already have discovery (link to the manifest) and we can rely on the manifest to at least know what's in the reading order (the boundaries of a WP are still very fuzzy and this requires a separate issue IMO).

Why would we require something else?

This entry page needs to declare to the UA that it authoritatively provides the infoset (through yet to be determined means) for a specific Web Publication.

That's not something we ever agreed on.

I know that you have your own agenda (one HTML document to rule them all, and in the <nav> element bind them), but there's absolutely zero consensus on the "entry page" being the place that provides the infoset.

@BigBlueHat
Copy link
Member

If the manifest is in JSON, and discovered from anywhere on the Web what thing is responsible for the authority within that manifest is used?

Is the expectation that "something else" will read this manifest and create the reading experience?

If that's the case, that falls well outside the extensible web manifesto. Where does one provide a shim/polyfill to use that manifest? In any of the resources that reference the manifest?

Given a Web Publication URL, we have to return something extensible that can "boot up" the publication and work as an authority of the "bound" resources.

The spec does currently say "Linking to a Manifest" and then present rel="publication" as used with a Web Publication URL. That likely needs re-consideration in light of providing an extensible response format (HTML) from that URL.

That "extensible response format" may indeed use a JSON document (in the end) to provide some or all of the infoset, but the browser can't be extended from a JSON file, so the initial response document becomes the authority over the other bits (JSON, JS, CSS, more HTML, etc).

Is that any clearer?

@HadrienGardeur
Copy link
Author

Well, actually it isn't.

You're basically saying that the "start URL" will provide a Web App for handling the Web Publication.

That's an option but certainly not a requirement. If my Web Publication already contains sufficient navigation to read the whole publication, why would we require a polyfill as well?
If your Web Publication works fine on the Web as-is, it doesn't need something to "boot it up".

To polyfill what exactly? Offline access? Packaging a WP into a PWP? Accessibility features?
It's good to have something that any browser can use without knowing about a WP, but I don't think that requiring a polyfill for every single feature listed in our spec is a good idea.

@BigBlueHat
Copy link
Member

That's an option but certainly not a requirement. If my Web Publication already contains sufficient navigation to read the whole publication, why would we require a polyfill as well?

Well, conversely, why would one make it a Web Publication? Which is what folks at browser vendors keep asking--"why don't ya'll just write PWAs?"

If your Web Publication works fine on the Web as-is, it doesn't need something to "boot it up".

Again. We are (and will be) asked what a Web Publication provides to the Web and it's existing UAs.

To polyfill what exactly? Offline access? Packaging a WP into a PWP? Accessibility features?

We'll need to prove to browser vendors that there are things we need them to add/change about browser behavior--at least if this is a spec for teaching browsers about publications (which has been my assumption).

If we're simply defining a JSON format for publishers to build Web Apps around, then we're doing this all wrong. 😄

@HadrienGardeur
Copy link
Author

I think we can both agree that this group has yet to prove to the rest of the W3C what makes Web Publications unique and requires a separate format and dedicated features in a browser.

But I don't think that requiring an "entry page"/"start URL" that basically behaves like a Web App is the best argument for that.
On the contrary, it'll simply prove that they're completely right to say that a Web Publication is just a PWA with its own internal JSON format.

Anyway, this feels like a different discussion... but my take is that in addition to what I've listed above in this issue, you're also proposing the "entry page" to serve a "WP viewer".

@TzviyaSiegman
Copy link
Contributor

... but my take is that in addition to what I've listed above in this issue, you're also proposing the "entry page" to serve a "WP viewer".

I don't think anyone is suggesting that the entry page serves as a viewer. We are simply suggesting that it is the way to ENTER the publication. Methods of reading the publication are not up to the UA. We can talk about conformance or preferred behaviors later. Neither the entry page nor manifest should create a reading experience.
I am not sure how this issue differs much from issue #94, unless you are asking if that HTML doc MAY hold more than discussed there.

@BigBlueHat
Copy link
Member

Pretty sure everyone in this group (and on this issue) believes in a future thing called a "Web Publication" that is built entirely from descriptive components--be they HTML, CSS, JSON, etc.

You are correct that it's my expectation that the "entry page" would serve as a "WP viewer." The hope being that in the future providing that viewer becomes optional, and that browsers "reader modes" would use the same definitions to take that experience farther, or make it more consistent and accessible, etc.

The important part of the whole "entry point" discussion is that given a Web Publication URL, we need to feed the browsers something extensible. That document (and that URL) will ultimately define the authority space of the publication (via ServiceWorker scope, CORS, CSP, SOP, etc).

Conversely (word of the day! 😉), if a JSON document is the "authority" of the Web Publication, then we must re-define how all those things (CORS, CSP, SOP, etc) act when rendering the stuff inside that JSON and inside what browsing context that happens, etc. There's a much higher hill to climb here.

So. The role of this HTML page becomes the defining authority space for whatever comes after.

@BigBlueHat
Copy link
Member

I don't think anyone is suggesting that the entry page serves as a viewer.

There's no requirement for that, no, but it'd be my expectation that this HTML provides the opportunity to "provide a reading experience"--and that publisher would/do that already (even if it's just next/prev/contents links).

We are simply suggesting that it is the way to ENTER the publication.

...the way to enter the publication when one only has the publication URL (and not coming in from some sub-portion of the publication).

Methods of reading the publication are not up to the UA.

They will be in the future, though, correct? That may need to be its own issue. 😄

We can talk about conformance or preferred behaviors later. Neither the entry page nor manifest should create a reading experience.

The entry page MAY do that. The manifest can't...at least not on its own.

The idea is that for now (who knows how long...) the "entry page" will provide some sort of "reading experience" (however subjective...). The hope of any specification we right here is to set those expectations (by defining requirements). As we do that, we can shim those expectations from an "entry page," but we'd have to provide a separate thing (i.e. a "reading system") to interpret a JSON-only definition of a Web Publication.

Which is likely where @HadrienGardeur and I find ourselves on opposite ends of the same rope. 😃 He has a reading system, and I have publications. 😄

The core to all this though is what are we "teaching" browsers to do on behalf of the good people of the Web. If we require an "entry point" response as HTML (and presumably as part of the publication), then we have a way to show them that.

@BigBlueHat
Copy link
Member

We can't use rel="manifest" unless we want everything that comes with that spec and specifically what's defined in the manifest life-cycle section.

Using rel="manifest" means the browser will (potentially) give the user an install banner which when acted upon will result in a standalone browser hosting just that single Web App. We can't overload or ignore that spec just to keep using the link relationship. Also, media types can't change the relationship stated between the links, just the format of the thing in the relationship. Link relationships define usage. Media types define format.

mattgarrish added a commit that referenced this issue Nov 18, 2017
updates per comments in issue #103
@HadrienGardeur
Copy link
Author

A rel doesn't need to have the same behaviour for every media type, it's the combination of rel and media-type that gives you the full context.

There are additional conditions too for the browser to display an install banner (we'll need to use the same syntax as the WAM, use SSL and also include a Service Worker).

@tcole3
Copy link
Contributor

tcole3 commented Nov 20, 2017

Based on today's call, we seem to have consensus that what is required of an 'Entry Page' should be relatively minimal - i.e., a link to the Way Bill. My question goes a bit the other direction.

Is there any impediment to allowing (optionally, of course) an 'Entry Page' to also be a Start Page for the WP, or possibly even an entire WP (excepting of course the Way Bill, which we have decided must be a separately addressable JSON file)?

In other words, is our minimal requirement that all WPs consist of at least 2 resources: a Web Manifest (JSON) and an Entry Page (HTML)?

mattgarrish added a commit that referenced this issue Nov 20, 2017
@mattgarrish
Copy link
Member

As proposed by Ivan in #103 (comment) , the proposal for the fpwd is that the entry page:

  • MUST be an HTML document;
  • MUST include a link to the manifest;
  • SHOULD be publication resource

This issue will remain open past fpwd to capture input from the broader community.

@WSchindler
Copy link

The minimal Entry Page is only required to have a link to the JSON waybill of the WP, because it's not feasible to prescribe for all kinds of use cases or types of publications what it should contain - TOC, cover, abstract, etc. Suppose that such an Entry Page would just contain that link and nothing else. A WP-aware user agent would access the waybill and find all the necessary information for giving the user access to the contents of the WP. But a non-WP-aware user agent that can't process the waybill properly, might just display a blank page. Shouldn't we just add a further fallback option for traditional UAs, especially browsers, that the Entry Page should offer a way to access the contents of a WP and list some possible approaches?

@mattgarrish
Copy link
Member

I'm trying to puzzle out if we can find better consensus by breaking apart what we want to achieve.

What started all of this was, I believe, this basic assertion:

  • A Web Publication MUST contain at least one HTML document which MUST provide an HTML link to the manifest.

This page ensures that we have at least one resource that is compatible with the WAM linking model, and for compatibility with vanilla user agents, search engines, etc. We probably don't need to call it anything special.

Where things have gone awry is in also making this the required address. There's no particular need for it to be, as far as I can tell, as vanilla anything isn't going to find the address or a way to this document from the manifest, since they don't understand the manifest. The same is true of any other document you reference. All the webby ways of finding documents will lead people into the publication.

Isn't all that we need an optional start url which tells wp-aware user agents what resource to load first? Even that is not required, as loading the first document in the default reading order seems like a natural enough choice.

What breaks or is not possible to achieve if the web publication does not have an address in the manifest? I can't answer that, which makes me think we might well be better off without it and the confusion it causes.

All I find myself wanting to clarify further is the non-Web Pub resource with a link to the manifest, but maybe all that needs is the following (only quasi-spec prose):

  • The manifest MAY include a start url that identifies a preferred resource to load when the Web Publication is initiated.

  • User agents MUST NOT initiate any external resources as though they belong to a Web Publication, even if they include a link to the manifest. In the absence of a start url, user agents SHOULD use the first resource in the default reading order to initiate the Web Publication in such scenarios.

@iherman
Copy link
Member

iherman commented Nov 21, 2017

@mattgarrish

You say:

What started all of this was, I believe, this basic assertion:

  • A Web Publication MUST contain at least one HTML document which MUST provide an HTML link to the manifest.

and I do not think that is correct. I believe we start by an assertion and a question:

  1. Assertion: there is a REQUIRED "address" entry in the information set, which is a URL
  2. Question: what happens when the "address" is dereferenced by, say, a Web browser or a search engine

And that is the question to which we gave an answer on the call and in the PR. In this respect, the "address" seems to play the same role as the "start_url" in the WAM, and I am fine with that. I have the impression that your proposed text would just complicate things further. Unless you want to reopen the discussion which led to the assertion above but, personally, I would not want to do so...

@mattgarrish
Copy link
Member

The new assertion made is that there must be at least one html page somewhere on the web that any browser can render and by which the manifest can be found and (should be) a publication resource.

I find this problematic, as what happens when someone creates a publication without any html documents which is perfectly valid right now? They're forced to put up one just for the sake of justifying an address in the manifest?

All I'm asking is whether we can find agreement on at least one HTML document as a publication resource without bringing "address" or "entry page" into the discussion, as the rationale is clearer and easier to understand.

For fpwd, we can leave the address in with the requirement for it to dereference to an html document that must have a link to the manifest, and leave it at that. It can be a pointer to the required html document in the absence of anything else.

That's all I'm really proposing for now.

@iherman
Copy link
Member

iherman commented Nov 21, 2017

(Admin) @mattgarrish, shouldn't that be a separate issue? Or do we want to change the title of the issue? We are getting somewhere else, so to say...

@mattgarrish
Copy link
Member

Yes, it probably would be useful. I think we have two questions swirling around this discussion: 1) what is the address and is it needed, which we can leave here; and 2) do we require an html document as a resource, which is an offshoot. Having separate answers to these questions would be useful, as they aren't bound to each other. I'll open a separate issue for the html question.

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Nov 21, 2017

As I mentioned during our last call, I also believe that we're mixing up two different concepts:

  • WPUB URL, which is essentially an identifier (along with a document returned by this identifier)
  • and start URL, which is where a WP-aware user agent should start displaying the publication

I see that @mattgarrish seems to also agree.

Since it's a little hard to provide examples given our lack of serialization in WP, I'll use the Readium syntax to illustrate.

Example 1: identifier, no start URL

{
  "@context": "http://readium.org/webpub/default.jsonld",
  
  "metadata": {
    "@type": "http://schema.org/Book",
    "title": "Moby-Dick",
    "author": "Herman Melville",
    "identifier": "urn:isbn:978031600000X",
    "language": "en",
    "modified": "2015-09-29T17:00:00Z"
  },

  "links": [
    {"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"}
  ],
  
  "spine": [
    {"href": "http://example.org/publication/c001.html", "type": "text/html", "title": "Chapter 1"}, 
    {"href": "http://example.org/publication/c002.html", "type": "text/html", "title": "Chapter 2"}
  ]
}

Example 2: identifier, start URL outside of the reading order

{
  "@context": "http://readium.org/webpub/default.jsonld",
  
  "metadata": {
    "@type": "http://schema.org/Book",
    "title": "Moby-Dick",
    "author": "Herman Melville",
    "identifier": "urn:isbn:978031600000X",
    "language": "en",
    "modified": "2015-09-29T17:00:00Z"
  },

  "links": [
    {"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"},
    {"rel": "start", "href": "http://example.org/publication/start", "type": "text/html"}
  ],
  
  "spine": [
    {"href": "http://example.org/publication/c001.html", "type": "text/html", "title": "Chapter 1"}, 
    {"href": "http://example.org/publication/c002.html", "type": "text/html", "title": "Chapter 2"}
  ]
}

Example 3: identifier, start URL part of the reading order

{
  "@context": "http://readium.org/webpub/default.jsonld",
  
  "metadata": {
    "@type": "http://schema.org/Book",
    "title": "Moby-Dick",
    "author": "Herman Melville",
    "identifier": "urn:isbn:978031600000X",
    "language": "en",
    "modified": "2015-09-29T17:00:00Z"
  },

  "links": [
    {"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"}
  ],
  
  "spine": [
    {"rel": "start", "href": "http://example.org/publication/c001.html", "type": "text/html", "title": "Chapter 1"}, 
    {"href": "http://example.org/publication/c002.html", "type": "text/html", "title": "Chapter 2"}
  ]
}

Just a few notes to provide additional clarity:

  • in a Readium Web Publication Manifest, the identifier has to be a URI, but it doesn't have to be a URL (that's why we have a URN in all those examples with an ISBN, which also works fine for PWP)
  • since we have a concept of links that's separate from resources in reading order (spine) and other resources from the publication (resources), it's very easy to reference a start URL that's not included in the publication itself
  • and if the start URL is part of the publication, all we need to add is "rel": "start"

I believe we start by an assertion and a question:

  • Assertion: there is a REQUIRED "address" entry in the information set, which is a URL
  • Question: what happens when the "address" is dereferenced by, say, a Web browser or a search engine

I completely agree that this is where we start from.
But based on the current consensus (it MUST return an HTML document and MUST include a link to the manifest), I don't think that the document has to be a start URL.

  • The manifest MAY include a start url that identifies a preferred resource to load when the Web Publication is initiated.

  • User agents MUST NOT initiate any external resources as though they belong to a Web Publication, even if they include a link to the manifest. In the absence of a start url, user agents SHOULD use the first resource in the default reading order to initiate the Web Publication in such scenarios.

👍 for that proposal.

@iherman
Copy link
Member

iherman commented Nov 21, 2017

@HadrienGardeur, just for my understanding. In all examples, the identifier of your publications is urn:isbn:978031600000X. In the WP definitions, that URN must have a URL equivalent; I do not remember what it is for ISBN (I am not sure there is one...) but let us say it is something like https://www.isbn.com/978031600000X. What will be returned if I dereference this URL?

Actually, if I follow your reasoning, there is no standard answer to this, but I am looking at what the reasonable setups are. In exampled (2) and (3) it could be the start page (ie, the one you identify as such) and I expect your answer for alternative (1) is that 'whatever the publisher decides it to be'.

Which may be a technically reasonable answer. But If we go down this open ended line, I am really worried that we are building into the spec a major source for bugs.

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Nov 21, 2017

@iherman in the Readium Web Publication Manifest, we treat this as purely an identifier, which will be used for example internally by a user agent.

There's absolutely no expectation that this will be shared or dereferenced, so we really don't care what's returned.

This is also meant to work for PWP/EPUB4, where we know that some publications won't have such a URL. In this situation, they can always use a URN for an ISBN or a UUID.

@iherman
Copy link
Member

iherman commented Nov 21, 2017

@HadrienGardeur forget about that for a moment. Say that the identifier is a URL. What happens then?

@HadrienGardeur
Copy link
Author

@iherman well it doesn't change much.

In our case (Readium) we don't give any particular role to that identifier aside from identifying. It can return whatever the publisher wants, we don't care.

Since this is JSON-LD, it might be also ingested by a JSON-LD aware crawler, this is where the identifier may matter a lot more.

Try the examples that I provided in the JSON-LD playground: https://json-ld.org/playground/

@iherman
Copy link
Member

iherman commented Nov 21, 2017

@HadrienGardeur, at this moment I am not really interested in what Readium does, sorry about that. What I am asking it: how would you translate those three alternatives to a WP case? Is your answer that "It can return whatever the publisher wants, we don't care."?

If so, that is where I feel that this will be a source of errors in practice. The authors of the publications are not the same as the publishers, and if both can, sort of, push the buck to the other then we have a possible problem.

@HadrienGardeur
Copy link
Author

How would you translate those three alternatives to a WP case?

I would say that these three examples already work fine for a WP as long as you use a URL instead of a URN.

I'm not sure what you mean by a source of errors in practice. The friction between author/publisher (+ third party content producer) will always exist, I don't see how this makes things worse.

@iherman
Copy link
Member

iherman commented Nov 21, 2017

I'm not sure what you mean by a source of errors in practice. The friction between author/publisher (+ third party content producer) will always exist, I don't see how this makes things worse.

If we require that 'entry page' (whatever the name is) to be part of the publication (whether start page or not) then the responsibility to produce one is by the publisher. A valid WP must have this. Otherwise... who knows?

@HadrienGardeur
Copy link
Author

@iherman that's currently a SHOULD, not a MUST.

What we're discussing here is:

  • whether start URL = WPUB URL
  • what's the role of the start URL

@TzviyaSiegman
Copy link
Contributor

I think this issue has become unnecessarily complicated. This is not the place to discuss identifiers, addressability, serialization, or whether we are using the start_url from WAM.

As @mattgarrish asked, we are dealing with only 2 points here:

  1. what is the address and is it needed, which we can leave here; and 2) do we require an html document as a resource.
    I think Matt has addressed that in his latest PR.

@HadrienGardeur
Copy link
Author

Sorry @TzviyaSiegman but constantly saying that we should avoid discussions and/or close issues is not exactly helpful.

To address your comment:

  • the WP URL is an identifier, so yes this is within scope
  • we're not discussing serialization, but the complete lack of serialization doesn't really help when we want to illustrate things, which is why I used the one that we've designed for Readium
  • we're not discussing whether we want to use start_url from WAM but if we want to use the same term and/or concept, which is quite different
  1. what is the address and is it needed, which we can leave here
  2. do we require an html document as a resource.

We've already addressed both in #94, this is not what the current issue is about.

By saying that a WP URL (identifier for the WP) returns an HTML document we've opened Pandora's box, from which all these questions are popping up:

  • what's the role of that page?
  • does it belong inside the publication?
  • do we have any requirement for the content included in that page?
  • how do we call it?

BTW, we could just say that we don't care about any of those, and let the publisher decides. That option is on the table, but completely ignoring all these issues is not helpful.

@llemeurfr
Copy link
Contributor

Proposal for resolution of the initial question:

  • the WP URL is called the WP Address (capital A, even if we know the each WP resource that contains a link to the manifest is a sort of WP address)
  • the role of the HTML page returned when dereferencing the WP Address is to provide an entry point to the WP for vanilla-browsers.
  • this page belongs to the publication, i.e. it is referenced in the manifest.
  • this page MUST include a link to the manifest
    and I would add
  • this page belongs to only one WP, not many.

I think it answers to all sub-questions raised by Hadrien and may be the best consensus we'll get.

@HadrienGardeur
Copy link
Author

@llemeurfr this works for me as long as we keep this completely separate from the concept of a start URL.

I think that the spec language proposed by @mattgarrish in #103 (comment) is also useful and complementary.

@HadrienGardeur
Copy link
Author

Closing this issue since the changes have been integrated in the draft a while ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests