Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"the infoset should not include specialised metadata" ??? #74

Closed
eshellman opened this issue Sep 29, 2017 · 23 comments
Closed

"the infoset should not include specialised metadata" ??? #74

eshellman opened this issue Sep 29, 2017 · 23 comments

Comments

@eshellman
Copy link

I'm puzzled by this:

The Web Publication Infoset should not include complex, specialised, and industry-specific metadata and authors should limit the metadata included in the manifest to the items above.

If the information set does not cover the publication's needs, authors should link to external metadata files in whichever formats and schema are most commonly accepted as the authoritative metadata format for their intended audiences. They can include multiple metadata links in multiple formats, one for each intended audience, if needed.

Am I to understand from this that wpub should not be used for "specialized" or "complex" uses?

For example, consider open textbooks and the specialized and complex attribution and copyright metadata that might be needed to support customization and re-use. Is wpub saying that such metadata SHOULD NOT be part of a web publication's infoset? Or is it just saying that people making wpubs shouldn't defecate in the kitchen?

perhaps @baldurbjarnason and the "we" of PR #64 could provide some "why" to motivate these should-nots. I'm guessing that it's because it's assumed that wpub reader applications will try to assemble the entire infoset. Is that correct? If so, might that assumption be premature? In any case, a bit more explanation will make the section much easier to interpret - most people consider everyone else's metadata to be complex and specialized. Even "Publisher" is becoming a pretty esoteric metadata term these days!

@GarthConboy
Copy link
Contributor

Love your analogy! :-)

I take those two quoted sentences to mean don't put all the interesting, specialized, complex, or industry-specific metadata directly in the manifest, instead point to it from the manifest. Which I further take to mean "don't invent stuff from whole cloth, point to standard incarnations." So, not "don't do it" just "do it in an appropriately standard way."

@baldurbjarnason
Copy link
Contributor

baldurbjarnason commented Sep 30, 2017

Yes, what @GarthConboy said.

I agree that my phrasing wasn't clear.

Those sentences were intended to tell authors what pieces of metadata can be included in the JSON serialisation of the manifest. They shouldn't be taken to be about what can or can not be included in the web publication as a whole.

The idea is that if your metadata needs aren't covered by what the infoset provides, rather than, say, pull in schema.org wholesale into the JSON manifest file, you should put that metadata in a file and link to it from the manifest. That still makes the metadata file a part of the publication and supporting User Agents can fetch and use it as they would any of the publication's resources.

Currently the infoset proposes to provide title, language, identifier, address, resources, reading order, ToC, creators, publication and modification dates, accessibility, subtitle, licensing, audience, and subject. All of that is subject to change, of course, depending on how things go, but hopefully you won't need to include a separate metadata file if your needs are covered by these items.

Licensing, attribution (i.e. creators and their roles), accessibility, and audience—the use cases you mentioned for open textbooks—are already specifically mentioned as items that the manifest should or may include so it's likely we'll at least try to figure out standard ways of letting people embed that information in the manifest, most likely by reusing a subset of a pre-existing metadata schema.

@eshellman
Copy link
Author

we're not talking about a portable web publication, which would require differentiation between the components being ported and those not ported, correct? So how are things linked from the manifest different from things included in the manifest? I ask, because the definitions elsewhere say the manifest is "a list of all resources" and the infoset "is primarily compiled from a Web Publication's manifest". So if the links to metadata are in the manifest, are they part of the infoset or not? Or maybe the infoset includes the links to the "specialized" metadata?

So I'm guessing that what is meant here is that only the manifest will include embedded metadata (like meta tags in html, as opposed to link tags); but that most metadata should be linked, not embedded. But there's no motivation for why (for example) "creator" metadata demands to be privileged for embedding, rather than linking, in the manifest. Because even something as simple as that gets complicated fast. "Creators" of what? What role? How do we identify a creator?

I hope my questions, as someone coming fresh to the discussion but not the field, help to show how others might be confused by the abstraction of the text.

@mattgarrish
Copy link
Member

I feel like we're on the epub metadata rollercoaster of madness -- if we don't define it, user agents won't use it; but once we define it there's nothing to make UAs use it.

Dare I suggest that we go the registry route? As imperfect as it might be, why not let the community decide what other properties are needed and how to express them? It would also give a path to future standardization once we know what gets traction.

We could change "additional metadata" to more of a short extensibility section where we also state that linked records are another way of providing properties in a standardized format that UAs might parse. We could leave it to UAs to determine what they parse out of such linked records and how they represent the information and/or such processing could be outlined in separate documents if there's enough interest.

A thought anyway.

@iherman
Copy link
Member

iherman commented Oct 5, 2017

@mattgarrish, do you mean going through the rel=XXX route of the <meta> element, or defining our own registry mechanism? The former is way too complicated and controversial. The latter may work… I believe at the end of our process. For the time being, the "community decide" mechanism is there when we begin to publish official working drafts and we would get, hopefully, public comments.

As for the separation of the information set items and the additional metadata (for the lack of a better word for now), I think that, actually, the comparison of @eshellman with the <meta>/<link> elements' duality in HTML is a good one. HTML defines a number of values in the header (whether via explicit elements like title or through registered rel values for <meta>) that are considered to be the top priority ones for HTML, and leaves the door open for additional metadata via <link>. My mental model is that all metadata that are, in the abstract sense, in the namespace of Web Publications are to be in among the information set items, and we would link out for metadata vocabularies that we do not define. I guess the two mental models are fairly similar.

@mattgarrish
Copy link
Member

do you mean going through the rel=XXX route of the element

No, I'm not talking about rel links but the actual properties to use, so we'd define our own. Something like this informative guide we did for epub: https://idpf.github.io/epub-guides/package-metadata/

As soon as we start trying to define "metadata" for a publication, we end up being incomplete and confusing. I'd rather the properties we define be as tightly focused on functionality as we can make them and leave everything else out of the specification.

@iherman
Copy link
Member

iherman commented Oct 5, 2017

Right. So the question is whether what we have now in the document corresponds to the 'minimal set'. I am all for trimming them, but we already had quite some discussions on each and every item in that list, I am not sure there would be consensus on cutting out any of those...

@dauwhe
Copy link
Contributor

dauwhe commented Oct 5, 2017

I'd rather the properties we define be as tightly focused on functionality as we can make them and leave everything else out of the specification.

Agreed. I sometimes find it useful to make a distinction between functional/structural metadata (like default reading order) and content metadata (like creator or subtitle). The former is essential for user agents to present the publication.

@mattgarrish
Copy link
Member

mattgarrish commented Oct 5, 2017

I am not sure there would be consensus on cutting out any of those

I'm not saying the metadata becomes inexpressible, only that we should be stricter in evaluating what is essential to define in a specification. The original set of properties we had was that, in my opinion. We can make some or all of the content metadata the first additions to a registry.

The infoset remains extensible for additional properties, which includes any and all content metadata anyone wants to include. We shouldn't even prefer linking or embedding, but simply provide the mechanisms for both. Let the ecosystem work these details out.

The problem, as we found with epub, is that no one agrees on what they want to express. We're not moving beyond that problem by trying to cherry pick certain content properties. The next question is always "what about this property that is critical to me".

But if we can't get consensus on shifting them all out, I still think it would be preferable to at least remove the MAY metadata.

@BillKasdorf
Copy link

Re Matt's "Dare I suggest that we go the registry route? As imperfect as it might be, why not let the community decide what other properties are needed and how to express them?":
We need to remember that there are many communities and some of them already have registries. Scholarly publishing has Crossref, the DOI RA (Registration Agency) for scholarly journal articles, books, chapters, proceedings, etc., and yes indeed, Crossref specifies the metadata that must be associated with all the items in that registry (in fact in a real sense the database of that metadata is the registry). But that registry would have varying overlaps with others: there have been discussions with magazines making use of the Crossref infrastructure, but magazines would require mostly different metadata, and even when there are overlaps (titles, contributors, etc.) they express that metadata in a different way. There are also registries for contributors (e.g., ORCID), public identities (ISNI), organizations (ISNI, Ringgold), etc. etc. We need to accommodate these things but not try to duplicate or replace them. Plus once all of the publications from all these different sectors and spheres become Web Publications, a master registry of all of them would be an incredibly daunting task, though I suppose theoretically possible.

@iherman
Copy link
Member

iherman commented Oct 5, 2017

@mattgarrish

But if we can't get consensus on shifting them all out, I still think it would be preferable to at least remove the MAY metadata.

That is certainly a good start.

@mattgarrish
Copy link
Member

a master registry of all of them would be an incredibly daunting task

And this is where flexible extensibility comes in. A registry handles the "how do I" problem for some subset of users, but we're never going to solve the metadata problem for all the different publishing industries, nor should we try.

The problem is that once we (in the specification) open the door wide to overlapping metadata, it looks like we are trying to replace them. At the very least, it leaves people scratching their heads trying to understand the purpose of what is allowed and how they're supposed to handle the things not described.

I've been asked numerous times over the years how to express X in epub when X cannot be expressed by DC or the meta element. Telling people to link a record has just gotten me called various names of stupid because it's not what is needed or wanted. Which is not to say that I think linking is a bad idea; I actually find it an elegant solution. There are just people who want metadata for reasons that have nothing to do with the user agent or elegance.

Anyway, these are old issues and arguments to you, Bill.

I'd just concede that maybe "registry" is too laden a term. "Guide" might be more appropriate, like we did in epub, especially if all we're doing is logging metadata from existing schemes that can be embedded in the manifest file.

@baldurbjarnason
Copy link
Contributor

@mattgarrish

I feel like we're on the epub metadata rollercoaster of madness -- if we don't define it, user agents won't use it; but once we define it there's nothing to make UAs use it.

And when they use it, they use it wrong. The epub ecosystem is messed up. A lot of the problems in the epub world stem from the fact that implementing the basic metadata structure (even before you get to the terms themselves) is hard. The refines and chaining mechanism is a huge pain in the rear. (Daniel Glazman outline some—but not all—of the issues in a series of blog posts IIRC.) This is a dysfunction that is both particular to ePub3 and had huge knock-on effects for the entire ecosystem. It threw out the pre-existing mechanism and made everybody start from scratch. It was more complicated than many of the other mechanisms people were using. And even though it reused terms it was structured in such a way that you realistically couldn't reuse code related to those terms.

So, it didn't really matter that much what metadata terms you defined or included for ePub3 because people were (and in many ways are still) having a hard time properly supporting the minimum metadata required for basic functionality. A complex mechanism leads to brittle code which means user agents are less willing to touch it once they've got it working.

This is the reason why I strongly support delegating all non-functional metadata to external, pre-existing formats. These formats already have parsers, UI widgets, database schemas, and search engine support (some by general web search engines, some by specialised search engines).

The less we stand in the way of code reuse, the more likely we are to get wide support. Epub is notorious for making format decisions that prevented code reuse from the web and other ecosystems.

I'd just concede that maybe "registry" is too laden a term. "Guide" might be more appropriate, like we did in epub, especially if all we're doing is logging metadata from existing schemes that can be embedded in the manifest file.

Your suggestion worries me a lot. You are proposing a substantial increase in complexity for the manifest, reading systems, and authors. If you go with a less formal guide it'll be a chaotic mess that is hard to understand and author. If you go with a proper registry you end up with a bureaucratic mess that is complex to implement on the reading side and will always lag actual practice. And either route gives us no additional features over just supporting linked metadata formats that somebody else is specifying. It's a lose-lose-lose proposition. It just makes things complicated.

I'd rather we strip out all metadata that isn't directly functional, like you and @dauwhe have also suggested, and just 'bless' a metadata scheme you link to as the recommended lowest common denominator default that all reading agents should try to support (preferably schema.org in a linked JSON-LD file using the schema.org-provided context file). If we get into the metadata registry and schema authorship game we make the entire ecosystem more complicated as there are a bunch of pre-existing players in that field and now there's a new one everybody has to accommodate.

If and only if we end up using JSON-LD for the manifest then we can talk about the possibility of letting authors embed that schema.org file in the manifest but only if the embedded data uses the schema.org context as that shifts the responsibility of managing the union of the various schemas over to schema.org and lets authors reuse the metadata toolchains they're using for the web.

A lot of publishers are managing schema.org metadata in their CMSes for SEO purposes. But based on what I'm told very few of them are using a proper JSON-LD toolchain. (And rumour has it that at least one of the search engines isn't either.) So relying on the schema.org schema without their context file is effectively a completely new metadata format for many pre-existing users.

(It's that code reuse thing again. Schema.org without the schema.org context file prevents code reuse for a lot of existing players.)

I personally think we should use JSON-LD as a format if we aren't using the web app manifest as that gives us Linked Data Signatures for free, which solves a lot of cryptographic signing and authentication use cases for free, but that's an entirely separate issue.


I suspect that part of the reason @mattgarrish's 'just link to it' recommendation got a negative response from epub authors was because ebook reading systems don't support it in any real way and all authoring guides focus on the embedded metadata. If somebody had made that suggestion to me when I was in ebook dev, then yeah, I probably would have given a harsh reply.

The problem is never about what epub developers want personally but always about what the systems support. None of the big systems support linking metadata and they have never given any indication that they intend to support it. Ebook devs want solutions that work and suggesting a theoretical solution that has no real world application will obviously get you a bad response.

Baking linked metadata into web publications at the start is an entirely different situation, especially if we don't give into temptation and start our own metadata schema (or curate a schema, which is just as bad). Make it unambiguous from the start—a hard requirement for reading agents; specify it in such a way that they can reuse pre-existing code; use terms and relationships they already have DB schemas for; and you're unlikely to get the same dysfunction as epub.

@dauwhe
Copy link
Contributor

dauwhe commented Oct 5, 2017

Adding to what @baldurbjarnason said, metadata in EPUB was both messed up and unimportant. Speaking only about the world of large traditional publishing, most of the metadata issues were solved with ONIX, and happened entirely outside EPUB. Many of us, having heard the priests of metadata sermonize on its importance, tried to embed metadata in EPUB because it was The Right Thing To Do™. I remember lots of discussion about what should go in dc:rights; some people just put the string "all rights reserved." 😧 But no reading system would use such metadata, partly because things were messed up, and partly because there was already a better solution.

For web publications that end up living outside of the existing supply chain (which I fervently hope they will), ONIX won't save them. Publishers will have to figure out how metadata on the web works. I'm sure Baldur has thoughts on how well that will go :)

@mattgarrish
Copy link
Member

Your suggestion worries me a lot.

Oh, it worries me, too. :)

It was more throwing a bone that if we have to have this stuff, at least let's try and hide our shame somewhere out of sight. I'd much rather we simply remained quiet and just identified that the infoset is extensible through additional properties and linked records, but user agents are not required to process such information. End of story.

@iherman
Copy link
Member

iherman commented Oct 6, 2017

I try to make a summary of a long thread...

  1. We have to go through the items listed as information items on a SHOULD or MUST level with a very critical eye to see if they are indeed all universally needed for Web Publications. Goal is to possibly reduce the list.
  2. Those that are filtered out in step 1, plus those that are listed with a MAY should be marked (somewhere in the document) as examples for metadata that should be expressed via "external", i.e., existing vocabularies like schema.org, Dublin Core, ONIX, etc.
  3. There should be a clearer statement somewhere, probably labelled as "extension points", making clear that there is a mechanism whereby possibly several metadata references can be linked.

Item 1 is not necessarily urgent right now, I would propose to do that when the first dust settles around the FPWD. Also, there are details around item 3 to clarify (e.g., does a link towards an external vocabulary include some hints as for the vocabulary used?).

Is this a fair summary?

@eshellman
Copy link
Author

This is the reason why I strongly support delegating all non-functional metadata to external, pre-existing formats. These formats already have parsers, UI widgets, database schemas, and search engine support (some by general web search engines, some by specialised search engines).

I strongly agree with this.

@mattgarrish
Copy link
Member

If anyone has comments on pull request #83, please add soon as I'd like to merge it today, if possible.

There are still many details of manifest + infoset we'll need to work out post-FPWD, of course.

@HadrienGardeur
Copy link

I'm surprised that this issue is still open, but I'd like to clarify one specific point.

Several members of the group have indicated their preference for external linking, for example to a JSON-LD document. But what if the WP manifest itself is JSON-LD? Is it still preferable to have an external document in that case?

@iherman
Copy link
Member

iherman commented Nov 15, 2017

If the WP manifest is in JSON-LD, then it is probably a "should" link to external linking. I see several reasons for this (from the top of my head):

  • some of these metadata (like ONIX) may be huge, and managing it separately is therefore more efficient (and more efficient on user agents that are not interested by the ONIX data)
  • some of these metadata vocabularies may have their own @context files, optimized to this vocabulary; pulling all of them together into one @context may have unwanted consequences
  • various metadata may be authored by different groups/people

But, again if the WP manifest itself is in JSON-LD, it is certainly not a MUST.

@baldurbjarnason
Copy link
Contributor

What @iherman said. I'd also add that one of the things some formats that use JSON-LD have discovered is that using a fully-fledged JSON-LD processor to parse JSON-LD isn't that popular.

E.g. IIRC with Activity Streams, Mastodon just uses a JSON parser for the most part and just assumes that the JSON-LD namespaces it supports are mapped in a way they expect. Or, at least it did a few weeks ago when I last had a browse through its code. May have changed now given the pace of its development.

And schema.org metadata is often both generated and consumed by regular JSON parsers who treat it as its own bespoke format (e.g. a 'schema.org' JSON format, not as a schema). Google processes it properly, tho.

The implication here is that—while JSON-LD is really useful in that it comes with built in extension mechanism—a lot of people use that extension mechanism by just assuming a standard context with no conversion or compacting.

That works decently if you are just adding a small set of properties to an already specified format but it doesn't scale particularly well.

You could easily run into situations where bringing schema.org metadata into the manifest effectively hides it from many schema.org consumers.

But all of the above is just a caveat. Sticking to external linking even when you're using JSON-LD is certainly not a MUST, like Ivan said. Integrating the data in those cases isn't a clear cut win but I think people should be allowed make that call themselves under those circumstances.

@iherman
Copy link
Member

iherman commented Mar 2, 2018

Propose closing: the current draft makes it clear that it must be possible to link to other metadata. I think that is the overall answer, which seems to reflect consensus.

@iherman
Copy link
Member

iherman commented Mar 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants