Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machine readable metainformation for manifest items? #1675

Closed
Doktorchen opened this issue May 14, 2021 · 8 comments
Closed

machine readable metainformation for manifest items? #1675

Doktorchen opened this issue May 14, 2021 · 8 comments
Labels
Status-Declined The issue has been reviewed and not accepted by the working group for inclusion Topic-PackageDoc The issue affects package documents

Comments

@Doktorchen
Copy link

Due to some new efforts for example in the EU, platforms for digital books may come up with upload-filters, some AI-programs to detect possible copyright issues in EPUBs.

Therefore it could be pretty helpful to relate corresponding metainformation to each manifest item directly inside the OPF-document to reduce the probability of false positives, resulting in a lot of annoyance for authors.
This can be relevant for fonts, images, graphics (SVG documents and fragments inside XHTML documents) articles, quotes etc.

Is there currently an option to do this right now in a normative, unique way?
If not possible already now to provide such information within the OPF-file, what needs to be added into EPUB 3.3 to allow authors to provide such additional information about manifest items and even only fragments of manifest items?

An alternative could be to use meta elements with a refines attribute within the metadata element to associate metadata with documents (and document fragments) referenced in the manifest.
However, currently this metadata element has poor structure, would be much better to use RDF within it to provide such metadata.
In this case it would be helpful both for authors and programmers of such upload-filter-programs to have some (normative?) advice, how to correlate metainformation about licences, authors, external sources etc of items with such books parts.
Maybe similar to
https://www.w3.org/TR/epub-33/#example-4
But here it might be still a problem, if only a fragment of a manifest item is the target of the information (quote, SVG fragment, article).
An explicit example how to correlate the metadata could be helpful within the 3.3 draft.

Currently, within SVG:metadata one can already use RDF for such information.
HTML5 suggests as well RDF as an option for such information within the XHTML:head element:
https://www.w3.org/TR/html52/dom.html#metadata-content-2
(I think, currently still epubcheck 4.2 does not like this, but due to HTML5 it is valid for the XML-serialisation, but within SVG:metadata it is ok even for epubcheck 4.2).

In relation to this possible upload-filter-problem it might increase the probability, that such information is recognised by AI-programs, if one can provide information about such relevant metadata within the opf-file.

@gregoriopellegrino
Copy link
Contributor

I ping @llemeurfr who is leading Text and Data Mining Reservation Protocol Community Group

@mattgarrish
Copy link
Member

Isn't this just a case of resource-specific linked records?

<link rel="record" refines="#some-resource" href="record.rdf" media-type="application/rdf+xml"/>

@mattgarrish mattgarrish added the Topic-PackageDoc The issue affects package documents label May 14, 2021
@Doktorchen
Copy link
Author

Currenrtly, rdf+xml is not listed as a core media type
https://www.w3.org/TR/epub-33/#sec-core-media-types
but for the OPF link element it is noted:
"Linked resources are not Publication Resources and MUST NOT be listed in the manifest. A linked resource MAY be embedded in a Publication Resource that is listed in the manifest, however, in which case it MUST be a Core Media Type Resource (e.g., an EPUB Content Document could contain a metadata record serialized as [RDFA-CORE] or [JSON-LD])."
Therefore, in practice (including the problem with epubcheck 4.2), one has to reference a metadata element inside an SVG document, containing the RDF, but then media-type would be image/svg+xml.
https://www.w3.org/TR/epub-33/#sec-link-elem

Surprising as well example 20:
https://www.w3.org/TR/epub-33/#example-20
application/marc is no core media type, but the reference with href points to a local resource.
Similar second link in example 23.

However, without mentioning such a use case with an extended example, I think, the chance is low, that authors will use it and programmers of upload-filters will implement this.

Open question: How to refine a fragment of a resource?

Another surprise, that a llinked resource must not be listed in a manifest.
Using a specific XHTML file with rdfa would be possible and may have some use for the human audience as well - why to dissallow? If authors and audience have a presentation of such content, the motivation for authors is higher to add it ad all and in a proper way.

Wouldn't it be better to say, that if linked files are no core media types, they must not be listed in the mainfest without fallback and are not expected to be interesting for the human audience?
Only programs with interest in metadata like licence information may recognise/interpret such files?
Others are not required to care about them?

@mattgarrish
Copy link
Member

Currenrtly, rdf+xml is not listed as a core media type

Doesn't matter. Link elements are not publication resources so are also not subject to fallback rules. You can link to whatever works if this becomes a reality and the information isn't better stored in each resource.

How to refine a fragment of a resource?

Do it inline if the format allows. Otherwise, use a relative path with fragment in the refines attribute if it has to be done from the package document.

@iherman
Copy link
Member

iherman commented May 14, 2021

Do it inline if the format allows. Otherwise, use a relative path with fragment in the refines attribute if it has to be done from the package document.

All RDF serializations (RDF/XML, Turtle, or JSON-LD) define fragment identifiers, so that should not be a problem.

@llemeurfr
Copy link

llemeurfr commented May 18, 2021

The discussion moved very quickly from a use-case (the rise of upload filters, via the EU DSM Directive Article 17) to tentative techniques (e.g. links to RDF metadata or embedding RDF in HTML). But link between both ends is not clear to me.

The primary use-case seems to be: a person or organization wants to upload an EPUB 3 publication to some platform, which has an EPUB 3 upload filter. The rights holder of the publication has notified the platform that this publication, or fragments of this publication, should not be allowed on the platform. Therefore the filter blocks it.

It means that the publication must be unambiguously identified; that the notion of "fragment" must be clearly defined and each fragment properly identified; that the way rightsholders notify platforms must be clear and easy; that an efficient mechanism must be put in place to block bad guys to grab false rights on content (which is exactly what happens in the music industry today). This is a large project, and metadata in EPUB files is only a small part of it.

Robust content identification (publications and fragments) can be achieved via ISCC and the most interesting use I've seen of this technology is the Content Blockchain Project, presented at a previous Digital Publishing Summit.. What is interesting in ISCC is that there is not requirement to embed the ID into the publication: it can be computed easily.

I ping @llemeurfr who is leading Text and Data Mining Reservation Protocol Community Group

The TDM Reservation Protocol will be used in a totally different use-case: rights holders who freely give access to content on the Web will notify TDM actors if they accept or not to get "mined", using simple means like HTTP headers, HTML metadata or a JSON-LD file on the origin server.

@dauwhe dauwhe added the Agenda+ Issues that should be discussed during the next working group call. label May 19, 2021
@Doktorchen
Copy link
Author

Yes this ISCC https://content-blockchain.org/ sounds interesting.
One needs something like NFTs or in general some free blockchain tokens as identifiers for:
a) media
b) author pseudonyms, repectively EPUB creators or contributors
c) individual licences for a medium in correlation to a pseudonym or an EPUB to be used with, if it is more restrictive than general licences as for example CC.

If a platform to provide media for other usage provides a list of identifiers of given licences, EPUB creators can correlate/link from their use case with the book or their creator/contributor identifier to such al list of accepted usage licenses.
Alternatively a blockchain token of a licence is extentend by each new use case identifier to be checkable.

To get this work done automatically after publication, EPUB still needs some normative structure as a reliable interface to provide such information at creation time to enable creators to indicate, that such a licences exists for a specific book fragment in an automatically checkable way.

A first step could be already a normative method to provide al least such an indication for content with a CC licence or a similar licence.
Is it for example sufficient to point to the related media page of wikimedia commons, if this is the source? Or to a related page at wikisource for free content, if taken from there?

@dauwhe dauwhe removed the Agenda+ Issues that should be discussed during the next working group call. label May 21, 2021
@iherman
Copy link
Member

iherman commented May 21, 2021

The issue was discussed in a meeting on 2021-05-21

List of resolutions:

View the transcript

4. Machine readable meta information for manifest items?

See github issue #1675.

Dave Cramer: Proposal to put metadata in the package file, particularly about copyright on specific manifest items
… Use case is to allow scanning of documents to check copyright status
… this concerns me
… There are various mechanisms to do that now (eg RDF)

Brady Duga: I don't understand the use case

Matt Garrish: This seems half baked
… Might want more detail before we consider it
… It does seem like this can be done now, and the recommendation doesn't really cover it all

Tzviya Siegman: Seems like this might be about the EU copyright directive

Tzviya Siegman: https://scholarlykitchen.sspnet.org/2021/05/17/stm-article-sharing-framework/

Tzviya Siegman: Might be about the scholarly world where you can detect if something is shareable
… Linked info above is still be discussed internationally

Hadrien Gardeur: I think in general when we get these requests, we should point out the current extensibility
… If some group comes up with a useful extension, then we can consider it
… we shouldn't tackle these things ourselves

Proposed resolution: Close issue 1675 without further steps (Ivan Herman)

Ben Schroeter: +1

Dave Cramer: We should close the issue, nothing to see here

Dave Cramer: +1

Bill Kasdorf: +1

Brady Duga: +1

Ivan Herman: +1

Tzviya Siegman: +1

Masakazu Kitahara: +1

Matt Garrish: +1

Resolution #4: Close issue 1675 without further steps

Dave Cramer: That was all our issues

@iherman iherman closed this as completed May 21, 2021
@mattgarrish mattgarrish added the Status-Declined The issue has been reviewed and not accepted by the working group for inclusion label Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status-Declined The issue has been reviewed and not accepted by the working group for inclusion Topic-PackageDoc The issue affects package documents
Projects
None yet
Development

No branches or pull requests

6 participants