Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate content of <dc:language> #702

Closed
jon-moreira opened this issue Aug 9, 2016 · 9 comments
Closed

Validate content of <dc:language> #702

jon-moreira opened this issue Aug 9, 2016 · 9 comments
Labels
status: in discussion The issue is being discussed by the development team type: feature The issue describes a new feature request

Comments

@jon-moreira
Copy link

jon-moreira commented Aug 9, 2016

epubcheck doesn't check dc:language value!

According with specification

Every metadata section must include at least one language element with a value conforming to [RFC5646].

The following example shows a Publication is in U.S. English.

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    …
    <dc:language>en-US</dc:language>
    …
</metadata>

content.opf of my ePUB after export from Adobe InDesign

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<package xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" unique-identifier="bookid" version="2.0">
    <metadata>
        <meta name="generator" content="Adobe InDesign"/>
        <meta name="cover" content="xxx-cover.jpg"/>
        <dc:title>xxxx</dc:title>
        <dc:creator>xxx</dc:creator>
        <dc:subject></dc:subject>
        <dc:description>xxx</dc:description>
        <dc:publisher>Editorial Presença</dc:publisher>
        <dc:date>2016-02-11</dc:date>
        <dc:source></dc:source>
        <dc:relation></dc:relation>
        <dc:coverage></dc:coverage>
        <dc:rights></dc:rights>
        **<dc:language>en-US-POSIX</dc:language>**
        <dc:language>pt-BR</dc:language>
        <dc:identifier id="bookid">xxx</dc:identifier>
    </metadata>

<dc:language>en-US-POSIX</dc:language> doesn't have a valid value and epubcheck ignores that.

epubcheck output:

java -jar epubcheck.jar xxx.epub 
Validating using EPUB version 2.0.1 rules.
No errors or warnings detected.
epubcheck completed
@tofi86 tofi86 added the type: feature The issue describes a new feature request label Oct 4, 2016
@tofi86 tofi86 added this to the Next milestone Oct 4, 2016
@tofi86
Copy link
Collaborator

tofi86 commented Dec 22, 2016

While at a first glance this looks easy to implement, it gets harder when you look at the RFC5646 spec and not only in the EPUB example: https://tools.ietf.org/html/rfc5646#appendix-A

Possibly allowed language tags:

  • de
    • (German)
  • en-US
    • (English as used in the United States)
  • zh-Hans
    • (Chinese written using the Simplified Chinese script)
  • zh-cmn-Hans-CN
    • (Chinese, Mandarin, Simplified script, as used in China)
  • sl-rozaj
    • (Resian dialect of Slovenian)
  • de-CH-1901
    • (German as used in Switzerland using the 1901 variant [orthography])
  • hy-Latn-IT-arevela
    • (Eastern Armenian written in Latin script, as used in Italy)
  • az-Arab-x-AZE-derbend
    • (private use subtags)

To be honest: That's a validation nightmare! And I don't see a quick chance to built a validation engine for that...

In fact, It could also be that your example en-US-POSIX is a valid RFC5646 language tag, although it doesn't make sense to us now...

Removing this from the "Next" milestone for the moment...


note to myself: IANA Language Subtag Registry: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

@tofi86 tofi86 removed this from the Next milestone Dec 22, 2016
@murata2makoto
Copy link
Contributor

Does the simple type xsd:language address this problem?

@tofi86
Copy link
Collaborator

tofi86 commented Jul 8, 2017

Looking at the examples at http://www.datypic.com/sc/xsd/t-xsd_language.html this seems indeed a good way to go! I only looked at this from a Java perspective, but not from the schema validation point of view...

However, when looking at the specs, EPUB->OPF->DublinCore requires RFC5646 which obsoletes the RFC spec XML Schema is defining, right? So the DublinCore meta date may allow more valid language codes than XML schema can validate, although I don't have an example for that.

However, if @mattgarrish as our spec-guru agrees, I would give this a go and change the schema datatype to xsd:language.

@mattgarrish
Copy link
Member

The schemas already enforce xsd:language constraints:

opf.dc.language = element dc:language { opf.id.attr? & datatype.languagecode }

datatype.languagecode = datatype.BCP47
datatype.BCP47 = xsd:language { pattern = "[a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*" }

But that just enforces the lexical constraint without trying to verify the validity of the segments. The request, as I understand it, is to go further and validate the segments.

It would be great if that were done, but it seems like no small task and a perpetual moving target.

@murata2makoto
Copy link
Contributor

It would be nice if meaningless tags such as en-US-POSIX are detected. But if some programming (as oppose to schema hacking) is required, I am not sure if this is important enough.

@tofi86 tofi86 added the status: ready for implem The issue is ready to be implemented label Sep 28, 2017
@tofi86 tofi86 added the status: has PR The issue is being processed in a pull request label Nov 21, 2017
@tofi86
Copy link
Collaborator

tofi86 commented Nov 21, 2017

Update: @kalaspuffar started working on this in PR #807. Review of the PR is welcome.

@tofi86 tofi86 removed the status: ready for implem The issue is ready to be implemented label Nov 27, 2017
@rdeltour
Copy link
Member

Unless we check the IANA registry, I don't think there's much we can do here more than the lexical check performed by the schema?

@xfq
Copy link
Member

xfq commented Aug 18, 2020

Yes, checking if language tags are valid requires access to or a copy of the registry.

I didn't check EPUB 3.2, but the EPUB 3.0 spec text in the first comment didn't say if it requires the language tag to be well-formed or valid. The LTLI document from W3C i18n WG contains some guidance on this.

@mattgarrish
Copy link
Member

We had a long discussion about well-formed v. valid for web publications and the resulting consensus was that there is little value in enforcing validity. Reading systems will react or not based on whether they recognize the language, so ensuring the general pattern is followed is all that is necessary. This really should be clarified in the epub spec.

@rdeltour rdeltour added status: in discussion The issue is being discussed by the development team and removed status: has PR The issue is being processed in a pull request labels Nov 13, 2021
rdeltour added a commit that referenced this issue Jul 8, 2022
In Package Document, the language tags appearing in the elements or attributes below
MUST be well-formed according to BCP47:
 - `xml:lang` attribute
 - `hreflang` attribute
 - `dc:language` element

For these values:
- the schema now only do basic datatype check (string, non-empty value when relevant)
- the well-formedness is checked with Java’s Locale.Builder#setLanguageTag() API
- a new check (OPF-092) is reported when an ill-formed value is found

See https://docs.oracle.com/javase/8/docs/api/java/util/Locale.Builder.html#setLanguageTag-java.lang.String-

Fix #1221
Close #702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: in discussion The issue is being discussed by the development team type: feature The issue describes a new feature request
Projects
None yet
Development

No branches or pull requests

6 participants