Validate content of <dc:language> #702

jon-moreira · 2016-08-09T08:47:22Z

epubcheck doesn't check dc:language value!

Every metadata section must include at least one language element with a value conforming to [RFC5646].

The following example shows a Publication is in U.S. English.
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    …
    <dc:language>en-US</dc:language>
    …
</metadata>

content.opf of my ePUB after export from Adobe InDesign

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<package xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" unique-identifier="bookid" version="2.0">
    <metadata>
        <meta name="generator" content="Adobe InDesign"/>
        <meta name="cover" content="xxx-cover.jpg"/>
        <dc:title>xxxx</dc:title>
        <dc:creator>xxx</dc:creator>
        <dc:subject></dc:subject>
        <dc:description>xxx</dc:description>
        <dc:publisher>Editorial Presença</dc:publisher>
        <dc:date>2016-02-11</dc:date>
        <dc:source></dc:source>
        <dc:relation></dc:relation>
        <dc:coverage></dc:coverage>
        <dc:rights></dc:rights>
        **<dc:language>en-US-POSIX</dc:language>**
        <dc:language>pt-BR</dc:language>
        <dc:identifier id="bookid">xxx</dc:identifier>
    </metadata>

<dc:language>en-US-POSIX</dc:language> doesn't have a valid value and epubcheck ignores that.

epubcheck output:

java -jar epubcheck.jar xxx.epub 
Validating using EPUB version 2.0.1 rules.
No errors or warnings detected.
epubcheck completed

The text was updated successfully, but these errors were encountered:

tofi86 · 2016-12-22T09:33:57Z

While at a first glance this looks easy to implement, it gets harder when you look at the RFC5646 spec and not only in the EPUB example: https://tools.ietf.org/html/rfc5646#appendix-A

Possibly allowed language tags:

de
- (German)
en-US
- (English as used in the United States)
zh-Hans
- (Chinese written using the Simplified Chinese script)
zh-cmn-Hans-CN
- (Chinese, Mandarin, Simplified script, as used in China)
sl-rozaj
- (Resian dialect of Slovenian)
de-CH-1901
- (German as used in Switzerland using the 1901 variant [orthography])
hy-Latn-IT-arevela
- (Eastern Armenian written in Latin script, as used in Italy)
az-Arab-x-AZE-derbend
- (private use subtags)

To be honest: That's a validation nightmare! And I don't see a quick chance to built a validation engine for that...

In fact, It could also be that your example en-US-POSIX is a valid RFC5646 language tag, although it doesn't make sense to us now...

Removing this from the "Next" milestone for the moment...

note to myself: IANA Language Subtag Registry: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

murata2makoto · 2017-07-08T01:12:41Z

Does the simple type xsd:language address this problem?

tofi86 · 2017-07-08T12:24:20Z

Looking at the examples at http://www.datypic.com/sc/xsd/t-xsd_language.html this seems indeed a good way to go! I only looked at this from a Java perspective, but not from the schema validation point of view...

However, when looking at the specs, EPUB->OPF->DublinCore requires RFC5646 which obsoletes the RFC spec XML Schema is defining, right? So the DublinCore meta date may allow more valid language codes than XML schema can validate, although I don't have an example for that.

However, if @mattgarrish as our spec-guru agrees, I would give this a go and change the schema datatype to xsd:language.

mattgarrish · 2017-07-08T12:49:15Z

The schemas already enforce xsd:language constraints:

opf.dc.language = element dc:language { opf.id.attr? & datatype.languagecode }

datatype.languagecode = datatype.BCP47
datatype.BCP47 = xsd:language { pattern = "[a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*" }

But that just enforces the lexical constraint without trying to verify the validity of the segments. The request, as I understand it, is to go further and validate the segments.

It would be great if that were done, but it seems like no small task and a perpetual moving target.

murata2makoto · 2017-07-08T13:47:17Z

It would be nice if meaningless tags such as en-US-POSIX are detected. But if some programming (as oppose to schema hacking) is required, I am not sure if this is important enough.

tofi86 · 2017-11-21T21:49:53Z

Update: @kalaspuffar started working on this in PR #807. Review of the PR is welcome.

rdeltour · 2017-11-27T23:08:33Z

Unless we check the IANA registry, I don't think there's much we can do here more than the lexical check performed by the schema?

xfq · 2020-08-18T08:17:40Z

Yes, checking if language tags are valid requires access to or a copy of the registry.

I didn't check EPUB 3.2, but the EPUB 3.0 spec text in the first comment didn't say if it requires the language tag to be well-formed or valid. The LTLI document from W3C i18n WG contains some guidance on this.

mattgarrish · 2020-08-18T11:06:19Z

We had a long discussion about well-formed v. valid for web publications and the resulting consensus was that there is little value in enforcing validity. Reading systems will react or not based on whether they recognize the language, so ensuring the general pattern is followed is all that is necessary. This really should be clarified in the epub spec.

In Package Document, the language tags appearing in the elements or attributes below MUST be well-formed according to BCP47: - `xml:lang` attribute - `hreflang` attribute - `dc:language` element For these values: - the schema now only do basic datatype check (string, non-empty value when relevant) - the well-formedness is checked with Java’s Locale.Builder#setLanguageTag() API - a new check (OPF-092) is reported when an ill-formed value is found See https://docs.oracle.com/javase/8/docs/api/java/util/Locale.Builder.html#setLanguageTag-java.lang.String- Fix #1221 Close #702

tofi86 added the type: feature The issue describes a new feature request label Oct 4, 2016

tofi86 added this to the Next milestone Oct 4, 2016

tofi86 removed this from the Next milestone Dec 22, 2016

tofi86 mentioned this issue Dec 30, 2016

language validation #495

Closed

tofi86 added the status: ready for implem The issue is ready to be implemented label Sep 28, 2017

kalaspuffar mentioned this issue Nov 15, 2017

[WIP] [fix #702] Validate content of <dc:language> #807

Closed

tofi86 added the status: has PR The issue is being processed in a pull request label Nov 21, 2017

tofi86 removed the status: ready for implem The issue is ready to be implemented label Nov 27, 2017

mattgarrish mentioned this issue Aug 18, 2020

Clarify language tag values w3c/epub-specs#1325

Closed

rdeltour added status: in discussion The issue is being discussed by the development team and removed status: has PR The issue is being processed in a pull request labels Nov 13, 2021

rdeltour mentioned this issue Dec 1, 2022

feat: new check (OPF-092) for language tags well-formedness #1363

Merged

rdeltour closed this as completed in 9b2c203 Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate content of <dc:language> #702

Validate content of <dc:language> #702

jon-moreira commented Aug 9, 2016 •

edited by tofi86

tofi86 commented Dec 22, 2016 •

edited

murata2makoto commented Jul 8, 2017

tofi86 commented Jul 8, 2017

mattgarrish commented Jul 8, 2017

murata2makoto commented Jul 8, 2017

tofi86 commented Nov 21, 2017

rdeltour commented Nov 27, 2017

xfq commented Aug 18, 2020

mattgarrish commented Aug 18, 2020

Validate content of <dc:language> #702

Validate content of <dc:language> #702

Comments

jon-moreira commented Aug 9, 2016 • edited by tofi86

tofi86 commented Dec 22, 2016 • edited

murata2makoto commented Jul 8, 2017

tofi86 commented Jul 8, 2017

mattgarrish commented Jul 8, 2017

murata2makoto commented Jul 8, 2017

tofi86 commented Nov 21, 2017

rdeltour commented Nov 27, 2017

xfq commented Aug 18, 2020

mattgarrish commented Aug 18, 2020

jon-moreira commented Aug 9, 2016 •

edited by tofi86

tofi86 commented Dec 22, 2016 •

edited