Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback while working on a JSON Schema for WP #287

Closed
HadrienGardeur opened this issue Jul 31, 2018 · 17 comments
Closed

Feedback while working on a JSON Schema for WP #287

HadrienGardeur opened this issue Jul 31, 2018 · 17 comments
Assignees

Comments

@HadrienGardeur
Copy link

HadrienGardeur commented Jul 31, 2018

Since we won't have any calls for the next two weeks, I've started working on a JSON Schema for WP.

A first draft is available at https://github.com/w3c/wpub/blob/master/schema/publication.schema.json

While working on this schema, I've had to go through the specification again and it highlighted several issues with the current draft.

1. Creators

The JSON serialization for creators feels very much underdefined: there's a good list of roles that I can use for my JSON Schema but everything else feels underdefined.

  • the required value is either "One or more Person" or "One or more Person or Organization"
  • does this mean that we MUST use an object? My audiobook example used strings as well, since it's much easier to author it that way when you only have a name to indicate.
  • do we require any terms in that object (name, @type) or have a strong preference for specific terms?
  • do we allow arrays as well? Arrays of strings? Arrays of objects?

2. Item-specific language

I find that part of the spec every bit as confusing as creators:

  • where exactly do we allow the use of @value + @language instead of a string? An explicit list would help a lot.
  • having examples for every term where this is allowed would also help, ideally in each respective section rather than under its own section
  • do we allow multiple localizations of each string as well (an array of objects)?

3. String/Array/Object

Overall, it's fairly difficult to know right now where we allow strings, arrays or objects.

For example:

  • do we allow all three for name (string for the simple case, object for a single localized value, array of objects for multiple localizations)?
  • what about the different accessibility related terms? On the schema.org website, I see that some of them are strings, while others are arrays of strings. It would be useful to indicate this in our specification.

4. Accessibility

There are many different values for the various accessibility terms that we support, but they're not listed in our specification.
Should we validate these values at all?

For languages (both inLanguage and @language), readingProgression and inDirection, I've already handled the validation of their values.

@mattgarrish
Copy link
Member

does this mean that we MUST use an object? My audiobook example used strings as well, since it's much easier to author it that way when you only have a name to indicate.

Ya, I asked this initially in #228. I probably should have left it open, since the table restructuring didn't solve it except in the sense it requires an object.

@iherman
Copy link
Member

iherman commented Aug 1, 2018

Since we won't have any calls for the next two weeks, I've started working on a JSON Schema for WP.

A first draft is available at https://github.com/w3c/wpub/blob/master/schema/publication.schema.json

Thanks for doing that.

I try to answer to some of these issues; we will have to work out the exact editorial work together.

While working on this schema, I've had to go through the specification again and it highlighted several issues with the current draft.

1. Creators

The JSON serialization for creators feels very much underdefined: there's a good list of roles that I can use for my JSON Schema but
everything else feels underdefined.

  • the required value is either "One or more Person" or "One or more Person or Organization"
  • does this mean that we MUST use an object? My audiobook example used strings as well, since it's much easier to author it that way when > you only have a name to indicate.

I think that for this, as well as for other issues, what we have to check is whether the structured data tool for schema.org accepts a structure or not.

For this specific issue,

{
        "@context"  : ["https://schema.org"],
        "@type"       : "CreativeWork",
         "url"             : "http://www.w3.org/TR/2015/REC-tabular-data-model-20151217/",
        "creator"      : "Jenni Tennison"
}

is accepted (pleasant surprise, I expected the opposite:-). Which means that we should say somewhere in the spec that a simple string is acceptable in those positions (and must be converted into Persons when converting into WebIDL).

  • do we require any terms in that object (name, @type) or have a strong preference for specific terms?

I think these two terms should be indeed required.

  • do we allow arrays as well? Arrays of strings? Arrays of objects?

I am not sure I understand. We do say "one or more" which, for me, means that an array can be accepted as values. Is this what you mean?

2. Item-specific language

I find that part of the spec every bit as confusing as creators:

  • where exactly do we allow the use of @value + @language instead of a string? An explicit list would help a lot.

In general, every time we expect a string (names is the typical example) the usage of @value+@language instead is allowed. This is, probably, a general statement that must be put into the document.

  • having examples for every term where this is allowed would also help, ideally in each respective section rather than under its own section

Correct, we will have to have a proper multilingual example that we can use.

  • do we allow multiple localizations of each string as well (an array of objects)?

You mean every time we expect a string we can accept an array of strings (or localized strings)? Yes.

3. String/Array/Object

Overall, it's fairly difficult to know right now where we allow strings, arrays or objects.

For example:

  • do we allow all three for name (string for the simple case, object for a single localized value, array of objects for multiple localizations)?

Yes.

  • what about the different accessibility related terms? On the schema.org website, I see that some of them are strings, while others are arrays of strings. It would be useful to indicate this in our specification.

I am a bit wary of repeating the specification of, say, the accessibility terms in our spec; we have a reference to those.

4. Accessibility

There are many different values for the various accessibility terms that we support, but they're not listed in our specification.
Should we validate these values at all?

I think yes.

Cc @mattgarrish

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Aug 1, 2018

I'll update the JSON Schema based on my own personal take on what should be required/allowed. Once that's done, I'll post an update here to document some of the choices I made in order to discuss them together.

Update: Done.

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Aug 1, 2018

The updated JSON Schema should now cover every term in our manifest: https://github.com/w3c/wpub/blob/master/schema/publication.schema.json

To finalize this schema, I had to take a few decisions regarding what's allowed or not.

1. Localizable Strings

All localizable strings can be expressed using:

  • a string
  • an object where both @value and @language are required
  • an array where every item is an object with the same requirements as above

This schema is applied to: name, accessibilitySummary and description.

To illustrate, here are three examples using name:

As a string

"name": "The Three Musketeers"

As an object

"name": {
  "@value": "The Three Musketeers",
  "@language": "en"
}

As an array of objects

"name": [
  {
    "@value": "The Three Musketeers",
    "@language": "en"
  },
  {
    "@value": "Les Trois Mousquetaires",
    "@language": "fr"
  }
]

The following example is rejected by the JSON Schema:

"name" : ["War and Peace", "Guerre et Paix"]

A mixed array containing both strings and objects would be rejected as well.

All objects in the array MUST be unique.

2. Creators

All creators allow:

  • a string
  • an object
  • an array of strings and/or objects

Whenever an object is used, name is required.

To illustrate, here are four examples using author:

As a string

"author": "Herman Melville"

As an object

"author": {
  "name": "Herman Melville",
  "@type": "Person"
}

As an array of strings

"author": ["Herman Melville", "Marcel Proust"]

As an array of objects

"author": [
  {
    "name": "Herman Melville",
    "@type": "Person"
  },
 {
    "name": "Marcel Proust",
    "@type": "Person"
  }
]

Mixed arrays containing both strings and objects are also allowed.

All objects in the array MUST be unique but the JSON Schema would validate the same creator expressed using both a string and an object.

3. Accessibility

I've applied strict validation on the various accessibility terms by forcing them to be either a string or an array and using an enum with a list of values.

4. Relative URIs

In JSON Schema, it's possible to validate that a string is a URI using "format": "uri".

But since we allow relative URIs, I had to drop this validation to allow any string. I wonder if this might also be a problem for other situations as well.

@iherman
Copy link
Member

iherman commented Aug 1, 2018

The updated JSON Schema should now cover every term in our manifest: https://github.com/w3c/wpub/blob/master/schema/publication.schema.json

Brilliant!

Before commenting below: just as for the json-ld context, it would be good to find a final place for the schema, so that tools could use that. I realize that means re-writing the schemas before you used the cross referencing possibility.

I just wonder whether it becomes too unwieldy if all schema files are folded into one. That may avoid forcing schema validation libraries to do HTTP request all the time (tools could store a copy locally)

But that is something we can handle later.

To finalize this schema, I had to take a few decisions regarding what's allowed or not.

1. Localizable Strings

All localizable strings can be expressed using:

  • a string
  • an object where both @value and @language are required
  • an array where every item is an object with the same requirements as above

What is the rationale not allowing an array with strings or and array with a mixture of strings and objects? I understand it is more complex but sounds a bit contrived for an author. If the default language for the document is English, ie, the inLanguage is set to en then the example below:

"name": [
  {
    "@value": "The Three Musketeers",
    "@language": "en"
  },
  {
    "@value": "Les Trois Mousquetaires",
    "@language": "fr"
  }
]

seems redundant, and this looks more natural:

"name": [
  "The Three Musketeers",
  {
    "@value": "Les Trois Mousquetaires",
    "@language": "fr"
  }
]

I do not see the problem on the implementation level either.

2. Creators

All creators allow:

  • a string
  • an object
  • an array of strings and/or objects

Whenever an object is used, name is required.

Mixed arrays containing both strings and objects are also allowed.

Right, I agree with this. All the more reasons to allow the same flexibility for localizable strings.

4. Relative URIs

In JSON Schema, it's possible to validate that a string is a URI using "format": "uri".

But since we allow relative URIs, I had to drop this validation to allow any string. I wonder if this might also be a problem for other situations as well.

Yes, this is unfortunate, but we cannot help this.

@HadrienGardeur
Copy link
Author

What is the rationale not allowing an array with strings or and array with a mixture of strings and objects?

If you have more than one string, you get in a weird situation where it's impossible to know which language you should apply:

"name": [
  "The Three Musketeers",
  "I tre moschettieri",
  {
    "@value": "Les Trois Mousquetaires",
    "@language": "fr"
  }
]

I think it's safer to force all arrays in localizable string to use objects instead. It's also much easier to test uniqueness when an array is all string or all objects.

Right, I agree with this. All the more reasons to allow the same flexibility for localizable strings.

Creators are quite different from name, description and accessibilitySummary because you're allowed to have multiple values, not just localized strings.
This means that you don't have the same issues when mixing up strings with objects, they're simply multiple values.

I just wonder whether it becomes too unwieldy if all schema files are folded into one. That may avoid forcing schema validation libraries to do HTTP request all the time (tools could store a copy locally)

Yikes. I'm always in favour of DRY (don't repeat yourself) and references helped a lot while writing this schema.

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Aug 1, 2018

FYI, the JSON Schema for a publication is already available at https://w3c.github.io/wpub/schema/publication.schema.json

I've tested examples in various validation tools and they could all support the various references that are already used in the current version.

@iherman
Copy link
Member

iherman commented Aug 1, 2018

If you have more than one string, you get in a weird situation where it's impossible to know which language you should apply:

"name": [
 "The Three Musketeers",
 "I tre moschettieri",
 {
   "@value": "Les Trois Mousquetaires",
   "@language": "fr"
 }
]

I do not see the problem. This is obviously bad authoring but the rules are clear (and the same as in HTML): if the language is set (in inLanguage) then that applies for a string, unless the string does not set the language for itself. The second string will be set to english in this case (ie, if the inLanguage is set). Authoring error. So?

I think it's safer to force all arrays in localizable string to use objects instead. It's also much easier to test uniqueness when an array is all string or all objects.

If this is the only reason then, I am sorry, but I do not agree.

I just wonder whether it becomes too unwieldy if all schema files are folded into one. That may avoid forcing schema validation libraries to do HTTP request all the time (tools could store a copy locally)

Yikes. I'm always in favour of DRY (don't repeat yourself) and references helped a lot while writing this schema.

Yikes indeed, I understand. The issue I see is, however, that the schema validation becomes impossible offline although, I think, using this validation and building into the processing would be extremely useful.

Is it necessary to use absolute urls? If the 'main entry point' to the schema used only relative URLs, one could make a copy of the whole collection. (Apologies, I am not very familiar with all the details of JSON schemas.)

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Aug 1, 2018

I do not see the problem. This is obviously bad authoring but the rules are clear (and the same as in HTML): if the language is set (in inLanguage) then that applies for a string, unless the string does not set the language for itself. The second string will be set to english in this case (ie, if the inLanguage is set). Authoring error. So?

This goes beyond authoring, it will affect the processing of the manifest as well.

When processing our WebIDL, we can't allow two different values for the same language on a localizable string.
I know that even with objects, this might be the case (two objects, both with the same language but different values), but it's even more likely to happen if we allow the use of multiple strings.

@iherman
Copy link
Member

iherman commented Aug 2, 2018

@HadrienGardeur,

I must admit I really do not understand the problem.

  1. Allowing a mixture of localizable strings and vanilla strings as an array value for, e.g., name allows avoiding unnecessary redundancy for the author (not being forced to explicitly repeat the default language tag into the value in case of an array). Redundancy, if possible, should be avoided.
  2. Any implementation that implements the LocalizableString WebIDL interface must be prepared to initialize it either with a string or with a @value+@language object. This is true because we agreed that name may have a simple string value.
  3. The WebIDL for name is an array (sequence, in WebIDL parlance) of localizable string-s. This means that by the time an application uses an object implementing WebPublicaitonManifest all those dualities should disappear, of course.
  4. Based on (2) above using a mixed array when initializing an object or class implementing WebPublicaitonManifest is a trivial step, something like:
set name(names) {
    const nameArray = Array.isArray(names) ? names : [names];
    this._name = nameArray.map((name) => new LocalizableString(name, this._lang));
}

I do not see any downsides that would justify imposing an unnecessary restriction.

When processing our WebIDL, we can't allow two different values for the same language on a localizable string.

I am not 100% sure of the validity of this statement but even if this is true, that is completely orthogonal. An implementation would have to make such a check on the result, regardless on whether the original manifest author used objects or strings.

@HadrienGardeur
Copy link
Author

@iherman the flexibility you're talking about comes at a very high cost:

  • difficulty to validate
  • parsing is much more complex
  • processing is also much more complex

This is not limited to name, we have this issue all over the place right now (allowing strings in links for example makes no sense at all).

I am not 100% sure of the validity of this statement but even if this is true, that is completely orthogonal. An implementation would have to make such a check on the result, regardless on whether the original manifest author used objects or strings.

As a UA, what should be displayed as the title of the publication if I have three strings that are all tagged as being in English?

@iherman
Copy link
Member

iherman commented Aug 2, 2018

@HadrienGardeur I believe the term "much more complex" is... overstating it.

I guess we will have to agree that we disagree on that one, and let the rest of the group make the decision.

@HadrienGardeur
Copy link
Author

HadrienGardeur commented Aug 3, 2018

This discussion was somehow continued at #288 (comment).

I'm going to create an alternate schema for localizable strings based on language maps that I won't reference from the main JSON Schema for now.

Update: https://github.com/w3c/wpub/blob/master/schema/localizable-map.schema.json

@iherman
Copy link
Member

iherman commented Aug 10, 2018

@HadrienGardeur administrative comment: I think we should have a separate issue on the usage or not usage of language maps instead of what we have now. It would be clearer to handle it as in issue of its own. Other issues were spread over other issues, too (eg, #290) and in the process of being closed. Are there other detailed issues in this one that should kept open separately? We could then close this one...

@HadrienGardeur
Copy link
Author

I think we need to explicitly state it in the spec and include examples for every format whenever a term allows a combination of string/array/object.

@iherman
Copy link
Member

iherman commented Aug 10, 2018

@HadrienGardeur I would leave that to @mattgarrish; maybe some sort of a general statement that makes it clear that string can be used instead of an array of strings, or in place of an object, etc, could be done as a general statement rather than making the text even more difficult to read.

I agree that the examples should reflect these more clearly.

@HadrienGardeur
Copy link
Author

I've updated the JSON Schema to use relative references. It should now be possible to embed it in other projects like your proof of concept @iherman.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants