New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of natural language in data formats such as JSON [I18N] #178

Open
aphillips opened this Issue May 18, 2017 · 46 comments

Comments

@aphillips

aphillips commented May 18, 2017

Hello TAG!

I'm requesting a TAG review of:

Further details (optional):

  • Relevant time constraints or deadlines: None specifically, but this has been a developing problem for us.

You should also know that...

The Internationalization WG has been commenting on data format specifications with increasing frequency over the past couple years in which we have noted the lack of natural language string types in formats such as JSON. We are concerned that there are internationalization gaps or, in an attempt to address our comments, non-interoperable and divergent implementation choices being made.

This issue is the result of an I18N WG action.

We would like the TAG's opinion on the problem and mooted solutions. The I18N WG chair (@aphillips) and Team contact (@r12a) can be available for consultation as needed.

We'd prefer the TAG provide feedback as (please select one):

  • open issues in our Github repo for each point of feedback
  • open a single issue in our Github repo for the entire review
  • leave review feedback as a comment in this issue and @-notify [@aphillips][@r12a]
@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips May 18, 2017

Please add the i18n-discuss label so that our tracking mechanism picks this up.

aphillips commented May 18, 2017

Please add the i18n-discuss label so that our tracking mechanism picks this up.

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic May 18, 2017

Member

This seems related to the discussion currently happening in heycam/webidl#358, where we're attempting to add a shared primitive to Web IDL that all specs can use, and getting stuck. The point of contention is basically whether the pattern should be

someAPI({
  lang: "...",
  dir: "...",
  label: "a string governed by the lang/dir"
  name: "another string, governed by the same lang/dir"
});

(the "Localizable base dictionary" solution)

or

someAPI({
  label: {
    lang: "...",
    dir: "...",
    value: "a string governed by the lang/dir"
  },
  name: "another string, using the default lang/dir"
});

(the "LocalizableString union typedef" solution).

The former makes it easier to say that all strings have the same lang/dir. The latter allows more granular decision making, at the cost of verbosity.

I suppose you could even have both.

Member

domenic commented May 18, 2017

This seems related to the discussion currently happening in heycam/webidl#358, where we're attempting to add a shared primitive to Web IDL that all specs can use, and getting stuck. The point of contention is basically whether the pattern should be

someAPI({
  lang: "...",
  dir: "...",
  label: "a string governed by the lang/dir"
  name: "another string, governed by the same lang/dir"
});

(the "Localizable base dictionary" solution)

or

someAPI({
  label: {
    lang: "...",
    dir: "...",
    value: "a string governed by the lang/dir"
  },
  name: "another string, using the default lang/dir"
});

(the "LocalizableString union typedef" solution).

The former makes it easier to say that all strings have the same lang/dir. The latter allows more granular decision making, at the cost of verbosity.

I suppose you could even have both.

@r12a

This comment has been minimized.

Show comment
Hide comment
@r12a

r12a May 18, 2017

@aphillips it seems that this repo is not under the w3c/ domain, so i'm unable to set up the normal notifications and labels, and we won't get notifications to our list.

r12a commented May 18, 2017

@aphillips it seems that this repo is not under the w3c/ domain, so i'm unable to set up the normal notifications and labels, and we won't get notifications to our list.

@torgo torgo added this to the tag-telcon-2017-06-06 milestone Jun 6, 2017

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Jun 6, 2017

Member

One question we're thinking about is to what extent this can be solved by using only dir="auto" (plus LRM and RLM or similar) and language tags. The ergonomics of both options aren't great. I also wonder whether there's an alternative that could be plain text much of the time but could also be more markup-like when needed.

Member

dbaron commented Jun 6, 2017

One question we're thinking about is to what extent this can be solved by using only dir="auto" (plus LRM and RLM or similar) and language tags. The ergonomics of both options aren't great. I also wonder whether there's an alternative that could be plain text much of the time but could also be more markup-like when needed.

@torgo

This comment has been minimized.

Show comment
Hide comment
@torgo

torgo Jun 6, 2017

Member

Discussed on call 06-06 suggestion that I18N group should work with WebIDL group.

Member

torgo commented Jun 6, 2017

Discussed on call 06-06 suggestion that I18N group should work with WebIDL group.

@cynthia

This comment has been minimized.

Show comment
Hide comment
@cynthia

cynthia Jun 6, 2017

Member

Solution 1: New Data Type
Create a new data type whose serialization optionally includes language and direction. Examples:

myLocalizedString: "Hello World!"@en^ltr
myLocalizedString_fr: "Bonjour monde !"@fr
myLocalizedString_ar: "مرحبا بالعالم!"@ar-EG^rtl
myLocalizedString_und: "שלום עולם!"^rtl
myLanguageNeutralString: "978-0-123-4567-X" // no language or direction for this non-natural-language string

There are quite a few parser implementations out in the wild already for this approach to be feasible - and since parsers which do not support this feature will not function against data with these tags present, this does not seem like a way forward.

We did briefly touch on http://unicode.org/faq/languagetagging.html during the call, in case that would be an option.

Member

cynthia commented Jun 6, 2017

Solution 1: New Data Type
Create a new data type whose serialization optionally includes language and direction. Examples:

myLocalizedString: "Hello World!"@en^ltr
myLocalizedString_fr: "Bonjour monde !"@fr
myLocalizedString_ar: "مرحبا بالعالم!"@ar-EG^rtl
myLocalizedString_und: "שלום עולם!"^rtl
myLanguageNeutralString: "978-0-123-4567-X" // no language or direction for this non-natural-language string

There are quite a few parser implementations out in the wild already for this approach to be feasible - and since parsers which do not support this feature will not function against data with these tags present, this does not seem like a way forward.

We did briefly touch on http://unicode.org/faq/languagetagging.html during the call, in case that would be an option.

@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips Jun 6, 2017

dir="auto" is not a panacea. The first strong characters in a string may be left-to-right and fool the algorithm.

My concern here is that this requires the addition of LRM/RLM markers to data---data that may not be owned by the process assembling the wire format or that may have a field length restriction expressed in characters, code units, or bytes, etc. Adopting auto semantics and requiring the markers introduces (possibly cascading) data change. It also requires, in some cases, developers to introduce more markers into text, as when assembling messages.

aphillips commented Jun 6, 2017

dir="auto" is not a panacea. The first strong characters in a string may be left-to-right and fool the algorithm.

My concern here is that this requires the addition of LRM/RLM markers to data---data that may not be owned by the process assembling the wire format or that may have a field length restriction expressed in characters, code units, or bytes, etc. Adopting auto semantics and requiring the markers introduces (possibly cascading) data change. It also requires, in some cases, developers to introduce more markers into text, as when assembling messages.

@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips Jun 6, 2017

@cynthia The Unicode language tagging characters are deprecated and it is a Bad Idea to use them. We knew coming in that a new data type was more-or-less a non-starter unless/until our request for the keys to the time machine comes through.

@torgo Thanks for the update. We'll reach out to WebIDL

aphillips commented Jun 6, 2017

@cynthia The Unicode language tagging characters are deprecated and it is a Bad Idea to use them. We knew coming in that a new data type was more-or-less a non-starter unless/until our request for the keys to the time machine comes through.

@torgo Thanks for the update. We'll reach out to WebIDL

@r12a

This comment has been minimized.

Show comment
Hide comment
@r12a

r12a Jun 13, 2017

All, wrt using Unicode formatting characters to establish direction, please read the other docs that Addison links – in particular http://w3c.github.io/i18n-discuss/notes/string-base-direction.html – where we try to enumerate the pros and cons of various approaches. (@aphilips we should probably make it a bit clearer that folks should read those docs to get a better basis for discussion)

@domenic it's useful to be able to apply the same lang/dir metadata to multiple strings without repeating the metadata, if that's possible; however, it's certainly easy to imagine situations where different assignments are needed for particular strings (eg. in the case of a set of alternative translations for an error message, where one string is in english, and another in hebrew).

hth

r12a commented Jun 13, 2017

All, wrt using Unicode formatting characters to establish direction, please read the other docs that Addison links – in particular http://w3c.github.io/i18n-discuss/notes/string-base-direction.html – where we try to enumerate the pros and cons of various approaches. (@aphilips we should probably make it a bit clearer that folks should read those docs to get a better basis for discussion)

@domenic it's useful to be able to apply the same lang/dir metadata to multiple strings without repeating the metadata, if that's possible; however, it's certainly easy to imagine situations where different assignments are needed for particular strings (eg. in the case of a set of alternative translations for an error message, where one string is in english, and another in hebrew).

hth

@cynthia

This comment has been minimized.

Show comment
Hide comment
@cynthia

cynthia Jun 20, 2017

Member

The Unicode language tagging characters are deprecated and it is a Bad Idea to use them. We knew coming in that a new data type was more-or-less a non-starter unless/until our request for the keys to the time machine comes through.

Understood. It was just a note that the topic came up during the call.

Member

cynthia commented Jun 20, 2017

The Unicode language tagging characters are deprecated and it is a Bad Idea to use them. We knew coming in that a new data type was more-or-less a non-starter unless/until our request for the keys to the time machine comes through.

Understood. It was just a note that the topic came up during the call.

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Jun 26, 2017

Member

So let me present what I'm concerned about here in a little more detail. My basic concern is that there really don't seem to be any good options:

  • the new data type seems to be a non-starter in terms of compatibility (e.g., parsing, etc.)
  • the use of dictionaries makes things harder for both developers using the API, for implementors of the API, and for specification authors (increasing both the amount of work and the risk of errors) and:
    • if the use of a dictionary rather than a string is option, the handling of dictionaries is frequently going to be wrong in both specs and implementations.
    • if the use of a dictionary is not optional, it adds a good bit of extra overhead for developers in the common case where they're not going to be interested in adding language or direction information.

One of the pieces of advice from i18n in the past was that text that should be presented to users should be markup rather than attribute values, so that when needed it could allow elements within it (for things like language and direction, ruby, etc.). I also wonder whether this sort of advice could be extended here, i.e., whether we should be encouraging the use of HTML rather than text.

Member

dbaron commented Jun 26, 2017

So let me present what I'm concerned about here in a little more detail. My basic concern is that there really don't seem to be any good options:

  • the new data type seems to be a non-starter in terms of compatibility (e.g., parsing, etc.)
  • the use of dictionaries makes things harder for both developers using the API, for implementors of the API, and for specification authors (increasing both the amount of work and the risk of errors) and:
    • if the use of a dictionary rather than a string is option, the handling of dictionaries is frequently going to be wrong in both specs and implementations.
    • if the use of a dictionary is not optional, it adds a good bit of extra overhead for developers in the common case where they're not going to be interested in adding language or direction information.

One of the pieces of advice from i18n in the past was that text that should be presented to users should be markup rather than attribute values, so that when needed it could allow elements within it (for things like language and direction, ruby, etc.). I also wonder whether this sort of advice could be extended here, i.e., whether we should be encouraging the use of HTML rather than text.

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Jul 7, 2017

Member

(And if we wanted to encourage HTML, would it be a subset of HTML, or arbitrary HTML?)

Member

dbaron commented Jul 7, 2017

(And if we wanted to encourage HTML, would it be a subset of HTML, or arbitrary HTML?)

@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips Jul 7, 2017

Thanks @dbaron. While, in general, markup is a Good Thing for this, at the same time the point of using JSON and other data languages is the transmission of "unrendered" data. Let me give a concrete use case.

Suppose that in my day job I am building a Web page to show a customer's library of e-books. The e-books exist in a catalog of data and consist of the usual data values. It might looks something like:

{
    "id": "978-0-1234-5678-X",
    "title": "Moby Dick",
    "authors": [ "Herman Melville" ],
    "language": "en-US",
    "pubDate": "1851-10-18",
    "publisher": "Mark Twain Press",
    "coverImage": "https://example.com/images/mobidick_cover.jpg",
    // etc.
},

Each of the above is a data field in a database somewhere. Now, because I know I need it, I have language and direction information for each of the textual fields also in my database. I even have stuff like a pronunciation field for title and author (for sorting Chinese and Japanese). Those are just data fields. Do I really want to serialize them as HTML:

   "title": "<span lang='en-US' dir='ltr'>Mobi Dick</span>"

After all, I may not end up displaying the title field in an HTML context! My JSON might very well be used to populate say the device local data store which uses native controls to show the title.

I'd also argue that:

a good bit of extra overhead for developers in the common case where they're not going to be interested in adding language or direction information

... is probably wrong. The common case where you don't want language or direction information is for non-language-bearing fields (isbn). Omitting the information for language-bearing fields is basically an I18N bug (yeah, being a pedant here)

aphillips commented Jul 7, 2017

Thanks @dbaron. While, in general, markup is a Good Thing for this, at the same time the point of using JSON and other data languages is the transmission of "unrendered" data. Let me give a concrete use case.

Suppose that in my day job I am building a Web page to show a customer's library of e-books. The e-books exist in a catalog of data and consist of the usual data values. It might looks something like:

{
    "id": "978-0-1234-5678-X",
    "title": "Moby Dick",
    "authors": [ "Herman Melville" ],
    "language": "en-US",
    "pubDate": "1851-10-18",
    "publisher": "Mark Twain Press",
    "coverImage": "https://example.com/images/mobidick_cover.jpg",
    // etc.
},

Each of the above is a data field in a database somewhere. Now, because I know I need it, I have language and direction information for each of the textual fields also in my database. I even have stuff like a pronunciation field for title and author (for sorting Chinese and Japanese). Those are just data fields. Do I really want to serialize them as HTML:

   "title": "<span lang='en-US' dir='ltr'>Mobi Dick</span>"

After all, I may not end up displaying the title field in an HTML context! My JSON might very well be used to populate say the device local data store which uses native controls to show the title.

I'd also argue that:

a good bit of extra overhead for developers in the common case where they're not going to be interested in adding language or direction information

... is probably wrong. The common case where you don't want language or direction information is for non-language-bearing fields (isbn). Omitting the information for language-bearing fields is basically an I18N bug (yeah, being a pedant here)

@dwsinger

This comment has been minimized.

Show comment
Hide comment
@dwsinger

dwsinger Jul 7, 2017

Not to be a pedant, but if you separate the language tag from the other fields, don't you introduce ambiguity or risk of error?

{
"id": "978-0-1234-5678-X",
"title": "Quo Vadis",
"authors": [ " Henryk Sienkiewicz" ],
"language": "la",
"pubDate": "1895-10-18",
// etc.
},

The title is indeed in Latin. The book was originally written in Polish. But maybe this edition is in some other language. This becomes particularly problematic if two fields need different tagging (here, the author's name might be tagged as "pl" -- Polish).

dwsinger commented Jul 7, 2017

Not to be a pedant, but if you separate the language tag from the other fields, don't you introduce ambiguity or risk of error?

{
"id": "978-0-1234-5678-X",
"title": "Quo Vadis",
"authors": [ " Henryk Sienkiewicz" ],
"language": "la",
"pubDate": "1895-10-18",
// etc.
},

The title is indeed in Latin. The book was originally written in Polish. But maybe this edition is in some other language. This becomes particularly problematic if two fields need different tagging (here, the author's name might be tagged as "pl" -- Polish).

@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips Jul 7, 2017

@dswinger Exactly so. The book language(s) (the language(s) of the intended audience) might be (often are) different from the language of the title or the author. The language field really is wrongly ambiguous, given that each field (title, author, publisher name) needs language and direction metadata.

aphillips commented Jul 7, 2017

@dswinger Exactly so. The book language(s) (the language(s) of the intended audience) might be (often are) different from the language of the title or the author. The language field really is wrongly ambiguous, given that each field (title, author, publisher name) needs language and direction metadata.

@hsivonen

This comment has been minimized.

Show comment
Hide comment
@hsivonen

hsivonen Jul 10, 2017

Solution 1 that would require changes to JSON itself isn't practical, because it would be too much of ocean boiling effort to change all JSON parsers.

I think Solution 2 potentially with bidi control characters within string values is workable.

This seems related to the discussion currently happening in heycam/webidl#358, where we're attempting to add a shared primitive to Web IDL that all specs can use, and getting stuck. The point of contention is basically whether the pattern should be

someAPI({
 lang: "...",
 dir: "...",
 label: "a string governed by the lang/dir"
 name: "another string, governed by the same lang/dir"
});

(the "Localizable base dictionary" solution)

or

someAPI({
 label: {
   lang: "...",
   dir: "...",
   value: "a string governed by the lang/dir"
 },
 name: "another string, using the default lang/dir"
});

(the "LocalizableString union typedef" solution).

The former makes it easier to say that all strings have the same lang/dir. The latter allows more granular decision making, at the cost of verbosity.

I would expect the former to face less resistance, because it just adds some key-value pairs without forcing a reorganization of a given JSON-based format compared to its lang/dir-unaware version. Moreover, considering JSON from the perspective of developers trying to escape XML, the added nesting/complexity of the latter would probably not be well received. Therefore, I think pushing the latter as the only option wouldn't be productive.

A third option would be:

someAPI({
  label_lang: "...",
  label_dir: "...",
  label: "a string governed by label_lang/label_dir",
  name_lang: "...",
  name_dir: "...",
  name: "another string, governed by name_lang/name_dir"
});

whether we should be encouraging the use of HTML rather than text

I think using HTML in JSON makes sense for strings that carry multi-paragraph text with inline formatting (i.e. something that would make sense inside HTML <body>), but I think it wouldn't be good to recommend markup inside JSON for strings that are closer to HTML <title>, email subject line, name of a person, a GUI label, invoice/inventory line item, etc.

Even though HTML parsers are now widely available, a plain-text string is a significantly simpler thing for the consumer's data model to deal with than a tree rooted at DOM DocumentFragment or equivalent in a non-DOM markup tree API.

People use JSON instead of XML to avoid various complexities of XML and to use a format that maps nicely to and from basic programming language data structures. Making shortish plainish strings (not just ones representing multi-paragraph text with inline formatting) in JSON potentially carry markup would defeat both avoiding XML mixed content complexity and having a format that maps nicely to and from basic programming language data structures.

When a JSON-based format wouldn't use markup in strings for non-bidi reasons, to the extent a base direction taken from an adjacent key-value pair isn't enough, I think finer-grained bidi control should use the bidi control characters instead of importing the full data model complexity of markup for every (human-readable) string.

(Whereas bidi is intrinsic to whole scripts, ruby is a sometimes-used (relatively rarely-used even) typographical device for the scripts with which it is used, so I think it cases where bidi doesn't justify the complexity of markup, ruby doesn't, either.)

hsivonen commented Jul 10, 2017

Solution 1 that would require changes to JSON itself isn't practical, because it would be too much of ocean boiling effort to change all JSON parsers.

I think Solution 2 potentially with bidi control characters within string values is workable.

This seems related to the discussion currently happening in heycam/webidl#358, where we're attempting to add a shared primitive to Web IDL that all specs can use, and getting stuck. The point of contention is basically whether the pattern should be

someAPI({
 lang: "...",
 dir: "...",
 label: "a string governed by the lang/dir"
 name: "another string, governed by the same lang/dir"
});

(the "Localizable base dictionary" solution)

or

someAPI({
 label: {
   lang: "...",
   dir: "...",
   value: "a string governed by the lang/dir"
 },
 name: "another string, using the default lang/dir"
});

(the "LocalizableString union typedef" solution).

The former makes it easier to say that all strings have the same lang/dir. The latter allows more granular decision making, at the cost of verbosity.

I would expect the former to face less resistance, because it just adds some key-value pairs without forcing a reorganization of a given JSON-based format compared to its lang/dir-unaware version. Moreover, considering JSON from the perspective of developers trying to escape XML, the added nesting/complexity of the latter would probably not be well received. Therefore, I think pushing the latter as the only option wouldn't be productive.

A third option would be:

someAPI({
  label_lang: "...",
  label_dir: "...",
  label: "a string governed by label_lang/label_dir",
  name_lang: "...",
  name_dir: "...",
  name: "another string, governed by name_lang/name_dir"
});

whether we should be encouraging the use of HTML rather than text

I think using HTML in JSON makes sense for strings that carry multi-paragraph text with inline formatting (i.e. something that would make sense inside HTML <body>), but I think it wouldn't be good to recommend markup inside JSON for strings that are closer to HTML <title>, email subject line, name of a person, a GUI label, invoice/inventory line item, etc.

Even though HTML parsers are now widely available, a plain-text string is a significantly simpler thing for the consumer's data model to deal with than a tree rooted at DOM DocumentFragment or equivalent in a non-DOM markup tree API.

People use JSON instead of XML to avoid various complexities of XML and to use a format that maps nicely to and from basic programming language data structures. Making shortish plainish strings (not just ones representing multi-paragraph text with inline formatting) in JSON potentially carry markup would defeat both avoiding XML mixed content complexity and having a format that maps nicely to and from basic programming language data structures.

When a JSON-based format wouldn't use markup in strings for non-bidi reasons, to the extent a base direction taken from an adjacent key-value pair isn't enough, I think finer-grained bidi control should use the bidi control characters instead of importing the full data model complexity of markup for every (human-readable) string.

(Whereas bidi is intrinsic to whole scripts, ruby is a sometimes-used (relatively rarely-used even) typographical device for the scripts with which it is used, so I think it cases where bidi doesn't justify the complexity of markup, ruby doesn't, either.)

@travisleithead

This comment has been minimized.

Show comment
Hide comment
@travisleithead

travisleithead Jul 11, 2017

Contributor

A lot of great points have been made in this thread. I'm personally not convinced that there is any single "right" solution.

It seems that if we want to create a compact data representation for strings, associating lang, direction, and other meta-data about the string, then this encoding has to be as maximally-portable across systems as possible, which leads me to think it must be some new representation of a string literal. That of course, is asking for a huge change across all programming environments and applications--not likely to happen, but neat to dream about--or even start some activity there, perhaps in Unicode.

For a serialization memory layout that associates lang, direction, etc., metadata about strings, I don't offer a strong opinion, though I have a weak opinion: keep it simple, or it will likely be too much of a burden to get much traction. For example, I think a simple dictionary with fields at parallel depths would work fine for most applications, e.g., { lang: .., dir: .., stringvalue: ... }.

Contributor

travisleithead commented Jul 11, 2017

A lot of great points have been made in this thread. I'm personally not convinced that there is any single "right" solution.

It seems that if we want to create a compact data representation for strings, associating lang, direction, and other meta-data about the string, then this encoding has to be as maximally-portable across systems as possible, which leads me to think it must be some new representation of a string literal. That of course, is asking for a huge change across all programming environments and applications--not likely to happen, but neat to dream about--or even start some activity there, perhaps in Unicode.

For a serialization memory layout that associates lang, direction, etc., metadata about strings, I don't offer a strong opinion, though I have a weak opinion: keep it simple, or it will likely be too much of a burden to get much traction. For example, I think a simple dictionary with fields at parallel depths would work fine for most applications, e.g., { lang: .., dir: .., stringvalue: ... }.

@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips Jul 11, 2017

@travisleithead I tend to agree. It would have been nice to address this in the past, but we're here now.

However, building recommended patterns and best practices would allow specs to be consistent and interoperate well.

aphillips commented Jul 11, 2017

@travisleithead I tend to agree. It would have been nice to address this in the past, but we're here now.

However, building recommended patterns and best practices would allow specs to be consistent and interoperate well.

@r12a

This comment has been minimized.

Show comment
Hide comment
@r12a

r12a Jul 17, 2017

The common case where you don't want language or direction information is for non-language-bearing fields (isbn).

Watch out though, apparently harmless data may not actually be so. The isbn field will only display correctly if it is isolated when displayed in a RTL target context and treated as LTR inside that isolated area. For example, if i just drop the text into a field on a RTL page without any precautions, i get:

screen shot 2017-07-17 at 18 11 08

rather than what i really want, which is:

screen shot 2017-07-17 at 18 11 19

However, if the value were a range, such as 100-300, the first arrangement would actually be what's wanted in Arabic (though not Hebrew, and i'm not sure about N'Ko), otherwise the range would appear to be decreasing instead of increasing.

So it may actually be useful to have some direction information for isbn numbers, MAC addresses, telephone numbers, etc.

r12a commented Jul 17, 2017

The common case where you don't want language or direction information is for non-language-bearing fields (isbn).

Watch out though, apparently harmless data may not actually be so. The isbn field will only display correctly if it is isolated when displayed in a RTL target context and treated as LTR inside that isolated area. For example, if i just drop the text into a field on a RTL page without any precautions, i get:

screen shot 2017-07-17 at 18 11 08

rather than what i really want, which is:

screen shot 2017-07-17 at 18 11 19

However, if the value were a range, such as 100-300, the first arrangement would actually be what's wanted in Arabic (though not Hebrew, and i'm not sure about N'Ko), otherwise the range would appear to be decreasing instead of increasing.

So it may actually be useful to have some direction information for isbn numbers, MAC addresses, telephone numbers, etc.

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Jul 17, 2017

Member

Though in those cases there's a tradeoff between having the direction data in the text data, versus having the application have knowledge of the correct way to present the particular field, since there is a correct and simple per-field algorithm (although it's not particularly simple to have tens or hundreds of them). One is easier for the producer of the data and the other is easier for the consumer.

This is different from cases where you basically have to have the direction data stored in the text because you can't trivially derive it from the text.

Member

dbaron commented Jul 17, 2017

Though in those cases there's a tradeoff between having the direction data in the text data, versus having the application have knowledge of the correct way to present the particular field, since there is a correct and simple per-field algorithm (although it's not particularly simple to have tens or hundreds of them). One is easier for the producer of the data and the other is easier for the consumer.

This is different from cases where you basically have to have the direction data stored in the text because you can't trivially derive it from the text.

@torgo torgo modified the milestones: tag-f2f-london-2017-07-25, tag-telcon-2017-07-11 Jul 25, 2017

@torgo torgo added the extra time label Jul 25, 2017

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Sep 27, 2017

Member

Discussed in today's TAG meeting a bit; we'll take it to an extra time breakout discussion today or tomorrow.

Member

dbaron commented Sep 27, 2017

Discussed in today's TAG meeting a bit; we'll take it to an extra time breakout discussion today or tomorrow.

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Sep 27, 2017

Member

@cynthia and I just had a breakout discussion of this.

I think one conclusion that we came to is that the option of having a lang that applies to multiple strings is a little weird, e.g., in:

{
  "title": "Nice",
  "description": "Une ville française",
  "lang": "fr",
  "postcode": "06000"
}

it feels weird that the "lang" applies to both the "title" and the "description" but not the "postcode". That just doesn't feel to me like the way JSON works. In other words, I think we lean towards the solutions that have the language information of the "title" to be inside the value of "title".

So then given the options that we've talked about so far, I think that only leaves two options, one of which is to say that we'd require what I'll call (A):

{
  "title": { "text": "Nice", "lang": "fr" },
  "description":  { "text": "Une ville française", "lang": "fr" },
  "postcode": "06000"
}

where only the "text" property would be mandatory, i.e., where you could write this (B):

{
  "title": { "text": "Nice" },
  "description":  { "text": "Une ville française" },
  "postcode": "06000"
}

if you didn't have language information, but that would forbid (C):

{
  "title": "Nice",
  "description": "Une ville française",
  "postcode": "06000"
}

The alternative remaining option would instead allow (C) as an equivalent to (B).


@cynthia also brought up the issue of how this extends to when you have text in multiple languages. For example:

{
    "name": [ {"lang": "fr", "text": "Nice"}, {"lang": "ja", "text": "ニース"} ]
}

or perhaps more JSON-ish:

{
    "name": { "fr": {"text": "Nice"}, "ja": { "text": "ニース"} }
}

although the latter doesn't allow the specific strings to be passed around without creating a new object. But the former has slow search.

Here you'd now have to test whether the value of "name" is an object or an array (are those the right JSON terms?). In theory the alternative could be to force a list, but that feels like we're going down a very different use case.


So another question for the folks who raised this issue is that it might help us to have a better idea of what the range of use cases where this comes up looks like. Are there a few representative examples we can look at to make this a little more concrete?

Member

dbaron commented Sep 27, 2017

@cynthia and I just had a breakout discussion of this.

I think one conclusion that we came to is that the option of having a lang that applies to multiple strings is a little weird, e.g., in:

{
  "title": "Nice",
  "description": "Une ville française",
  "lang": "fr",
  "postcode": "06000"
}

it feels weird that the "lang" applies to both the "title" and the "description" but not the "postcode". That just doesn't feel to me like the way JSON works. In other words, I think we lean towards the solutions that have the language information of the "title" to be inside the value of "title".

So then given the options that we've talked about so far, I think that only leaves two options, one of which is to say that we'd require what I'll call (A):

{
  "title": { "text": "Nice", "lang": "fr" },
  "description":  { "text": "Une ville française", "lang": "fr" },
  "postcode": "06000"
}

where only the "text" property would be mandatory, i.e., where you could write this (B):

{
  "title": { "text": "Nice" },
  "description":  { "text": "Une ville française" },
  "postcode": "06000"
}

if you didn't have language information, but that would forbid (C):

{
  "title": "Nice",
  "description": "Une ville française",
  "postcode": "06000"
}

The alternative remaining option would instead allow (C) as an equivalent to (B).


@cynthia also brought up the issue of how this extends to when you have text in multiple languages. For example:

{
    "name": [ {"lang": "fr", "text": "Nice"}, {"lang": "ja", "text": "ニース"} ]
}

or perhaps more JSON-ish:

{
    "name": { "fr": {"text": "Nice"}, "ja": { "text": "ニース"} }
}

although the latter doesn't allow the specific strings to be passed around without creating a new object. But the former has slow search.

Here you'd now have to test whether the value of "name" is an object or an array (are those the right JSON terms?). In theory the alternative could be to force a list, but that feels like we're going down a very different use case.


So another question for the folks who raised this issue is that it might help us to have a better idea of what the range of use cases where this comes up looks like. Are there a few representative examples we can look at to make this a little more concrete?

@dbaron dbaron modified the milestones: tag-telcon-2017-08-22, tag-telcon-2017-10-31 Sep 27, 2017

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Sep 28, 2017

Member

FWIW, Notifications API has something like this: https://notifications.spec.whatwg.org/#api. There lang and dir span both title and body though. This is similar to how in HTML the lang attribute also applies to the title or alt attribute, along with the contents of the element.

Member

annevk commented Sep 28, 2017

FWIW, Notifications API has something like this: https://notifications.spec.whatwg.org/#api. There lang and dir span both title and body though. This is similar to how in HTML the lang attribute also applies to the title or alt attribute, along with the contents of the element.

@cynthia

This comment has been minimized.

Show comment
Hide comment
@cynthia

cynthia Sep 28, 2017

Member

Right, so that design is probably close to (but is a lot more complete than) one of the proposals above - but in a multilingual payload context we still have the lookup problem described above. I don't have any particularly good ideas on how to tackle this without a better manifest of the use cases though...

Member

cynthia commented Sep 28, 2017

Right, so that design is probably close to (but is a lot more complete than) one of the proposals above - but in a multilingual payload context we still have the lookup problem described above. I don't have any particularly good ideas on how to tackle this without a better manifest of the use cases though...

@tobie

This comment has been minimized.

Show comment
Hide comment
@tobie

tobie Sep 28, 2017

The alternative remaining option would instead allow (C) as an equivalent to (B).

This is conceptually very similar to the LocalizableString proposal we discussed in heycam/webidl#358.

tobie commented Sep 28, 2017

The alternative remaining option would instead allow (C) as an equivalent to (B).

This is conceptually very similar to the LocalizableString proposal we discussed in heycam/webidl#358.

@r12a

This comment has been minimized.

Show comment
Hide comment
@r12a

r12a Sep 28, 2017

Having the lang/dir values span both title and description is problematic, not because it doesn't cover postcode, but because the title and description may be in different languages. This is why we have also long wished there was an alternative to having one lang attribute in HTML to cover both the element content and the attribute values (and why, more generally, we advise avoiding natural language text in attribute values when designing markup languages).

Having said that, markup like HTML can be useful in that the content author doesn't need to specify the language for every element – the language is inherited from higher up the hierarchy. This is not so easy for JSON-like formats, but would be useful if it's possible. For example, it may be possible to declare the language for all the strings in A in the way you did, but this would only work if it were possible to override that for a particular string when needed.

r12a commented Sep 28, 2017

Having the lang/dir values span both title and description is problematic, not because it doesn't cover postcode, but because the title and description may be in different languages. This is why we have also long wished there was an alternative to having one lang attribute in HTML to cover both the element content and the attribute values (and why, more generally, we advise avoiding natural language text in attribute values when designing markup languages).

Having said that, markup like HTML can be useful in that the content author doesn't need to specify the language for every element – the language is inherited from higher up the hierarchy. This is not so easy for JSON-like formats, but would be useful if it's possible. For example, it may be possible to declare the language for all the strings in A in the way you did, but this would only work if it were possible to override that for a particular string when needed.

@tobie

This comment has been minimized.

Show comment
Hide comment
@tobie

tobie commented Sep 28, 2017

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Oct 31, 2017

Member

We discussed this briefly today in the TAG's teleconference, and were wondering if we could find a way (and time) to make progress on this at TPAC. @cynthia is sending email about this, but noting this in github as well. /cc @r12a @aphillips

Member

dbaron commented Oct 31, 2017

We discussed this briefly today in the TAG's teleconference, and were wondering if we could find a way (and time) to make progress on this at TPAC. @cynthia is sending email about this, but noting this in github as well. /cc @r12a @aphillips

@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips Oct 31, 2017

@dbaron Happy to coordinate a time.

aphillips commented Oct 31, 2017

@dbaron Happy to coordinate a time.

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron
Member

dbaron commented Nov 7, 2017

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Nov 21, 2017

Member

And, for the record here, that discussion was reasonably productive and I think (if I'm recalling correctly) focused on two things:

  • an object that can be used where a string would be used, but also carries language and direction information, such as {text: "Moby Dick", lang: "en", dir: "rtl" } (could use text or value or other things there)
  • the ability to set defaults within a JSON object perhaps with lang and dir or perhaps with something like default_lang and default_dir (or should that be camelCase?), though this has both advantages (compactness) and disadvantages (need to augment the objects to readd the defaults in order to pass them around)

There was then further discussion about how to use the former in contexts where multiple languages needed to be provided (i.e., ability to use array or dictionary, where the dictionary has redundancy but may allow faster lookup assuming appropriate use of BCP49 language codes). The redundancy is, however, preferably to having objects that can't be passed around because the language from the dictionary key needs to be added back on.

There was also a bit of discussion about JSON vs. WebIDL that I've forgotten.

The next step was that @aphillips was going to revise the string-meta document along these lines and then post it for review from those of us present (and others, if interested). So leaving the pending-external-feedback label.

Member

dbaron commented Nov 21, 2017

And, for the record here, that discussion was reasonably productive and I think (if I'm recalling correctly) focused on two things:

  • an object that can be used where a string would be used, but also carries language and direction information, such as {text: "Moby Dick", lang: "en", dir: "rtl" } (could use text or value or other things there)
  • the ability to set defaults within a JSON object perhaps with lang and dir or perhaps with something like default_lang and default_dir (or should that be camelCase?), though this has both advantages (compactness) and disadvantages (need to augment the objects to readd the defaults in order to pass them around)

There was then further discussion about how to use the former in contexts where multiple languages needed to be provided (i.e., ability to use array or dictionary, where the dictionary has redundancy but may allow faster lookup assuming appropriate use of BCP49 language codes). The redundancy is, however, preferably to having objects that can't be passed around because the language from the dictionary key needs to be added back on.

There was also a bit of discussion about JSON vs. WebIDL that I've forgotten.

The next step was that @aphillips was going to revise the string-meta document along these lines and then post it for review from those of us present (and others, if interested). So leaving the pending-external-feedback label.

@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips Nov 21, 2017

Thanks @dbaron. The edits in question are in progress and I have an action item to notify this issue when complete. If you look at the String-Meta document, you can see the edits in progress (including recent commits today), although I have not yet reached closure enough to notify.

aphillips commented Nov 21, 2017

Thanks @dbaron. The edits in question are in progress and I have an action item to notify this issue when complete. If you look at the String-Meta document, you can see the edits in progress (including recent commits today), although I have not yet reached closure enough to notify.

@cynthia

This comment has been minimized.

Show comment
Hide comment
@cynthia

cynthia Jan 10, 2018

Member

@aphillips is there anything for us to follow-up on related to this?

Member

cynthia commented Jan 10, 2018

@aphillips is there anything for us to follow-up on related to this?

@torgo torgo modified the milestones: tag-telcon-2018-01-02, tag-f2f-london-2018-01-31 Jan 16, 2018

@dbaron

This comment has been minimized.

Show comment
Hide comment
@dbaron

dbaron Jan 23, 2018

Member

@aphillips We (the TAG) have a face-to-face meeting next week; curious if we'll have something to review by then.

Member

dbaron commented Jan 23, 2018

@aphillips We (the TAG) have a face-to-face meeting next week; curious if we'll have something to review by then.

@aphillips

This comment has been minimized.

Show comment
Hide comment
@aphillips

aphillips Jan 23, 2018

@dbaron @cynthia We're working on updating our String-Meta doc. Most of the changes we discussed have been implemented in a not-quite clean way (syntax in examples is not right yet). Super open to comments. Our next teleconference is Thursday, wherein our WG will discuss.

aphillips commented Jan 23, 2018

@dbaron @cynthia We're working on updating our String-Meta doc. Most of the changes we discussed have been implemented in a not-quite clean way (syntax in examples is not right yet). Super open to comments. Our next teleconference is Thursday, wherein our WG will discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment