Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements - MF wishlist #3

Closed
romulocintra opened this issue Nov 27, 2019 · 66 comments
Closed

Requirements - MF wishlist #3

romulocintra opened this issue Nov 27, 2019 · 66 comments
Labels
requirements Issues related with MF requirements list resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. Stale

Comments

@romulocintra
Copy link
Collaborator

romulocintra commented Nov 27, 2019

List of requirements to consider for MF

@romulocintra romulocintra changed the title Requirements List Requirements Nov 27, 2019
@romulocintra romulocintra changed the title Requirements Requirements - MF wishlist Jan 6, 2020
@romulocintra
Copy link
Collaborator Author

romulocintra commented Jan 6, 2020

I'am listing requirements from the 1st meeting slides :

List of possible requirements

  • Easier to use ICU “Select”
  • Fluent could be considered as a starting point for the future of message format
  • Have pluggable “formatters”(Date/Time/Number ...)
  • HTML Markup
  • Cross-platform / Universal Format
  • Messages should have more context “description” or ”metadata”
  • MessageFormat - More Readable
  • Escaping(“ or ‘ ) and Interpolations (html tags)
  • Rule Modifiers - Send Message or Send SMS  -> similar to select ICU feature
  • Improve Translators / Developers  UX/DX
  • I need to somehow be able to cache my translations
  • Use Yaml or JSON as file format
  • Message reference - from another Message

@zbraniecki
Copy link
Member

Proposal for an additional requirement:

  • Provides a translation of an XML/HTML element.

@jamuhl
Copy link

jamuhl commented Jan 6, 2020

Sorry, I wasn't there in the first meetings so I'm not sure what is meant with "HTML Markup"?

But:

  • fully agree on custom pluggable "formatters"

And add:

  • extended plurals, like:
{ count , plural ,
   =0 {No candy left}
  one {Got # candy left}
  <10 {Got a few candies left}
  10-20 {Got a handful candies left}
other {Got # candies left} }

edit:
in i18next we use a postProcessing plugin to achieve that: https://github.com/i18next/i18next-intervalPlural-postProcessor#usage-sample

@zbraniecki
Copy link
Member

HTML Markup

Ability to interpolate localization with HTML. Example:

<span>You have <b>6</b> unread messages from <img/> Mary.</span>

Fluent provides DOM Overlays which are heavily used in Firefox l10n - https://github.com/projectfluent/fluent.js/wiki/DOM-Overlays

@jamuhl
Copy link

jamuhl commented Jan 6, 2020

@zbraniecki thank you for explaining...so basically take the innerhtml element(s) and extend it with the attributes and content contained in the translation...looks similar to the Trans component we have in react-i18next -> https://react.i18next.com/latest/trans-component (just we have no html elements but react components)

edit:
guess we could mimic DOM-Overlays by extending our Trans component...just not sure if this is part of the syntax or an extension that is provided by the i18n library?

@romulocintra
Copy link
Collaborator Author

@mihnita should i reference here the your entire document or we can break it in features to add here ?

@zbraniecki
Copy link
Member

In our experience innerHTML in particular is a no-go for security reasons (l10n resources are treated as a third-party). I expect the requirements from the W3C to be similar here.

Instead, we whitelist allowed textual elements (<sup/>, <sub/>, <span/> etc.) and for everything else we require the developer to provide the elements in the source with a name, and then the localizer can position them using the same name:

<p data-l10n-id="key1">
  <a href="https://www.mozilla.org" data-l10n-name="link"/>
  <img src="./pics/img1.png" data-l10n-name="logo"/>
</p>
key1 =
    Welcome to <a data-l10n-name="link">Mozilla</a>!
    Please, click on <img data-l10n-name="logo"/> to proceed.

That's significantly more involved than innerHTML, but the end result is quite similar with a lot of linting, security, and sanity checks.
We're also discussing further extensions - https://github.com/zbraniecki/fluent-domoverlays-js/wiki/New-Features-(rev-3)

@jamuhl
Copy link

jamuhl commented Jan 6, 2020

innerHTML was more referring to the content than to the implementation detail...same reason we do not just append translations into a react element by using dangerouslySetInnerHTML ;)

@mihnita
Copy link
Collaborator

mihnita commented Jan 7, 2020 via email

@romulocintra romulocintra added the requirements Issues related with MF requirements list label Jan 7, 2020
@romulocintra
Copy link
Collaborator Author

romulocintra commented Jan 7, 2020

@mihnita

  • If you can break the into features great and link is important to Both are important
  • I completely agree that some of features wont fit in one line and will need more detail, that ones IMHO deserve a unique issue or thread.

My Proposal :

  • If you can break it into features will be perfect(agree that the link is important too)
  • Some of the features won't fit in one line description needing more detail, that ones IMHO deserve a unique issue or thread, I suggest that we can create a new Issue tagged as "requirements", where we have all detail and discussion about that issue, but we can keep a reference with description here to keep the list in only one place.

I feel that also the short description ones will grow to have their own issue/task, but I think we can figure out later after we groom and filter the tasks/lists of requirements

@longlho
Copy link

longlho commented Jan 7, 2020

My proposal for the process @romulocintra is to set a deadline, then de-dupe the list, then prioritize into mvp, v1, v2... so we can move this along.

@romulocintra
Copy link
Collaborator Author

romulocintra commented Jan 7, 2020

My proposal for the process @romulocintra is to set a deadline, then de-dupe the list, then prioritize into mvp, v1, v2... so we can move this along.

@longlho i believe this(process , mvp , roadmap , goals) must be addressed in #4 where we can define all related organizational and process as a team.

Related with this task and regarding how we organize the list, I think the previous proposal can fit our current needs, I did not propose any deadline for this task but I see next meeting as a good candidate to prioritize/filter/de-dupe the items originated in this thread. finally, we can review #4 to close all the organizational issues, deadlines and goals.

Meanwhile, I'm referencing your comments in #4

PS: just added this topics to the next meeting agenda

@MickMonaghan
Copy link

Right now, in ICU4J, if you do:
"You owe {someNumber, number, currency}." - then the actual currency is inferred from the current locale - which is just nasty.

You can do this:
"You owe {someNumber, number, :: currency/JPY}." - but this means that you know in advance that you're dealing with a specific currency - JPY - in this case.
One should be able to declare the actual currency at run time.
Perhaps Fluent already supports this?

@nbouvrette
Copy link
Collaborator

nbouvrette commented Jan 12, 2020

Sorry for joining the conversation late and having to leave the last session early but here is my take:

  • Make the syntax cross-language/cross-platform. Maybe having an RFC and/or improved (non-technical) documentation of the syntax would help?
  • See if we can make the syntax easier to read (not just for developers, but presuming "raw" syntax could also be translatable by linguists)
  • Provide free tools with the syntax for authoring and translation (our own online CAT tool?)
  • Extend selectors (I like @jamuhl's example and will have other to present in the next session)
  • File format-agnostic - not all TMS does a good job supporting file formats. If the syntax is independent it makes it more flexible to adopt
  • Leave the syntax markup (e.g. HTML) agnostic - the syntax should be able to accept HTML or any other markup but the TMS and or library can implement manipulation how it find best for its use case
  • Find better ways to escape the syntax (' is way too common and the current escape patterns could be possibly standardized/simplified)
  • Add more features:
    • Predefined Linguistic selectors (will be presenting this idea in the next meeting)
    • Improved list support
    • Better currency support
    • More flexible formats (extendable inline?)
    • Numbers to "written numbers" convertor?
    • Inflections (genders, articles, declensions, etc.)

@MickMonaghan
Copy link

Can we keep the language used to retrieve the UI strings separate from the language/locale used to format variables/placeholders within a string?
This would be consistent with how some OSs and some string formatting libs already separate UI language from locale formats.

@zbraniecki
Copy link
Member

Perhaps Fluent already supports this?

Fluent does support it, it's called "partially formatted variables" and currency was the particular example that drove that feature.

The way it works in Fluent is this:

ctx.format('product-cost', {
  amount: FluentNumber(342, {
    currency: "JPY",
  })
});
// Translation can just use "default" formatting options
product-cost = This product costs { $amount }

// Or a translation can specify its own list of options (based on ECMA402 NumberFormat

product-cost = This product costs { NUMBER($amount, minimumFractionDigits: 3) }

An important bit is that the selector (NUMBER) limits which options can be provided by the translator - in case of number, currency is not available for the localizer to specify.

@zbraniecki
Copy link
Member

zbraniecki commented Jan 13, 2020

Provide free tools with the syntax for authoring and translation (our own online CAT tool?)

Fluent comes with a CAT tool - https://github.com/mozilla/pontoon / https://pontoon.mozilla.org/
A lot of effort in Pontoon at the moment goes into better WYSIWYG for Fluent selectors.

Leave the syntax markup (e.g. HTML) agnostic - the syntax should be able to accept HTML or any other markup but the TMS and or library can implement manipulation how it find best for its use case

I'm not sure if I agree. Features like compound messages are important only when you look at the problem in context of UI widgets. The drive to be agnostic may lead to a syntax that is not really optimized for anything.
While I agree that we should ensure the syntax and data model are useful for wide range of software use cases (and not, say, just for Web/React), having some "P1" targets would help us bring something actually useful imho.
In particular, from my angle, understanding that Software UI is not created by a bunch of imperative calls from JS/C/Java, but is usually defined in some declarative markup is fundamental to how you design features.
If we reject this hypothesis, it will have deep implications on what we end up with.

@grhoten
Copy link
Member

grhoten commented Jan 13, 2020

I previously gave a presentation called Let's Come To An Agreement About Our Words. The presentation covers an older format that we used in Siri, and we're migrating to a newer simplified format. Here are some highlights on what it can do or found was desirable.

  • It's generally an XML format. The original would use something like ECMAScript/Java beans/UEL for referencing variables and its properties. The UEL syntax was too complicated and was changed to favor more XML with a nicer editor, much like your favorite word processor stores its data in XML without the end user really knowing that low level detail. It's also easier to interchange it with XLIFF when it's XML.
  • Support for SSML is very desirable for screen readers or virtual assistants.
  • The messages are by default both printable and speakable, but you can exclusively print or speak a phrase. If you ever need to explicitly speak a number within a given context, this is critical.
  • Word inflection and grammeme detection (values of grammatical categories) are fundamental parts of the syntax. It's critical functionality with user provided vocabulary. Generally, you need to know the grammatical number, grammatical case, the grammatical gender of the words and the pronunciation of the word (generally just if the word starts or ends with a vowel).
  • Word inflection can include adding prepositions, articles, pronouns or grammatical states of a given word. For complicated examples, check out Russian, Korean or Arabic.
  • Number pronunciation is provided by CLDR's RBNF.
  • Getting a number and noun into grammatical agreement is critical. The grammatical gender of the number comes from the noun. The grammatical number of the noun is generally affected by the value of the number (e.g. 1 or 2). The grammatical case is defined by the translator given the context of the sentence. The translator does not provide the exact inflections by default.
  • List handling involves inflecting each word. This might mean making each item the definite form.
  • The "and" (AKA conjunction) list, and the "or" (AKA disjunction) list are able to handle the context correctly for Italian, Spanish and Korean.
  • There is also the adjective list, which is probably the hardest to get correct for English. For Chinese and Korean, it's a lot easier.
  • There is a calendar concept based mostly on CLDR's translations. Some functionality is provided to add preposition or postpositions as needed. The grammatical case can be modified as needed. CLDR doesn't handle grammatical case modification that well by default.
  • There is a measurement concept that is separate from CLDR's implementation to provide precise translations of units of measure, like kilometers and miles. CLDR is more focused on the printable form instead of the speakable form, which is why CLDR is generally ignored when the speakable form is also needed.
  • It has a highly customized currency concept. CLDR only partially covers support for this functionality. Pronunciation of a currency for its units and subunits in native and foreign contexts is important.

This functionality works or is shipped on Linux, macOS, iOS, tvOS and watchOS. The watchOS support is probably the important thing to highlight because it is the most resource restrictive environment to support. I'm just stating that this functionality can live in resource constrained environments where grammatical correctness of a message is important.

@zbraniecki
Copy link
Member

zbraniecki commented Jan 13, 2020

Can we keep the language used to retrieve the UI strings separate from the language/locale used to format variables/placeholders within a string?
This would be consistent with how some OSs and some string formatting libs already separate UI language from locale formats.

While we definitely experienced a very vocal community of users of Firefox who want to use different translation from locale formats, this has also been a trap for regular users because date/time formats often contain translations.

For example, Japanese 2020年1月13日 星期一 下午12:03:10 or 星期一 下午12時 (for { weekday: "long", hour: "numeric" }) would be very confusing if placed in a sentence with different locale.

There are even extreme cases. If the user had german translation, with a date that is formatted in en-US, there's a chance of flipping MM/DD and DD/MM order. If the sentence is in german, user has the right to interpret the "05/08" using german "DD/MM" pattern, and be very surprised if they later learn that it was actually en-US "MM/DD` taken from their OS locale formatting preferences.

My initial position is that we generally should, by default, format placeables (numbers, dates etc.) using the same locale as the translation is in, and allow for the develop to provide an alternative language negotiation for formatters in order to handle exceptions like you mentioned.

This is also important once we start talking about the error handling UX. Fluent has been designed to fallback using a locale chain, so if there's an error or missing string in the primary language, we'll fallback on the second best choice, rather than display an error and break the app.
It's an important resilience measure for us.
What's interesting is that that means that the locale chain used for formatters is per-bundle so that in the locale context ["fr-CA", "fr", "en"] we first try to localize a message in fr-CA using fr-CA formatters, but if there are errors and we end up localizing the message using en resources, we'll format the date/times using en locale.

@zbraniecki
Copy link
Member

@grhoten - this is awesome! Thank you for sharing!

We have some experience with TTS in form of Common Voice project which uses Fluent.

While I don't see it in the translation resources they use now, I remember that in some variant of the project they used fluent's compound messages to represent the spoken/written difference:

time-is =
    .written = { $time }
    .spoken = The time is { $time }

It was an unexpected use of the compound messages, but brought up the idea that having message variants that are recognized as a single unit (with comments, invalidation rules, fallbacking together etc.) is important.

@mihnita
Copy link
Collaborator

mihnita commented Jan 13, 2020 via email

@mihnita
Copy link
Collaborator

mihnita commented Jan 13, 2020

About extended plurals, like:

{ count , plural ,
   =0 {No candy left}
  one {Got # candy left}
  <10 {Got a few candies left}
  10-20 {Got a handful candies left}
other {Got # candies left} }

I am quite reluctant about it.
There is something similar in Java (ChoiceFormat)
Example: "-1#is negative| 0#is zero or fraction | 1#is one |1.0<is 1+ |2#is two |2<is more than 2."

And it was a huge problem for proper localization.
It was banned in most places I've been.

@grhoten
Copy link
Member

grhoten commented Jan 13, 2020

"You owe {someNumber, number, currency}." - then the actual currency is inferred from the current locale - which is just nasty.

@MickMonaghan I agree. Actually, currency formatting that I've been involved with disallows this scenario. Currency formatting is a measured unit and not a number. The unit has to be explicitly defined outside of the current message.

@zbraniecki
Copy link
Member

I am quite reluctant about it.

I agree with @mihnita. Such translations are rejected by the Mozilla L10n Drivers and the logic we use is that this is not a plural-based variant of the same string, but a set of separate strings, and which one to use should depend on some other selector than a localizer trying to build a selection like in the example.
We documented that recommendation in https://github.com/projectfluent/fluent/wiki/Good-Practices-for-Developers#prefer-separate-messages-over-variants-for-ui-logic

@echeran
Copy link
Collaborator

echeran commented Jan 23, 2020

@nbouvrette - is there a place where we could continue this thread without taking over the requirements one?

Please open new issue. This new thread can work as knowledge share about MF and related topics.

I opened up issue #15 accordingly to discuss HTML.

(In response to the question about whether Google Docs would be easier, I think the answer is the same as to the question of whether we should have a chat group -- we discussed a few months ago to keep discussions in Github so that they're public and searchable and no extra logins, based on people's past experiences.)

@nbouvrette
Copy link
Collaborator

@mihnita

if the XLIFF 1.2 (final spec in Feb 2008) is not properly supported, how long will it take for CAT tools to support it?

It's been 12 years already... I think it's safe to say it will never be fully supported? :) And it's not just CAT tools, it's also TMSes. There are dozens of both these products on the market and some of the top players are not known to move very quickly.

And this is also an argument for developing the format while considering at all times how that will interact with existing CAT tools. That includes not only how things are presented to the translators, but how leveraging works (or not).

Developing a new format can be quite challenging to have broad support (XLIFF is a good example). I still believe it would be a lot simple if we can find a way to remain format agnostic.

Another advantage if we can stay format agonistic is that most TMSes can support multi-level filter when parsing strings which means, you could have an HTML document with MessageFormat strings inside and they could be both parsed and presented correctly to linguists. This could also work the other way around.

Most CAT / TMs assume a 1:1 model "you give me a source message, I give you back a translated message". When the input is 2 messages (singular / plural) and the output is 4 messages (for example because Russian has 4 plural forms), then we run into problems.

Exactly, this is the biggest challenge - most linguistic tools expect symmetric keys in both the input and output and one input can have multiple outputs in multiple languages that have different rules. This is also why MessageFormat works well, regardless of the file format.

Do you think this would be easier in a document? (like Google docs?) It would keep the different "threads" together.

I tried Google docs to have conversations in the past and so far Git seems better - I would still love to propose having our own Slack at some point if we start having more active conversations but Git is also good at keeping everything documented. I just tagged you in this new thread when you have a chance!

The current MessageFormat syntax is a bit of a weird one. It ignores whitespaces in the syntax outside the messages, but it preserves the whitespaces in the messages.

Is there a reason for this? I wrote a parser that preserves both whitespaces. I used this both for syntax highlighting and also auto-completion/validation & error detection. It's a lot easier to be able to refer to a character position without changing the input for example.

Inline comments / extendable inline

I have had mixed results with this. Some translators translate the comments too, especially for first-timers, and they don't realize that the final message recipient won't see them, which wastes translation time. There are other times when there is information that is best conveyed inline. Sometimes the comments get in the way of readability. I can see the pros and cons of such functionality.

@grhoten

+1 on your comment - there are other ways to provide comments (typically called context) to linguists which handled correctly today by most TMSes. If we need inline context, there might be something too complex with the syntax.

This was referenced Jan 30, 2020
@romulocintra romulocintra removed the requirements Issues related with MF requirements list label Feb 18, 2020
@mihnita mihnita added requirements Issues related with MF requirements list and removed Stale labels Sep 24, 2020
@aphillips aphillips added Stale resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. labels Jun 28, 2023
@aphillips
Copy link
Member

Closing resolve-candidates per discussion in 2023-07-24 call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
requirements Issues related with MF requirements list resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. Stale
Projects
None yet
Development

No branches or pull requests