Skip to content

Add support for language variations like zh-hk #499

Open
arjan opened this Issue Jan 13, 2013 · 11 comments

4 participants

@arjan
Zotonic member
arjan commented Jan 13, 2013

It would be good to have support for more specific locales like zh-hk but also en-us, en-gb, etc.

Lets have this ticket as a discussion on how to do that and what that means for the po files, fallback mechanism (?), etc.

@mworrell mworrell was assigned Jan 13, 2013
@mworrell mworrell added the enhancement label Feb 7, 2014
@ArthurClemens

I have the need as well. For instance, I need to differentiate between en-GB and en-IE, or fr-BE and fr-FR. Note that the codes are not always 4 chars; Traditional Chinese spoken in Taiwan is zh-Hant-TW, while Simplified Chinese in China is zh-Hans.

These codes are used (intented use) for authoring HTML, see http://www.w3.org/International/articles/language-tags/

But, quoting:

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.

So the 4 char is not a rule; 2-char and 4-char will co-exist, or even 3-char subtags, like ar-afb.

Best is that the author is allowed to choose the language tag/subtag best suited for the context.

Issue at hand: using the dollar syntax for translation fields only works for 2-char language names, not for custom codes (as above) in code like textarea name="tl$en_GB" or textarea name="tl$master" (yes).

Gettext should be able to handle this. From http://www.gnu.org/software/libc/manual/html_node/Using-gettextized-software.html:

what happens if [...] the required value is de_DE.ISO-8859-1? We already mentioned above that a situation like this is not infrequent. E.g., a person might prefer reading a dialect and if this is not available fall back on the standard language.
The gettext functions know about situations like this and can handle them gracefully. The functions recognize the format of the value of the environment variable. It can split the value is different pieces and by leaving out the only or the other part it can construct new values. This happens of course in a predictable way. To understand this one must know the format of the environment variable value. There is one more or less standardized form, originally from the X/Open specification:
language[_territory[.codeset]][@modifier]
Less specific locale names will be stripped of in the order of the following list:
codeset
normalized codeset
territory
modifier

[...]
Even this extended functionality still does not help to solve the problem that completely different names can be used to denote the same locale (e.g., de and german). To be of help in this situation the locale implementation and also the gettext functions know about aliases.

@mworrell
Zotonic member

The reason why I didn't add this yet is:

Determining alternative languages if a certain language is not available

This must be done two ways. For example if the current language is 'zh-HK' then selecting 'zh' is a good second best.

Alternatively, if the language is 'de' then selecting 'de-CH' is still better than selecting 'en', but not as good as selecting 'de-DE'

As you can see, this can get quite complicated quickly.

We need to:

  • find a good mechanism for mapping these arbitrary languages to configured languages
  • decide what to do with unknown languages in resources, how they should map

Runtime efficiency

And then decide how to make this efficient (think of how to organize lookup tables, mapping between atoms and binaries, limited atom space, etc.)

To make this doable we might need to change all {trans, […]} records to start using a binary for the language selection. We need to determine the impact of this.

An idea might be to change the translations to 3-tuples: {en_GB, en, <<"Hello">>}

User preferred fallbacks

The selected language in the #context{} could still be a single atom, but with an extra list of alternatives, in the order provided by the user-agent.

Of course, this dynamics based on the user-agent preferences wreaks havoc on any caching scheme….

@ArthurClemens

In which scenario do you need to find the fallback language? I can think of reducing the amount of translatable strings (f.i. en_GB would need to copy all en strings). But otherwise it is up to the user to provide those strings in the sub-language.

If the fallback language is important, a possibility would be to offer the languages as a tree like the categories, and deduce from the position in the tree ("has parent") where to fetch missing strings.
In my app I use a predicate "Inherits" on Language to map this relationship. But speed has not been my main concern.

@mworrell
Zotonic member

We need it when showing texts with translations. They might be from the .po files or from resources. This is done a lot, so it needs to be efficient.

To have the least surprise it is important to show sensible texts. For example, if a resource has an en_GB title and a nl title, and the selected language is en then any user would expect the en_GB should be shown. Irrespective if nl would be the default system language.

Don't forget that most modules, resources etc are not translated in all the available languages. Mostly only a smaller subset is available.

@ArthurClemens

Let's take a different example than English, let's say Portuguese:

  • No inheritance: there is no relationship between pt_BR, pt_PT or pt. This will potentially lead to more fallbacks to English - assuming that en is always the final fallback. And more work on the translator's behalf.
  • With inheritance: pt_BR and pt_PT both inherit from pt, or pt_BR inherits from pt_PT. Not necessarily the way around (showing pt_BR if pt is missing will lead to more surprises). But perhaps this can be configured by the user even.

Googling this shows that the concept of fallback language is quite common for CMS-es (Django, Drupal, EpiServer).

http://www.unicode.org/reports/tr35/#Likely_Subtags describes a mechanism to deal with this ("subtags"), but the approach is rather theoretical (non practical). If we leave out locale matching using pattern matching, and limit lookup to the locale-parent chain only, we can reduce the complexity a lot.

And in fact a mechanism like this already exists.

z_trans:lookup_fallback finds the translation string, and when nothing is found, calls default_language to find the config value i18n.language. This mechanism could be extended/changed with a fallback lookup list. Adding Lang as parameter to default_language/2 in line 132:

case default_language(Lang, Context) of

And then as non-optimized code (because default_language could be replaced with fallback_language):

default_language(Context) ->
    default_language(en, Context).

default_language(Lang, _Context) ->
    fallback_language(Lang).

fallback_language(Lang) when Lang =:= de_AT -> de;
fallback_language(Lang) when Lang =:= at -> de;
fallback_language(Lang) when Lang =:= be -> nl;
fallback_language(_Lang) -> en.

The lookup can either be erlang code or a config. The lookup using Erlang code as above is similar in speed to the current code.

@ArthurClemens ArthurClemens added a commit that referenced this issue Aug 30, 2015
@ArthurClemens ArthurClemens Fallback language 7873589
@ArthurClemens

This branch ("ac-fallback-language") contains a fairly efficient lookup of a language fallback. It is called when a translation cannot be found for the current language. In the current Zotonic, only en is considered as hardcoded fallback. The code changes make it possible to define a specific fallback per language, thus progressing one step towards region specific language settings like "fr-be” and "zh-tw”.

The major code change is that #context.language now contains a list of language code atoms, instead of a single language code. The first list item is the selected language. To keep other code working as the same as possible, z_context:language/1 will still return the selected language.
Other code that directly read Context#context.language has been replaced with the z_context:language/1 function call.
z_context:set_language/2 accepts an atom and puts it in the list, but it will also take a list. That list will be set by mod_translation when a language has defined a fallback language.

Translation lookup takes one more lookup in case no translation is found. Fallback lookup is not recursive: only the fallback of the current language is explored. This also removes the need to program against infinite loops.
Currently only 1 fallback is considered, but it is built with a list of fallbacks in mind.
The ultimate fallback is en and that functions as it does today.

It looks like there is an obsolete function z_trans.erl:lookup_fallback_language that is called by filter_language.

@mworrell
Zotonic member

Nice work! And we shouldn't be poking directly into the Context#context.language anyway.

👍 for merging

@ArthurClemens ArthurClemens added a commit that referenced this issue Aug 31, 2015
@ArthurClemens ArthurClemens Fallback language 08b2745
@ArthurClemens ArthurClemens added a commit that referenced this issue Aug 31, 2015
@ArthurClemens ArthurClemens Fallback language 8df82fb
@ArthurClemens

Things left to do on this issue:

  1. Add more languages to i18n/iso639.erl, possibly use https://github.com/bdswiss/country-language/blob/master/data.json or http://www.localeplanet.com/icu/iso639.html
  2. Change i18n/z_trans:is_language to relax the check on 2-letter codes.
  3. It seems that the url can only have a 2-letter language code.
  4. Check .po generation
  5. Check filters/scomps
@mworrell
Zotonic member
mworrell commented Sep 1, 2015

Ad 3: that is handled by mod_i18n, it can be easily changed to also accept xy-... prefixes, where xy should be a valid ISO code and we add a max length to the language variation.

Do we also want to be able to handle languages like x-klingon ?

@ArthurClemens

Description of language tags (Wikipedia).

From the specification, rfc5646: "Examples of Language Tags (Informative)"

Simple language subtag:

  de (German)

  fr (French)

  ja (Japanese)

  i-enochian (example of a grandfathered tag)

Language subtag plus Script subtag:

  zh-Hant (Chinese written using the Traditional Chinese script)

  zh-Hans (Chinese written using the Simplified Chinese script)

  sr-Cyrl (Serbian written using the Cyrillic script)

  sr-Latn (Serbian written using the Latin script)

Extended language subtags and their primary language subtag
counterparts:

  zh-cmn-Hans-CN (Chinese, Mandarin, Simplified script, as used in
  China)

  cmn-Hans-CN (Mandarin Chinese, Simplified script, as used in
  China)

  zh-yue-HK (Chinese, Cantonese, as used in Hong Kong SAR)

  yue-HK (Cantonese Chinese, as used in Hong Kong SAR)

Language-Script-Region:

  zh-Hans-CN (Chinese written using the Simplified script as used in
  mainland China)

  sr-Latn-RS (Serbian written using the Latin script as used in
  Serbia)

The authorative list is http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

@ddeboer ddeboer modified the milestone: Release 1.0, Roadmap Jan 12, 2016
@ddeboer ddeboer referenced this issue Jun 22, 2016
Open

Zotonic 1.0 Hackday #1319

0 of 7 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.