This document is a compilation of internationalization research I've done with the goal to localize a (mostly) client-side Javascript app.
i18n (internationalization) - The process by which software is made language and locale neutral. i18n stands for 18 characters in the word internationalization.
l10n (localization) - The required processes to localize software into a specific locale (includes translations, rules about numbers, currencies, dates, and more).
Besides language translation, there's other localization requirements you might need to consider.
- Dates — Date formats change across cultures. For example, 10/4/15 means October 4th in the US, and April 10th in the UK.
- Times — Different locales require either a 24-hour clock or 12-hour clock. Also, some locales use different notations, like 5h10 in French.
- Formatting of numbers — Different locales use different digits to represent numbers. So, 3,025.23 in English would be 3.025,23 in Greman, and 3 025,23 in French.
- Images — If you have images with text, you need to make sure to provide versions for each locale.
- UI Spacing — You need to provide enough space in the UI to handle expanded lengths of words. IBM has provided design guidelines that specify an additional 200% space for short words; The W3 provides an example of a translation for Flickr requiring an additional 300% in Italian.
- Text Sorting — Text sorting can vary by language. For instance, German has two types of sort order, phonebook and dictionary, which determine whether to sort by sounds (umlauted vowels become to character pairs: ä → ae, ö → oe, ü → ue.) or by character order.
- Punctuation — Different languages use different punctuation. For instance, double quotes in English (" ") are represented as guillemets in French (« »).
- Keyboard shortcuts — If you have hotkeys that map to English words, these should be updated with a mapping for each locale.
- Outbound links — External links to documentation will need to take language into account.
- Accessibility — If the software offers accessibility options, those will need to take the locales into consideration.
There's a few more that I don't need to handle for this particular project:
- Currencies
- Addresses — including zipcodes
- Phone numbers
- Validation — Luckily, we have no data input fields that would require locale-specific validation (like number inputs or date/time inputs).
When translating strings, there's no consensus on how to specify keys used for the strings. Three main strategies are used: English strings, descriptive keys, and object keys.
Using a string as a key might look like this:
// returns "Welcome" in English,
// and "Willkommen" in German
_("Welcome")
Which would return the given string in English, and the translation in another language.
This is the format that both gettext
(Linux's i18n implementation) and genstrings
(Apple's i18n tool) use. Using English strings have a number of clear benefits:
- It directly shows the meaning of the text.
- If a translation is unavailable, it's possible to fall back to the given string.
- Given that this strategy is the closest thing to an i18n key standard, if browsers were to implement an i18n translation standard in the future, this will probably be the strategy.
However, there are drawbacks:
- Using English as the default can lead to conflicting keys in other languages. For instance, an English word "Email" might require two different texts in French, "E-Mail" or "Envoyer un e-mail".
- If an English translation changes, every translation file's keys needs to be updated (though, presumbly, if an English translation changes, every other language will have to change as well).
- For longer texts, specifying English strings keys could become verbose, particularly for paragraph-length text.
Using descriptive keys for each translation is another oft-used solution that would look like this:
// returns "Welcome" in English,
// and "Willkommen" in German
_("WELCOME_MESSAGE")
This method solves many of the drawbacks of using strings (handling homonyms, changing English translations without updating all language files, and less verbose).
However, using abstract keys comes with its own set of disadvantages:
- It's not immediately clear what a given string means, so metadata or comments dictating the purpose and placement of strings would be necessary.
- No fallback is possible if a translation doesn't exist.
- Collisions in key names could happen much more frequently.
Finally, we could use objects which is effectively a more advanced key structures:
// returns "Welcome" in English,
// and "Willkommen" in German
_("messages.welcome")
The main benefit here would be to keep things better organized by namespacing various texts. This is the strategy that Rails uses for their i18n tool. It would also avoid the issue of collisions highlighted above with a flat JSON object.
In many languages, word order can change. Therefore, it's important that translations maintain the ability to change word order in sentences. This means string concatenations should be avoided:
// Bad!
_("File moved to ") + folder_name + _("a few minutes ago")
// Good!
_("File moved to % a few minutes ago", %s)
Most libraries accept arguments in order or as named parameters which allows for flexible input. This allows the developer to decide which method of input to use, dependinding on whether verbosity or clarity is desired.
// Passing arguments in order
_('The first letters in the alphabet: %s %s %s', 'a', 'b', 'c')
// Passing named arguments
_('Welcome to %(version), %(user_name)', { version: 'Awesome Software', user_name: 'admin' })
Another problem between different locales concerns plurals.
Different languages have different rules for plurals. Polish, for instance, has four plural forms:
A plural rule defines a plural form using a formula that includes a counter. A counter is the number of items you’re trying to pluralize. Say we’re working with “2 rabbits.” The number before the word “rabbits” is the counter. In this case, it has the value 2. Now, if we take the English language as an example, it has two plural forms: singular and plural. Therefore, our rules look like this:
- If the counter has the integer value of 1, use the singular: “rabbit.”
- If the counter has a value that is not equal to 1, use the plural: “rabbits.”
However, the same isn’t true in Polish, where the same word—“rabbit,” or “królik”—can take more than two forms:
- If the counter has the integer value of 1, use “królik.”
- If the counter has a value that ends in 2–4, excluding 12–14, use “królika.”
- If the counter is not 1 and has a value that ends in either 0 or 1, or the counter ends in 5–9, or the counter ends in 12–14, use “królików.”
- If the counter has any other value than the above, use “króliki.”
To solve this, ICU's MessageFormat provides a standard for formatting plurals in strings across languages. (It also specifies how to handle genders in different languages.)
A message with plurals might look like this:
'There {files, plural, one{is # file} other{are # files}}';
The message above shows plurals inlined in the message. Another strategy is to define each plural message separately:
{
FILE_MESSAGE: {
one: "There is # file",
other: "There are # files"
}
}
Defining plurals using the latter strategy can become very verbose, especially for longer text strings. Inlining plurals as shown in the first example is much more concise.
Many libraries organize translation files by language with an optional capitilized country code:
- de.json
- en-US.json
- fr-CA.json
- fr.json
If we decide to split up translation files by application section, languages would be the logical folder structure:
- de/
- constants.json
- files.json
- users.json
- en-US/
- constants.json
- files.json
- users.json
This would be useful if we wanted to abstract out constants for each set of translations in order to share common words or phrases.
The most obvious file format is JSON.
However, gettext
, Linux's standard translation library, uses a format called PO files which is also worth considering.
The format of a PO file is:
white-space
# translator-comments
#. extracted-comments
#: reference…
#, flag…
#| msgid previous-untranslated-string
msgid untranslated-string
msgstr translated-string
There is wide support across languages for reading PO files, including Node, Python, and Java.
A major argument in favor of using PO files is that it would allow us to take advantage of the ecosystem that's been built up around the PO format. For instance, POEdit is a popular translating tool; numerous online web services offer PO support; and any experienced translator will certainly be familiar with the PO format.
Additionally, if you're sharing translation files with a particular non-Javascript backend, PO might be a more appropriate file format than JSON.
When it comes to debugging translation integration, a number of recommendations are worth highlighting.
A common recommendation is to set up an "pseudo language" for testing. This will highlight all translated strings in the application and allow developers to quickly locate any untranslated strings.
One solution is to simply pad English strings, making it easy to see missing strings:
"里îßEnter your name:里îß"
Another option is to replace English with a repeating character:
“XXXXX”
My favorite is replacing English characters with similar, yet distinctly foreign, replacements:
Ŝęľęčŧ äʼn äččőūʼnŧ þęľőŵ ŧő vįęŵ őř đőŵʼnľőäđ yőūř äväįľäþľę őʼnľįʼnę şŧäŧęmęʼnŧş.
HTML elements containing translations should be tagged with the particular key of the translation in development mode. This will make it easy to identify a particular translation if it acts up.
We should provide an easy way to see what translations are missing across all translation files.
I evaluated a number of i18n libraries. Here are my thoughts on each.
###FormatJS License: Yahoo BSD | Code Sample | Live Example
FormatJS is a collection of i18n tools, including integrations for templating engines such as Handlebars.
Handlebars helpers automatically translate based on the locale:
var template = Handlebars.compile(
'{{formatNumber price style="currency" currency="USD"}}'
);
template({ price: 1000 }, {
data: {intl: { locales: 'en-US' }}
});
> $1,000
// switching to French
var result = template({ price: 1000 }, {
data: {intl: {locales: 'fr' }}
});
> 1 000 $US
Number and Date/Time localizations are included as well.
FormatJS also handles plurals according to the ICU MessageFormat standard:
var intlData = {
"locales": "en-US",
"messages": {
"photos": "{name} took {numPhotos, plural,\n =0 {no photos}\n =1 {one photo}\n other {# photos}\n} on {takenDate, date, long}.\n"
}
};
var template = Handlebars.compile(
'{{formatMessage (intlGet "messages.photos")
name=name
numPhotos=numPhotos
takenDate=takenDate}}'
);
var context = {
name: 'Annie',
numPhotos: 1000,
takenDate: Date.now()
};
template(context, {
data: {intl: intlData}
});
> Annie took 1,000 photos on July 23, 2015.
Additionally, the set of APIs appears to be modular, so we can pick and choose what we want to include.
License: MIT | Code Sample | Live Example
i18next handles string translations. It doesn't seem to work with dates or numbers. Supports the browser and Node.
locales/en/translation.json
{
"app": {
"name": "Hello"
}
}
locales/fr/translation.json
{
"app": {
"hello": "Bonjour"
}
}
script.js
t("app.name");
> Hello
i18n.setLng('fr', function(err, t) {
t("app.name");
});
> Bonjour
i18next has nice sprintf and interpolation support.
locales/en/translation.json { interpolation_ordered: 'The first 4 letters of the english alphabet are: %s, %s, %s and %s', "interpolation_named": "Hello name" }
script.js
t('interpolation_named', {name: 'Franklin'});
> Hello Franklin
t('interpolation_ordered', 'a', 'b', 'c', 'd');
> The first 4 letters of the english alphabet are: a, b, c, and d
It also handles plurals, along with the ability to define your own plural rules.
i18n.t("key", { count: 0 }); // -> zero
i18n.t("key", { count: 1 }); // -> singular
i18n.t("key", { count: 2 }); // -> two
i18n.t("key", { count: 11 }); // -> many
// sample for arabic
i18n.pluralExtensions.addRule("ar", {
"name": "Arabic",
"numbers": [ 0, 1, 2, 3, 11, 100 ],
"plurals": function(n) {
return Number(n === 0 ?
0 :
n == 1 ?
1 :
n == 2 ?
2 :
n % 100 >= 3 && n % 100 <= 10 ?
3 :
n % 100 >= 11 ?
4 :
5);
}
});
A nicely built library, very customizable, with dynamic requiring of translation files.
License: MIT | Code Sample
L10ns is spoken of highly, and purports to offer a number of nice features including robust plurals handling.
Unfortunately, the getting started documentation is missing steps, and the generated code has a syntax error so I wasn't able to fully evaluate the library.
// Provided sample code
var requireLocalizations = require('path/to/output/all.js');
var l = requireLocalizations('en-US');
l('SIGN_UP->FIRSTNAME');
> Franklin
l('SIGN_UP->LASTNAME');
> Pierce
Overall, the neat features include:
- Automatic source code traversal to find translation strings and automatically generating the translation files via running
> l10ns compile
- An online web editor for translators to use to input translations
However, the library requires some very non-standard translation file formats, and would require additional compilation steps in our build process.
License: Custom | Code Sample | Live Example
Polyglot is an i18n library built and used by Airbnb.
It supports both Node and the browser. It's agnostic to translation file format.
var phrases = {
"nav": {
"hello_name": "Hello, %{name}",
}
}
var polyglot = new Polyglot({phrases: phrases});
polyglot.t("nav.hello_name", { name: "John"});
> Hello, John
Some drawbacks:
- Does not support English strings for translation keys; i.e., provides no fallback mechanism
- Naive implementation of plurals (does not seem to handle more than two plural forms)
- Does not handle ordered parameters, must be named objects.
License: Similar to MIT | Code Sample | Live Example
This is an implementation of the ICU MessageFormat in Javascript.
This library is focused solely on handling pluralization and generalization.
{
"GENDER" : "male",
"NUM_RESULTS" : 1,
"NUM_CATEGORIES" : 2
}
> "He found 1 result in 2 categories."
{
"GENDER" : "female",
"NUM_RESULTS" : 1,
"NUM_CATEGORIES" : 2
}
> "She found 1 result in 2 categories."
{
"GENDER" : "male",
"NUM_RESULTS" : 2,
"NUM_CATEGORIES" : 1
}
> "He found 2 results in 1 category."
{
"NUM_RESULTS" : 2,
"NUM_CATEGORIES" : 2
}
> "They found 2 results in 2 categories."
This looks like a solid library.
License: Similar to MIT | Code Sample | Live Example
If we want to roll our own translation library, this is a solid implementation of sprintf.
// Ordered parameters
sprintf("The first 3 letters are: %s, %s and %s", ["a", "b", "c"]);
> The first 3 letters are: a, b and c
// Argument swapping
sprintf("%2$s %3$s a %1$s", "cracker", "Polly", "wants");
> Polly wants a cracker
// Named arguments
sprintf("Hello %(name)s", { name: 'Dolly' } );
// Hello Dolly
Polyfill License: MIT | Code Sample | Live Example
Intl is a built-in API for handling a subset of localization needs, including number and date formatting and string sorting. Two other great articles about Intl are here and here.
// Numbers
var nf = new Intl.NumberFormat('en');
nf.format(1000.23);
> 1,000.23
nf = new Intl.NumberFormat('fr');
nf.format(1000.23);
> 1 000,23
// Dates
var date = new Date(Date.UTC(2012, 11, 20, 3, 0, 0));
var df = new Intl.DateTimeFormat('en');
df.format(date);
> 12/19/2012
df = new Intl.DateTimeFormat('de');
df.format(date);
> 19.12.2012
Intl is currently supported on Chrome, Firefox, and IE11+. It is unsupported on Safari, but a polyfill exists.
License: MIT | Code Sample | Live Example
Moment.js has i18n support, including date and time localization.
moment().format("dddd, MMMM Do YYYY, h:mm:ss a");
> "Sunday, February 14th 2010, 3:25:50 pm"
moment.locale('fr');
moment().format("dddd, MMMM Do YYYY, h:mm:ss a");
> "jeudi, juillet 23 2015, 1:11:00 pm"
A nifty solution for handling punctuation in CSS is proposed by Maksim Chemerisuk. Using the lang
element on the <html>
element, in our LESS files we could do something like:
.quote {
&:lang(en) {
&&::before, &&::after {
content: '"';
}
}
&:lang(fr) {
&&::before {
content: "«";
}
&&::after {
content: '»';
}
}
}
- CLDR A standard set of locale data. Included are translations for currency, country names, patterns for formatting numbers, translations for dates, etc.