Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language hint #2

Open
sindresorhus opened this issue Feb 17, 2020 · 38 comments
Open

Language hint #2

sindresorhus opened this issue Feb 17, 2020 · 38 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@sindresorhus
Copy link
Owner

Some languages have overlapping characters. To provide the most accurate result, we could accept a language hint and prefer that language when there's a conflict. You would still be able to use multiple languages in a string, but the provided one gets priority. For example, sv-SE to prioritize the Swedish replacement.

@sindresorhus sindresorhus added enhancement New feature or request help wanted Extra attention is needed labels Feb 17, 2020
@Mottie
Copy link

Mottie commented Feb 17, 2020

We started building something similar to this a few years ago - https://github.com/diacritics - but we never finished it.

We had the transliteration default to general usage, but also allowed the user set the language and variant. For example:

const transliterate = require("diacritics-transliterator").transliterate;

const german = transliterate("¿abcñ-ß123?", "decompose", "de");
/* german => "¿abcñ-ss123?"; only the German s-sharp is replaced */

const spanish = transliterate("¿abcñ-ß123?", "base", "es");
/* spanish => "?abcn-ß123?"; only the Spanish ñ is modified */

const generic = transliterate("¿abcñ-ß123?");
/* generic => "?abcn-ß123?"; the base of ß is still ß */

const unchanged = transliterate("¿abcñ-ß123?", "base", "test");
/* unchanged => "¿abcñ-ß123?"; if no diacritics match the set language variant,
  the original string is returned */

But, that might have been too aggressive because the code owner opted to rewrite the transliterater to be much more basic.

@thorn0
Copy link

thorn0 commented Feb 17, 2020

Did you consider using https://github.com/wooorm/franc?

@sindresorhus
Copy link
Owner Author

It's a bit heavy to depend on here, but I should definitely recommend it for automatic detection when this issue is resolved.

@thorn0
Copy link

thorn0 commented Feb 17, 2020

A custom build of franc limited only to the languages you're interested in is probably an option too.

@thorn0
Copy link

thorn0 commented Feb 17, 2020

I should definitely recommend it for automatic detection when this issue is resolved

Then the solution for this issue should use the same language codes as Franc, namely ISO 639-3 (e.g. swe, not sv).

@sindresorhus
Copy link
Owner Author

Yeah, or both, since the two letter version is more commonly used.

@sindresorhus
Copy link
Owner Author

I just realized we only need language detection for the few languages that overlap. Not all. So maybe we can make it fully automatic without any options by bundling a slimmed down version of franc.

@dotnetCarpenter
Copy link

I think transliterate should be compatible with Intl.DisplayNames, since it is the nearest web standard neighbour. It uses ISO-3166 2-letter country codes.

Info about Intl.DisplayNames:

Intl.DisplayNames is a new API intended to expose translations for basic units used in other formatters.

While the scope of the proposal has been reduced for the initial revision, the second revision should bring us much awaited date and time related terms:

// Display names in English
symbolNames = new Intl.DisplayNames(['en'], {type: 'dateSymbol'});
symbolNames.of('saturday'); // => "Saturday" 
symbolNames.of('september');  // => "September"
symbolNames.of('q1'); // => "1st quarter"
symbolNames.of('pm'); // => "PM"

This API should end up being useful not just for language selectors, but also eventually for date/time pickers etc.

from https://diary.braniecki.net/2020/02/14/js-intl-in-2020/

It is however a little confusing that the constructor takes an array of language codes, but only one is ever used in the examples. The original proposal only use one language code.

Original:

var obj = Intl.DisplayNames('ar', {});
obj.format('Europe/Warsaw'); // 'وارسو'

var obj = Intl.DisplayNames('pl', {});
obj.format('Europe/Warsaw'); // 'Warszawa'

Standard:

const languageNames = new Intl.DisplayNames(['zh-Hant'], { type: 'language' });
languageNames.of('fr');
// → '法文'
languageNames.of('zh');
// → '中文'
languageNames.of('de');
// → '德文'

A language hint would probably be easier to implement if it is only one language instead of a priority list. I also expect a list will quickly be useless since you will loose control of the overlaps. The API might use an array to enable a priority list down the line, without having to change the API.

@dotnetCarpenter
Copy link

dotnetCarpenter commented Mar 7, 2020

After digging a little deeper I now see that the language codes can be in various formats. Reading Unicode Language Identifier is not helpful. The allowed formats are simply too wide to be worth the extra complexity. With complexity, I do not only mean the implementation in transliterate but also in usage. I prefer to see usage of transliterate to be as familiar as possible, when I take over someone else's code.

The following list of supported language codes is taken from the V8 blog post:

Intl.DisplayNames.prototype.of( code ) expects the following formats depending on the type of how the instance is constructed.

  • When type is "region", code must be either an ISO-3166 2-letter country code or a UN M49 3-digit region code.
  • When type is "language", code must be conform to Unicode's language identifier grammar.
  • When type is "currency", code must be a ISO-4217 3-letter currency code.
  • When type is "script", code must be a ISO-15924 4-letter script code.

@sindresorhus I propose strictly using ISO 3166-1 alpha-2 and only expand if deemed necessary by users of transliterate, that is having a real use-case in a project.

@thorn0
Copy link

thorn0 commented Mar 7, 2020

I propose strictly using ISO 3166-1 alpha-2 and only

Those are not language codes.

@dotnetCarpenter
Copy link

transliterate('Sju sjösjuka sjömän') // -> Sju sjoesjuka sjoemaen
transliterate('Sju sjösjuka sjömän', ['se']) // -> Sju sjosjuka sjoman

@dotnetCarpenter
Copy link

dotnetCarpenter commented Mar 7, 2020

I propose strictly using ISO 3166-1 alpha-2 and only

Those are not language codes.

What languages are missing @thorn0? I see that SJ which is Svalbard and is not a language code, since the 2 islands belongs to Norway (though Russia has a disputed claim).

@thorn0
Copy link

thorn0 commented Mar 7, 2020

Missing? These codes are country codes, they're not language codes. Do I really need to explain the difference?

@dotnetCarpenter
Copy link

dotnetCarpenter commented Mar 7, 2020

@thorn0 no I get it. The purpose is to not have en-GB, en-US, en-AU, en-CA, en-NZ, en-IE, en-ZA, en-JM, en-CB, en-BZ, en-TT, en-ZW, en-PH, en-ID, en-HK, en-IN, en-MY and en-SG, if they all transliterate to the exact same letters.

The above English language codes are taken from https://github.com/libyal/libfwnt/wiki/Language-Code-identifiers

@thorn0
Copy link

thorn0 commented Mar 7, 2020

en-GB, ... en-SG, if they all transliterate to the exact same letters.

That's why it's enough to check only the part before the hyphen, isn't it?

@thorn0
Copy link

thorn0 commented Mar 7, 2020

Intl.DisplayNames, since it is the nearest web standard neighbour. It uses ISO-3166 2-letter country codes.
symbolNames = new Intl.DisplayNames(['en'], {type: 'dateSymbol'});

en is not ISO 3166. It's an ISO 639-1 language code.

@dotnetCarpenter
Copy link

@thorn0

en-GB, ... en-SG, if they all transliterate to the exact same letters.

That's why it's enough to check only the part before the hyphen, isn't it?

Well, checking only the part before the hyphen is a ISO-3166 2-letter country code.

I reiterate that the current web standard accepts a ISO-3166 2-letter country code.

When type is "region", code must be either an ISO-3166 2-letter country code or a UN M49 3-digit region code.

I also contest that UN M49 3-digit region code is useful for transliterate.

Since ISO-3166 2-letter country code is the same as ccTLD which all web developers use daily, it should be immediately familiar.

PL | Poland | 1974 | .pl | ISO 3166-2:PL |

@thorn0
Copy link

thorn0 commented Mar 7, 2020

"en" is not a country code. The discussion gets circular. Let's delete our messages to not derail this issue completely.

@dotnetCarpenter
Copy link

If wikipedia is to be believed then EN is unassigned which is problematic.

ISO 639-3 contains 7,546 human languages1 which seems like overkill for transliterate.
But even ISO 639-1 contains no, nn and nb (Norwegian + dialects) but has no transliterate overlap.

"en" is not a country code. The discussion gets circular. Let's delete our messages to not derail this issue completely.

I think the discussion is to implement a language hint. My proposal is to follow the web standard of Intl.DisplayNames and use ISO-3166 2-letter country code for the hint. If ISO-3166 2-letter country code is ruled out, I think this discussion is important in order to understand why. I also suspect that many others will propose following Intl.DisplayNames once it has more wide spread support.

So far, the only argument (while a good argument) is that ISO-3166 2-letter country codes is not language codes.

@thorn0
Copy link

thorn0 commented Mar 7, 2020

follow the web standard of Intl.DisplayNames and use ISO-3166 2-letter country code for the hint

Where exactly did you read that it uses country codes to specify languages?

@dotnetCarpenter
Copy link

Please re-read #2 (comment) and #2 (comment).

TL;DR https://v8.dev/features/intl-displaynames

@thorn0
Copy link

thorn0 commented Mar 7, 2020

It uses standard language tags from https://tools.ietf.org/html/rfc5646. See the specification.

@dotnetCarpenter
Copy link

As I said in both comments, I think that is overkill for transliterate. ISO-3166 2-letter country code fits the bill. Unless you see a missing language that transliterate should support?

When type is "region", code must be either an ISO-3166 2-letter country code or a UN M49 3-digit region code.

@thorn0
Copy link

thorn0 commented Mar 7, 2020

Right, region codes are used for specifying regions. What does it have to do with specifying languages?

@thorn0
Copy link

thorn0 commented Mar 7, 2020

Are you sure you understand what Intl.DisplayNames.prototype.of is supposed to do?

@dotnetCarpenter
Copy link

dotnetCarpenter commented Mar 7, 2020

Right, region codes are used for specifying regions. What does it have to do with specifying languages?

I though I made that point abundantly clear!?

The ISO-3166 2-letter country code specifies every language (+ Chinese if the Hanyu Pinyin system is used, and allowing TW to use a separate system), that transliterate currently support.

But I'm not married to ISO-3166 2-letter country code. But every other system I have seen are way too big. Depending on a huge library as https://github.com/wooorm/franc, would mean I will not use transliterate. But I am neither the one to make the call or in a position to block any direction this library will take.

I feel I have said enough. And I understand that you want to use https://github.com/wooorm/franc.
We disagree - that's all. 💐 🕊️

@dotnetCarpenter
Copy link

Are you sure you understand what Intl.DisplayNames.prototype.of is supposed to do?

Yes, but I do not understand your point?

@thorn0
Copy link

thorn0 commented Mar 7, 2020

The point is that only language codes should be used for specifying languages.

You're saying: let's use country codes to represent languages, just like Intl.DisplayNames does. So my other point is: no, it doesn't do this. It uses language codes for languages (the constructor) and region codes for regions (the of method).

@dotnetCarpenter
Copy link

@thorn0 perhaps you have a similar feeling towards me, but I feel you are not listening at all to what I am saying.

After digging a little deeper I now see that the language codes can be in various formats. Reading Unicode Language Identifier is not helpful. The allowed formats are simply too wide to be worth the extra complexity.

Source: #2

I standby that using ISO-3166 2-letter country code:

Since ISO-3166 2-letter country code is the same as ccTLD which all web developers use daily, it should be immediately familiar.

Source: #2 (comment)

It uses standard language tags from https://tools.ietf.org/html/rfc5646. See the specification.

Intl.DisplayNames is not in the specification. It is a stage 3 proposal.

I do regret that I didn't specify that I meant type is "region" in #2 (comment), which might have confused you.

I think the discussion is to implement a language hint. My proposal is to follow the web standard of Intl.DisplayNames and use ISO-3166 2-letter country code for the hint. If ISO-3166 2-letter country code is ruled out, I think this discussion is important in order to understand why. I also suspect that many others will propose following Intl.DisplayNames once it has more wide spread support.

All this time I have been arguing for ISO-3166 2-letter country code for the hint. I did not see you arguing for ISO 639-3 at all doing the discussion. After this lengthy discussion, I agree with you that It uses language codes for languages (the constructor) and region codes for regions (the of method). I did that before your first comment! #2 (comment) I still think that supporting ISO 639-3 with its 7,546 human languages is too heavy.

@thorn0 Please, next time you want to debate something. Post full-meaning and not built-up arguments. #2 (comment) was a good post from you. And please refrain from being condescending #2 (comment)

@thorn0
Copy link

thorn0 commented Mar 7, 2020

Your reference to Intl.DisplayNames is irrelevant because it doesn't use country codes for languages. I'm not arguing for anything but my main point: "only language codes should be used for specifying languages". That's it.

Not sure what your "too heavy" refers to. Nobody is going to bundle the entire list of languages into this module. What for?

And another thing I don't understand, why are you thinking only about languages this module supports now? What about languages it might support in the future?

@dotnetCarpenter
Copy link

dotnetCarpenter commented Mar 7, 2020

Not sure what your "too heavy" refers to. Nobody is going to bundle the entire list of languages into this module. What for?

Do you propose a subset of ISO 639-3?

The supported languages will have to be tagged somehow. I see an issue with having to tag no, nn and nb as the same because when transliterating will be same. So a user could specify nn but you will need to map nn to no etc. I am sure this goes for many of the ISO 639-3 languages.

Instead of doing that, it might be preferable to only have a subset languages (by grouping similar languages within countries). The goal is to transliterate correct not to specify every language correct.

In order to ease maintenance I propose limiting the scoop of inputs. Way way below ISO 639-3.
Preferable with a standard list but in the case of ISO-3166 2-letter country code, en is not in the standard. So the list might have to be arbitrary or an arbitrary subset of ISO 639-3

@thorn0
Copy link

thorn0 commented Mar 7, 2020

Do you propose a subset of ISO 639-3?

Or ISO 639-2, or both. There is a standard and it should be used. Besides, confusing countries and languages is really inappropriate. There is a great site on this topic: http://www.flagsarenotlanguages.com/

I see an issue with having to tag no, nn and nb as the same because when transliterating will be same.

But why would this module need to check if the text is Norwegian in the first place? Does Norwegian have any ambiguous letters whose transliteration rules conflict with other languages?

@dotnetCarpenter
Copy link

dotnetCarpenter commented Mar 8, 2020

But why would this module need to check if the text is Norwegian in the first place? Does Norwegian have any ambiguous letters whose transliteration rules conflict with other languages?

In short, yes. And no, this module does not need to check anything. However, this module need to tag its replacements (we want to tag a conversion with a language code to be used with hinting). For Norwegian you have:

// Norwegian
['æ', 'ae'],
['ø', 'oe'],
['å', 'aa'],
['ô', 'o'],
['ô', 'o'],
['è', 'ee'],
['ê', 'ee'], // <- need to check this with my nynorsk speaking friend
['ó', 'o'],
['ò', 'o'],
['â', 'u'], // <- need to check this with my nynorsk speaking friend
['ô', 'edh'], // <- need to check this with my nynorsk speaking friend

Akin to Swedish, exemplified in #2 (comment)

transliterate('Sju sjösjuka sjömän') // -> Sju sjoesjuka sjoemaen
transliterate('Sju sjösjuka sjömän', ['se']) // -> Sju sjosjuka sjoman

In the latin group you have:

['è', 'e'],
['é', 'e'],
['ê', 'e'],
['ë', 'e'],
['ò', 'o'],
['ó', 'o'],
['ô', 'o'],
['õ', 'o'],
['ö', 'oe'],
['ő', 'o'],
['ø', 'o'],

But I have no idea in which language where ø becomes o. I only know ø from Danish and Norwegian and in both languages ø becomes oe. I also do not know any language where ö becomes oe. It appears to be an error. ö sounds like ø in Swedish but no Swedish person understand that oe means ö - that only make sense in Danish and Norwegian. In Swedish e sounds like y (in English) and therefore oe would be akin to "oui" in French.

Like I said before both ISO 639-2 and ISO 639-1 has the same issue. Having defined no, nn and nb without any difference in transliteration. For sure no, nb and dk have no conflict and follow the exact same rules.

So far, both ISO-3166 2-letter country code and ISO 639-1, ISO 639-2 and ISO 639-3 has failed to deliver a system that is sufficiently small for inclusion, not ambiguous and contains all of the language codes needed. ISO 639-* creates needless ambiguity and the 2-letter version of ISO-3166 is both missing languages (most visible en) and have nonsensical countries like SV (Svalbard and Jan Mayen which are both Norwegian - no).

@dotnetCarpenter
Copy link

dotnetCarpenter commented Mar 8, 2020

So Wikipedia has a list of romanizations which seems to be what we want. https://en.wikipedia.org/wiki/List_of_ISO_romanizations

List of ISO standards for transliterations and transcriptions (or romanizations):

ISO 9 — Cyrillic (Russian, Bulgarian, Belarusian, Ukrainian, Serbian and Macedonian)
ISO 233 — Arabic
ISO 259 — Hebrew
ISO 843 — Greek
ISO 3602 — Japanese (1989, last reviewed 2013)
ISO 7098 — Chinese
ISO 9984 — Georgian
ISO 9985 — Armenian
ISO 11940 — Thai
ISO 11940-2 — Thai (simplified)
ISO 11941 — Korean (different systems for North and South Korea – withdrawn in 2013)
ISO 15919 — Indic scripts

There does not appear to be an ISO standard for Scandinavian, but I can provide that. Given that I get some time to sort out the New Norwegian (Ny Norsk).

So far I think this is the best list of standards to pursue in order to correctly transliterate.

Obviously we would need help from language experts - being people who actually speak the languages. But that's what github is for, right?

@gengel
Copy link

gengel commented Mar 10, 2020

I would love to see a language hint option.

@wooorm
Copy link
Sponsor

wooorm commented Mar 10, 2020

Hey folks, just to add: there are different ISOs that do different things:

  • ISO-3166: (specifically 3166-1 alpha 2, probably) country codes. If you want to specify Switzerland, whether its French, German, or Italian, use this
  • ISO-639: (specifically 639-1 alpha 2 if possible or 639-3 if not) language codes. If you want to specify German, whether in Germany, Switzerland, or Austria, use this
  • BCP-47: uses the best possible combination. Such as de if you mean German as used in Germany, or de-CH to mean German as spoken in Switzerland.
    BCP-47 uses different both specs (and more), see bcp-47 and related projects for more info.

BCP-47 is used in most places currently, from Accept-language in HTTP, to lang in HTML, to i18n features in ECMAScript’s locale support.

@dotnetCarpenter
Copy link

I think it makes sense to look at previous work. Lingua::Translit is a perl module that transliterate cyrillic, greek, arabic, latin and sanskrit using transliteration standards.

One might use it like this:

use Lingua::Translit;
 
my $tr = new Lingua::Translit("ISO 9");
 
my $text_tr = $tr->translit("цхарацтер ориентед стринг"); # <- character oriented string
 
if ($tr->can_reverse()) {
  $text_tr = $tr->translit_reverse("character oriented string"); # <- цхарацтер ориентед стринг
}

You can install the CLI version on Ubuntu based distros via sudo apt install liblingua-translit-perl.

$ echo цхарацтер ориентед стринг | translit -t "ISO 9"
character oriented string

I would prefer to use the keywords like cyrillic, greek, arabic, latin or sanskrit instead of Common ARA, Devanagari IAST, GOST 7.79 UKR or ISO 9 at the expense of correctness.

I'm beginning to think that language codes (and country codes) are not useful at all. Although there is possible conflicts within language groups. We have already debated the ø = o and ø = oe conflict in the latin group.

@dotnetCarpenter
Copy link

dotnetCarpenter commented Mar 13, 2020

BCP-47 is used in most places currently, from Accept-language in HTTP, to lang in HTML, to i18n features in ECMAScript’s locale support.

I could get behind BCP-47 for the sole reason that it is familiar. But also more granular than a language code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants