-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language hint #2
Comments
We started building something similar to this a few years ago - https://github.com/diacritics - but we never finished it. We had the transliteration default to general usage, but also allowed the user set the language and variant. For example: const transliterate = require("diacritics-transliterator").transliterate;
const german = transliterate("¿abcñ-ß123?", "decompose", "de");
/* german => "¿abcñ-ss123?"; only the German s-sharp is replaced */
const spanish = transliterate("¿abcñ-ß123?", "base", "es");
/* spanish => "?abcn-ß123?"; only the Spanish ñ is modified */
const generic = transliterate("¿abcñ-ß123?");
/* generic => "?abcn-ß123?"; the base of ß is still ß */
const unchanged = transliterate("¿abcñ-ß123?", "base", "test");
/* unchanged => "¿abcñ-ß123?"; if no diacritics match the set language variant,
the original string is returned */ But, that might have been too aggressive because the code owner opted to rewrite the transliterater to be much more basic. |
Did you consider using https://github.com/wooorm/franc? |
It's a bit heavy to depend on here, but I should definitely recommend it for automatic detection when this issue is resolved. |
A custom build of franc limited only to the languages you're interested in is probably an option too. |
Then the solution for this issue should use the same language codes as Franc, namely ISO 639-3 (e.g. |
Yeah, or both, since the two letter version is more commonly used. |
I just realized we only need language detection for the few languages that overlap. Not all. So maybe we can make it fully automatic without any options by bundling a slimmed down version of |
I think transliterate should be compatible with Info about
from https://diary.braniecki.net/2020/02/14/js-intl-in-2020/ It is however a little confusing that the constructor takes an array of language codes, but only one is ever used in the examples. The original proposal only use one language code. Original: var obj = Intl.DisplayNames('ar', {});
obj.format('Europe/Warsaw'); // 'وارسو'
var obj = Intl.DisplayNames('pl', {});
obj.format('Europe/Warsaw'); // 'Warszawa' Standard: const languageNames = new Intl.DisplayNames(['zh-Hant'], { type: 'language' });
languageNames.of('fr');
// → '法文'
languageNames.of('zh');
// → '中文'
languageNames.of('de');
// → '德文' A language hint would probably be easier to implement if it is only one language instead of a priority list. I also expect a list will quickly be useless since you will loose control of the overlaps. The API might use an array to enable a priority list down the line, without having to change the API. |
After digging a little deeper I now see that the language codes can be in various formats. Reading Unicode Language Identifier is not helpful. The allowed formats are simply too wide to be worth the extra complexity. With complexity, I do not only mean the implementation in transliterate but also in usage. I prefer to see usage of transliterate to be as familiar as possible, when I take over someone else's code. The following list of supported language codes is taken from the V8 blog post:
@sindresorhus I propose strictly using ISO 3166-1 alpha-2 and only expand if deemed necessary by users of transliterate, that is having a real use-case in a project. |
Those are not language codes. |
transliterate('Sju sjösjuka sjömän') // -> Sju sjoesjuka sjoemaen
transliterate('Sju sjösjuka sjömän', ['se']) // -> Sju sjosjuka sjoman |
What languages are missing @thorn0? I see that |
Missing? These codes are country codes, they're not language codes. Do I really need to explain the difference? |
@thorn0 no I get it. The purpose is to not have en-GB, en-US, en-AU, en-CA, en-NZ, en-IE, en-ZA, en-JM, en-CB, en-BZ, en-TT, en-ZW, en-PH, en-ID, en-HK, en-IN, en-MY and en-SG, if they all transliterate to the exact same letters. The above English language codes are taken from https://github.com/libyal/libfwnt/wiki/Language-Code-identifiers |
That's why it's enough to check only the part before the hyphen, isn't it? |
|
Well, checking only the part before the hyphen is a ISO-3166 2-letter country code. I reiterate that the current web standard accepts a ISO-3166 2-letter country code.
I also contest that UN M49 3-digit region code is useful for transliterate. Since ISO-3166 2-letter country code is the same as ccTLD which all web developers use daily, it should be immediately familiar.
|
"en" is not a country code. The discussion gets circular. Let's delete our messages to not derail this issue completely. |
If wikipedia is to be believed then ISO 639-3 contains 7,546 human languages1 which seems like overkill for transliterate.
I think the discussion is to implement a language hint. My proposal is to follow the web standard of So far, the only argument (while a good argument) is that ISO-3166 2-letter country codes is not language codes. |
Where exactly did you read that it uses country codes to specify languages? |
Please re-read #2 (comment) and #2 (comment). |
It uses standard language tags from https://tools.ietf.org/html/rfc5646. See the specification. |
As I said in both comments, I think that is overkill for transliterate. ISO-3166 2-letter country code fits the bill. Unless you see a missing language that transliterate should support?
|
Right, region codes are used for specifying regions. What does it have to do with specifying languages? |
Are you sure you understand what |
I though I made that point abundantly clear!? The ISO-3166 2-letter country code specifies every language (+ Chinese if the Hanyu Pinyin system is used, and allowing TW to use a separate system), that transliterate currently support. But I'm not married to ISO-3166 2-letter country code. But every other system I have seen are way too big. Depending on a huge library as https://github.com/wooorm/franc, would mean I will not use transliterate. But I am neither the one to make the call or in a position to block any direction this library will take. I feel I have said enough. And I understand that you want to use https://github.com/wooorm/franc. |
Yes, but I do not understand your point? |
The point is that only language codes should be used for specifying languages. You're saying: let's use country codes to represent languages, just like |
@thorn0 perhaps you have a similar feeling towards me, but I feel you are not listening at all to what I am saying.
Source: #2 I standby that using ISO-3166 2-letter country code:
Source: #2 (comment)
I do regret that I didn't specify that I meant
All this time I have been arguing for ISO-3166 2-letter country code for the hint. I did not see you arguing for ISO 639-3 at all doing the discussion. After this lengthy discussion, I agree with you that It uses language codes for languages (the constructor) and region codes for regions (the of method). I did that before your first comment! #2 (comment) I still think that supporting ISO 639-3 with its 7,546 human languages is too heavy. @thorn0 Please, next time you want to debate something. Post full-meaning and not built-up arguments. #2 (comment) was a good post from you. And please refrain from being condescending #2 (comment) |
Your reference to Not sure what your "too heavy" refers to. Nobody is going to bundle the entire list of languages into this module. What for? And another thing I don't understand, why are you thinking only about languages this module supports now? What about languages it might support in the future? |
Do you propose a subset of ISO 639-3? The supported languages will have to be tagged somehow. I see an issue with having to tag no, nn and nb as the same because when transliterating will be same. So a user could specify nn but you will need to map nn to no etc. I am sure this goes for many of the ISO 639-3 languages. Instead of doing that, it might be preferable to only have a subset languages (by grouping similar languages within countries). The goal is to transliterate correct not to specify every language correct. In order to ease maintenance I propose limiting the scoop of inputs. Way way below ISO 639-3. |
Or ISO 639-2, or both. There is a standard and it should be used. Besides, confusing countries and languages is really inappropriate. There is a great site on this topic: http://www.flagsarenotlanguages.com/
But why would this module need to check if the text is Norwegian in the first place? Does Norwegian have any ambiguous letters whose transliteration rules conflict with other languages? |
In short, yes. And no, this module does not need to check anything. However, this module need to tag its replacements (we want to tag a conversion with a language code to be used with hinting). For Norwegian you have:
Akin to Swedish, exemplified in #2 (comment)
In the latin group you have:
But I have no idea in which language where Like I said before both ISO 639-2 and ISO 639-1 has the same issue. Having defined no, nn and nb without any difference in transliteration. For sure So far, both ISO-3166 2-letter country code and ISO 639-1, ISO 639-2 and ISO 639-3 has failed to deliver a system that is sufficiently small for inclusion, not ambiguous and contains all of the language codes needed. ISO 639-* creates needless ambiguity and the 2-letter version of ISO-3166 is both missing languages (most visible |
So Wikipedia has a list of romanizations which seems to be what we want. https://en.wikipedia.org/wiki/List_of_ISO_romanizations List of ISO standards for transliterations and transcriptions (or romanizations):
There does not appear to be an ISO standard for Scandinavian, but I can provide that. Given that I get some time to sort out the New Norwegian (Ny Norsk). So far I think this is the best list of standards to pursue in order to correctly transliterate. Obviously we would need help from language experts - being people who actually speak the languages. But that's what github is for, right? |
I would love to see a language hint option. |
Hey folks, just to add: there are different ISOs that do different things:
BCP-47 is used in most places currently, from |
I think it makes sense to look at previous work. Lingua::Translit is a perl module that transliterate cyrillic, greek, arabic, latin and sanskrit using transliteration standards. One might use it like this: use Lingua::Translit;
my $tr = new Lingua::Translit("ISO 9");
my $text_tr = $tr->translit("цхарацтер ориентед стринг"); # <- character oriented string
if ($tr->can_reverse()) {
$text_tr = $tr->translit_reverse("character oriented string"); # <- цхарацтер ориентед стринг
} You can install the CLI version on Ubuntu based distros via
I would prefer to use the keywords like I'm beginning to think that language codes (and country codes) are not useful at all. Although there is possible conflicts within language groups. We have already debated the |
I could get behind BCP-47 for the sole reason that it is familiar. But also more granular than a language code. |
Some languages have overlapping characters. To provide the most accurate result, we could accept a language hint and prefer that language when there's a conflict. You would still be able to use multiple languages in a string, but the provided one gets priority. For example,
sv-SE
to prioritize the Swedish replacement.The text was updated successfully, but these errors were encountered: