Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification for i18n #38

Closed
martinheidegger opened this issue Oct 18, 2016 · 20 comments
Closed

Specification for i18n #38

martinheidegger opened this issue Oct 18, 2016 · 20 comments

Comments

@martinheidegger
Copy link

Right now Readme's in github are not internationalised. Even if they could be. I.e README.ja.md could show the readme in japanese. I think it would be a step in the right direction to specify this format as well.

@sotayamashita
Copy link
Contributor

@martinheidegger I agree with it so we need to decide which language code use.

@martinheidegger
Copy link
Author

martinheidegger commented Nov 18, 2016

I tend towards ISO 639-1 (even though ISO 639-3 would be more complete) because as-a-readme it should try to find a balance of effort/usefulness. I think any person that can read "Cantonese" can read "Chinese" (at least to my knowledge).

@sotayamashita
Copy link
Contributor

@RichardLitt Please add the discussion label.

@RichardLitt
Copy link
Owner

Done. Thanks @sotayamashita.

I agree about this issue; thank you so much for making it!

Can you point me to some repos which already do translations? I would love to see how others have done it, already, before deciding on a new standard.

Why wouldn't you lean towards ISO 639-3? What would be the downsides?

@martinheidegger
Copy link
Author

martinheidegger commented Nov 18, 2016

Downsides to ISO639-3:

Uncommon in usage

Usually people know ISO 639-1 (en, ja, etc.) they usually have not heard of the more extended forms. Mistakes and irritation is expectable.

Usefulness questionable

It makes sense to translate content into more than one language because the major languages all have millions of people that talk it. Several of the ISO 639-3 languages are dialects of people that are spoken additionally to one of the ISO 639-1 languages. Those languages provide little or no value.

Growth of Maintenance cost

Maintain translations is a pain in the ass. A restriction to 184 languages at least restricts a little bit from having the translations to grow overboard. (tbh. I wonder if it wouldn't make sense to restrict the list to languages spoken by 50 million people or more: https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers - 25 languages would qualify)

@RichardLitt
Copy link
Owner

Those are fair points. I think limiting to 184 languages is fine.

Would it be possible to specify both language codes? Is there a self-describing language code - as in, can it be clear that we are using ISO 639-1 as opposed to ISO639-3 easily?

@martinheidegger
Copy link
Author

ISO 639-1 are two letter codes, ISO 639-3 are three-letter codes. Usually specifying "two letter code" implies that you use ISO 639-1.

I am not quite sure if this is what you asked but: it is possible to support either language code:

if (characterCodePart.length === 2) {
 iso639_1Check(characterCodePart)
} else if (characterCodePart.length === 3) {
 iso639_3Check(characterCodePart)
} else {
 throw watIsDat();
}

@RichardLitt
Copy link
Owner

Ah. So, it's very easy to tell the difference, then. In that case, why not add something saying: "Use ISO 639-1 if you can. If you can't, using ISO 639-3 is also valid." Is there a need to eliminate one for the sake of the other?

@martinheidegger
Copy link
Author

see my reasoning above. i still somehow think the top 25 would be enough

@RichardLitt
Copy link
Owner

Maybe I am not being clear about where I am confused.

Even if they could be. I.e README.ja.md could show the readme in japanese. I think it would be a step in the right direction to specify this format as well.

What I am seeing is that we add this to the format:

If you have i18n for your READMEs, the standard is to name your files accordingly: README.ja.md, where ja is the ISO 639-1 code. All 639-1 codes are valid; if your language falls outside of ISO 639-1, then you may use ISO 639-3, which has three letters. For instance, README.ask.md for Askunu. However, if your language has both a ISO 639-1 and ISO 639-3 code, default to the ISO 639-1 code. So, for instance, README.en.md instead of README.eng.md.

This is what I am considering adding to the spec. For the linter, we can check, if it is two digits, if it is an ISO 639-1 language. If it is three, than it is the other - as your code suggests. I don't see a feasible way to limit the languages to 25, or why we would even want to.

Does this sound alright? What am I missing from your understanding of intentionally excluding ISO 639-3 languages?

@wooorm
Copy link
Collaborator

wooorm commented Nov 19, 2016

Maybe this can be more permissive by allowing BCP-47 tags? E.g., de, en-GB, nl-BE, and the like. That would open up regions as well, and it allows both 639-1 and 639-3 (preferring the shortest) too.

@RichardLitt
Copy link
Owner

I'm finding the spec a bit hard to parse. I'm also not sure I want to live in a world where README.en.md and README.en-GB.md are two READMEs I need to keep updated. But it does seem to have the best ratification elsewhere - it's used by a lot of other computing standards [See wiki]. So, I am for that.

Use the appropriate IETF tag seems to be a pretty fine thing to say; it allows us to not worry about conforming to one ISO variant over another, and it puts the burden on the translator to know what tag they should use in their README version (which they should already know, anyway).

@wooorm
Copy link
Collaborator

wooorm commented Nov 19, 2016

Having BCP47 also allows for different currencies; comma or full-stop as number separators; multiple scripts (Chinese, some Slavic languages, etc)!

@RichardLitt
Copy link
Owner

Well, that's sold me. Others?

@martinheidegger
Copy link
Author

I wrote above this:

I wonder if it wouldn't make sense to restrict the list to languages spoken by 50 million people or more

Using BCP47 would go straight against this.

Of the following two options:

A

Are you coming to bed? - I can't. This is important. - What? - Somebody's wrong on the internet

or

B

I am too old for this shit

I am going with B) ー Godspeed.

@RichardLitt
Copy link
Owner

@martinheidegger I'm really sorry man; I'm still a bit confused why you feel that we need to restrict languages. That's why we keep talking past each other.

What would restricting languages functionally mean? Is this not just about naming the README.md files?

@martinheidegger
Copy link
Author

martinheidegger commented Nov 21, 2016

Okay, not trying to win the argument here. Just trying to convey my point: At the core there is one question: "Why does github not support internationalization?" And my guessing point on this is: because multiple languages would split the community and the effectiveness of open source code.

With this in mind, the counter question becomes "Why would you even try to support different languages in open source?". The only answer I can come up with is: "incompetency" (sort of): Not every person in the world speaks/writes/reads english. (See EPI)

By providing different translations we accept that and try to accommodate people who don't speak english but that is an effort and comes at a cost. To argue that cost, to make it worth it, translators should focus on the biggest amount of people that can not deal with english. Because more translations mean more effort and less work on the open-source code itself. Which I think is not good to facilitate.

allows for different currencies; comma or full-stop

This means to me just that we can put more effort into some place to which the effort doesn't help much, if at all.

@RichardLitt
Copy link
Owner

I understand all of that; thank you so much for laying it out clearly.

I agree about the cost, about why GitHub doesn't support i18n, and about focusing on the most amount of people.

I am curious about this possible point: If I speak a language that is uncommon, what is to stop me naming my README using my language code and doing my own work of translating the README? Is there a high cost for people who are not the translators? Because if there isn't, than I don't think it's a bad thing to support i18n for all possible languages - buy-in would be the responsibility of the translators and language communities, not the project. Limiting languages to the top 25 most spoken would be detrimental towards their efforts, I think. Do you understand what I am getting at? I may be misinformed! Please let me know if so. I agree that, as a translator, if I spoke three languages, I should focus on the more common one, but I don't see a problem with also translating the other one if I wanted.

Regarding currencies, commas, and the like; those may be important, and I don't see a problem with using a standard that does away with any possible bike-shedding. That's the point of standard-readme in the end, too.

@martinheidegger
Copy link
Author

Is there a high cost for people who are not the translators? Because if there isn't, than I don't think it's a bad thing to support i18n for all possible languages.

The problem is here, I think, that you can't really separate the translators spec from the writers spec. Also the translators need to work from some base. so: if you open the spec to all languages, all languages will likely be used.

  • buy-in would be the responsibility of the translators and language communities, not the project.

That would suggest translators are not essential members of the project... heresy? 😛

... rd that does away with any possible bike-shedding.

Just for a complete picture: It is possible to limit the "25" languages to "25 BCP 47 codes". In other words: "en" could automatically stand for "en_UK".


There is an argument for and against limitation. BCP_47 will likely be a little loose to be effictive but is generally more inviting. Limiting the languages would be a "bolder choice" that might not be favorable by people but could result in a nicer infra-structure.

@RichardLitt
Copy link
Owner

so: if you open the spec to all languages, all languages will likely be used.

I don't think this is true. It's up to each project to decide which languages should be used. If I write a spec saying all languages are possible for standard-readme, I won't wake up tomorrow to 7000 translations from all human languages.

That would suggest translators are not essential members of the project... heresy?

Sure they are. But only if the project has multilingual users or support.

BCP_47 will likely be a little loose to be effictive but is generally more inviting. Limiting the languages would be a "bolder choice" that might not be favorable by people but could result in a nicer infra-structure.

I think that sums it up for me. I am going to go with BCP_47. We can revisit this later if we need to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants