Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i10n #1292

Closed
wants to merge 6 commits into from

Conversation

Projects
None yet
@GuillaumeGomez
Copy link
Member

GuillaumeGomez commented Sep 24, 2015

@Manishearth

This comment has been minimized.


```Shell
rustc --install-lang fr # downloads an official language pack from the server
rustc --install-lang fr=pack.zip # a custom pack can be installed this way

This comment has been minimized.

@killercup

killercup Sep 24, 2015

Member

I'm not sure I like the idea of adding a subcommand that downloads and extracts files to rustc. I'd rather see a separate utility to install these language packs (could also be part of multirust).

This comment has been minimized.

@Manishearth

Manishearth Sep 24, 2015

Member

Or a part of install.sh/rustup.sh (or whatever it's called these days).

This comment has been minimized.

@GuillaumeGomez

GuillaumeGomez Sep 24, 2015

Author Member

I don't see the issue to add such a thing to rust. Since localization will be handled by compiler directly, why not the language packs too ?

This comment has been minimized.

@Manishearth

Manishearth Sep 24, 2015

Member

The issue is that rustc will be talking over the network, we don't want that.

So the compiler will have support for language packs, just not for downloading/installing them.

This comment has been minimized.

@GuillaumeGomez

GuillaumeGomez Sep 24, 2015

Author Member

Why not talking over the network ? I don't really see the issue here.

This comment has been minimized.

@GuillaumeGomez

GuillaumeGomez Sep 24, 2015

Author Member

Oh my bad. Didn't think of that. Thanks for the info !

This comment has been minimized.

@nagisa

nagisa Sep 24, 2015

Contributor

Rustc also won’t be able to save these packs most of the time anyways since you need administrative privileges for that in current versions of all major systems.

This comment has been minimized.

@GuillaumeGomez

GuillaumeGomez Sep 24, 2015

Author Member

Just like rust_install.sh. I don't think this is a real issue here. They can just launch the command with sudo if needed.

This comment has been minimized.

@nagisa

nagisa Sep 24, 2015

Contributor

You don’t simply launch compilers as root. If a compiler needs administrative privileges, then something went wrong somewhere.

This comment has been minimized.

@GuillaumeGomez

GuillaumeGomez Sep 24, 2015

Author Member

It's a side thing, so I don't find it abnormal.


##Storage of localizations

The localization files should be stored in a folder called i10n, which is part of the rust installation folder. By default, the english files will be there, but if you put the french files there, the option `--lang fr` will work. The folder will look like this:

This comment has been minimized.

@killercup

killercup Sep 24, 2015

Member

Are there any plans on a file format? Or is this just an implementation detail at this point?

This comment has been minimized.

@GuillaumeGomez

GuillaumeGomez Sep 24, 2015

Author Member

What do you mean by file format ? Is it about how the localization file is written or how it is compressed ? On the first one, it has been explained a bit upper, for the second one, not really.

This comment has been minimized.

@Manishearth

Manishearth Sep 24, 2015

Member

Just a text file with keyvalue pairs as rust strings.

This comment has been minimized.

@killercup

killercup Sep 24, 2015

Member

Is it about how the localization file is written

Yes, I thought the text format shown above was just as an example.

Just a text file with keyvalue pairs as rust strings.

With lots of keys and sub-keys you might want to consider an indexed structure, for example. This could also be added in the future, though, if there was a clear way of versioning these files. (Also, I'm sure there are a lot of existing formats for these kind of thing.)

This comment has been minimized.

@Manishearth

Manishearth Sep 24, 2015

Member

With lots of keys and sub-keys you might want to consider an indexed structure

I don't think we'll need that.

This particular format is basically a copy of Firefox's .properties files for JS l10n (they use DTDs for HTML/XUL l10n).

Firefox has tons of these, but each JS file only loads the necessary ones for performance. They don't have nested keys though.

We don't need to do what Firefox does since these are for errors and warnings -- pretty cheap to do file I/O to fetch this information.

@Kimundi

This comment has been minimized.

Copy link
Member

Kimundi commented Sep 24, 2015

Theoretically you could use macros/syntax extension to directly allow i18n!("expected one of {} or an operator, found {}", expected, found) macros that use the string as a lookup key, and have a way to list all uses of the i18n! macro in the compiler so that it is comprehensive.

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Sep 24, 2015

@Kimundi I like the spirit of that idea, but it would make language packs much less portable

@GuillaumeGomez

This comment has been minimized.

Copy link
Member Author

GuillaumeGomez commented Sep 24, 2015

@Kimundi: And if you want to change the key-string, you'll have to change it in all other language files. I don't think that it would be very convenient...

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Sep 24, 2015

I would really like to see a comparison to other i18n libraries, efforts, and such. This is an incredibly complex topic, and this is a very, very brief RFC.

i18n is really important, and we should gain support for it. But it's really easy to do poorly.

@jonas-schievink

This comment has been minimized.

Copy link
Member

jonas-schievink commented Sep 24, 2015

I feel like it's still too early to implement these "luxury" features. I think translating now would just lead to a worse experience (= many untranslated errors in the output) as diagnostics are improved and new ones get added.

I also don't get how we're supposed to change the structure of existing messages (ie. add a {} or swap their positions). Should we just create a new one and delete all old translations?

That said, even if I'll never use this (I'm much more used to english jargon than german), this does seem like a good feature to have in general.

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Sep 24, 2015

I also don't get how we're supposed to change the structure of existing messages

we should use named arguments as much as possible.

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Sep 24, 2015

It occurs to me that we forgot about pluralization. That's nontrivial to handle. I think Firefox handles it by asking for two versions of the string.

@nagisa

This comment has been minimized.

Copy link
Contributor

nagisa commented Sep 24, 2015

Do not re-invent a wheel. There’s gettext and lots of infrastructure around it. IMHO this proposal is strongly inferior to implementing gettext library (that works equally well on all supported platforms, as opposed to python’s gettext implementation) in rust and pulling that into rustc.

If gettext is not satisfactory in some way, then at least port something that is known to already work; the rust project really doesn’t need to solve the already-mostly-solved l10n problem all over again.

P.S. this RFC is proposing infrastructure for l10n (localisation), not i18n (internationalisation). i18n is much more involved and I don’t see how rust needs it at all.

@apasel422

This comment has been minimized.

Copy link
Member

apasel422 commented Sep 24, 2015

@Manishearth Pluralization is more complicated than that, as some languages have more intricate rules than just 1 -> singular, !1 -> plural.

I agree with @steveklabnik that this needs far more investigation. l20n should certainly be brought up here.

@GuillaumeGomez

This comment has been minimized.

Copy link
Member Author

GuillaumeGomez commented Sep 24, 2015

@nagisa: You're absolutely right. I got confused but it's i10n.

@GuillaumeGomez GuillaumeGomez changed the title i18n i10n Sep 24, 2015

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Sep 24, 2015

P.S. this RFC is proposing infrastructure for l10n (localisation), not i18n (internationalisation).

infrastructure for l10n is i18n. i18n is making a piece of software localizable, l10n is creating the translations.

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Sep 24, 2015

Pluralization is more complicated than that,

Agreed. I think they have some handling for that, but I haven't looked into it.

@olivren

This comment has been minimized.

Copy link

olivren commented Sep 25, 2015

I agree with the intent of this RFC, but not on the proposed solution.

In my experience, translations based on simple key/value formats are a real pain to work with. Finding consistent or meaningful key names is impossible. Developers now have to follow an indirection to know what the content of the string is. The most important part of translating software lies in the tools that make it easy for the translators to keep the translations up to date, and to do so at their own pace. gettext is very good for that, and it is not something to overlook.

So I think this RFC should just be "internationalize rustc using gettext". Unfortunately, I could not find any attempt to support gettext for Rust, and this is obviously a prerequisite (note that this needs supports both on gettext side and the creation of a binding in Rust).

@seanmonstar

This comment has been minimized.

Copy link
Contributor

seanmonstar commented Sep 26, 2015

I highly recommend looking at l20n.

I don't recommend looking to deeply at this implementation, as it was my first rust code ever, but I feel some absurd urge to include it with my comment about l20n.

@withoutboats

This comment has been minimized.

Copy link
Contributor

withoutboats commented Sep 26, 2015

I don't know a lot about localization (is it just me or are these acronyms ironic given that this is an accessibility issue?), but wouldn't it make sense for this to be built on top of semantic error values that could also have a machine readable (e.g. JSON) output form, as well as a localized human readable String, similar to this RFC?

@nrc nrc added the T-compiler label Sep 26, 2015

@fbstj

This comment has been minimized.

Copy link

fbstj commented Sep 29, 2015

I agree with @withoutboats that machine readable debugging is probably much more easily translatable than embedding the i10n/etc inside everywhere.

@Nashenas88

This comment has been minimized.

Copy link

Nashenas88 commented Oct 1, 2015

Let's not forget that pluralization is not the only thing that varies in translated strings. A huge class of languages does noun declension, where the spelling and pronunciation of a noun changes depending on its usage in a sentence, which can also be mixed with genders.

It's not just the messages we have now that would need to change, but the code around how some of those messages are generated too. There are some cases where we programmatically build up parts of the string (I'm not talking about the user's own code, but the messages themselves). This can lead to cases where figuring out how many strings need to be translated will be very difficult to do.

I'd also propose that when we do these translation files that the original English translation include an additional column for the context. This is usually a piece of text that describes more information about the text and the words used in order to help a translator understand the context that the translation is used. Not everyone coming up with translations is going to completely understand the code it's used in. They can also be used to describe the sections of the string that are replaced with user content. For example, whether the value that appears in "{}" is an identifier or a constant value, etc. This can change the surrounding words depending on the language. I can't think of an exact example, but a similar example is translating the word "to go" into Russian. They don't have a general word for "to go", but expressions like, " to drive ", "to fly", "to walk", etc. Having a context description makes coming up with a readable/sensible translation much easier.

@apasel422

This comment has been minimized.

Copy link
Member

apasel422 commented Oct 1, 2015

@Nashenas88 That's a good point. l20n makes solving these kinds of grammatical issues relatively simple.

@nikomatsakis nikomatsakis self-assigned this Oct 1, 2015

@Nashenas88

This comment has been minimized.

Copy link

Nashenas88 commented Oct 2, 2015

@apasel422, I just looked up l20n, and I'm impressed by what it offers. I haven't had a chance to look at the code yet though; I hope it's something we could take advantage of easily.

@durka

This comment has been minimized.

Copy link
Contributor

durka commented Oct 2, 2015

Maybe I missed something in the RFC, but how would this actually work? If the translations are regular format strings (with {}), then they have to be passed to format! at compile-time using some kind of include! shenanigans. But then rustc either has to be compiled to change locales, or it contains all the format strings in the binary and has big switch statements everywhere when printing. Alternatively, format strings could be interpreted at runtime with less safety. Which is proposed?

@Manishearth

This comment has been minimized.

Copy link
Member

Manishearth commented Oct 2, 2015

There is machinery to iterate through format strings at runtime, it's easy to use that.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Oct 7, 2015

So there is a change I have been contemplating doing that is related to this RFC. It frequently happens in Rust that we have "multipart" error messages, like an error with several explanatory notes, and maybe a help suggestion as well. Currently, each of these is a distinct message, and each has its own span, and it's kind of a big mess.

Furthermore, many of the messages -- such as those produced by the borrow checker -- involve "program flow". We currently display this as a multipart message highlighting each point in the code, but this must be "cross-referenced" against the original source somehow.

On a related note, our messages often include a lot of terminology that users may not know. Research shows that even simple terms like "function body" can be confusing to new users and so on, to say nothing of things like "object type" or "lvalue". It'd be great if we could find a way to make these terms clearer to people.

I was hoping to address all of these points by allowing us to construct richer errors. The idea would be to have:

  1. Multipart errors, first of all.
  2. Markup and not just plain strings, where you can associate arbitrary terms with spans, and not just the entire message. The formatter would try to coallesce these spans into a single snippet, with the relevant text color-coded, so that e.g. if we say "the lvalue", it might be colored red, and the lvalue we are referring to in the source also colored red. Or something. There would probably have to be some kind of priorities too, so we can degrade gracefully when color is not available, or if we can't get all the spans into a single snippet.

Anyway, I bring this up here because while these goals are not directly I10N goals, they are obviously served by some of the measures in this RFC. Furthermore, for I10N purposes, I imagine these "multipart" messages ought to be a unit, since the right breakdown will probably be translated differently. I know that sometimes I have to really torture the phrasing to make it make grammatical sense in English, and I assume it would be near impossible to port that across languages.

@GuillaumeGomez

This comment has been minimized.

Copy link
Member Author

GuillaumeGomez commented Oct 7, 2015

Markup and not just plain strings, where you can associate arbitrary terms with spans, and not just the entire message. The formatter would try to coallesce these spans into a single snippet, with the relevant text color-coded, so that e.g. if we say "the lvalue", it might be colored red, and the lvalue we are referring to in the source also colored red. Or something. There would probably have to be some kind of priorities too, so we can degrade gracefully when color is not available, or if we can't get all the spans into a single snippet.

I was working on something similar but only for rustc --explain E0000 (and I didn't finish it because rustdoc must be modified before).

Giving a better access to information for newcomers (and even more experimented rust users!) should be a little more considered.

@seanmonstar

This comment has been minimized.

Copy link
Contributor

seanmonstar commented Oct 18, 2015

I've updated l20n so that it now compiles on stable rust. The parsing and resolving works, but locale negotiation is non-existent at the moment. (repo: https://github.com/seanmonstar/l20n.rs)

With some more work, it could be possible to use syntax extensions (or codegen like serde) to compile the l20n templates into rust code at compile time, instead of runtime. (Runtime should stay though, since it's also a possible strategy for an application to download updated language resources and need to compile them at runtime.)

@graydon

This comment has been minimized.

Copy link

graydon commented Nov 12, 2015

I will play community memory here and point out that the current format string syntax in rust was chosen explicitly to be compatible with ICU MessageFormat syntax (itself derived from Java's). This is the standard (and has been worked over very thoroughly to accommodate variations in plural, gender and similar dimensions).

http://icu-project.org/apiref/icu4j/com/ibm/icu/text/MessageFormat.html

@graydon

This comment has been minimized.

Copy link

graydon commented Nov 12, 2015

(The most thorough conversion we had about this was in 2013, starting with https://mail.mozilla.org/pipermail/rust-dev/2013-May/003999.html ... There are lots of arguments and informative links in that thread. )

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 8, 2016

We discussed this RFC in the @rust-lang/compiler meeting yesterday. The rough consensus was that it is too early to "internationalize" the compiler, even though we would like to do so eventually. Even when just considering English, it is difficult to maintain the quality of error messages, and adding other languages into the mix would be a significant burden. It's also hard for us to judge the quality of those error messages. That said, we are doing some work on overhauling the error reporting infrastructure for IDE integration and better usability, and I expect that this should make internationalization easier longer term (though at the moment we have not been focusing on extracting the text of the messages themselves outside of the compiler). Therefore, I'm inclined to close this RFC for the time being (and open a corresponding issue), but I'd like to hear feedback on that first.

@GuillaumeGomez

This comment has been minimized.

Copy link
Member Author

GuillaumeGomez commented Jan 8, 2016

What we had in mind @Manishearth and myself was more to provide the structure to allow users to add localization (so rust team can not internationalize anything). However I approve this way of doing it, for now steps need to be done before going more into this.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 8, 2016

@GuillaumeGomez OK. I will close then for now, but thanks for the interesting discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.