New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilingual support #5

Open
azerupi opened this Issue Jul 29, 2015 · 28 comments

Comments

Projects
None yet
10 participants
@azerupi
Copy link
Collaborator

azerupi commented Jul 29, 2015

Add support for multiple languages.

@azerupi azerupi added the Enhancement label Jul 29, 2015

@FuGangqiang

This comment has been minimized.

Copy link
Contributor

FuGangqiang commented Aug 12, 2015

multiple languages for document?

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Aug 12, 2015

Yes, I think Gitbook does support something like that.

Instead of having the markdown files directly in the source folder you would have some sub folders like this:

src/
├── de
├── en
└── fr

And their would be an easy way to change the language in the rendered book.

It's definitely something I would like to add, but it's not the highest priority at the moment

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Jan 12, 2016

Multiple designs possible:

  • One SUMMARY.md to rule them all

    pros:

    • Changes in structure are reflected in all languages immediately
    • 1 to 1 mapping from pages from one language to another, would allow changing the language of the page directly from a menu button

    cons:

    • If one language is lagging behind it's going to get ugly
  • One SUMMARY.md for every language

    pros:

    • Every language can have it's own pace

    cons:

    • Does not push all languages to be up to date / coherent
    • No 1 to 1 mapping guarantee and thus not possible to toggle the language from a page without having the risk that the page does not exist in the other language
@mkpankov

This comment has been minimized.

Copy link

mkpankov commented Jan 13, 2016

I don't think one SUMMARY.md for everything is a good idea. I consider consistency within translated version more important than consistency with original. Otherwise, we can easily start having broken links because upstream renamed some chapter and translation didn't, yet. I believe a book that has no broken links is the minimum standard.

Also, I don't support the idea of "pushing" to be up-to-date. AFAIK, translations (not only ours) are done by enthusiasts and it's not always possible to keep up at all times.

Moreover, 1 to 1 mapping of pages doesn't look straightforward to me, even in case there's single SUMMARY. Words have different length in different languages, and in Russian translation we consistently have sentences that are noticeably longer than original. But I'd love to have it so that one click can show the same point in text in original language.

I think this can be handled by tracking 1-to-1 mapping of paragraphs - sections aka markdown files are too big. Paragraphs also seem a good candidate because sentences get paraphrased and reordered sometimes, but the paragraphs stay in same order and have same gist.

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Jan 13, 2016

Thanks for the input! I really appreciate the feedback :)

Otherwise, we can easily start having broken links because upstream renamed some chapter and translation didn't, yet. I believe a book that has no broken links is the minimum standard.

Moreover, 1 to 1 mapping of pages doesn't look straightforward to me

When I am talking about 1 to 1 mapping I am talking about page to page mapping, not sentence to sentence (that would be insane 😉).

Let's take a hypothetical situation with the Rust book. Let's say I am reading a blog post and it references some chapter in the Rust book, for example the chapter about ownership. But English is not my main language and it would be a lot easier to understand the chapter in my native language. If we have 1 to 1 mapping on page / chapter level the user could then select his language (if it is supported) from a dropdown menu and he would land on the exact same page in his chosen language.

However for this to work correctly we need a guarantee that every page in one language has an equivalent page in the other language. If you allow a different SUMMARY.md per language there is no way to know what pages are equivalent if any equivalent page even exists at all.

Also, I don't support the idea of "pushing" to be up-to-date. AFAIK, translations (not only ours) are done by enthusiasts and it's not always possible to keep up at all times.

Of course, I totally agree with you. But the SUMMARY.md is only about structure, so what order the chapters come in, not the content.

If there is one SUMMARY.md for all languages I think it will only cause trouble if:

  1. New chapters get added, as equivalent chapter in other languages will just be blank until they are translated
  2. The markdown files get renamed, this should not happen often when it does it is not difficult to rename the files accordingly for every language
  3. A reorderering of the chapters where the continuity of the content is broken. This too should not happen often, but it's more challenging to fix as it requires the translators to translate the text that changed

To be honest, once a book has it's definitive structure the SUMMARY.md is not likely to change often unless there is a major rewrite being done.

I think both designs have advantages and drawbacks, we need to figure out which one we want / need the most.


Idea for Rust book workflow when translations are in tree

When / if translations are moved into the official repository we could create a more elaborate pull request process. This is only an idea, it may be flawed 😉

When a pull request is made that contain changes that need translation (e.g. not typos) we could wait to merge the pull request until translations have been made for all officially supported languages.

The pull request could track what translations have been made using a check list like this:

  • Russian
  • French
  • German

Once all the translations are ready the pull request is merged in.
Officially supported languages could be languages with a minimum number of "official" maintainers.

This would add a little / lot of overhead for the english version but it would solve the two big issues with translations.

  1. Translations would always be up to date!
  2. This is probably the easiest way to track changes

There may be organizational problems I haven't considered though. @steveklabnik

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Jan 14, 2016

The biggest problem with blocking English changes to non-English changes is that I am paid for my work, but others are not. This places a big burden on them; I'm gonna want to land changes ASAP, and that's not fair to people who can't do this as a day job.

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Jan 14, 2016

That's true, didn't think of that.
It could still be applied without blocking the English changes? Just for tracking. Not sure if it's worth the overhead though.

Anyways, do you have a preference for any of the two design choices (one vs. multiple SUMMARY.md)?

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Jan 14, 2016

I think I prefer a single for the reasons you've stated, but since I'm not doing the translations themselves, I don't think my opinions matter much :)

And yeah, tracking might be different/better than actually blocking on them landing.

@mkpankov

This comment has been minimized.

Copy link

mkpankov commented Jan 14, 2016

When I am talking about 1 to 1 mapping I am talking about page to page mapping, not sentence to sentence (that would be insane 😉).

Ok, I think what I was trying to say but couldn't get across is this: page-to-page mapping isn't enough for printed versions, as same pages will have different content. And if by page you meant a web page, that is not enough either. Some sections (pages) are tens of screens long, and to provide smooth transition from one version to another we should track smaller units than entire files (web pages).

I originally thought you were talking about printed pages and written the following, but I'm not sure now. For printed versions, depending on length of the section and sentence-length difference with the original, this can very from "I see not the beginning of the paragraph that talks about Foo feature, but the end" to "I don't see the paragraph that talks about Foo feature on screen at all", when linked to "page 83 of PDF".

So let's clarify the terms before continuing as apparently I misunderstood something 😄

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Jan 14, 2016

Ok yes, I will try to do my best to explain what I envision:

So in this issue I am not at all talking about tracking any changes for translations, only about how to support multiple languages in the same folder / book.

Before I continue, let's explain what the SUMMARy.md does exactly.

When you render the book (mdbook build) it is going to search for the SUMMARY.md and parse it. The SUMMARY dictates

  • The Order of the chapters.
  • The names of the chapters.
  • The markdown file corresponding to each chapter.

That is the "only" information we get from the SUMMARY.md

If we want to support multiple languages for one book, there are two possible designs (that I thought off):

  • One SUMMARY.md at the root of the source directory that will be used for all languages.
  • SUMMARY.md for every language

Let's see both in more details.

One SUMMARY.md for all languages

Consider this SUMMARY.md for a book:

# Summary

- [hello world](hello-world.md)
- [second chapter](second-chapter.md)

and this directory structure:

├── book
└── src
    ├── en
    │   ├── hello-world.md
    │   └── second-chapter.md
    ├── fr
    │   ├── hello-world.md
    │   └── second-chapter.md
    ├── ru
    │   ├── hello-world.md
    │   └── second-chapter.md
    └── SUMMARY.md

As you can see here, every language has the same markdown files defined in the global SUMMARY.md. This means that the "hello world" chapter has a corresponding page in every language! (1 to 1 mapping)

Advantages

Having a guarantee that every chapter in one language has a corresponding chapter in another language gives us the possibility to change the language from any chapter and land on that same chapter in the other language.

Example: I am reading the "borrowing" chapter of the Rust book. I want to see that same chapter in French. I just select "French" from the dropdown button in the menu-bar and I will land on the French version of the chapter.

Drawbacks

When the SUMMARY.md is modified it can cause some consistency problems in the translations because changes in the SUMMARY.md
will be reflected immediately in all languages. However, changes in the SUMMARY.md should be relatively rare once the book has found it's "final" structure.

Problems that could occur:

  • Chapter is moved: When a chapter is moved (the order of the chapters is rearenged) it could cause problems with text flow.
  • markdown file is renamed: When a markdown file is renamed it should be renamed in all languages and in all the references to it. This should not be too big of a problem.
  • New chapter is added: When a new chapter is added it will appear blank in the other languages until it's translated.

Content is not modified by the SUMMARY.md so any of the designs here is not going to cause any trouble with the content if the SUMMARY.md is modified.

Another drawback is that I am not sure yet how translations will give a translation for the chapter titles in the sidebar (SUMMARY.md). Maybe just take the first heading from the corresponding markdown file?

One SUMMARY.md for EVERY language

Let's consider this directory structure:

├── book
└── src
    ├── en
    │   ├── hello-world.md
    │   ├── second-chapter.md
    │   └── SUMMARY.md
    ├── fr
    │   ├── hello-world.md
    │   └── SUMMARY.md
    └── ru
        ├── hello-world.md
        ├── second-chapter.md
        └── SUMMARY.md

As you can see here, every language has it's own SUMMARY.md and thus can define the order of their chapters and the markdown files as they wish.

There is absolutely no more guarantee that the French version contains the same chapters as the English version. No 1 to 1 mapping. Essentially every language is its own separate book, they could have exactly the same structure or they could have totally different chapters. There is no way for the program to know that.

It is thus impossible to change the language from a chapter. You would have navigate to the French version manually and search for the chapter you were reading if it exists in the French version at all!

Advantages

Translations have a lot more freedom, but this can also be seen as a drawback. Translations do not need to have the same structure, so when the SUMMARY.md is changed in the English version, absolutely nothing is going to change in the other languages. Every change in the translations has to be done manually.

Drawbacks

There is no guarantee that a chapter in one language as an equivalent in another language.(No 1 to 1 mapping) The program can not know what chapters are equivalent in the different languages and it would thus be impossible to change the language from a chapter to land on the same chapter in the other language.


I hope this made it more clear, if there is still something you don't understand I can elaborate more on some specific area. 😉

EDIT: A little quote from a response I made on Rust's internals forum:

And to be honest, if you have different TOCs you essentially have different books. There is little gain to support that, other than being able to group all the translations in one directory and build them in one go.

You can already group the multiple translations in one directory as different books each with it's own SUMMARY.md and book.json and if you configure the source and destination directories correctly there should be minimum trouble to integrate with automatic deployment scripts etc.

@defuz

This comment has been minimized.

Copy link

defuz commented Jan 14, 2016

There is no guarantee that a chapter in one language as an equivalent in another language.

Regarding Rust Book translation process, it is not disadvantages of some solution, but simply a fact. I think that the other projects that will use mdBook with multiple languages will have the same problem.

The program can not know what chapters are equivalent in the different languages and it would thus be impossible to change the language from a chapter to land on the same chapter in the other language.

Can we make it simple and assume that the files with the same name in different languages are the same chapter? Then we can give the opportunity to switch to another language. I think this approach will satisfy both cases:

  1. When there is complete consistency between all languages.
  2. When consistency between languages is not complete.
@defuz

This comment has been minimized.

Copy link

defuz commented Jan 14, 2016

Also, I don't like the idea that when I read the book in Russian, I'll see TOC in English. I think we should not assume that the reader is familiar enough with the language of original to understand the chapter titles.

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Jan 14, 2016

When consistency between languages is not complete.

How would you handle that? On some pages you can change the language and on others not? That would be really confusing for users I think.

Also, I don't like the idea that when I read the book in Russian, I'll see TOC in English.

Of course that was not the plan, I just hadn't found a good solution for it yet so I didn't discuss it too much

@defuz

This comment has been minimized.

Copy link

defuz commented Jan 14, 2016

How would you handle that? On some pages you can change the language and on others not? That would be really confusing for users I think.

Why not? We can clearly indicate that the translation for this chapter is not available yet. Another possible situation is that translation for some languages is available, but for other languages it's not.

@defuz

This comment has been minimized.

Copy link

defuz commented Jan 14, 2016

Another example that I care about.

Let's compare the structure of the section "Getting started" in the nightly and stable books. As you can see, Steve joined 4 chapters into one. Imagine that not all the language versions supported this change yet. If we have common TOC, this means that there is no possibility to open "Installing Rust", "Hello World" and "Hello Cargo" chapters in non-English version of book, because they do not exist in the original TOC anymore.

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Jan 14, 2016

Yes I totally agree with you! This would be a big problem. However I am not sure I want to settle with the solution Gitbook proposes either. Maybe we can come up with something better that combines all the advantages and none of the drawbacks? (even if it's a little more complex)

Gitbook uses the "one SUMMARY.md per language" method and to be honest I don't think it is real multilingual support. They essentially have one book per language no cross-linking between the different languages except on a landing page...

I think you could already achieve something very similar with mdBook with multiple books and configuring the source and output directories according to what you want. The only difference is that Gitbook makes it just a little bit easier to setup.

@defuz

This comment has been minimized.

Copy link

defuz commented Jan 14, 2016

My suggestion is to have "one SUMMARY.md per language", but support page-to-page cross-linking between the different languages. The easiest way to do this is to consider that the files with the same name are the same chapters. In 99% this should work. A more complex way to do this is to add some kind of identifier to each file (something like UUID). If the identifiers of the files are identical, we can cross-link them.

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Jan 14, 2016

Hmm yes that might be a good compromise. At least if the translations don't diverge to much from the original. I will try to think about this a little more and see if I can come up with other ideas.

Thanks for the valuable input! :)

@mdinger

This comment has been minimized.

Copy link
Contributor

mdinger commented Jan 1, 2017

FWIW, there are tools to handle translations which I didn't see mentioned here yet. For example, crowdin is used (or was when I was involved) over at freecad for document translation of their wiki. It was noteworthy that when an update was made to an english file, the plugin would notify you that the other translations need to be updated for that specific section or they would be out of date. The page linked above actually lists how complete each language translation is and maintains that information.

It is possible a tool like crowdin could just be added to the build process as a plugin which has been notified of which files require translating. Then it will maintain the database itself somewhere and you could tell mdbook where the translated files are located.

A solution like this seems worth the time exploring before spending effort creating a new ground up approach to solve the same problem.


EDIT: Also note they offer free support to open source projects

@tyoc213

This comment has been minimized.

Copy link

tyoc213 commented Jun 24, 2017

For you information, what about single file for the source???

like

[es]
Esto es un ejemplo
[en]
This is an example
[fr]
Ceci une example

[es]
Esto no
[fr]
Ce n'est pas

Well, just saying :) (I mean for example for making a book/tutorial with code examples it will be better to only have one source code but the explanation in different languages.

And sure, switching between languages could be possible, and if there is no paragraph, show the default language of the document.

@sebras

This comment has been minimized.

Copy link

sebras commented Aug 6, 2017

How about a src/SUMMARY.md specifying the default chapter structure expected for all languages that are up to date and forcing specialized src/*/SUMMARY.md for the languages that have not yet made similar changes? This puts the penalty on the translations who have to keep a separate SUMMARY.md around for some time and do work to be up to date. The con is that the person updating the English translation does a minor amount of work when, in essence, causing the translation to fork.

So the rule would be: src/*/SUMMARY.md has higher precedence than src/SUMMARY.md

├── book
└── src
    ├── SUMMARY.md
    ├── en
    │   ├── hello-world.md
    │   └── second-and-third-chapter-combined.md
    ├── fr
    │   ├── SUMMARY.md
    │   ├── hello-world.md
    │   └── second-chapter.md
    │   └── third-chapter.md
    └── ru
         ├── hello-world.md
         └── second-and-third-chapter-combined.md

Consider e.g. the case you mentioned above where the original English book combined several chapters into one (or conversely split one into many). In this case the English translation would need to update src/SUMMARY.md, at this point the English author copies src/SUMMARY.md into each translation not yet updated. Hopefully these src/*/SUMMARY.md only stay around for a short period of time until the translations are updated accordingly.

In the example above before the English original text combined its chapters, src/SUMMARY.md is copied into src/fr/SUMMARY.md and src/ru/SUMMARY.md, next the English original text combines src/en/second-chapter.md and src/en/third-chapter.md into src/en/second-and-third-chapter-combined.md and updates src/SUMMARY.md to refer to the new second-and-third-chapter-combined.md (which at this point only exists in en). Some time later perhaps src/ru/second-and-third-chapter-combined.md is created at which point src/ru/SUMMARY.md may be deleted. src/fr might not yet have been updated so its src/fr/SUMMARY.md stays around a bit longer. Once all languages are updated their specialized src/*/SUMMARY.md can all be deleted and all languages can again rely on the default src/SUMMARY.md.

Do you think an approach like this is feasible and desirable?

I'm eager to do a translation of the Rust book, so I'd like for mdbook to resolve this bug and support translations, hence I'm trying to help you make progress. :)

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Aug 6, 2017

Thank you for your input!

Do you think an approach like this is feasible and desirable?

Unfortunately, I don't think this will work well in practice because there is a lot of overhead for the author of the original text. Every time the original texts diverge, the burden is on the the author to copy over the old summary to the translations before making a change. If he forgets, things will break, this seems very error prone.

I am more in favour of having one summary per language, cross-link files with the same name. This approach is, in my opinion, simpler to understand and doesn't require any extra work when the original text and the translations diverge.

I hope to make progress on this issue in the "near" future, we are slowly reworking parts of the internals to make it possible.

@sebras

This comment has been minimized.

Copy link

sebras commented Aug 8, 2017

I am more in favour of having one summary per language, cross-link files with the same name.

If there is one SUMMARY.md per language, what forces the files containing chapters to be named the same way in every language? I do agree about this design being less work for the original author of course. :)

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Aug 8, 2017

If there is one SUMMARY.md per language, what forces the files containing chapters to be named the same way in every language?

Nothing, it would be a convention. A translation would keep the same file structure and just modify the content of the files. If the translations diverge, you loose cross-linking but everything still works.

I am open to alternative ideas, but I think we should go with something that has minimal friction. :)

@sebras

This comment has been minimized.

Copy link

sebras commented Aug 8, 2017

If the translations diverge, you loose cross-linking but everything still works.

That's a good point. Maybe mdBook can warn if this is the case?

I am open to alternative ideas, but I think we should go with something that has minimal friction. :)

Yes, I absolutly. I was worried was no progress because of lack of design discussion, hence my suggestion to try to help you decide. I don't know the mdBook code base (or rust) yet. :)

@sebasmagri

This comment has been minimized.

Copy link

sebasmagri commented Aug 19, 2017

HI!

I'm probably going to reiterate on some already discussed topics but I'd still like to describe this case hoping it's useful to define the best mechanism for book translations in mdbook.

So I've been trying to define a process we could recommend for a localisation team to tackle tasks such as The Rust Programming Language book translation.

One of the things is how to integrate translated contents with the build output. For this specific case, and after having asked the docs team for feedback, it should be easier to handle all of the book contents independently in its own directory, including SUMMARY.md. This would allow the book translators to work in a completely independent way by forking the book repository and probably integrating it back as git submodules in the original repo. There would not be any kind of enforcement on the document internal structure neither on the phrase level content of translations.

Another thing is how to link translated content in the output. It could be linked on a per document fashion by mapping translations using the exact file name, in which case we'd have folder structure enforcement, or it could be linked only on the front page, in which case translation would have complete freedom on the folder structure, and even the Tree/Table of Contents. In the latter case, the contents tree guidelines could be defined by maintainers but not enforced at all by the tooling.

This two features or mechanisms, however, might not work for people wanting to use tools such as crowdin, transifex or weblate to manage their translations, which is probably more adequate for Software translation than for book translations. To support this case mdbook might need to generate a paragraph level mapping of translations and probably support output to any standard internationalization format such as gettext's PO files or L20N.

I'm absolutely willing to dedicate some time to this feature since this could be one of the primary goals of the localisation team. So of course I'm completely open to any kind of feedback and collaboration so we can lay out a plan to implement this.

Regards,

@azerupi

This comment has been minimized.

Copy link
Collaborator

azerupi commented Aug 20, 2017

Hi @sebasmagri

Thank you for the input! I would love to work together with the concerned parties to end up with a strong design that is both useful for simple and more complex requirements.

Currently, the design we are considering is the following:

To make a book multi-lingual, you would have to add some information to the configuration file:

[languages]
en = { name = "English", default = true }
fr = { name = "Français" }
# OR alternatively
# [languages.en]
# name = "English"
# default = true
#
# [languages.fr]
# name = "Français"

For the example above, we would expect to have sub-folders in the src directory, matching the keycodes en and fr used in the config, containing the source files for each language.

We could imagine having an optional source = "path" key in the language tables for more flexibility. This would then allow the submodule scenario you described.

We also think it is better to have a SUMMARY.md file for each translation. This allows translations to diverge without breaking the build.

For the HTML output, we consider cross-linking chapters from different languages based on the file structure. An English chapter called src/en/chapter_2/lifetimes-in-a-nutshell.md would be mapped to all the same chapters in different languages src/*/chapter_2/lifetimes-in-a-nutshell.md. This has the advantage of being simple and degrading gracefully when translations diverge. So if authors want cross-language linking they would have to keep the same structure, but if they don't or the structure diverges, the books will still build fine with non-matching chapters pointing to the index when changing languages.

To support this case mdbook might need to generate a paragraph level mapping of translations and probably support output to any standard internationalization format such as gettext's PO files or L20N.

This seems very complex? I am not very familiar with this issue but it seems to me that it would either require a lot of manual annotations for correct paragraph mapping or some heuristics. I would think this is (currently) out of scope for mdBook. Lets first focus on having basic but strong multi-lingual facilities and eventually expand from there. :)

Does that correspond to the requirements of the localisation team? If there is anything I missed or there are additional requirements that haven't been considered, please feel free to post 😉

I'm absolutely willing to dedicate some time to this feature since this could be one of the primary goals of the localisation team.

That would be wonderful, I am particularly interested in the perspective of the Rust project on this issue because I think they will be the ones using this feature the most.

@cauebs

This comment has been minimized.

Copy link

cauebs commented May 6, 2018

Just to resurface what @mattico said at #687

This should be fairly straightforward:

  1. Add a config option to set the default language.
  2. Determine & document the folder structure used for the translations.
  3. Change index generation to ignore translations.
  4. Set the lang template parameter in hbs_renderer based on the page path + default config.
  5. Add a menu to the html template + stylus.
  6. In hbs_renderer, look for different versions of the current page and add them to the template parameter.
  7. Set the language used to generate the search index.
  8. Add a cargo feature to disable this functionality, since rustc can't have the search language support due to licensing issues.
    Edit:
    Links might also need to be adjusted so they point at the page for the current language. This might not be necessary if the correct relative links are used, I'd have to check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment