Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add localization support with po4a. #2793

Closed
wants to merge 5 commits into from
Closed

Conversation

urbalazs
Copy link
Contributor

Hello @sbrl
Pull request is created as you asked. I also generate an initial POT file for translators.
More information in #2339 (comment).

@CLAassistant
Copy link

CLAassistant commented Feb 18, 2019

CLA assistant check
All committers have signed the CLA.

@mebeim
Copy link
Member

mebeim commented Feb 18, 2019

Hi @urbalazs, I've read #2339, but I find it hard to understand how this could help. Could you elaborate on why this should be introduced and which advantages it brings? The current translation workflow seems to be really straight-forward and manageable to me.

@urbalazs
Copy link
Contributor Author

Hi @mebeim can you explain me, what is the current translation workflow? If any of the English strings are changed, how can translator be informed about this?
My solution is based on PO files, which can be updated with po4a, and generate the localized pages.

@mebeim
Copy link
Member

mebeim commented Feb 19, 2019

@urbalazs that's what I was not understanding, thanks for clarifying. I thought you were just proposing a different translation mechanism for newly translated pages. Sorry for the misunderstanding.

The issue you describe is currently unhandled, you're right. What you're proposing is interesting, but I'm not familiar with the tool and would like to have some real examples.

Take for example the following scenario (points in chronological order):

  1. Someone(1) creates a page under /pages/common/somecmd.md.
  2. Someone(2) translates said page into /pages.it/common/somecmd.md.
  3. Someone(3) edits an example description in /pages.it/common/somecmd.md.

What should be done before or after any of these points, using the tool you mentioned?

@urbalazs
Copy link
Contributor Author

@mebeim First of all, po4a have to be installed. Check the po4a.conf file, which I added with this PR.

Scenario 1: someone create a new page.
Then add the new page to po4a.conf (there is a command chain included to do this recursively) and just execute po4a -v -k 0 --no-translations po4a.conf. This will update the POT and PO files.

Scenario 2: someone translate a page.
You have to execute po4a -v -k 0 po4a.conf and the localized pages will be regenerated (check po4a.conf for details) for languages, that are added to po4a.conf.
(we have a small problem with lines longer than 80 characters, because this command will break the lines at col. 80. We have to write a small beautifier script for this.)

Scenario 3: some edit a translated page.
This is no longer needed, just the PO file need to edit. Translated pages will be generated as described in scenario 2.

@waldyrious
Copy link
Member

@urbalazs shouldn't the .pot file be updated whenever there's a change to the original English pages? Is there a way we could set this to happen automatically or at least in regular intervals? Say, via Travis as we're currently doing to build the archives at tldr-pages/tldr-pages.github.io?

@mebeim
Copy link
Member

mebeim commented Feb 19, 2019

@urbalazs so if a page is already translated and I want to edit it I should only touch the PO file... I see. Thanks for the thorough explanation though, looks cool, I will give it a try on my fork when I have some time.

@waldyrious the pot file can be updated automatically by Travis for sure.

@urbalazs
Copy link
Contributor Author

urbalazs commented Feb 19, 2019

@mebeim @waldyrious I added a bash script to this PR. Now you only have to add your language to the LANGS array (separated with spaces, e.g.: de hu it pt-BR ta zh ...) and execute the script: bash translation-update.sh

It does everything, what we need:

  • Generate or update POT file from pages/SUBDIR/*.md
  • Generate or update PO files
  • Generate or update translated TLDR pages to pages.XX/SUBDIR/*.md (where XX is the language code)

I really like it!

Please note that if you add currently translated languages (it, pt-BR, ta, zh) to LANGS in bash script, the script will overwrite the existing translations! Translations now come from PO files exclusively! Existing translations need to be migrated to PO files.

@agnivade
Copy link
Member

Shouldn't the PO files be checked in the repo as well ?

This seems like the way to go, but it needs some effort to integrate it with our flow. @mebeim - Do you want to own this ?

@mebeim
Copy link
Member

mebeim commented Feb 20, 2019

@agnivade I would like to, but I'm currently very busy so I've only managed to comment/review here and there lately. As I said earlier, I'd like to check this out on my fork to see how it works since I'm not familiar with the tool. Looks simple enough, but it indeed would require some work to be built into the current flow. I'll probably be able to take a look at this in two weeks.

@agnivade
Copy link
Member

Great ! Please take your time, there is absolutely no rush.

@waldyrious waldyrious added tooling Helper tools, scripts and automated processes. translation Translate pages from one language to another. labels Feb 20, 2019
@sbrl
Copy link
Member

sbrl commented Feb 20, 2019

Cool, thanks @urbalazs! Looks like there's still some work to do, but great work so far!

Does it support a wildcard, like this?

[type: asciidoc] pages/(.*)/(.*).md $lang:pages.$lang/$1/$2.md

Also, our pages aren't actually asciidoc - they're markdown. Not sure if that affects things at all.

Looks like there are some commits here with an email address different to the one that you've got attached toy our GitHub account by the way - I suspect that's why @CLAassistant is complaining.

@urbalazs
Copy link
Contributor Author

@agnivade PO files will be stored in the repo. I didn't send PO files yet, but my bash script will generate them to i18n folder.

@sbrl

Does it support a wildcard, like this?

No, as I know it doesn't. But this is not important for now, as my bash script will generate the po4a.conf file based on the current tldr pages.

Also, our pages aren't actually asciidoc - they're markdown.

I know, but po4a has no Markdown support. I tried all supported languages, and asciidoc works the best. It is not perfect, because asciidoc and Markdown are different. That's why I add a beautifier section to the bash script: I have to fix the generated output with sed. But the results looks fine!

email address different to the one that you've got attached toy our GitHub account

Yes, sorry for this. I forgot to change it in my local git config, and the first commit was made with my gmail address. But now I don't want to use GMail anymore, because I don't want to be "product".

@sbrl
Copy link
Member

sbrl commented Feb 22, 2019

Cool, thanks for the clarification!

Yes, sorry for this. I forgot to change it in my local git config, and the first commit was made with my gmail address. But now I don't want to use GMail anymore, because I don't want to be "product".

Ah, I see. I've got a similar reason for moving away from gmail myself. You may find an interactive rebase helpful in fixing the commit author, IIRC.

@mebeim
Copy link
Member

mebeim commented Mar 19, 2019

Hey @urbalazs I was looking at po4a, and I noticed there's a "text" format. Would that have any advantage over asciidoc? For example longer lines?

@urbalazs
Copy link
Contributor Author

Using "text" format results these entries in POT and PO files:

> A file archiver with high compression ratio.

> Homepage: <https://www.7-zip.org/>.

- Archive a file or directory:

The same with "asciidoc":

A file archiver with high compression ratio.

Homepage: <https://www.7-zip.org/>.

Archive a file or directory:

This is better for us, because the leading > and - characters are not part of the translatable text, so translators can not break the structure accidentally.

@mebeim
Copy link
Member

mebeim commented Mar 20, 2019

@urbalazs oh, I see. So the line width limit is still 80 chars even with "text"?

@urbalazs
Copy link
Contributor Author

Yes, lines are wrapped when po4a generates the localized output. I tried to unwrap all lines before the operation, but without success. It seams, this is a builtin "feature" of po4a.
My script removes the unnecessary new lines from the generated pages.

@mebeim
Copy link
Member

mebeim commented Mar 28, 2019

Hi again @urbalazs, sorry if it took a long time, but I finally had the chance to take a look and test the proposed approach.

Here's what I think: in short, this method looks much cooler, maintainable, and easier, but there are some problems that I think cannot be overlooked. Here's a list:

  1. Translations will not be managed atomically for each page, but all together using a HUGE file (namely one .po file for each language) which contains all the messages that have to be translated. This means that translating a page will definitely not be as easy as it is right now. Since messages are unique, and each one is identified and listed in the .po file, to translate a page a user would have to go and find all the messages that were generated by that page. This is not a trivial job, since messages from a page can be spread across the whole .po file.

    Take for example common/asar.md, which has (among the others) these two messages listed in your sample tldr.hu.po file (notice the line numbers):

       84: #: pages/common/asar.md:18
       85: msgid "List the contents of an archive file:"
       86: msgstr ""
       ...
     1321: #: pages/common/asar.md:4
     1322: msgid "A file archiver for the Electron platform."
     1323: msgstr ""
    

    So if someone wants to translate the asar command, they'd have to manually find all messages related to asar.md, which can be spread "randomly" through .po file, which is huge. This makes it much harder for people that are not power users to create a translation, and also for us to review it. Once translated, they would have to run the script to update translations, and then see if the result is fine.

  2. This brings us to the second point. Since messages are unique by design (to not waste time translating the same phrase twice), this creates another problem: if two people want to translate two different pages which happen to share a common message, then a merge conflict would be generated for the .po file they are working on. This really does not help reviewing and integrating translations either.

  3. All pages are generated regardless of the fact that messages have been translated or not. For example, using your sample tldr.hu.po without translating any message, all the pages are generated in the pages.hu folder, in English: this is not what we would want. I am not sure if this would be possible to avoid (maybe it just needs a simple flag to be added to the po4a command), but it is nonetheless annoying.

  4. This should not actually be a real problem, but I see that messages sometimes include Markdown tokens and sometimes don't, take for example (again from tldr.hu.po):

     20: #. type: Plain text
     21: #: pages/common/7za.md:2
     22: msgid "# 7za"
                ^ includes '#'
    
     25: #. type: Plain text
     26: #: pages/common/7za.md:4 pages/common/7z.md:4 pages/common/7zr.md:4
     27: msgid "A file archiver with high compression ratio."
                ^ does not include '>'
    
     46: #. type: Plain text
     47: #: pages/common/7za.md:10
     48: msgid "`7za a {{archived.7z}} {{path/to/file_or_directory}}`"
                ^ includes '`'
    

    I think I remember you already mentioned this, although I cannot find the comment where you were talking about it, and it should be rather simple to fix, but this as I already said, this is not the real problem.


So, given the above points, most importantly number 1 and 2, while I think that an upgrade of the current translation workflow would be nice, to me it doesn't look like po4a would be a valid solution/improvement. At least from what I can see taking a look at your branch and testing with the commands you provided.

Let me know what you guys think about it, and of course do let me know if I got anything wrong above, since I am not familiar with the tool. The more opinions, the better!

Cc: @sbrl @agnivade

@agnivade
Copy link
Member

Thanks for working on this @mebeim !

  1. Can this be solved using some tooling ? Let's say we have one file for each command. And some tool generates the final .po file which gets translated.

  2. That is indeed an issue. I guess it's unavoidable with how the po4a architecture is setup. But it's more of a corner case.

I have a more of a general question. If the po4a workflow is adopted, would we have 2 sets of files then ? One raw set, which is to be edited/added to for changes. Another generated set, which will be consumed by the clients ? Sounds like a complicated workflow for a newcomer to contribute to. Right now, the md files are what is consumed. So anyone can see and contribute a PR, it's simple. But when po4a comes, folks have to go through the raw set, edit the correct translation msg and send a PR.

@urbalazs
Copy link
Contributor Author

Hello @mebeim
My answers to your questions:

  1. Translations should be maintain on translation service pages like Transifex or Weblate. Both have GitHub integration. Pull request should be sent only for source pages. This is a huge project, translation is not a one-man-work, but a community effort.

  2. Both Transifex and Weblate can handle this.

  3. I modified the script, that now deletes the completely untranslated pages.

  4. I also added a modification to remove the command name from the POT file, so this kind of strings (e.g. "# 7za") is not present in POT file any more. I have no solution to remove the back tick characters from the commands, so we have to live with this.

@agnivade

So anyone can see and contribute a PR, it's simple.

No, this is not true. Why do you think, that all translators have git knowledge? We should make translation easy! Forking, edit, PR create, sending is not the convenient way.

If the po4a workflow is adopted, would we have 2 sets of files then ?

No. The expected workflow should be the follows (let assume, you already integrated Transifex or Weblate - let's call them "translation service"):

  1. Someone add new pages or modify the existing pages via pull request.
  2. Existing translations are pulled from "translation service".
  3. Run my script, that update POT from source and generate translated pages from the translations (from PO files).
  4. The "translation service" get the updated POT, which is ready for translators.

Then you have to loop step 1-4 in a regular bases. No more pull request is needed for translations.

To all:
Please note, that gettext (aka working with PO files) is the de-facto standard for translating anything. Such a huge project like this can not be handled manually anymore. Translators are not informed about source changes, and creating a pull request is not easy for everyone.

Let's see this commit: 948147d
If the translators don't follow all changes, the translated pages will still wrong. My proposal is a working solution for this problem.

Please let me know, if something is not clear or if you need help in anything!

@urbalazs
Copy link
Contributor Author

urbalazs commented Mar 28, 2019

Example for Transifex integration and Weblate integration

@mquinson
Copy link

mquinson commented Jun 26, 2019 via email

@mebeim
Copy link
Member

mebeim commented Jun 26, 2019

Thank you again @mquinson for the explanation.

About the third point: I was talking about the following, quoting @urbalazs's comment [1] (emphasis mine):

You have to execute po4a -v -k 0 po4a.conf and the localized pages will be regenerated (check po4a.conf for details) for languages, that are added to po4a.conf.
(we have a small problem with lines longer than 80 characters, because this command will break the lines at col. 80. We have to write a small beautifier script for this.)

and a followup on that [2] [3]:

mebeim: @urbalazs oh, I see. So the line width limit is still 80 chars even with "text"?

urbalazs: Yes, lines are wrapped when po4a generates the localized output. I tried to unwrap all lines before the operation, but without success. It seams, this is a builtin "feature" of po4a.
My script removes the unnecessary new lines from the generated pages.

Is this 80 columns limit unavoidable or compulsary? If not, how could it be disabled?

@mquinson
Copy link

mquinson commented Jun 26, 2019 via email

@mebeim
Copy link
Member

mebeim commented Jun 26, 2019

@mquinson got it, thank you again! I will start experimenting with this when I have time, looks promising so far. Keep up the good work with po4a.

@mquinson
Copy link

mquinson commented Jun 26, 2019 via email

@waldyrious
Copy link
Member

Is this 80 columns limit unavoidable or compulsary? If not, how could it be disabled?

I actually think this could be a great opportunity to introduce a line length limit in tldr-pages. Long lines typically mean we're trying to explain complex concepts, and the line limit could help identify and tackle those cases.

I see it as similar to our limit to the number of examples per page, and perfectly in line with our mission of TL;DR-ing manpages and complex documentation. But if anyone disagrees, let's drop the idea for now, so as to not derail this discussion :)

@agnivade
Copy link
Member

I need to mull on it for some time. In any case, let us discuss this in a separate issue.

@waldyrious
Copy link
Member

In any case, let us discuss this in a separate issue.

Makes sense. We can start with neverwrap and remove it later if we agree on a limit.

Opened #3145.

@stale
Copy link

stale bot commented Sep 10, 2019

Hi all! This thread has not had any recent activity. Are there any updates? Thanks!

@stale stale bot added the waiting Issues/PRs with Pending response by the author. label Sep 10, 2019
@waldyrious
Copy link
Member

I second stale-bot's words 😄

Is anyone able to point out the state of this PR, and what's blocking progress?

@stale stale bot removed the waiting Issues/PRs with Pending response by the author. label Nov 11, 2019
@mebeim
Copy link
Member

mebeim commented Nov 12, 2019

@waldyrious to sum it up: this seems like it could definitely be done, but it would take a lot of work and some big changes to the build process and translation contribution workflow, plus integration with Weblate, which would require either asking for a free host or self-hosting (I also don't know how the two options would differ in practice). A substantial part of the work would also consist in adapting existing translations. It does look promising, and nothing is really blocking the progress, it's just an overwhelming amount of work for anybody to just step in and say "ok, I'll handle this".

@sbrl
Copy link
Member

sbrl commented Nov 12, 2019

Hmm. Sounds like it might be worth breaking it down a bit into multiple sub-components to make it easier to manage. It's a complicated problem though, so I'm not sure how to best do this.

@mquinson
Copy link

I don't think I agree with @mebeim here. I think that we are quite close to what's needed. I was still waiting for a follow up of you guys after starting playing and experimenting with po4a. This is where we are:

Someone confident in tldr should try to apply this PR locally and play with the po4a command. Add the -k0 parameter in po4a.conf and ask a translation to a language that is not provided. The "translated" file will only contain english text, so that you can check whether the result is OK with you. (without -k0, such a page would not be generated as po4a wants at least 80% of a page to be translated to generate it). If the output is not looking nice, don't fiddle with the translate script and all its nasty sed commands. Ask me and the po4a dudes to fix the handling of markdown instead. That's much more productive to fix the bugs at their roots than to mask their effects with sed.

Someone in the TLDR admin staff should contact the weblate administrator to ask for hosting. Since you are a free software, it may be easy. If he (or his server) is too busy, you should ask to crowdin, which will certainly accept. You may also consider http://zanata.org/ for free hosting of your translations. Once you have the po files generated with po4a and a working hosting solution, setting up everything is easy.

Dealing with the existing translation is somewhat more difficult, but po4a comes with a tool to convert a master file and its translation into a po file that can be integrated in the regular po4a workflow. This tool must be used by someone understanding the translation (to adapt the translated file if the structure was somewhat modified), but that's not very difficult. It's straightforward if the structure of the master and translated files match exactly, and it's a bit long if not (but that's still not complex). See https://po4a.org/man/man1/po4a-gettextize.1.php

The difficult part is to adapt the current translation workflow, as humans are involved. If the teams are already working and happy with their workflow, that's maybe not a good idea to force them to adapt to a methodology. You can still setup the po4a thing, and tell the other teams (and the ones which don't exist yet) to go for that, as it's easier on the long term. Note that the more you wait, the harder that part of the conversion will be.

Note also that if you prefer not to use po4a, I'm perfectly fine with it. I don't sell po4a but I'm a potential user of the french version of TLDR. As long as the material gets translated and that translations are maintained whenever the original text changes, that's cool. If your translation workflow takes this into account, please forget about po4a!

@sbrl
Copy link
Member

sbrl commented Nov 12, 2019

Yeah, the current issue with the existing translation workflow is twofold:

  1. Lots of PRs are generated for translations, which need reviewing manually (and this could be automated
  2. There's no way to tell or ask people to update translations of a page if the English master is updated.

I could reach out to weblate etc. on behalf of tldr-pages, if that's helpful.

The structure of translations and the original master should be identical..... unless the master has been updated and the translation hasn't.

@mebeim
Copy link
Member

mebeim commented Nov 13, 2019

Hi, @mquinson. There's a problem that I didn't think of previously. Assume the following:

  1. A threshold of 100% is set to only generate pages that are fully translated.
  2. A it.po file is already present and up to date, providing strings for Italian pages.
  3. A English page that is fully translated in Italian gets an additional example added (i.e. additional strings to translate).
  4. This causes the translation percentage to go below the threshold, and the corresponding Italian page to not be generated.

Do you know how something like this could be handled? Ideally, we would like to keep the "old" version of the Italian page until the new English strings are also translated and we have a "new" fully translated page again. I hope I was clear enough.

@mquinson
Copy link

  1. I'm not sure I understand why you guys want to put the threshold at 100%. In other projects, 80% is seen as a fair compromise between providing really fresh improvements to the ones who can some english and providing a fair translation of the content that didn't change recently. That's your project, I respect your choice, but I'm just wondering.

  2. What you're calling for is easy to script if it's really what you want. As po4a to generate the translation into a new/ directory, and move the ones that are actually generated to their real location.

But again, consider carefully whether it's really better to keep an old complete translation rather than providing a recent version with one or two new sentences in english in the middle of the translation. The good news is that each team can chose a different policy, if you want, as both approach are really easy to get with po4a.

@mquinson
Copy link

I would like to stress again that the support for markdown is somewhat experimental in po4a. If/when you see bugs, we'll fix them.

@mebeim
Copy link
Member

mebeim commented Nov 13, 2019

@mquinson thank you for the quick response, I guess generating files into a new folder would indeed work. The scenario still applies even with a 80% threshold. Since as of now translation are basically all manual, it's like working with a 100% threshold, and I'm assuming this threshold just to make it simpler, but of course once the whole thing is set up we will discuss what's best to set as threshold.

As per the Markdown support, I'll let you know for sure if there's something that looks like a bug. For what I could see until now, it seems to work fine.

@stale
Copy link

stale bot commented Dec 9, 2019

Hi all! This thread has not had any recent activity. Are there any updates? Thanks!

@stale stale bot added the waiting Issues/PRs with Pending response by the author. label Dec 9, 2019
@stale
Copy link

stale bot commented Jan 8, 2020

Hi everyone. This thread is being closed as there was no response to the previous prompt. However, please leave a comment whenever you're ready to resume, so the thread can be reopened. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tooling Helper tools, scripts and automated processes. translation Translate pages from one language to another. waiting Issues/PRs with Pending response by the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants