Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce the use of a single string formatting mechanism for translation source strings #9377

Conversation

loicteixeira
Copy link
Member

@loicteixeira loicteixeira commented Oct 16, 2022

Fix #7111. As discussed there, the chosen style is the old style formatting.

25a83e0 setups and configures Semgrep in order to lint the source strings and prevent regressions.
It might not be perfect just yet, maybe there's false negative, but at least I didn't see any false positives (which we could always temporarily ignore with # nosemgrep: id-of-the-error until the rules are updated).

11e86ea and 4c7106b (the second one deals with strings in JS, so it was separated to another commit for ease of review.) update the source strings to the desired format.

See a compiled list of source string changes.

5db259e is unrelated, but since flake8 was running on the file, it would complain about it.

Now that's about 150 strings which translators would have to update as I don't think the auto matching whatever will be able to pick this up. In order to reduce the burden as much as possible, I've made the following script which will update the PO files with the new placeholders when possible.

See the update script.

If all goes well, it would only leave the translators 4 translations to update (see the AMBIGUOUS_REPLACEMENTS constant in the script).

However I'm not 100% sure how to proceed, so here is what I'm thinking so far:

  1. Merge this PR OR do it on this branch until everything is confirmed to work?
  2. Fetch translations
  3. Commit & push to main to create a "restore" point, just in case
  4. Run the script cd wagtail_root; python ./scripts/issue_7111_update_i18n_placeholders.py
  5. Commit
  6. Rebuild sources and push to Transifex
  7. Push the translations with tx push -t

Unfortunately I can't really test it without actually doing it...

The main idea is that we can't push the updated source strings first because the translations correspondance would be lost, so we fetch first (step 2), update locally (step 4), push the updated source string (step 6), at this point the translations correspondance would be lost, but pushing the updated translations (step 7) should restore it 🤞

The other question that I have is when pulling translations with the fetch-translations.sh script, we would normally do some cleanup and maybe we don't want to do that (since we're going to push those files later), so we should only run the first 2 statements of the script (i.e. remove old translation files then pull).


Please check the following:

  • Do the tests still pass?1
  • Does the code comply with the style guide?
    • Run make lint from the Wagtail root.
  • For Python changes: Have you added tests to Existing tests cover the new/fixed behaviour?
  • For front-end changes: Did you test on all of Wagtail’s supported environments?2
    • Please list the exact browser and operating system versions you tested:
    • Please list which assistive technologies 3 you tested:
  • For new features: Has the documentation been updated accordingly?

Please describe additional details for testing this change.

Footnotes

  1. Development Testing

  2. Browser and device support

  3. Accessibility Target

@squash-labs
Copy link

squash-labs bot commented Oct 16, 2022

Manage this branch in Squash

Test this branch here: https://loicteixeira7111-enforce-trans-ay1u4.squash.io

@loicteixeira loicteixeira added the component:i18n i18n for content created in Wagtail, not the admin UI itself label Oct 16, 2022
@loicteixeira loicteixeira marked this pull request as ready for review October 16, 2022 10:07
Copy link
Member

@lb- lb- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loicteixeira this is incredible - the code looks solid to me, have not run locally yet though.

A few things

Do not use str.format style formatting for translations.
Use printf style formatting with named placeholders instead.
For example, do `_("Hello %(name)s") % {"name": "Wagtail"}`
instead of `_("Hello {name}").format(name="Wagtail")`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should really point the developer to a docs page here, hopefully I can get #9431 in and we can have this end docs file as a source of truth here.

The nuances of the various types of string formatting was a bit for me to understand, and there is the temptation to use the 'newer' thing. Having the docs as a place where can explain this a bit better would be useful.

Also, this means adding a docs section also I think as part of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that #9431 is merged, I'll rebase my branch tomorrow Friday and piggyback on it 👍

@loicteixeira
Copy link
Member Author

Adding semgrep as another dependency is probably required, but just wondering if this can be somehow written into a tool we already use like a flake8 plugin (note: I have done zero research into this, so feel free to shut that idea down). flake8.pycqa.org/en/latest/plugin-development/index.html

Yes, it should be possible to do the same thing with a flake8 plugin, but it's a bit more involved since it works with the AST tree (where semgrep is some pseudo-regex) and would need to be published to pypi. I actually started exploring this, but finally switched to semgrep so I could get this out quickly and finally provide some much needed help to the translators.

I'd understand though if we don't want to introduce yet another dev dependency, but I'd be in favour of going with it for now, and I can try in a second time to replace semgrep with a flake8 plugin.

As a side note, merging the string changes without the linter is also possible, but obviously we would have to do this all over again later 😒

@loicteixeira
Copy link
Member Author

Also, we probably need to wait for 4.1 to be released, otherwise we will mess up the translations for that release.

@lb-
Copy link
Member

lb- commented Oct 26, 2022

I think that reasoning makes sense - regarding the added dependency. I also think this should be merged as one, the fixes and the enforcing of the new approach.

Hopefully though there will be some other thoughts on this from others in the core team soon.

Thanks Loic.

@thibaudcolas thibaudcolas self-requested a review October 26, 2022 16:05
@lb-
Copy link
Member

lb- commented Nov 1, 2022

Discussed at the core team last week, no major objections but a few refinements to do before this goes in. But now that 4.1 is out this can be revisited.

  • @thibaudcolas to review
  • Revisit how we can minimise the amount of changes needed in Transifex
  • Did not seem like any major objections to the new dev dependency
  • Might be good to get some feedback if possible from translators

Hopefully that summarises it ok.

Thanks for the epic progress on this @loicteixeira

@loicteixeira loicteixeira force-pushed the 7111-enforce-translation-string-formatting branch from 4c7106b to 462b4ba Compare November 3, 2022 06:57
@loicteixeira
Copy link
Member Author

loicteixeira commented Nov 3, 2022

Rebased on main after the release of v4.1. The list of diff and the script in the PR description have been updated accordingly.

Edit: I removed the need design decision since the core team had no major objections.

@loicteixeira
Copy link
Member Author

Btw, if approved, I'm obviously happy to merge this myself and run the scripts since this isn't your usual merge & push.

@th3hamm0r
Copy link
Contributor

@loicteixeira epic work, this really improves the quality of wagtail's i18n!

I've focused on the changed translation calls in python and the resulting strings (original and "de"), I've also tested your script locally.
I've found one issue with the script, there is at least one replacement, which is incomplete, therefore it does not work:

# => Are you sure you want to delete this %(object)s? If other things in your
    "Are you sure you want to delete this %s? If other things in your": (
        ("%s", "%(object)s"),
    ),

Somehow you've only used parts of the original string. This finding was lucky, because the string hasn't been changed, but the (newer?) polib-version reformatted the string, so I've found the remaining "%s".
I haven't checked all other replacements yet, as there are some of them 😁

As you've pointed out above, not everything can be tested, but from my point of view it looks good 🙏

@th3hamm0r
Copy link
Contributor

After finding the missed replacement above, I've searched through all resulting po files. I did not find any other (missed) "%s" strings, but after searching for "{", I've found the following, where I don't know exactly why it has been missed with semgrep?

hide_include_subtree = False
self.fields["include_subtree"].label = ngettext(
"Include subtree ({} page)",
"Include subtree ({} pages)",
descendant_count,
).format(descendant_count)

Also there are some uses of {...} in the admin's JS, see djangojs.po.
Looking at django's docs those usages could be adapted to the python format by using .interpolate() instead of javascript's .replace(), but I have to admit that I haven't used that by myself yet.

Copy link
Contributor

@th3hamm0r th3hamm0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry for spamming you with comments, my process was a bit iterative, I should have waited and used the review comment once 😬

@loicteixeira
Copy link
Member Author

No worries. Thank you for the thorough review 👍

@loicteixeira loicteixeira force-pushed the 7111-enforce-translation-string-formatting branch from 3037ce0 to 171fd2b Compare November 24, 2022 07:30
@loicteixeira
Copy link
Member Author

loicteixeira commented Nov 24, 2022

Rebased on main.

Truncated "Are you sure you want to delete this %s?"

Nice catch. I've updated the script accordingly!

I don't know exactly why it has been missed with semgrep

The semgrep was looking for a function called _ or gettext or gettext_lazy or ngettext, but with only a single argument.

  • d9be05d updates the pattern, so it also looks for more arguments (it still seems to work when there's only one)
  • 2224fc8 adds ngettext_lazy to the list of functions to match

In the case of ngettext, it's still not perfect as it will only be looking at the first of the two strings, but maybe that's good enough for a first iteration 🤔

Also there are some uses of {...} in the admin's JS

The rule only looks for Python files. I'll have a look at enabling it for js/jsx/ts/tsx files as well, but I'll need to make sure it doesn't break anything (regarding the replace/interpolate you mentioned).

@th3hamm0r
Copy link
Contributor

Awesome!

The rule only looks for Python files. I'll have a look at enabling it for js/jsx/ts/tsx files as well, but I'll need to make sure it doesn't break anything (regarding the replace/interpolate you mentioned).

Yes, I assumed that. Since I haven't used them yet, I cannot really give advice here, so for me it is more a nice to have.

Copy link
Collaborator

@gasman gasman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this @loicteixeira!

wagtail/images/fields.py Outdated Show resolved Hide resolved
@th3hamm0r
Copy link
Contributor

Should we add semgrep --config .semgrep.yml --error also to the Makefile's lint-server section?

The PR's template guides everyone to execute make lint, so this would then also cover semgrep. I also use it quite often 😁

@loicteixeira loicteixeira force-pushed the 7111-enforce-translation-string-formatting branch from 16d001b to 65b996b Compare November 25, 2022 11:45
@loicteixeira
Copy link
Member Author

  1. Missing replacements
  2. The replacement for the this plural message does not work properly
  3. %(object)s should be %(image_title)s
  4. "objects" vs. "object"

Script updated, see the changes.

  1. Atm we have 3 replacements of the string "Edit this %s", all with different placeholders. This leads to invalid strings, eg. for "de" the result is:

All updated to use object. I also updated related strings (add/delete this ...) so it use the same placeholder everywhere.

  1. I think we cannot automate the replacement of the following, but I'm not completely sure

Indeed, this won't work. Ignoring it then and it will be up to the translators to retranslate.

Copy link
Contributor

@th3hamm0r th3hamm0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
I've rerun the scripts again and there seems to be only one issue with the new plural handling in your script. It produced the following diff:

-msgid "Include subtree ({} page)"
+msgid "Include subtree (%(descendant_count)s page)"
 msgid_plural "Include subtree ({} pages)"
-msgstr[0] "Teilbaum ({} Seite) einbeziehen"
-msgstr[1] "Teilbaum ({} Seiten) einbeziehen"
+msgstr[0] "Teilbaum (%(descendant_count)s Seite) einbeziehen"
+msgstr[1] "Teilbaum (%(descendant_count)s Seiten) einbeziehen"

The "msgid_plural" isn't replaced, so I think this has to be done in this block: https://gist.github.com/loicteixeira/bcc5e678e3d06af07bdb3afe9d149575#file-source_strings_udpate-py-L452-L454

Copy link
Member

@laymonage laymonage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this PR, thank you very much for the great work!

I'm working on some refactoring between snippets and generic views in #9709, and I'd like to rename some variables in translatable strings so that both views can share the same strings. I was hoping to piggyback some strings in the snippets views, if you don't mind? Thanks (and sorry for adding more work)!

wagtail/admin/views/generic/models.py Show resolved Hide resolved
@th3hamm0r th3hamm0r self-requested a review November 26, 2022 09:04
Copy link
Contributor

@th3hamm0r th3hamm0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reran the scripts, looks good! Again, amazing work @loicteixeira and all contributors!

I think we should keep your script (or better you as the "polib-pro") in mind, because changing placeholder names during refactorings will probably happen again in the future, and this little script avoids creating unnecessary (maybe release-blocking) tasks for the translators 👍

@loicteixeira
Copy link
Member Author

loicteixeira commented Nov 26, 2022

I merged locally, but the re-upload to Transifex didn't really go as planned.
I'm trying to fix it and will update here with the exact steps taken.
Edit: Old state re-uploaded to Transifex.

loicteixeira added a commit that referenced this pull request Nov 26, 2022
@loicteixeira
Copy link
Member Author

loicteixeira commented Nov 26, 2022

Steps taken to merge (this for posterity, and also because I'll have to do it all over again for wagtail-localize which is in the same Transifex project).

BRANCH_NAME="7111-enforce-translation-string-formatting"

# Merge
git switch main
git pull --ff-only upstream main
git switch "${BRANCH_NAME}"
git reset $(git merge-base main "${BRANCH_NAME}")
git add .
git commit -m "Enforce the use of a single string formatting mechanism for translation source strings" -m "Close #9377"
git rebase main
git merge --ff-only "${BRANCH_NAME}"

# Pull & commit existing translations
cd scripts
./fetch-translations.sh > ../../logs/translation.out 2>../../logs/translation.err
python ./check-translation-strings.py
cd ..
git add .
git commit -m "Fetch new translations from Transifex"

# Update the strings and sanity check with French translation (I can't possibly check all languages...)
python ./scripts/source_strings_update.py
find wagtail -name *.po -type f -path **/fr/** -exec git diff --staged --word-diff {} \;
git add .

# Rebuild English translations to check if there's diff from what the script has done
cd scripts
./rebuild-translation-sources.sh
cd ..
git add .

git commit -m "Update translations placeholder names" -m "See 364de3d4dc and #9377"

# Last linting check
semgrep --config .semgrep.yml --error .

Now the tricky part:

# Upload English source strings to Transifex
tx push --source

Languages now show 160 untranslated strings as expected.

# Upload French translations to Transifex for sanity check on Transifex
tx push --translations --language=fr

Sadly it did not work, files were being "skipped", and we still had 160 strings to translate.

Trying to fix it:

git checkout HEAD~1  # Come back to the commit before the placeholder update
tx push --source  # Re-push the original English sources

At this point, Transifex is back to what it was before I started.

git switch main

# Delete translations
find wagtail -iname *.po ! -iwholename */en/* -delete

# Re-fetch, but without using `scripts/fetch-translations.sh` to keep strings location information.
tx pull --all --minimum-perc=1

# Re-update placeholders
python ./scripts/source_strings_update.py

# Re-upload English updated source strings to Transifex
tx push --source

# Re-upload French translations
tx push --translations --language=fr

Unfortunately, files are still being "skipped".
While the location information "might" be required anyway, the files seem to be skipped because of their modification timestamp.
Trying to force the upload.

# Re-upload French translations
tx push --translations --language=fr --force --no-interactive

This work, but it means that the time between fetch and upload has to be as short as possible, or we will overwrite work from translators which might be updating strings at the same time we do this work.

# Upload other translations
tx push --translations --force --no-interactive

So at this point I'm not sure if files were skipped because of the missing location information or simply the timestamps.
It feels like it would be the latter (location information shouldn't be needed as long as you have the string "id")
but I don't want to fiddle with it any more for today.

In the end, there are: 14 strings which weren't updated (that's 10 more than the 5 impossible/ambiguous auto-update we had identified).

So 160 strings down to 14. Not perfect but not bad either.

# Cleanup & merge
git restore .
git push --dry-run upstream main
git push upstream main

So this might need some refinement before we start using it on a "regular" basis.

a cartoon dog sitting at a table in the middle of a room in fire saying "this is fine" and looking into its coffee cup

Anyway, thank you @lb-, @th3hamm0r, @gasman & @laymonage for your help on this issue.

@loicteixeira loicteixeira deleted the 7111-enforce-translation-string-formatting branch November 26, 2022 11:54
@lb-
Copy link
Member

lb- commented Nov 26, 2022

Wow @loicteixeira - thanks for documenting this process also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:i18n i18n for content created in Wagtail, not the admin UI itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enforce the use of a single string formatting mechanism for translation source strings
5 participants