Enforce the use of a single string formatting mechanism for translation source strings #9377

loicteixeira · 2022-10-16T09:16:14Z

Fix #7111. As discussed there, the chosen style is the old style formatting.

25a83e0 setups and configures Semgrep in order to lint the source strings and prevent regressions.
It might not be perfect just yet, maybe there's false negative, but at least I didn't see any false positives (which we could always temporarily ignore with # nosemgrep: id-of-the-error until the rules are updated).

11e86ea and 4c7106b (the second one deals with strings in JS, so it was separated to another commit for ease of review.) update the source strings to the desired format.

See a compiled list of source string changes.

5db259e is unrelated, but since flake8 was running on the file, it would complain about it.

Now that's about 150 strings which translators would have to update as I don't think the auto matching whatever will be able to pick this up. In order to reduce the burden as much as possible, I've made the following script which will update the PO files with the new placeholders when possible.

See the update script.

If all goes well, it would only leave the translators 4 translations to update (see the AMBIGUOUS_REPLACEMENTS constant in the script).

However I'm not 100% sure how to proceed, so here is what I'm thinking so far:

Merge this PR OR do it on this branch until everything is confirmed to work?
Fetch translations
Commit & push to main to create a "restore" point, just in case
Run the script cd wagtail_root; python ./scripts/issue_7111_update_i18n_placeholders.py
Commit
Rebuild sources and push to Transifex
Push the translations with tx push -t

Unfortunately I can't really test it without actually doing it...

The main idea is that we can't push the updated source strings first because the translations correspondance would be lost, so we fetch first (step 2), update locally (step 4), push the updated source string (step 6), at this point the translations correspondance would be lost, but pushing the updated translations (step 7) should restore it 🤞

The other question that I have is when pulling translations with the fetch-translations.sh script, we would normally do some cleanup and maybe we don't want to do that (since we're going to push those files later), so we should only run the first 2 statements of the script (i.e. remove old translation files then pull).

Please check the following:

Do the tests still pass?¹
Does the code comply with the style guide?
- Run make lint from the Wagtail root.
For Python changes: ~~Have you added tests to~~ Existing tests cover the new/fixed behaviour?
~~For front-end changes: Did you test on all of Wagtail’s supported environments?~~²
- Please list the exact browser and operating system versions you tested:
- Please list which assistive technologies ³ you tested:
~~For new features: Has the documentation been updated accordingly?~~

Please describe additional details for testing this change.

squash-labs · 2022-10-16T09:16:20Z

Manage this branch in Squash

Test this branch here: https://loicteixeira7111-enforce-trans-ay1u4.squash.io

lb-

@loicteixeira this is incredible - the code looks solid to me, have not run locally yet though.

A few things

We probably need eyes on this from a few more core team to get feedback (I will add the needs decision label and flag this for tomorrow's meeting)
Adding semgrep as another dependency is probably required, but just wondering if this can be somehow written into a tool we already use like a flake8 plugin (note: I have done zero research into this, so feel free to shut that idea down). https://flake8.pycqa.org/en/latest/plugin-development/index.html
Semgrep's licence looks like it's compatible with Wagtail - https://github.com/returntocorp/semgrep/blob/develop/LICENSE

lb- · 2022-10-25T21:47:21Z

.semgrep.yml

+      Do not use str.format style formatting for translations.
+      Use printf style formatting with named placeholders instead.
+      For example, do `_("Hello %(name)s") % {"name": "Wagtail"}`
+      instead of `_("Hello {name}").format(name="Wagtail")`.


I think we should really point the developer to a docs page here, hopefully I can get #9431 in and we can have this end docs file as a source of truth here.

The nuances of the various types of string formatting was a bit for me to understand, and there is the temptation to use the 'newer' thing. Having the docs as a place where can explain this a bit better would be useful.

Also, this means adding a docs section also I think as part of this PR.

Yes, now that #9431 is merged, I'll rebase my branch ~~tomorrow~~ Friday and piggyback on it 👍

loicteixeira · 2022-10-26T05:41:02Z

Adding semgrep as another dependency is probably required, but just wondering if this can be somehow written into a tool we already use like a flake8 plugin (note: I have done zero research into this, so feel free to shut that idea down). flake8.pycqa.org/en/latest/plugin-development/index.html

Yes, it should be possible to do the same thing with a flake8 plugin, but it's a bit more involved since it works with the AST tree (where semgrep is some pseudo-regex) and would need to be published to pypi. I actually started exploring this, but finally switched to semgrep so I could get this out quickly and finally provide some much needed help to the translators.

I'd understand though if we don't want to introduce yet another dev dependency, but I'd be in favour of going with it for now, and I can try in a second time to replace semgrep with a flake8 plugin.

As a side note, merging the string changes without the linter is also possible, but obviously we would have to do this all over again later 😒

loicteixeira · 2022-10-26T06:29:16Z

Also, we probably need to wait for 4.1 to be released, otherwise we will mess up the translations for that release.

lb- · 2022-10-26T07:44:01Z

I think that reasoning makes sense - regarding the added dependency. I also think this should be merged as one, the fixes and the enforcing of the new approach.

Hopefully though there will be some other thoughts on this from others in the core team soon.

Thanks Loic.

lb- · 2022-11-01T22:07:45Z

Discussed at the core team last week, no major objections but a few refinements to do before this goes in. But now that 4.1 is out this can be revisited.

@thibaudcolas to review
Revisit how we can minimise the amount of changes needed in Transifex
Did not seem like any major objections to the new dev dependency
Might be good to get some feedback if possible from translators

Hopefully that summarises it ok.

Thanks for the epic progress on this @loicteixeira

loicteixeira · 2022-11-03T06:59:39Z

Rebased on main after the release of v4.1. The list of diff and the script in the PR description have been updated accordingly.

Edit: I removed the need design decision since the core team had no major objections.

loicteixeira · 2022-11-03T07:01:34Z

Btw, if approved, I'm obviously happy to merge this myself and run the scripts since this isn't your usual merge & push.

th3hamm0r · 2022-11-23T08:08:13Z

@loicteixeira epic work, this really improves the quality of wagtail's i18n!

I've focused on the changed translation calls in python and the resulting strings (original and "de"), I've also tested your script locally.
I've found one issue with the script, there is at least one replacement, which is incomplete, therefore it does not work:

# => Are you sure you want to delete this %(object)s? If other things in your
    "Are you sure you want to delete this %s? If other things in your": (
        ("%s", "%(object)s"),
    ),

Somehow you've only used parts of the original string. This finding was lucky, because the string hasn't been changed, but the (newer?) polib-version reformatted the string, so I've found the remaining "%s".
I haven't checked all other replacements yet, as there are some of them 😁

As you've pointed out above, not everything can be tested, but from my point of view it looks good 🙏

th3hamm0r · 2022-11-23T09:48:33Z

After finding the missed replacement above, I've searched through all resulting po files. I did not find any other (missed) "%s" strings, but after searching for "{", I've found the following, where I don't know exactly why it has been missed with semgrep?

wagtail/wagtail/contrib/simple_translation/forms.py

Lines 43 to 48 in 3037ce0

    
           hide_include_subtree = False 
        
           self.fields["include_subtree"].label = ngettext( 
        
               "Include subtree ({} page)", 
        
               "Include subtree ({} pages)", 
        
               descendant_count, 
        
           ).format(descendant_count)

Also there are some uses of {...} in the admin's JS, see djangojs.po.
Looking at django's docs those usages could be adapted to the python format by using .interpolate() instead of javascript's .replace(), but I have to admit that I haven't used that by myself yet.

th3hamm0r

Sry for spamming you with comments, my process was a bit iterative, I should have waited and used the review comment once 😬

loicteixeira · 2022-11-24T06:57:29Z

No worries. Thank you for the thorough review 👍

loicteixeira · 2022-11-24T07:40:12Z

Rebased on main.

Truncated "Are you sure you want to delete this %s?"

Nice catch. I've updated the script accordingly!

I don't know exactly why it has been missed with semgrep

The semgrep was looking for a function called _ or gettext or gettext_lazy or ngettext, but with only a single argument.

d9be05d updates the pattern, so it also looks for more arguments (it still seems to work when there's only one)
2224fc8 adds ngettext_lazy to the list of functions to match

In the case of ngettext, it's still not perfect as it will only be looking at the first of the two strings, but maybe that's good enough for a first iteration 🤔

Also there are some uses of {...} in the admin's JS

The rule only looks for Python files. I'll have a look at enabling it for js/jsx/ts/tsx files as well, but I'll need to make sure it doesn't break anything (regarding the replace/interpolate you mentioned).

th3hamm0r · 2022-11-24T08:17:16Z

Awesome!

The rule only looks for Python files. I'll have a look at enabling it for js/jsx/ts/tsx files as well, but I'll need to make sure it doesn't break anything (regarding the replace/interpolate you mentioned).

Yes, I assumed that. Since I haven't used them yet, I cannot really give advice here, so for me it is more a nice to have.

gasman

Great work on this @loicteixeira!

wagtail/images/fields.py

th3hamm0r · 2022-11-24T10:36:07Z

Should we add semgrep --config .semgrep.yml --error also to the Makefile's lint-server section?

The PR's template guides everyone to execute make lint, so this would then also cover semgrep. I also use it quite often 😁

Now all occurences use the same replacement name. And while not necessary, the name was updated for related strings for homogeneity.

loicteixeira · 2022-11-25T11:52:12Z

Missing replacements

The replacement for the this plural message does not work properly

%(object)s should be %(image_title)s

"objects" vs. "object"

Script updated, see the changes.

Atm we have 3 replacements of the string "Edit this %s", all with different placeholders. This leads to invalid strings, eg. for "de" the result is:

All updated to use object. I also updated related strings (add/delete this ...) so it use the same placeholder everywhere.

I think we cannot automate the replacement of the following, but I'm not completely sure

Indeed, this won't work. Ignoring it then and it will be up to the translators to retranslate.

th3hamm0r

Nice!
I've rerun the scripts again and there seems to be only one issue with the new plural handling in your script. It produced the following diff:

-msgid "Include subtree ({} page)"
+msgid "Include subtree (%(descendant_count)s page)"
 msgid_plural "Include subtree ({} pages)"
-msgstr[0] "Teilbaum ({} Seite) einbeziehen"
-msgstr[1] "Teilbaum ({} Seiten) einbeziehen"
+msgstr[0] "Teilbaum (%(descendant_count)s Seite) einbeziehen"
+msgstr[1] "Teilbaum (%(descendant_count)s Seiten) einbeziehen"

The "msgid_plural" isn't replaced, so I think this has to be done in this block: https://gist.github.com/loicteixeira/bcc5e678e3d06af07bdb3afe9d149575#file-source_strings_udpate-py-L452-L454

laymonage

I love this PR, thank you very much for the great work!

I'm working on some refactoring between snippets and generic views in #9709, and I'd like to rename some variables in translatable strings so that both views can share the same strings. I was hoping to piggyback some strings in the snippets views, if you don't mind? Thanks (and sorry for adding more work)!

wagtail/admin/views/generic/models.py

…ore generalization work

th3hamm0r

I've reran the scripts, looks good! Again, amazing work @loicteixeira and all contributors!

I think we should keep your script (or better you as the "polib-pro") in mind, because changing placeholder names during refactorings will probably happen again in the future, and this little script avoids creating unnecessary (maybe release-blocking) tasks for the translators 👍

loicteixeira · 2022-11-26T10:47:46Z

I merged locally, but the re-upload to Transifex didn't really go as planned.
I'm trying to fix it and will update here with the exact steps taken.
Edit: Old state re-uploaded to Transifex.

See 364de3d4dc and #9377

loicteixeira · 2022-11-26T11:52:54Z

Steps taken to merge (this for posterity, and also because I'll have to do it all over again for wagtail-localize which is in the same Transifex project).

BRANCH_NAME="7111-enforce-translation-string-formatting"

# Merge
git switch main
git pull --ff-only upstream main
git switch "${BRANCH_NAME}"
git reset $(git merge-base main "${BRANCH_NAME}")
git add .
git commit -m "Enforce the use of a single string formatting mechanism for translation source strings" -m "Close #9377"
git rebase main
git merge --ff-only "${BRANCH_NAME}"

# Pull & commit existing translations
cd scripts
./fetch-translations.sh > ../../logs/translation.out 2>../../logs/translation.err
python ./check-translation-strings.py
cd ..
git add .
git commit -m "Fetch new translations from Transifex"

# Update the strings and sanity check with French translation (I can't possibly check all languages...)
python ./scripts/source_strings_update.py
find wagtail -name *.po -type f -path **/fr/** -exec git diff --staged --word-diff {} \;
git add .

# Rebuild English translations to check if there's diff from what the script has done
cd scripts
./rebuild-translation-sources.sh
cd ..
git add .

git commit -m "Update translations placeholder names" -m "See 364de3d4dc and #9377"

# Last linting check
semgrep --config .semgrep.yml --error .

Now the tricky part:

# Upload English source strings to Transifex
tx push --source

Languages now show 160 untranslated strings as expected.

# Upload French translations to Transifex for sanity check on Transifex
tx push --translations --language=fr

Sadly it did not work, files were being "skipped", and we still had 160 strings to translate.

Trying to fix it:

git checkout HEAD~1  # Come back to the commit before the placeholder update
tx push --source  # Re-push the original English sources

At this point, Transifex is back to what it was before I started.

git switch main

# Delete translations
find wagtail -iname *.po ! -iwholename */en/* -delete

# Re-fetch, but without using `scripts/fetch-translations.sh` to keep strings location information.
tx pull --all --minimum-perc=1

# Re-update placeholders
python ./scripts/source_strings_update.py

# Re-upload English updated source strings to Transifex
tx push --source

# Re-upload French translations
tx push --translations --language=fr

Unfortunately, files are still being "skipped".
While the location information "might" be required anyway, the files seem to be skipped because of their modification timestamp.
Trying to force the upload.

# Re-upload French translations
tx push --translations --language=fr --force --no-interactive

This work, but it means that the time between fetch and upload has to be as short as possible, or we will overwrite work from translators which might be updating strings at the same time we do this work.

# Upload other translations
tx push --translations --force --no-interactive

So at this point I'm not sure if files were skipped because of the missing location information or simply the timestamps.
It feels like it would be the latter (location information shouldn't be needed as long as you have the string "id")
but I don't want to fiddle with it any more for today.

In the end, there are: 14 strings which weren't updated (that's 10 more than the 5 impossible/ambiguous auto-update we had identified).

So 160 strings down to 14. Not perfect but not bad either.

# Cleanup & merge
git restore .
git push --dry-run upstream main
git push upstream main

So this might need some refinement before we start using it on a "regular" basis.

Anyway, thank you @lb-, @th3hamm0r, @gasman & @laymonage for your help on this issue.

lb- · 2022-11-26T22:48:25Z

Wow @loicteixeira - thanks for documenting this process also.

loicteixeira added the component:i18n i18n for content created in Wagtail, not the admin UI itself label Oct 16, 2022

loicteixeira marked this pull request as ready for review October 16, 2022 10:07

loicteixeira requested review from gasman and lb- October 16, 2022 10:07

loicteixeira added the status:Needs Review label Oct 16, 2022

zerolab mentioned this pull request Oct 19, 2022

feat: use Willow instead of Pillow for images. #8974

Closed

8 tasks

lb- reviewed Oct 25, 2022

View reviewed changes

lb- added the status:Needs Design Decision label Oct 25, 2022

loicteixeira mentioned this pull request Oct 26, 2022

Provide better context for translations #7116

Open

thibaudcolas self-requested a review October 26, 2022 16:05

loicteixeira force-pushed the 7111-enforce-translation-string-formatting branch from 4c7106b to 462b4ba Compare November 3, 2022 06:57

loicteixeira removed the status:Needs Design Decision label Nov 3, 2022

loicteixeira requested a review from allcaps November 15, 2022 17:41

th3hamm0r self-requested a review November 23, 2022 07:55

th3hamm0r requested changes Nov 23, 2022

View reviewed changes

loicteixeira force-pushed the 7111-enforce-translation-string-formatting branch from 3037ce0 to 171fd2b Compare November 24, 2022 07:30

gasman reviewed Nov 24, 2022

View reviewed changes

wagtail/images/fields.py Outdated Show resolved Hide resolved

loicteixeira added 8 commits November 25, 2022 12:10

Upgrade pattern to work with ngettext which accepts multiple arguments

8c9fb47

Update missed ngettext string

e901445

Add ngettext_lazy to list of functions to catch

9315a0c

Run semgrep rules in CI and make lint

623f43d

Avoid escaping placeholders

2903c82

Run applicables rules on JavaScript files

ef71da8

Update javascript files as well

cf1fa5c

Fix conflict between multiple 'Edit this %s' strings

65b996b

Now all occurences use the same replacement name. And while not necessary, the name was updated for related strings for homogeneity.

loicteixeira force-pushed the 7111-enforce-translation-string-formatting branch from 16d001b to 65b996b Compare November 25, 2022 11:45

loicteixeira requested review from lb- and th3hamm0r November 25, 2022 11:52

th3hamm0r requested changes Nov 25, 2022

View reviewed changes

gasman approved these changes Nov 25, 2022

View reviewed changes

laymonage reviewed Nov 25, 2022

View reviewed changes

wagtail/admin/views/generic/models.py Show resolved Hide resolved

loicteixeira added 2 commits November 26, 2022 09:14

Use more generic placeholder names for snippets in anticipation for m…

6d9476d

…ore generalization work

Normalize placeholder name for instance/object

868c239

th3hamm0r self-requested a review November 26, 2022 09:04

th3hamm0r approved these changes Nov 26, 2022

View reviewed changes

loicteixeira closed this in 5c1c2c8 Nov 26, 2022

loicteixeira added a commit that referenced this pull request Nov 26, 2022

Update translations placeholder names

56d0fe8

See 364de3d4dc and #9377

loicteixeira deleted the 7111-enforce-translation-string-formatting branch November 26, 2022 11:54

loicteixeira removed the status:Needs Review label Nov 26, 2022

lb- mentioned this pull request Nov 26, 2022

Enforce the use of a single string formatting mechanism for translation source strings #7111

Closed

lb- mentioned this pull request Jul 13, 2023

Drop Python 3.7 support #10676

Closed

lb- removed their request for review January 20, 2024 12:24

gasman mentioned this pull request May 3, 2024

Speed up semgrep on pre-commit #11927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce the use of a single string formatting mechanism for translation source strings #9377

Enforce the use of a single string formatting mechanism for translation source strings #9377

loicteixeira commented Oct 16, 2022 •

edited

squash-labs bot commented Oct 16, 2022

lb- left a comment

lb- Oct 25, 2022

loicteixeira Oct 26, 2022 •

edited

loicteixeira commented Oct 26, 2022

loicteixeira commented Oct 26, 2022

lb- commented Oct 26, 2022

lb- commented Nov 1, 2022

loicteixeira commented Nov 3, 2022 •

edited

loicteixeira commented Nov 3, 2022

th3hamm0r commented Nov 23, 2022

th3hamm0r commented Nov 23, 2022

th3hamm0r left a comment

loicteixeira commented Nov 24, 2022

loicteixeira commented Nov 24, 2022 •

edited

th3hamm0r commented Nov 24, 2022

gasman left a comment

th3hamm0r commented Nov 24, 2022

loicteixeira commented Nov 25, 2022

th3hamm0r left a comment

laymonage left a comment

th3hamm0r left a comment

loicteixeira commented Nov 26, 2022 •

edited

loicteixeira commented Nov 26, 2022 •

edited

lb- commented Nov 26, 2022

Enforce the use of a single string formatting mechanism for translation source strings #9377

Enforce the use of a single string formatting mechanism for translation source strings #9377

Conversation

loicteixeira commented Oct 16, 2022 • edited

Footnotes

squash-labs bot commented Oct 16, 2022

Manage this branch in Squash

lb- left a comment

Choose a reason for hiding this comment

lb- Oct 25, 2022

Choose a reason for hiding this comment

loicteixeira Oct 26, 2022 • edited

Choose a reason for hiding this comment

loicteixeira commented Oct 26, 2022

loicteixeira commented Oct 26, 2022

lb- commented Oct 26, 2022

lb- commented Nov 1, 2022

loicteixeira commented Nov 3, 2022 • edited

loicteixeira commented Nov 3, 2022

th3hamm0r commented Nov 23, 2022

th3hamm0r commented Nov 23, 2022

th3hamm0r left a comment

Choose a reason for hiding this comment

loicteixeira commented Nov 24, 2022

loicteixeira commented Nov 24, 2022 • edited

th3hamm0r commented Nov 24, 2022

gasman left a comment

Choose a reason for hiding this comment

th3hamm0r commented Nov 24, 2022

loicteixeira commented Nov 25, 2022

th3hamm0r left a comment

Choose a reason for hiding this comment

laymonage left a comment

Choose a reason for hiding this comment

th3hamm0r left a comment

Choose a reason for hiding this comment

loicteixeira commented Nov 26, 2022 • edited

loicteixeira commented Nov 26, 2022 • edited

lb- commented Nov 26, 2022

loicteixeira commented Oct 16, 2022 •

edited

loicteixeira Oct 26, 2022 •

edited

loicteixeira commented Nov 3, 2022 •

edited

loicteixeira commented Nov 24, 2022 •

edited

loicteixeira commented Nov 26, 2022 •

edited

loicteixeira commented Nov 26, 2022 •

edited