New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce the use of a single string formatting mechanism for translation source strings #9377
Enforce the use of a single string formatting mechanism for translation source strings #9377
Conversation
Manage this branch in SquashTest this branch here: https://loicteixeira7111-enforce-trans-ay1u4.squash.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loicteixeira this is incredible - the code looks solid to me, have not run locally yet though.
A few things
- We probably need eyes on this from a few more core team to get feedback (I will add the needs decision label and flag this for tomorrow's meeting)
- Adding
semgrep
as another dependency is probably required, but just wondering if this can be somehow written into a tool we already use like a flake8 plugin (note: I have done zero research into this, so feel free to shut that idea down). https://flake8.pycqa.org/en/latest/plugin-development/index.html - Semgrep's licence looks like it's compatible with Wagtail - https://github.com/returntocorp/semgrep/blob/develop/LICENSE
Do not use str.format style formatting for translations. | ||
Use printf style formatting with named placeholders instead. | ||
For example, do `_("Hello %(name)s") % {"name": "Wagtail"}` | ||
instead of `_("Hello {name}").format(name="Wagtail")`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should really point the developer to a docs page here, hopefully I can get #9431 in and we can have this end docs file as a source of truth here.
The nuances of the various types of string formatting was a bit for me to understand, and there is the temptation to use the 'newer' thing. Having the docs as a place where can explain this a bit better would be useful.
Also, this means adding a docs section also I think as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now that #9431 is merged, I'll rebase my branch tomorrow Friday and piggyback on it 👍
Yes, it should be possible to do the same thing with a flake8 plugin, but it's a bit more involved since it works with the AST tree (where semgrep is some pseudo-regex) and would need to be published to pypi. I actually started exploring this, but finally switched to semgrep so I could get this out quickly and finally provide some much needed help to the translators. I'd understand though if we don't want to introduce yet another dev dependency, but I'd be in favour of going with it for now, and I can try in a second time to replace semgrep with a flake8 plugin. As a side note, merging the string changes without the linter is also possible, but obviously we would have to do this all over again later 😒 |
Also, we probably need to wait for 4.1 to be released, otherwise we will mess up the translations for that release. |
I think that reasoning makes sense - regarding the added dependency. I also think this should be merged as one, the fixes and the enforcing of the new approach. Hopefully though there will be some other thoughts on this from others in the core team soon. Thanks Loic. |
Discussed at the core team last week, no major objections but a few refinements to do before this goes in. But now that 4.1 is out this can be revisited.
Hopefully that summarises it ok. Thanks for the epic progress on this @loicteixeira |
4c7106b
to
462b4ba
Compare
Rebased on Edit: I removed the |
Btw, if approved, I'm obviously happy to merge this myself and run the scripts since this isn't your usual merge & push. |
@loicteixeira epic work, this really improves the quality of wagtail's i18n! I've focused on the changed translation calls in python and the resulting strings (original and "de"), I've also tested your script locally. # => Are you sure you want to delete this %(object)s? If other things in your
"Are you sure you want to delete this %s? If other things in your": (
("%s", "%(object)s"),
), Somehow you've only used parts of the original string. This finding was lucky, because the string hasn't been changed, but the (newer?) polib-version reformatted the string, so I've found the remaining "%s". As you've pointed out above, not everything can be tested, but from my point of view it looks good 🙏 |
After finding the missed replacement above, I've searched through all resulting po files. I did not find any other (missed) "%s" strings, but after searching for "{", I've found the following, where I don't know exactly why it has been missed with semgrep? wagtail/wagtail/contrib/simple_translation/forms.py Lines 43 to 48 in 3037ce0
Also there are some uses of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sry for spamming you with comments, my process was a bit iterative, I should have waited and used the review comment once 😬
No worries. Thank you for the thorough review 👍 |
3037ce0
to
171fd2b
Compare
Rebased on main.
Nice catch. I've updated the script accordingly!
The semgrep was looking for a function called
In the case of ngettext, it's still not perfect as it will only be looking at the first of the two strings, but maybe that's good enough for a first iteration 🤔
The rule only looks for Python files. I'll have a look at enabling it for js/jsx/ts/tsx files as well, but I'll need to make sure it doesn't break anything (regarding the replace/interpolate you mentioned). |
Awesome!
Yes, I assumed that. Since I haven't used them yet, I cannot really give advice here, so for me it is more a nice to have. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this @loicteixeira!
Should we add The PR's template guides everyone to execute |
Now all occurences use the same replacement name. And while not necessary, the name was updated for related strings for homogeneity.
16d001b
to
65b996b
Compare
Script updated, see the changes.
All updated to use
Indeed, this won't work. Ignoring it then and it will be up to the translators to retranslate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I've rerun the scripts again and there seems to be only one issue with the new plural handling in your script. It produced the following diff:
-msgid "Include subtree ({} page)"
+msgid "Include subtree (%(descendant_count)s page)"
msgid_plural "Include subtree ({} pages)"
-msgstr[0] "Teilbaum ({} Seite) einbeziehen"
-msgstr[1] "Teilbaum ({} Seiten) einbeziehen"
+msgstr[0] "Teilbaum (%(descendant_count)s Seite) einbeziehen"
+msgstr[1] "Teilbaum (%(descendant_count)s Seiten) einbeziehen"
The "msgid_plural" isn't replaced, so I think this has to be done in this block: https://gist.github.com/loicteixeira/bcc5e678e3d06af07bdb3afe9d149575#file-source_strings_udpate-py-L452-L454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this PR, thank you very much for the great work!
I'm working on some refactoring between snippets and generic views in #9709, and I'd like to rename some variables in translatable strings so that both views can share the same strings. I was hoping to piggyback some strings in the snippets views, if you don't mind? Thanks (and sorry for adding more work)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reran the scripts, looks good! Again, amazing work @loicteixeira and all contributors!
I think we should keep your script (or better you as the "polib-pro") in mind, because changing placeholder names during refactorings will probably happen again in the future, and this little script avoids creating unnecessary (maybe release-blocking) tasks for the translators 👍
I merged locally, but the re-upload to Transifex didn't really go as planned. |
Steps taken to merge (this for posterity, and also because I'll have to do it all over again for wagtail-localize which is in the same Transifex project). BRANCH_NAME="7111-enforce-translation-string-formatting"
# Merge
git switch main
git pull --ff-only upstream main
git switch "${BRANCH_NAME}"
git reset $(git merge-base main "${BRANCH_NAME}")
git add .
git commit -m "Enforce the use of a single string formatting mechanism for translation source strings" -m "Close #9377"
git rebase main
git merge --ff-only "${BRANCH_NAME}"
# Pull & commit existing translations
cd scripts
./fetch-translations.sh > ../../logs/translation.out 2>../../logs/translation.err
python ./check-translation-strings.py
cd ..
git add .
git commit -m "Fetch new translations from Transifex"
# Update the strings and sanity check with French translation (I can't possibly check all languages...)
python ./scripts/source_strings_update.py
find wagtail -name *.po -type f -path **/fr/** -exec git diff --staged --word-diff {} \;
git add .
# Rebuild English translations to check if there's diff from what the script has done
cd scripts
./rebuild-translation-sources.sh
cd ..
git add .
git commit -m "Update translations placeholder names" -m "See 364de3d4dc and #9377"
# Last linting check
semgrep --config .semgrep.yml --error . Now the tricky part:
Languages now show 160 untranslated strings as expected.
Sadly it did not work, files were being "skipped", and we still had 160 strings to translate. Trying to fix it: git checkout HEAD~1 # Come back to the commit before the placeholder update
tx push --source # Re-push the original English sources At this point, Transifex is back to what it was before I started. git switch main
# Delete translations
find wagtail -iname *.po ! -iwholename */en/* -delete
# Re-fetch, but without using `scripts/fetch-translations.sh` to keep strings location information.
tx pull --all --minimum-perc=1
# Re-update placeholders
python ./scripts/source_strings_update.py
# Re-upload English updated source strings to Transifex
tx push --source
# Re-upload French translations
tx push --translations --language=fr Unfortunately, files are still being "skipped". # Re-upload French translations
tx push --translations --language=fr --force --no-interactive This work, but it means that the time between fetch and upload has to be as short as possible, or we will overwrite work from translators which might be updating strings at the same time we do this work. # Upload other translations
tx push --translations --force --no-interactive So at this point I'm not sure if files were skipped because of the missing location information or simply the timestamps. In the end, there are: 14 strings which weren't updated (that's 10 more than the 5 impossible/ambiguous auto-update we had identified). So 160 strings down to 14. Not perfect but not bad either. # Cleanup & merge
git restore .
git push --dry-run upstream main
git push upstream main So this might need some refinement before we start using it on a "regular" basis. Anyway, thank you @lb-, @th3hamm0r, @gasman & @laymonage for your help on this issue. |
Wow @loicteixeira - thanks for documenting this process also. |
Fix #7111. As discussed there, the chosen style is the old style formatting.
25a83e0 setups and configures Semgrep in order to lint the source strings and prevent regressions.
It might not be perfect just yet, maybe there's false negative, but at least I didn't see any false positives (which we could always temporarily ignore with
# nosemgrep: id-of-the-error
until the rules are updated).11e86ea and 4c7106b (the second one deals with strings in JS, so it was separated to another commit for ease of review.) update the source strings to the desired format.
See a compiled list of source string changes.
5db259e is unrelated, but since flake8 was running on the file, it would complain about it.
Now that's about 150 strings which translators would have to update as I don't think the auto matching whatever will be able to pick this up. In order to reduce the burden as much as possible, I've made the following script which will update the PO files with the new placeholders when possible.
See the update script.
If all goes well, it would only leave the translators 4 translations to update (see the
AMBIGUOUS_REPLACEMENTS
constant in the script).However I'm not 100% sure how to proceed, so here is what I'm thinking so far:
main
to create a "restore" point, just in casecd wagtail_root; python ./scripts/issue_7111_update_i18n_placeholders.py
tx push -t
Unfortunately I can't really test it without actually doing it...
The main idea is that we can't push the updated source strings first because the translations correspondance would be lost, so we fetch first (step 2), update locally (step 4), push the updated source string (step 6), at this point the translations correspondance would be lost, but pushing the updated translations (step 7) should restore it 🤞
The other question that I have is when pulling translations with the
fetch-translations.sh
script, we would normally do some cleanup and maybe we don't want to do that (since we're going to push those files later), so we should only run the first 2 statements of the script (i.e. remove old translation files then pull).Please check the following:
make lint
from the Wagtail root.Have you added tests toExisting tests cover the new/fixed behaviour?For front-end changes: Did you test on all of Wagtail’s supported environments?2Please list the exact browser and operating system versions you tested:Please list which assistive technologies 3 you tested:For new features: Has the documentation been updated accordingly?Please describe additional details for testing this change.
Footnotes
Development Testing ↩
Browser and device support ↩
Accessibility Target ↩