New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: validate common issues with translatable text #7967
Comments
Straight single quotes are tricky because often mark-up includes single quotes, and typographic quotes really shouldn't be used there. Examples:
Line 216 in 0945c30
Arguably the typography changes aren't that important, but I've seen others maintain the importance of using typographic quotes during reviews, so I've tried to follow that. Tools such as poedit highlight non-standard spacing after full-stops as an anomaly. Dashes aren't critical either, but I do think at least dashes at least of the 'en-dash' length should be favoured over a plain hyphen when the symbol is clearly not being used as a hyphen. Perhaps the only thing missing is the ellipsis character, but the style guide currently says to stick with three full-stops over the Unicode symbol, so it doesn't need to be considered. Aside from the mentioned PR, 5294bbf was the most recent bunch of corrections I made. I think it's reasonable to fix errors as we find them (especially as they seemed to get missed on initial review and PR merge)... but pre-emptively finding issues and fixing them as early as possible will help reduce the re-translation overhead. |
Assuming we want to maintain consistency with the style guide - and I recognise that's not a given - I started drafting a quick little Python script to look for translation strings and find non-conforming punctuation in them (and potentially replace them). Strings on multiple lines is fine, but I got stuck on working out how to handle multiple separate translation strings on the same line (I'm not sure if there's already a way to parse this in Lines 125 to 134 in 13353f8
(There are also examples where the second translation string terminates on that same line too. That aside, I thought I would share what I've found so far in
I'm not intending on making any further changes in the strings right now, but just pointing out what I've found so far. At least the processing seems fast. For all the |
I would assume (hope?) that the wmlparser3 library |
wmllint does not use wmlparser3. Do not look at wmllint on how to parse WML. |
At a quick glance this should either be done by properly parsing (wesnoth preprocessed) WML with wmlparser3 or work on the pot-files. The latter would require a pot-update prior to get new strings in there. (which we already do in CI anyway) |
I worked out how to grab the translation strings (including multi-line and multiple end/starts on the same line) from the The main struggle now is how to appropriately filter straight single-quotes and hyphen-minus characters. It may not be possible to cover every case programmatically, but I suppose it's a starting point. I'm not great at regular expressions, so maybe that's the limiting factor. I skip mark-up such as hexadecimals (for colour values) but where the mark-up is all alphabetical I seem to be getting unintended matches. For example:
However, this experiment does show me that there are still a few places I've missed such as in WoF - I thought I already covered all the campaign cases:
Given all the ways dashes can be used, I'm not really sure where to go with hyphen-minus other than the most obvious case of Perhaps I should wait until we've actually made a decision about how tightly we want to enforce the style guide before investing more time in this. It was really meant to be an experiment I started on a whim... |
Sounds like normal XML. It's not a line based format. Parsing XML (https://stackoverflow.com/q/1732348) or WML with regex is generally not a good approach. |
For my work context, at least, it is definitely non-standard XML. Multiple XML trees, including DOCTYPE declarations, with inconsistent new-line formatting. Running a standard XML parser wasn't going to work. I'm not really familiar with |
You might be better using a library to read the |
The usual rule for dashes is either to use or to use with no spaces around it. When a hyphen is used as a dash, it's generally surrounded by spaces, so I would think that substituting dashes is as simple as
This isn't even a quote character. It's a grave accent. So it's definitely wrong to use it in prose. Which you've said we're not doing, so that's good.
Not really relevant because we want to use curly quotes instead of straight quotes, but… WML syntax does allow for the use of double quotes in a string. You'd write it something like this:
The single quote is actually quite tricky, and I don't think it can be handled with regular expressions. The problem is in words like For markup, you can ignore anything enclosed in
Translation: any run of text between and open and closing tag that contains an equals sign.
The ones used as textual bullet-points should probably be replaced with |
I think there are some occasions where hyphen may be used as a dash, but without spaces. But because of confusion with actual hyphens, I haven't verified this. Current issues that I did find are in #8000.
I was just looking at characters that the style guide said to avoid. And I've noticed people use the character as a quote, perhaps more from pre-Internet days, though.
That's what I was trying to work out, the escape character for a straight double-quote. Updated the list in #8000 with these. There were a couple for manual tests that I haven't recorded, one of which was marked for translation... I wonder if it's even worth having translated strings for tests? |
I don't think I've ever seen this. Maybe using a double hyphen as an em dash is precedented though? I'm not sure.
Yes, I've seen it used that way, and it looks absolutely atrocious. It's a remnant of a time when all you had was ASCII. I'm not sure, it may have looked fine in contemporary fixed-width fonts, but it does not look even remotely okay in variable-width fonts.
It is done to simplify the schema validation, so that we don't have to somehow specify an "exception" in the tests for keys that are "required" to be translatable.. The strings are marked for translation, but placed in a textdomain that is not collected by the pot-update. If you find such a string that's not in the untranslated wesnoth-test textdomain, then that's an error. (Although, as a side note, I think there could be a debate over whether some of the test scenarios deserve to be translated. At least the AI demo scenarios are quite friendly and kind of intended for addon developers to view as examples. They were at one time translated; now they aren't. I don't think it's a clear-cut situation, there are good arguments for both sides.) |
Ah, good point. I did a quick review on that and I couldn't find any instances in anything that's viewed by a normal player, but I did see some instances in WML commentary (which is fine). |
Judging by the number of failures in #8180, that looks to be a long way off still... |
Something that has happened more than once is that a bunch of text is added, and then later someone (usually Wedge009) reviews it and needs to submit a bunch of corrections (ie: #7964 and #7965). So it would be good to be able to do some sort of validation for common issues that tend to be found. These include (but please mention any others as well):
'
instead of’
-
instead of—
For the first three, I imagine a grep or similar over the po files would be able to catch these. For spelling I'm not as sure, though IIRC wmllint has a spellcheck, so maybe that would help with at least some of these if/when that can be fixed up enough to run on mainline as part of the CI.
The text was updated successfully, but these errors were encountered: