Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: validate common issues with translatable text #7967

Open
Pentarctagon opened this issue Oct 20, 2023 · 14 comments
Open

CI: validate common issues with translatable text #7967

Pentarctagon opened this issue Oct 20, 2023 · 14 comments
Labels
CI Issues involving the CI platform. Enhancement Issues that are requests for new features or changes to existing ones.

Comments

@Pentarctagon
Copy link
Member

Something that has happened more than once is that a bunch of text is added, and then later someone (usually Wedge009) reviews it and needs to submit a bunch of corrections (ie: #7964 and #7965). So it would be good to be able to do some sort of validation for common issues that tend to be found. These include (but please mention any others as well):

  • Using ' instead of
  • Using - instead of
  • Using two spaces after a period instead of a single space
  • Spelling errors

For the first three, I imagine a grep or similar over the po files would be able to catch these. For spelling I'm not as sure, though IIRC wmllint has a spellcheck, so maybe that would help with at least some of these if/when that can be fixed up enough to run on mainline as part of the CI.

@Pentarctagon Pentarctagon added Enhancement Issues that are requests for new features or changes to existing ones. CI Issues involving the CI platform. labels Oct 20, 2023
@Wedge009
Copy link
Member

Wedge009 commented Oct 20, 2023

Straight single quotes are tricky because often mark-up includes single quotes, and typographic quotes really shouldn't be used there. Examples:

label= _ "<span font='16'>I'll just watch the animals.</span>"

text= _ "<span color='gold'>+$chest_gold gold</span>"

text= _ "In Wesnoth, it is not enough simply to recruit units and fight. You must watch your gold as well, especially in campaigns, where you can carry extra gold over from one scenario to the next. There are two aspects to this; <italic>text='income'</italic> and <italic>text='upkeep'</italic>." + _ "

Arguably the typography changes aren't that important, but I've seen others maintain the importance of using typographic quotes during reviews, so I've tried to follow that. Tools such as poedit highlight non-standard spacing after full-stops as an anomaly. Dashes aren't critical either, but I do think at least dashes at least of the 'en-dash' length should be favoured over a plain hyphen when the symbol is clearly not being used as a hyphen.

Perhaps the only thing missing is the ellipsis character, but the style guide currently says to stick with three full-stops over the Unicode symbol, so it doesn't need to be considered.

Aside from the mentioned PR, 5294bbf was the most recent bunch of corrections I made. I think it's reasonable to fix errors as we find them (especially as they seemed to get missed on initial review and PR merge)... but pre-emptively finding issues and fixing them as early as possible will help reduce the re-translation overhead.

@Wedge009
Copy link
Member

Wedge009 commented Oct 24, 2023

Assuming we want to maintain consistency with the style guide - and I recognise that's not a given - I started drafting a quick little Python script to look for translation strings and find non-conforming punctuation in them (and potentially replace them). Strings on multiple lines is fine, but I got stuck on working out how to handle multiple separate translation strings on the same line (I'm not sure if there's already a way to parse this in wmllint, for example, I was writing this all from scratch). An example is:

wesnoth/data/core/help.cfg

Lines 125 to 134 in 13353f8

[topic]
id=..eras_section
title= _ "Eras"
text= _ "A faction is a collection of units and leaders. Factions are assigned to sides in multiplayer games.
" + _ "An era is a collection of factions, intended to be played against one another. Besides the mainline eras that come with the game, many user-made factions are available from add-ons.
"
generator="contents:generated"
[/topic]

(There are also examples where the second translation string terminates on that same line too.

That aside, I thought I would share what I've found so far in master:

  • No back-quotes
  • No straight double-quotes (presumably that would be a syntax error anyway)
  • A curious single case of an actual ellipsis character being used over three regular full-stops (I wasn't expecting any at all):
    message= _ "Oh no, more of them? And there I thought I could catch my breath…"
  • Loads of straight single-quotes - some are genuine inconsistencies, will need to filter out those that lie within mark-up (presumably can be handled with regular expressions).
  • Loads of hyphens/minus-signs, may be difficult to programmatically filter out ones that really ought to be dashes. Some instances are even used as textual bullet-points.

I'm not intending on making any further changes in the strings right now, but just pointing out what I've found so far. At least the processing seems fast. For all the .cfg files in Wesnoth's data directory, it only took a second or so to process.

@Pentarctagon
Copy link
Member Author

Pentarctagon commented Oct 24, 2023

I would assume (hope?) that the wmlparser3 library that wmlint et al use has a way of handling multiple concatenated strings, since that's done in a fair number of places. I don't actually know though.

@soliton-
Copy link
Member

wmllint does not use wmlparser3. Do not look at wmllint on how to parse WML.

@soliton-
Copy link
Member

At a quick glance this should either be done by properly parsing (wesnoth preprocessed) WML with wmlparser3 or work on the pot-files. The latter would require a pot-update prior to get new strings in there. (which we already do in CI anyway)

@Wedge009
Copy link
Member

Wedge009 commented Oct 24, 2023

I worked out how to grab the translation strings (including multi-line and multiple end/starts on the same line) from the cfg files based on similar experience in processing oddly-formatted XMLs for my work (multiple start/end tags on the same line). It might be a naïve approach because I started from scratch, but it seems to work well enough.

The main struggle now is how to appropriately filter straight single-quotes and hyphen-minus characters. It may not be possible to cover every case programmatically, but I suppose it's a starting point. I'm not great at regular expressions, so maybe that's the limiting factor. I skip mark-up such as hexadecimals (for colour values) but where the mark-up is all alphabetical I seem to be getting unintended matches. For example:

.../wesnoth/data/multiplayer/factions/northerners-aoh.cfg:16
The <bold>text='Northerners'</bold> are a faction of <ref>text='Orcs' dst='..race_orc'</ref> and their allies who live in the north of the Great Continent, thus their name. Northerners consist of the warrior orcs race, the enslaved <ref>text='goblins' dst='..race_goblin'</ref>, <ref>text='trolls' dst='..race_troll'</ref> who are tricked into combat by the orcs, and the serpentine <ref>text='naga' dst='..race_naga'</ref>. The Northerners play best by taking advantage of having many low-cost and high HP soldiers.

Not sure why I'm getting a match on the equals sign, for example, with [a-zI\s(]\'[a-z\s)]. Edit: Never mind, I realised it's matching on the end of the mark-up, such as for trolls' . Don't think it's possible to automatically distinguish that from normal prose unless doing something much more complex like parsing the mark-up itself.

However, this experiment does show me that there are still a few places I've missed such as in WoF - I thought I already covered all the campaign cases:

message= _ "Behold, mighty Shek'kahan!"

Given all the ways dashes can be used, I'm not really sure where to go with hyphen-minus other than the most obvious case of -. But I think catching the straight quotes is more important than accounting for dashes.

Perhaps I should wait until we've actually made a decision about how tightly we want to enforce the style guide before investing more time in this. It was really meant to be an experiment I started on a whim...

@soliton-
Copy link
Member

I worked out how to grab the translation strings (including multi-line and multiple end/starts on the same line) from the cfg files based on similar experience in processing oddly-formatted XMLs for my work (multiple start/end tags on the same line). It might be a naïve approach because I started from scratch, but it seems to work well enough.

Sounds like normal XML. It's not a line based format. Parsing XML (https://stackoverflow.com/q/1732348) or WML with regex is generally not a good approach.

@Wedge009
Copy link
Member

For my work context, at least, it is definitely non-standard XML. Multiple XML trees, including DOCTYPE declarations, with inconsistent new-line formatting. Running a standard XML parser wasn't going to work.

I'm not really familiar with wmlparser3.py. I ran against a single cfg data file and it just seemed to give me back the the same WML. I really should stop looking at this, though.

@stevecotton
Copy link
Contributor

You might be better using a library to read the pot files, instead of writing a new cfg parser.

@CelticMinstrel
Copy link
Member

Dashes aren't critical either, but I do think at least dashes at least of the 'en-dash' length should be favoured over a plain hyphen when the symbol is clearly not being used as a hyphen.

The usual rule for dashes is either to use or to use with no spaces around it. When a hyphen is used as a dash, it's generally surrounded by spaces, so I would think that substituting dashes is as simple as s/ - / – /.

No back-quotes

This isn't even a quote character. It's a grave accent. So it's definitely wrong to use it in prose. Which you've said we're not doing, so that's good.

No straight double-quotes (presumably that would be a syntax error anyway)

Not really relevant because we want to use curly quotes instead of straight quotes, but… WML syntax does allow for the use of double quotes in a string. You'd write it something like this: _"This is a ""silly"" example"

Loads of straight single-quotes - some are genuine inconsistencies, will need to filter out those that lie within mark-up (presumably can be handled with regular expressions).

The single quote is actually quite tricky, and I don't think it can be handled with regular expressions. The problem is in words like ’tis. If you use normal heuristics, that would become ‘tis which is incorrect. The rule is that an apostrophe is always a "closing" quote, even when it appear at the beginning of a word. Probably the best we can manage is automate it with a word processor's "smart quotes" algorithm and then review the changes to correct any start-of-word exceptions. If you want to fully automate it, you probably need to make a list of words that start with an apostrophe.

For markup, you can ignore anything enclosed in <angle brackets>, ie in regex <.*?>. That covers Pango markup, but there's also help markup to consider. Perhaps you can match that with a regular expression such as the following:

<.*?>.*?=.*?</.*?>

Translation: any run of text between and open and closing tag that contains an equals sign.

Loads of hyphens/minus-signs, may be difficult to programmatically filter out ones that really ought to be dashes. Some instances are even used as textual bullet-points.

The ones used as textual bullet-points should probably be replaced with . Other than that, as I mentioned earlier, it should be relatively easy to distinguish hyphen from dash based on whether it has spaces around it. It's fair to want to review the list just to make sure there are no exceptions, though.

@Wedge009
Copy link
Member

Wedge009 commented Oct 29, 2023

The usual rule for dashes is either to use or to use with no spaces around it. When a hyphen is used as a dash, it's generally surrounded by spaces, so I would think that substituting dashes is as simple as s/ - / – /.

I think there are some occasions where hyphen may be used as a dash, but without spaces. But because of confusion with actual hyphens, I haven't verified this. Current issues that I did find are in #8000.

This isn't even a quote character. It's a grave accent. So it's definitely wrong to use it in prose. Which you've said we're not doing, so that's good.

I was just looking at characters that the style guide said to avoid. And I've noticed people use the character as a quote, perhaps more from pre-Internet days, though.

WML syntax does allow for the use of double quotes in a string. You'd write it something like this: _"This is a ""silly"" example"

That's what I was trying to work out, the escape character for a straight double-quote. Updated the list in #8000 with these. There were a couple for manual tests that I haven't recorded, one of which was marked for translation... I wonder if it's even worth having translated strings for tests?

@CelticMinstrel
Copy link
Member

CelticMinstrel commented Oct 29, 2023

I think there are some occasions where hyphen may be used as a dash, but without spaces. But because of confusion with actual hyphens, I haven't verified this.

I don't think I've ever seen this. Maybe using a double hyphen as an em dash is precedented though? I'm not sure.

And I've noticed people use the character as a quote, perhaps more from pre-Internet days, though.

Yes, I've seen it used that way, and it looks absolutely atrocious. It's a remnant of a time when all you had was ASCII. I'm not sure, it may have looked fine in contemporary fixed-width fonts, but it does not look even remotely okay in variable-width fonts.

I wonder if it's even worth having translated strings for tests?

It is done to simplify the schema validation, so that we don't have to somehow specify an "exception" in the tests for keys that are "required" to be translatable.. The strings are marked for translation, but placed in a textdomain that is not collected by the pot-update.

If you find such a string that's not in the untranslated wesnoth-test textdomain, then that's an error.

(Although, as a side note, I think there could be a debate over whether some of the test scenarios deserve to be translated. At least the AI demo scenarios are quite friendly and kind of intended for addon developers to view as examples. They were at one time translated; now they aren't. I don't think it's a clear-cut situation, there are good arguments for both sides.)

@Wedge009
Copy link
Member

Wedge009 commented Oct 29, 2023

I don't think I've ever seen this. Maybe using a double hyphen as an em dash is precedented though? I'm not sure.

Ah, good point. I did a quick review on that and I couldn't find any instances in anything that's viewed by a normal player, but I did see some instances in WML commentary (which is fine).

@cooljeanius
Copy link
Contributor

cooljeanius commented Jan 5, 2024

For spelling I'm not as sure, though IIRC wmllint has a spellcheck, so maybe that would help with at least some of these if/when that can be fixed up enough to run on mainline as part of the CI.

Judging by the number of failures in #8180, that looks to be a long way off still...
Edit: cross-referencing against: https://forums.wesnoth.org/viewtopic.php?t=57291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Issues involving the CI platform. Enhancement Issues that are requests for new features or changes to existing ones.
Projects
None yet
Development

No branches or pull requests

6 participants