Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate strings via Japanese→$language dictionaries? #38

Open
nmlgc opened this issue Oct 29, 2014 · 4 comments
Open

Translate strings via Japanese→$language dictionaries? #38

nmlgc opened this issue Oct 29, 2014 · 4 comments
Labels

Comments

@nmlgc
Copy link
Contributor

nmlgc commented Oct 29, 2014

With the reorganization of the project and the progress we've made in the last year, I think it's worth raising this question again. The new project administration will need to take a stance on this before covering the next game, and it'll be important to know the background behind my decision against such a system.

It has always been my design goal for thcrap's translation functionality not to rely on any original Japanese strings. This decision stems from two observations:

  • There are static translation patches for earlier games into a variety of languages.
  • With the older games essentially being abandonware [citation needed] and, as of this writing, only one game being distributed as an official download release, piracy remains the only distribution channel that really matters. And it has been shown time and time again that most site owners rather provide complete, easy-to-use packages aimed to speakers of their own language than clean, original copies - which is further justified by some inadvertent "region locking" that e.g. causes 東方紅魔郷.exe and every game's custom.exe to simply not work on non-Japanese locales.

Combine the two and you have rampant piracy of unofficially translated games. We can't rely on pirated copies coming with the original text anymore. That's just how it is.

This means, however, that we have needed all sorts of alternative indexing systems to correctly assign translations:

  • Dialogue and endings use a rather involved time code system that additionally gives translators the freedom to edit dialogue on a text box level and allows them to use 2 lines where the original only uses one (and vice versa). This greatly goes against the line-centric design of ZUN's original formats, and is one source of all the complexity in the dialogue patcher (with the other one being the hard line system used in th06, th07 and parts of th08).

  • Spell cards use their (zero-based) index number in the game's Result screen, which are pulled out from the game's memory using a bunch of breakpoints. This works fine for games that have a result screen. All games that don't (TH095, Uwabami Breakers, TH125, TH143) coincidentally also happen to have no reliable index numbers, and therefore require even more build-specific binary hacks and breakpoints to somehow derive them from the game progression - or, in the case of Uwabami Breakers, a full copy of all .ECL danmaku scripts with corrected IDs. :(

  • Hardcoded string translation assigns IDs to virtual memory addresses (stringlocs.js), then looks up translations for these IDs in a separate table (stringdefs.js). This is reliable, needs zero game- and build-specific hacks, and I have a script to locate the addresses, but they still need to be committed for every build of every game.

  • Music Room translation uses a combination of hardcoded string translation (for the "No. X ??????" strings displayed for locked tracks), a separate file containing song title translations (themes.js) and a separate file for comments (musiccmt.js).

    The themes.js system was designed before thcrap to serve as a song title source for (hypothetical) third-party applications to cope with frequent translation changes (which in turn was the main motivation for thcrap in the first place).

    musiccmt.js basically uses the same format as the generic plaintext translation support that would later be developed for th143, but with a special syntax that replaces a single @ character in a line with a customizable format string, printing the title of the currently selected track.

    Not to mention that pulling the theme number out of the game also requires its fair share of build-specific breakpoints.

  • Dialog resource translation makes use of the fact that the widgets (and thus, their strings) internally appear in a set order, builds a JSON array of hardcoded string IDs (dialog_*.js) in this order, then pulls the translations out of stringdefs.js. Other than that, no build-specific hacks necessary.

Using one single dictionary-based solution instead of these systems would greatly reduce the amount of effort required to support new games at the expense of both compatibility to static patches and more bloat in the translation files.

@nmlgc nmlgc added the question label Oct 29, 2014
@nmlgc
Copy link
Contributor Author

nmlgc commented Oct 29, 2014

Reposting this from our (now apparently dead) Trello page. I've taken a look at Nutzer's Touhou 8.3 patch and noticed that it frequently uses multiple spell card declarations within a single scene. These would pretty much be impossible to translate without a dictionary system.

Currently, base_tsa merely has an extremely ugly solution to handle this case for the one single scene in the original TH14.3, 3-7, which actually has a second spell card name. It involved finding a certain location in the ECL parsing code that only seems to be called in that specific case, then resetting thcrap's internal spell card ID and assigning the translation for "「リザレクション」" to ID № 1.

(Just in case anyone was still thinking that 14.3 is just a generic Touhou game that shouldn't have posed any difficulty in automatic patching. It is not.)

@nmlgc
Copy link
Contributor Author

nmlgc commented Jul 16, 2016

Turns out the best implementation is as follows:

  • Keep the current overall translation file layout, with a separate dictionary for every original file. This is necessary because according to UnKnwn, there are many cases of identical sentences in different contexts which translators might want to translate differently.

  • Still, there should be an additional global fallback table per game – not only to save translators who don't want to translate those differently.

  • Add a layer of indirection, so that we go "Japanese": some ID (in filename.ids.json in base_tsa) and then some ID: "translation" (in filename.table.json in the translation patches). This will be necessary for supporting static patches by simply adding "statically translated text": some ID to filename.ids.json.

    Note that this means that we technically don't need separate filename.table.json files and could keep all translations in one single big table per game, but I think it would still be better for saving traffic when doing HTTP updates, and for clarity in editing.

  • By making the ID step optional, we eliminate the need for those separate ID tables where it really isn't all too necessary, as in…

  • … client-side character name translation, which we can do in the context of an ending like this:

    base_tsa/th14/e01.msg.ids.json:

    {
        "霊夢 「やっと、お祓い棒が大人しくなってきたわ」": "e01_02"
    }

    lang_en/th14/e01.msg.table.json:

    {
        "e01_02": "<r$<d$霊夢>  >\"Looks like my purification rod's finally calming down.\""
    }

    script_latin/global.table.json:

    {
        "霊夢": "Reimu"
    }

    The <d$> command would then perform dictionary lookup of the given text in the global table. As you see, we don't need an additional ID table for this case, as the Kanji name lookup could have only been initiated from our translation, which always has the correct source text.

  • The only instance where we do need a global ID table is TH08 spell card owner translation on top of static patches, where we simply do the reverse for every language we have static patches for:

    base_tsa/global.ids.json:

    {
        "Reimu": "霊夢",
        "Рейму": "霊夢",
        "灵梦": "霊夢"
    }
  • Dialog boxes, Music Room comments, and other multi-line strings are looked up and replaced as a single, concatenated string. For better readability in plaintext editors, we should probably support JSON arrays for the values in the table.json files and \n-concatenate those automatically.

nmlgc added a commit to thpatch/thcrap-tsa that referenced this issue Aug 30, 2018
… screen in all supported versions.

And even in some versions that aren't supported yet. Some of those
trials for older games don't even have the safe sprintf() hacks yet,
which are necessary for translated versions to show up in the first
place, heh.

Oh, and while I'm at it:
• Don't cover the "th?? JP" string unless there is a good reason. This
  string is typically only used for the human-readable section of replay
  files, which we shouldn't translate in order to not introduce
  incompatibilities.
• "th08 Music Room spoiler 5" just consists of a single U+3000
  IDEOGRAPHIC SPACE, and isn't meaningfully used in later games.

…  yeah, we *really* need thpatch/thcrap#38.
@32th-System
Copy link
Member

32th-System commented May 14, 2023

For spell cards, dictionary based translation is now technically possible. The new spell_id breakpoint can take multiple parameters from multiple places, and combine them into one string to be used as the spell ID. These parameters can even be strings. So it would be possible to have a spell_name breakpoint like this

{
    "spell_id": {
        [
            {
                "type": "s",
                "param": "ecx"
            }
        ]
    },
    "spell_name": "ecx"
}

or this

{
    "spell_id": "ecx",
    "spell_id_type": "s",
    "spell_name": "ecx"
}

However, by being able to combine as many parameters as you want, there is also no need for a dictionary based translation, even in ISC or ISC mods that use multiple spell declarations in the same scene. On the other hand, I think that spell names in content mods should be entirely up to the mod itself, and that just putting the name in the ecl file and leaving it at that is therefore pefectly OK

@32th-System
Copy link
Member

In th19, certain server status messages are pulled from the internet. They are therefore

  • Impossible to static patch
  • We only want to translate known messages, and if someone encounters a new server message, it should be shown as is

and therefore: only translatable with a dictionary based system. Because of that, I have added a new dict_translate breakpoint. It's placed right before the draw_ctext call that's responsible for drawing those status messages to the screen.

Since we were basically forced to add this functionality, and it has been another 9 years since this issue was opened, using a dictionary based system for other things might warrant another more lengthy discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants