Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML2GFF: Serialize color codes #55

Open
DrMcCoy opened this issue May 29, 2020 · 7 comments
Open

XML2GFF: Serialize color codes #55

DrMcCoy opened this issue May 29, 2020 · 7 comments
Assignees

Comments

@DrMcCoy
Copy link
Member

DrMcCoy commented May 29, 2020

The Aurora games has a weird way to do color codes: <cRGB>Foo</c> where R, G and B are raw byte values (!).

When reading in LocStrings, we demunge those into <cRRGGBBAA>Foo</c> where RR, GG, BB and AA are textual hexadecimal values.

When converting XML files back to GFF, we need to reverse this as well.

@drake127
Copy link
Contributor

drake127 commented Dec 2, 2020

Hi, can I vouch for this ticket?

I happen to need to parse thousands of character (BIC) files, process them in XML a save them back. When I fixed up encodings (cp1250->utf-8->cp1250) I encountered color codes in one of the files.

Specifically <c\x01\x01\x01> ie. black which does convert to UTF-8 (noop, remains <c\x01\x01\x01>) but does not convert back to cp-1250 in xml2gff.

The actual error was:

ERROR: XML document failed to parse
    Because: xmlvault/Jiraiya/gorky.xml:287: parser error : PCDATA invalid Char value 1
    <exostring label="FamiliarName">&lt;c&gt;&lt;/c&gt;</exostring>

Thanks.

@drake127
Copy link
Contributor

drake127 commented Dec 2, 2020

To clarify: gff2xml does NOT transform them to <CRRGGBBAA>, they stays in raw values. See attachment.

char.zip

@DrMcCoy
Copy link
Member Author

DrMcCoy commented Dec 3, 2020

Ah, yes, seems like we're only doing that for LocStrings and TLK strings, not for ExoStrings. Since we're not reconstructing them anyway, do consider gff2xml and xml2gff broken for GFF files with color codes for now.

What do you mean with wanting to "vouch" for this ticket? Just to add your vote that you consider this an important issue, or that you'd want to work on a fix yourself?

I can offer pointers for the latter case, but I can't guarantee when I'll have time to work on this myself. To be fair, I do have a few weeks of vacations coming up, though, so chances are good I'll find time to work on this before the end of the year, but I wouldn't say no to "converting" you to a contributors to the xoreos project. ;)

@drake127
Copy link
Contributor

drake127 commented Dec 3, 2020

😄 I intended to just add a vote but if it annoys me too much, I'd try to fix it somehow.

My issue is not only with color codes but with any invalid characters. I would be quite happy if it was just encoded like <c&#01;&#01;&#01;> but libxml2 will not parse it anyway (I tried). This is unfortunate as it's not possible to store arbitrary payload to XML...

Maybe it would be more reasonable to base64 encode whole string (when it contains !libxml2::IS_CHAR(c) character) instead of converting just color codes with arbitrary format?

@DrMcCoy
Copy link
Member Author

DrMcCoy commented Dec 3, 2020

Yeah, gff2xml already base64 encodes string with invalid byte values (which it detects by checking if iconv throws an error), but again not for ExoStrings, that's one thing I overlooked. Also apparently not for all LocStrings? I need to recheck that.

You can see this happening for your example file, because it shows a warning on stderr, and some strings consist of "[!?!]" (a dummy value) instead. That's certainly a bug, in that case gff2xml should base64-encode the string instead, so that we're at least no losing any data.

In the case of the color codes, I would still like to convert them to something reasonable and human-readable, like the <cRRGGBB> format.

@drake127
Copy link
Contributor

drake127 commented Dec 3, 2020

I haven't look into code yet but I don't think that iconv would fail (in some instances) as all characters < 0x80 are valid in UTF-8 but not in XML. That's what causes the error.

I use cp1250 and so far didn't see an issue with dummy characters. Entire cp1250 (0-255) should map to UTF-8 and back to cp1250 correctly.

@DrMcCoy
Copy link
Member Author

DrMcCoy commented Dec 3, 2020

No, there are undefined values in CP1250, for example 0x90.

Compare

echo -ne "\x66\x6f\x6f\x0a" | iconv -f cp1250 -t utf8

vs

echo -ne "\x66\x6f\x6f\x90\x0a" | iconv -f cp1250 -t utf8

For gorky.bic, look at line 593ff of the xml produced by gff2xml:

        <locstring label="DescIdentified" strref="4294967295">
          <string language="0">[!?!]</string>
        </locstring>

Still, for color codes, certain values might not throw an error. Which is why they still need to be preparsed first.

See line 287, for example:

<exostring label="FamiliarName">&lt;c^A^A^A&gt;&lt;/c&gt;</exostring>

For readability, it might an idea to wrap color-coded strings in a CDATA section, to increase readiblity with the angle brackets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants