Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMMON: Standardize conversion between U32String and String #2452

Draft
wants to merge 1 commit into
base: master
from

Conversation

@SupSuper
Copy link
Contributor

SupSuper commented Sep 8, 2020

Currently the U32String(const String &str) constructor converts a string using byte-copy, while the String(const U32String &str) constructor converts a string using UTF-8. These are asymmetrical operations, and lead to total protonic reversal corrupt data when converting Unicode back and forth between the two.

As an example, here's what happens if you add Unicode characters to a game description via the GUI:
GUI bug
This isn't a display bug, the description will keep getting more corrupt as you go in the Edit Game dialog.

While this can be solved via encode() and decode(), I don't think having error-prone APIs is a good idea, so my proposals are either:

  • Default to UTF8 when converting between the two. This guarantees no data loss, is compatible with ASCII strings, and cleans up most uses of encode() and decode(). The only downside is developers have to be aware this is the default behavior, and to explicitly use encode() and decode() when they require specific encodings for specific purposes.
  • Never convert between the two outside of the explicit encode() and decode() functions. This makes it explicit to the developers what is happening, at the cost of making the code more verbose. This might also require C++11 literals for Unicode strings.

Do not merge. The code is for discussion and not a final solution.

This ensures there's no data loss when converting between the two.
@criezy
Copy link
Member

criezy commented Sep 8, 2020

Indeed the U32String constructor that takes a String only works with ASCII characters. I think that might have been done on purpose, but I have been wondering if it should instead use decode(). This would still work with ASCII strings, although it might be a bit slower.

Something else I have been thinking as well is that we might want to add the code page parameter as we have in encode() and decode():

class String {
public:
	/** Construct a new string with the given encoding from the given u32 string. */
	String(const U32String &str, CodePage page = kUtf8);
class U32String {
public:
	/** Construct a new u32 string from the given string assuming it uses the given encoding. */
	U32String(const String &str, CodePage page = kUtf8);

If we decide instead that neither of these constructor do any conversion, we could add a check that all the characters are in the ASCII range and error out if this is not the case.

here's what happens if you add Unicode characters to a game description via the GUI

Note that unicode input in the GUI is not currently supported and doesn't work properly (unless you use paste from clipboard, which should handle unicode properly). So in this test there might be more to it than just a mismatch in the String and U32String conversions.

@SupSuper
Copy link
Contributor Author

SupSuper commented Sep 8, 2020

Note that unicode input in the GUI is not currently supported and doesn't work properly (unless you use paste from clipboard, which should handle unicode properly). So in this test there might be more to it than just a mismatch in the String and U32String conversions.

Seems to work fine on Windows, at least in this case. I checked the code and it was getting mangled when getting sent to ConfMan and back.

Should probably tag @aryanrawlani28 on this as well.

@aryanrawlani28
Copy link
Contributor

aryanrawlani28 commented Sep 9, 2020

Thanks for the tag SupSuper!

I think the solutions you've laid out are pretty nice. The easiest one to solve this particular thing would be using encode() and decode(), but point 1:

Default to UTF8 when converting between the two. This guarantees no data loss, is compatible with ASCII strings, and cleans up most uses of encode() and decode(). The only downside is developers have to be aware this is the default behavior, and to explicitly use encode() and decode() when they require specific encodings for specific purposes.

seems suitable to me as a proper and long term solution, I think.

However, Unicode input is still a bit far ways from ScummVM, because sev pointed out that each engine would also have to implement support for savegame names to be Unicode, while at the same time providing backward-compatibility for the older saves. Then there's also the issue of ConfMan writing and reading from .ini file as you pointed out.

Once we have Unicode everywhere, the solutions you've described should work well! And, I'm glad to hear input is working fine on windows! Criezy has already tested Unicode clipboards into editables, and that was working well too. Will probably need testing on other platforms tho.

@lephilousophe
Copy link
Member

lephilousophe commented Sep 9, 2020

I think that having to explicitly use encode and decode to do actual conversions is the proper way.
We can also wire a two arguments (without any default encoding) constructor to the decode/encode call.
We can think about String like Python bytes object and U32String like str. Passing from one to the other shouldn't be done by accident.

@criezy
Copy link
Member

criezy commented Sep 9, 2020

Passing from one to the other shouldn't be done by accident.

We could mark the constructors as explicit to avoid that. But then we might as well use encode() and decode().
Either way, it might indeed be a good idea to remove those constructors, or mark them as explicit, and this might uncover some current issues in our code.

@SupSuper How well unicode input works currently probably depends on the platform and keyboard layout.

Here is what I get on macOS currently if I try to type tête à claque in the search field with a US keyboard (notice the additional characters before both the ê and the à):
image

And here is what I get with a French keyboard (notice the missing accent on the 'ê', but not on the 'à'):
image

Working on unicode text input was not part of @aryanrawlani28 GSoC task, so it's a bonus that it almost works. But there is still some work to be done.

@sev-
Copy link
Member

sev- commented Sep 10, 2020

Well, ideally, we'd better kill ustr.cpp and turn str.cpp into a template. However, I would wait until next week when I announce that we allow c++11. This would let make the conversion more beautiful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.