Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-ASCII character encodings #10

Closed
solemnwarning opened this issue Nov 23, 2018 · 2 comments
Closed

Support non-ASCII character encodings #10

solemnwarning opened this issue Nov 23, 2018 · 2 comments
Labels
feature A new feature
Milestone

Comments

@solemnwarning
Copy link
Owner

For other 8-bit codepages, simply substitute the values going in/out.

For other sizes of fixed-length encoding, also narrow the width reserved for the text view.

For variable-length encodings like UTF-8, render multibyte characters at the position of their first byte, with spaces or some other indicator showing their length. Be wary of control characters (e.g. right-to-left) and also fullwidth characters (thanks Japan!).

@solemnwarning
Copy link
Owner Author

So, on the encoding branch, I've added a "Text encoding" option to the context menu alongside the "Data type" one. These aren't mutually exclusive - so you can set a range of bytes as (e.g.) "x86 machine code", and also independently set the text encoding used for decoding the text column on the right.

This is probably not very helpful - there are some architectures/eras where mixing of code and data is apparently common, but even on such things an unbroken block of data/code wouldn't disassembly properly past the first data blob (unless the architecture happens to have fixed-length instructions and the data blob preserves alignment, or the data is followed by a big wall of nops or something).

I'm leaning towards taking the "Text encoding" option back out, and make encoding a "Data type" type instead (e.g. "UTF-16 encoded text"), so in disassemblies with mixed code/data they can be marked as such.

I'm not yet sure how best to integrate it with the "Strings" tab - if the Strings tab relies on the encodings defined in the file, its arguably a bit useless since you already know where the text is to have marked it as such, OTOH, if it had its own encoding selection, then it isn't going to know how to handle mixed-encoding files, and won't make use of the encoding annotations already added.

Finally, I'm not sure how the "Strings" tab should detect what actually is a string in this scary new international world - right now it just looks for sequences of "printable" ASCII characters, but what does "printable" mean in Unicode? Printable ASCII + any 8-bit character? I'm pretty sure Unicode has control characters in it too.

If anyone has opinions on the above, now's the time to voice them!

@solemnwarning
Copy link
Owner Author

Almost 3 years since I opened this ticket... and here we are. Changes merged to master in de6eefc.

Selections can be marked as a specific text encoding under the Set Data Type > Text context menu (UTF-8/16/32 and ISO-8859-X are currently supported).

Once marked as a text encoding, text in the "ASCII" column will be decoded and the full range of characters that can be drawn will be rendered. High bit characters can be typed in and will be encoded correctly, as will text copied/pasted from the clipboard. Any characters that cannot be represented in the destination encoding will be skipped, with an accompanying beep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

No branches or pull requests

1 participant