Support non-ASCII character encodings #10

solemnwarning · 2018-11-23T20:36:22Z

For other 8-bit codepages, simply substitute the values going in/out.

For other sizes of fixed-length encoding, also narrow the width reserved for the text view.

For variable-length encodings like UTF-8, render multibyte characters at the position of their first byte, with spaces or some other indicator showing their length. Be wary of control characters (e.g. right-to-left) and also fullwidth characters (thanks Japan!).

solemnwarning · 2021-09-10T12:08:47Z

So, on the encoding branch, I've added a "Text encoding" option to the context menu alongside the "Data type" one. These aren't mutually exclusive - so you can set a range of bytes as (e.g.) "x86 machine code", and also independently set the text encoding used for decoding the text column on the right.

This is probably not very helpful - there are some architectures/eras where mixing of code and data is apparently common, but even on such things an unbroken block of data/code wouldn't disassembly properly past the first data blob (unless the architecture happens to have fixed-length instructions and the data blob preserves alignment, or the data is followed by a big wall of nops or something).

I'm leaning towards taking the "Text encoding" option back out, and make encoding a "Data type" type instead (e.g. "UTF-16 encoded text"), so in disassemblies with mixed code/data they can be marked as such.

I'm not yet sure how best to integrate it with the "Strings" tab - if the Strings tab relies on the encodings defined in the file, its arguably a bit useless since you already know where the text is to have marked it as such, OTOH, if it had its own encoding selection, then it isn't going to know how to handle mixed-encoding files, and won't make use of the encoding annotations already added.

Finally, I'm not sure how the "Strings" tab should detect what actually is a string in this scary new international world - right now it just looks for sequences of "printable" ASCII characters, but what does "printable" mean in Unicode? Printable ASCII + any 8-bit character? I'm pretty sure Unicode has control characters in it too.

If anyone has opinions on the above, now's the time to voice them!

solemnwarning · 2021-10-27T09:47:09Z

Almost 3 years since I opened this ticket... and here we are. Changes merged to master in de6eefc.

Selections can be marked as a specific text encoding under the Set Data Type > Text context menu (UTF-8/16/32 and ISO-8859-X are currently supported).

Once marked as a text encoding, text in the "ASCII" column will be decoded and the full range of characters that can be drawn will be rendered. High bit characters can be typed in and will be encoded correctly, as will text copied/pasted from the clipboard. Any characters that cannot be represented in the destination encoding will be skipped, with an accompanying beep.

solemnwarning added the feature A new feature label Nov 24, 2018

tompazourek mentioned this issue Nov 16, 2020

Support switching different encodings in the Strings tab #106

Closed

solemnwarning added this to the NEXT + 1 milestone Aug 23, 2021

solemnwarning closed this as completed Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-ASCII character encodings #10

Support non-ASCII character encodings #10

solemnwarning commented Nov 23, 2018

solemnwarning commented Sep 10, 2021

solemnwarning commented Oct 27, 2021

Support non-ASCII character encodings #10

Support non-ASCII character encodings #10

Comments

solemnwarning commented Nov 23, 2018

solemnwarning commented Sep 10, 2021

solemnwarning commented Oct 27, 2021