Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to interpret/display packet content as UTF-8 #1190

Open
cofleury opened this issue May 27, 2024 · 4 comments
Open

Add support to interpret/display packet content as UTF-8 #1190

cofleury opened this issue May 27, 2024 · 4 comments

Comments

@cofleury
Copy link

It would be great to support displaying the content of a packet as UTF-8 in addition to ASCII.

@cofleury cofleury changed the title Add support to interpret/display packet content as UTF-8 in addition to ASCII Add support to interpret/display packet content as UTF-8 May 27, 2024
@guyharris
Copy link
Member

For ASCII (and other single-byte character encodings), there can be a one-to-one correspondence between offsets into the packet and positions in the display.

For multi-byte character encodings, a decision has to be made as to how to display a character that's split between rows in the text display. The best thing to do is probably to display it at the location of the first byte, and perhaps to display the next character, which does not begin at the beginning of the next row, with some filler characters before it, corresponding to the bytes in that row that are part of the character that begins in the previous row.

For variable-length multi-byte character encodings, such as UTF-8, there's not likely to be a correspondence between offsets in the packet and positions in the display. At best, what could be done is to display characters adjacent to one another, display characters that are split across rows at the location of the first byte, and show the aforementioned filler characters.

@guyharris
Copy link
Member

Sequences of bytes that are valid UTF-8 characters but that are not printable characters should be displayed as ".", just as bytes that are not printable ASCII characters are displayed in the ASCII display.

Any sequence of bytes that are not part of a valid UTF-8 character should probably also be displayed as a sequence of "."s.

@infrastation
Copy link
Member

What would be the way to know where UTF-8 strings start and end in the packet data? UTF-8 bytes, whether perfectly valid or not, could be prepended/followed by pure binary bytes that could interfere with UTF-8 reading. As far as I understand, the only way to do it reliably would be to know the packet structure when doing a hex dump.

@fenner
Copy link
Contributor

fenner commented Jul 20, 2024

There's a straightforward way to identify whether or not a sequence of bytes is valid UTF-8; https://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c is an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants