Closed
Description
Non-printable characters in a results cell are not escaped. Eg- \b
is displayed as what appears to be an empty string.
Two possible solutions that I can think of right now:
- add a quote to all strings that either start or end with a non-printable character
- preprocess all result strings and convert a subset of non-printable chars to their unicode number (ie-
\uXXXX
). We probably only want to do this for a fixed set of chars, probably the ascii chars that are known to be control characters.
Activity
mattt commentedon Nov 12, 2020
Another option might involve mapping non-printing characters to their counterparts in the Control Pictures unicode block. For example, in the case of
\b
(BACKSPACE U+0008) you could show ␈(SYMBOL FOR BACKSPACE U+2408). Those only covers the C0 control codes, but that may be sufficient depending on your use case.aeisenberg commentedon Nov 13, 2020
Thanks for your comment. This is a good possibility. Before deciding on an approach, I will contact some of our users who have hit this issue and determine what would make the most sense for them. This is a rare scenario so far and usually comes up when a user creates queries that are specifically looking for hidden control codes in code files, which could indicate something malicious.
justinjao commentedon May 5, 2021
Hello @aeisenberg, I'd like to take a crack at working on this, if the issue is still open!
marcnjaramillo commentedon Oct 4, 2021
I think either approach would work. I'm wondering, though, if we add quotes to strings that contain pre-determined sequences of non-printing characters will those sequences actually be visible or would the results just have some strings in quotes and some that aren't in quotes? I haven't had to do anything like this, so I'm not sure what the end result would look like. I like the idea of converting the characters to their unicode numbers, but would that be enough to draw attention to them?
aeisenberg commentedon Oct 4, 2021
Many of the non-printable characters will still be invisible if we just surround them in quotes, though at least users would be able to see that something exists in the cell, even if it's opaque exactly what this value is.
So, I'd say it is better to do something slightly more sophisticated. Dealing with unicode can get extremely complicated extremely quickly. So, at least at first, it's best to keep it simple. We would want to display all non-printable characters as something, so displaying unicode values would be sufficient, something like
U+0008
for\b
would be simple.If you wanted to try to display control pictures for unicode values that have them, this would be fine, too (but not strictly necessary).
marcnjaramillo commentedon Oct 4, 2021
Sounds good. I think just displaying the unicode values will be a good start.
marcnjaramillo commentedon Oct 5, 2021
I've been thinking about the approach for this issue, and hope to get some thoughts. One way I can think of dealing with this is to create some kind of table that maps each non-printing character we want to account for to its unicode counterpart. Then, we would have to iterate over each string and if any substring is equivalent to a key in the table we replace it with the associated unicode value. This modified string would then be used in the response.
I'm also wondering if maybe we could use a regex to detect the non-printing characters and replace them that way. I'm not sure if we could easily replace the expression with its unicode equivalent, but maybe we could just replace them all with some message to alert the user that a non-printing character was detected.
I'm also thinking of the space/time complexity of each approach and how they could be optimized (if possible), and whether either approach is worth the cost given the problem being looked at is a rare use case anyway.
mattt commentedon Oct 5, 2021
@marcelolynch You should be able to iterate through each code point in the string in linear time. I'd recommend against using a regular expression for this kind of string processing. Each C0 control character (U+0000 – U+001F) has a corresponding control picture that you can get by adding the offset
0x2400
. For any other characters, aswitch
statement would probably be your best bet.