Skip to content

Problems handling non-printable chars #584

Closed
@aeisenberg

Description

@aeisenberg
Contributor

Non-printable characters in a results cell are not escaped. Eg- \b is displayed as what appears to be an empty string.

Two possible solutions that I can think of right now:

  1. add a quote to all strings that either start or end with a non-printable character
  2. preprocess all result strings and convert a subset of non-printable chars to their unicode number (ie- \uXXXX). We probably only want to do this for a fixed set of chars, probably the ascii chars that are known to be control characters.

Activity

mattt

mattt commented on Nov 12, 2020

@mattt

Another option might involve mapping non-printing characters to their counterparts in the Control Pictures unicode block. For example, in the case of \b (BACKSPACE U+0008) you could show ␈(SYMBOL FOR BACKSPACE U+2408). Those only covers the C0 control codes, but that may be sufficient depending on your use case.

aeisenberg

aeisenberg commented on Nov 13, 2020

@aeisenberg
ContributorAuthor

Thanks for your comment. This is a good possibility. Before deciding on an approach, I will contact some of our users who have hit this issue and determine what would make the most sense for them. This is a rare scenario so far and usually comes up when a user creates queries that are specifically looking for hidden control codes in code files, which could indicate something malicious.

justinjao

justinjao commented on May 5, 2021

@justinjao

Hello @aeisenberg, I'd like to take a crack at working on this, if the issue is still open!

marcnjaramillo

marcnjaramillo commented on Oct 4, 2021

@marcnjaramillo
Contributor

I think either approach would work. I'm wondering, though, if we add quotes to strings that contain pre-determined sequences of non-printing characters will those sequences actually be visible or would the results just have some strings in quotes and some that aren't in quotes? I haven't had to do anything like this, so I'm not sure what the end result would look like. I like the idea of converting the characters to their unicode numbers, but would that be enough to draw attention to them?

aeisenberg

aeisenberg commented on Oct 4, 2021

@aeisenberg
ContributorAuthor

Many of the non-printable characters will still be invisible if we just surround them in quotes, though at least users would be able to see that something exists in the cell, even if it's opaque exactly what this value is.

So, I'd say it is better to do something slightly more sophisticated. Dealing with unicode can get extremely complicated extremely quickly. So, at least at first, it's best to keep it simple. We would want to display all non-printable characters as something, so displaying unicode values would be sufficient, something like U+0008 for \b would be simple.

If you wanted to try to display control pictures for unicode values that have them, this would be fine, too (but not strictly necessary).

marcnjaramillo

marcnjaramillo commented on Oct 4, 2021

@marcnjaramillo
Contributor

Sounds good. I think just displaying the unicode values will be a good start.

marcnjaramillo

marcnjaramillo commented on Oct 5, 2021

@marcnjaramillo
Contributor

I've been thinking about the approach for this issue, and hope to get some thoughts. One way I can think of dealing with this is to create some kind of table that maps each non-printing character we want to account for to its unicode counterpart. Then, we would have to iterate over each string and if any substring is equivalent to a key in the table we replace it with the associated unicode value. This modified string would then be used in the response.

I'm also wondering if maybe we could use a regex to detect the non-printing characters and replace them that way. I'm not sure if we could easily replace the expression with its unicode equivalent, but maybe we could just replace them all with some message to alert the user that a non-printing character was detected.

I'm also thinking of the space/time complexity of each approach and how they could be optimized (if possible), and whether either approach is worth the cost given the problem being looked at is a rare use case anyway.

mattt

mattt commented on Oct 5, 2021

@mattt

@marcelolynch You should be able to iterate through each code point in the string in linear time. I'd recommend against using a regular expression for this kind of string processing. Each C0 control character (U+0000 – U+001F) has a corresponding control picture that you can get by adding the offset 0x2400. For any other characters, a switch statement would probably be your best bet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @mattt@aeisenberg@marcnjaramillo@justinjao

    Issue actions

      Problems handling non-printable chars · Issue #584 · github/vscode-codeql