Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal control character in JSON output #165

Closed
Slange-Mhath opened this issue Jun 30, 2021 · 6 comments
Closed

Illegal control character in JSON output #165

Slange-Mhath opened this issue Jun 30, 2021 · 6 comments
Assignees
Labels

Comments

@Slange-Mhath
Copy link

Hey @richardlehane,

Is it possible that control characters in the JSON output are not escaped correctly?
Running Siegfried with a file that has for instance the name "/tmp/test_folder/^Test_file 1"
(linux output might look like "/tmp/test_folder/?Test.log") seems to output an invalid JSON.

A way to replicate that error would be to create a new text file with a control character inside and run a scan with sf -json.
The JSON output will contain the control character, which might result in an invalid JSON format.
Not sure if this might affect other output formats (like YAML) as well.

@richardlehane
Copy link
Owner

When I read this, I thought ... of course this can't be possible, I must be using golang's json package to write safe json output, and surely that library is thoroughly battle-tested. Well... I just checked the code and, embarrassingly, the json output is hand-crafted & indeed this is very definitely possible! There is some "sanitisation" of file names in the output (https://github.com/richardlehane/siegfried/blob/master/pkg/writer/writer.go#L227) but evidently not enough. I'll aim to fix this for the next release

@Slange-Mhath
Copy link
Author

Ha! Classic (and not embarrassing at all!)
Thank you so much for this. Really appreciate it!

@ross-spencer
Copy link
Collaborator

I can't recreate this with the folder structure below and command sf -json test_data | jsonlint or sf -json test_data | python -m json.tool:

ross-spencer:/tmp$ ls -la test_folder
total 92
drwxrwxr-x  2 ross-spencer ross-spencer  4096 Jun 30 17:09  .
drwxrwxrwt 26 root         root         73728 Jun 30 17:10  ..
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:02 '\afile'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:05 '\atest'
-rw-rw-r--  1 ross-spencer ross-spencer     5 Jun 30 13:17 '?file'
-rw-rw-r--  1 ross-spencer ross-spencer     5 Jun 30 13:17 '^file'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:09 '^M'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 13:18  @test
-rw-rw-r--  1 ross-spencer ross-spencer     5 Jun 30 13:17 '^?test_file'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:07 '^Test_file'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:08 '?Test.log'

Siegfried then outputs the filenames:

    "filename": "test_folder/?Test.log",
    "filename": "test_folder/?file",
    "filename": "test_folder/@test",
    "filename": "test_folder/\\afile",
    "filename": "test_folder/\\atest",
    "filename": "test_folder/^?test_file",
    "filename": "test_folder/^M",
    "filename": "test_folder/^Test_file",
    "filename": "test_folder/^file",

That aside though I wanted to point y'all at the test here I wrote recently which might be useful to copy of modify to simulate the error during the fix/refactor:

// Implement a basic Writer test for some of the data coming out of
// the Wikidata identifier. CSV and YAML will need a little more
// thought.
var w writer.Writer
buf := new(bytes.Buffer)
w = writer.JSON(buf)
w.Head("path/to/file", time.Now(), time.Now(), [3]int{0, 0, 0}, wdSiegfried.Identifiers(), wdSiegfried.Fields(), "md5")
w.File("testName", 10, "testMod", []byte("d41d8c"), nil, res)
w.Tail()
if !json.Valid([]byte(buf.String())) {
t.Errorf("Output from JSON writer is invalid: %s", buf.String())
}
}

@jamesmooneybodleian
Copy link

jamesmooneybodleian commented Jun 30, 2021

I suspect how control characters are shown on the filesystem differs with operating system.

On RedHat 7, when creating test files with control characters using Control-v then Control-a (Control-m, Control-o) they appear like this:

-rw-r--r-- 1 root root 0 Jun 30 18:06 ?testing_control_character_a
-rw-r--r-- 1 root root 0 Jun 30 18:09 ?testing_control_character_m
-rw-r--r-- 1 root root 0 Jun 30 18:05 ?testing_control_character_o

siegfried_json_output.txt

Attached json output from siegfried which shows the included control characters.

However on Ubuntu, a control character named file looked like this:
-rw-rw-r-- 1 root root 0 Jun 30 18:08 ''$'\017''testing'

siegfried_json_output_ubuntu.txt
Attached ubuntu json output, also includes control character.

@richardlehane
Copy link
Owner

for reference, here is how the standard library escapes json strings: https://golang.org/src/encoding/json/encode.go?s=8400:8459#L1028
Key thing missing is escaping of bytes < 0x20

@ross-spencer
Copy link
Collaborator

With some time today I used those tests mentioned above to (I believe) create a test that confirms the issue generically (i.e. not contingent on OS differences) that can be used to develop from here: #167 (it tests and passes for true-negatives and passes, and true-positives and passes, so it will fail when the issue is fixed - I figured a "pass" would generate less noise although I appreciate it might be less intuitive).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants