Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON: the parser overflows stack on deeply nested files (modified by @masatake) #1682

Closed
codebrainz opened this issue Feb 11, 2018 · 8 comments

Comments

@codebrainz
Copy link
Contributor

Not sure if this is considered a bug, but the JSON parser uses recursion while parsing (as I'm sure other parsers do), so if a large enough, deeply-nested enough file is processed, the parser runs out of stack and crashes.

It would be great if it used iteration instead of recursion, or maybe limited the recursion depth so it doesn't crash at least. This came up as a Geany bug.

Backtrace:

#0  0x00007ffff756a938 in _IO_vfprintf_internal (s=s@entry=0x7fffff7ff570, format=format@entry=0x7ffff7d49353 "%u", ap=ap@entry=0x7fffff7ff6f0)
    at vfprintf.c:1320
#1  0x00007ffff763bc79 in ___vsnprintf_chk (s=0x7fffff7ff7e0 "", maxlen=<optimized out>, flags=1, slen=<optimized out>, format=0x7ffff7d49353 "%u", args=args@entry=0x7fffff7ff6f0) at vsnprintf_chk.c:63
#2  0x00007ffff763bba5 in ___snprintf_chk (s=s@entry=0x7fffff7ff7e0 "", maxlen=maxlen@entry=32, flags=flags@entry=1, slen=slen@entry=32, format=format@entry=0x7ffff7d49353 "%u") at snprintf_chk.c:34
#3  0x00007ffff7d334f9 in snprintf (__fmt=0x7ffff7d49353 "%u", __n=32, __s=0x7fffff7ff7e0 "") at /usr/include/x86_64-linux-gnu/bits/stdio2.h:64
#4  0x00007ffff7d334f9 in parseValue (token=token@entry=0x555555b638e0) at ../../ctags/parsers/json.c:330
#5  0x00007ffff7d33535 in parseValue (token=token@entry=0x555555b638e0) at ../../ctags/parsers/json.c:335
#6  0x00007ffff7d33535 in parseValue (token=token@entry=0x555555b638e0) at ../../ctags/parsers/json.c:335
[snip many more identical lines]

I generated a test file to reproduce the crash by piping the output of the following script to a file and opening it in Geany:

#!/usr/bin/env python3
import sys
limit = 75000
for i in range(0, limit):
    sys.stdout.write('[')
sys.stdout.write('\n"foo"\n')
for i in range(0, limit):
    sys.stdout.write(']')
@masatake
Copy link
Member

Interesting. I will take a look when I can find time.

@b4n
Copy link
Member

b4n commented Feb 14, 2018

This is fairly tricky at fixing completely, because there's a lot of recursion involved in this parser (as it's usually the simplest way of implementing recursive parsing).

The case showed here is one among others, and it's probably the easiest one to fix or work around. A harder one is a slight modification of the same idea:

#!/usr/bin/env python3
import sys
limit = 750000
sys.stdout.write('{"a":')
for i in range(0, limit):
    sys.stdout.write('[')
sys.stdout.write('\n"foo"\n')
for i in range(0, limit):
    sys.stdout.write(']')
sys.stdout.write('}\n')

This one will use a lot of heap memory (because of allocating tokens recursively) and will likely also blow up the stack just the same. -- Nevermind, I though yours reproduced the issue in Geany's report, but it's actually showing another one, the same I was trying to show here.

I agree that it would be nice -- if not very important -- to fix this kind of problems, but I'm afraid it's not so trivial. Basically, you'd want no arbitrary recursion at all, and an overall resources limit enforced in a way that will only stop the parsers and not ctags's core. I don't know how easy it'd be to do that. Maybe most OSes have soft resource limits that could be applied only to part of the program or such, but I'm not very knowledgeable about that.

@b4n
Copy link
Member

b4n commented Feb 14, 2018

For the part highlighted in the Geany report (handling of large malformed input), this could be a fix. It's handling invalid input differently, which probably doesn't matter as it's invalid anyway and there's no way to know what is the "best" way to handle it.

diff --git a/parsers/json.c b/parsers/json.c
index 5098daa1..7646dd8a 100644
--- a/parsers/json.c
+++ b/parsers/json.c
@@ -242,21 +242,28 @@ static void skipToOneOf3 (tokenInfo *const token,
                           const tokenType type2,
                           const tokenType type3)
 {
+    unsigned int depth = 0;
+
     while (token->type != TOKEN_EOF &&
-           token->type != type1 &&
-           token->type != type2 &&
-           token->type != type3)
+           (depth > 0 ||
+            (token->type != type1 &&
+             token->type != type2 &&
+             token->type != type3)))
     {
         readToken (token);
-        if (token->type == TOKEN_OPEN_CURLY)
+        switch (token->type)
         {
-            skipTo (token, TOKEN_CLOSE_CURLY);
-            readToken (token);
-        }
-        else if (token->type == TOKEN_OPEN_SQUARE)
-        {
-            skipTo (token, TOKEN_CLOSE_SQUARE);
-            readToken (token);
+            case TOKEN_OPEN_CURLY:
+            case TOKEN_OPEN_SQUARE:
+                depth++;
+                break;
+            case TOKEN_CLOSE_CURLY:
+            case TOKEN_CLOSE_SQUARE:
+                if (depth > 0)
+                    depth--;
+                break;
+            default:
+                break;
         }
     }
 }

However, it doesn't fix the issue at hand, which happens when parsing valid enormous input.

@codebrainz
Copy link
Contributor Author

codebrainz commented Feb 15, 2018

Just brainstorming, but one thing that comes to mind is to use GCC's instrumentation hooks as a way to keep track of recursion depth in a static global variable without having to modify every function. In the entry point of the parser, it could setjmp and then within the entry hook, if a recursion depth limit is exceeded, it could longjmp back to the entry point, cleanup the accumulated tags, and return an error code.

It would be really hacky, but would require minimal changes to any parsing code and could be easily guarded out for unsupported compilers or for highly optimized builds. Just a thought, not sure how practical it really is.

@codebrainz
Copy link
Contributor Author

I just noticed the -fstack-limit-symbol=sym option on the same page, which may be more or less hacky.

@codebrainz
Copy link
Contributor Author

I did a quick test, the enter/leave instrumentation functions won't work with longjmp, ostensibly since they aren't part of the call stack proper (longjmp just segfaults). A similar technique could still be used without setjmp/longjmp but AFAICT it would require changing every single function. Oh well.

@masatake masatake changed the title JSON parser overflows stack on deeply nested files JSON: the parser overflows stack on deeply nested files (modified by @masatake) Jan 24, 2019
@masatake masatake mentioned this issue May 30, 2019
masatake added a commit to masatake/ctags that referenced this issue Jun 11, 2019
Close universal-ctags#2107.
Related to universal-ctags#1682.

To avoid stack overflow, this change provides the way to terminate
the parsing.

The recursion is very related to square brackets and curly brackets.
Instead of limit the function recursion, this change tracks the depth
of brackets. Here I assume deep brackets in input stream causes
deep function call recursion.

readTokenFull is changed to return EOF when it detects too deep (> 512)
brackets in the current input stream. The EOF terminates the parsing.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
@masatake
Copy link
Member

Could you look at #2111?

masatake added a commit to masatake/ctags that referenced this issue Jun 12, 2019
…airs

Close universal-ctags#2107.
Related to universal-ctags#1682.

To avoid stack overflow, this change provides the way to terminate
the parsing if recursion CAN be too deep.

The recursion is very related to square brackets and curly brackets
in input stream.
Instead of limit the depth of function recursion itself, this change
tracks the depth of brackets. Here I assume deeply nested brackets
increases the recursion depth of function calls.

readTokenFull is changed to return EOF token when it detects too
deeply nested (>= 512) brackets in the current input stream. The
EOF token may terminate the parsing GENTLY.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit to masatake/ctags that referenced this issue Jul 14, 2019
…airs

Close universal-ctags#2107 reported by @hanwen.
Related to universal-ctags#1682.

To avoid stack overflow, this change provides the way to terminate
the parsing if recursion CAN be too deep.

The recursion is very related to square brackets and curly brackets
in input stream.
Instead of limit the depth of function recursion itself, this change
tracks the depth of brackets. Here I assume deeply nested brackets
increases the recursion depth of function calls.

readTokenFull is changed to return EOF token when it detects too
deeply nested (>= 512) brackets in the current input stream. The
EOF token may terminate the parsing GENTLY.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
@masatake
Copy link
Member

About only JSON, the issue itself is fixed via #2111.
What I did is very similar to the example in " b4n commented on Feb 15, 2018".
My fix rejects too deeply nested brackets even if the input is valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants