JSON: the parser overflows stack on deeply nested files (modified by @masatake) #1682

codebrainz · 2018-02-11T02:47:25Z

Not sure if this is considered a bug, but the JSON parser uses recursion while parsing (as I'm sure other parsers do), so if a large enough, deeply-nested enough file is processed, the parser runs out of stack and crashes.

It would be great if it used iteration instead of recursion, or maybe limited the recursion depth so it doesn't crash at least. This came up as a Geany bug.

Backtrace:

#0  0x00007ffff756a938 in _IO_vfprintf_internal (s=s@entry=0x7fffff7ff570, format=format@entry=0x7ffff7d49353 "%u", ap=ap@entry=0x7fffff7ff6f0)
    at vfprintf.c:1320
#1  0x00007ffff763bc79 in ___vsnprintf_chk (s=0x7fffff7ff7e0 "", maxlen=<optimized out>, flags=1, slen=<optimized out>, format=0x7ffff7d49353 "%u", args=args@entry=0x7fffff7ff6f0) at vsnprintf_chk.c:63
#2  0x00007ffff763bba5 in ___snprintf_chk (s=s@entry=0x7fffff7ff7e0 "", maxlen=maxlen@entry=32, flags=flags@entry=1, slen=slen@entry=32, format=format@entry=0x7ffff7d49353 "%u") at snprintf_chk.c:34
#3  0x00007ffff7d334f9 in snprintf (__fmt=0x7ffff7d49353 "%u", __n=32, __s=0x7fffff7ff7e0 "") at /usr/include/x86_64-linux-gnu/bits/stdio2.h:64
#4  0x00007ffff7d334f9 in parseValue (token=token@entry=0x555555b638e0) at ../../ctags/parsers/json.c:330
#5  0x00007ffff7d33535 in parseValue (token=token@entry=0x555555b638e0) at ../../ctags/parsers/json.c:335
#6  0x00007ffff7d33535 in parseValue (token=token@entry=0x555555b638e0) at ../../ctags/parsers/json.c:335
[snip many more identical lines]

I generated a test file to reproduce the crash by piping the output of the following script to a file and opening it in Geany:

#!/usr/bin/env python3
import sys
limit = 75000
for i in range(0, limit):
    sys.stdout.write('[')
sys.stdout.write('\n"foo"\n')
for i in range(0, limit):
    sys.stdout.write(']')

The text was updated successfully, but these errors were encountered:

masatake · 2018-02-13T21:40:12Z

Interesting. I will take a look when I can find time.

b4n · 2018-02-14T18:42:16Z

This is fairly tricky at fixing completely, because there's a lot of recursion involved in this parser (as it's usually the simplest way of implementing recursive parsing).

~~The case showed here is one among others, and it's probably the easiest one to fix or work around. A harder one is a slight modification of the same idea:~~

#!/usr/bin/env python3
import sys
limit = 750000
sys.stdout.write('{"a":')
for i in range(0, limit):
    sys.stdout.write('[')
sys.stdout.write('\n"foo"\n')
for i in range(0, limit):
    sys.stdout.write(']')
sys.stdout.write('}\n')

~~This one will use a lot of heap memory (because of allocating tokens recursively) and will likely also blow up the stack just the same.~~ -- Nevermind, I though yours reproduced the issue in Geany's report, but it's actually showing another one, the same I was trying to show here.

I agree that it would be nice -- if not very important -- to fix this kind of problems, but I'm afraid it's not so trivial. Basically, you'd want no arbitrary recursion at all, and an overall resources limit enforced in a way that will only stop the parsers and not ctags's core. I don't know how easy it'd be to do that. Maybe most OSes have soft resource limits that could be applied only to part of the program or such, but I'm not very knowledgeable about that.

b4n · 2018-02-14T20:06:43Z

For the part highlighted in the Geany report (handling of large malformed input), this could be a fix. It's handling invalid input differently, which probably doesn't matter as it's invalid anyway and there's no way to know what is the "best" way to handle it.

diff --git a/parsers/json.c b/parsers/json.c
index 5098daa1..7646dd8a 100644
--- a/parsers/json.c
+++ b/parsers/json.c
@@ -242,21 +242,28 @@ static void skipToOneOf3 (tokenInfo *const token,
                           const tokenType type2,
                           const tokenType type3)
 {
+    unsigned int depth = 0;
+
     while (token->type != TOKEN_EOF &&
-           token->type != type1 &&
-           token->type != type2 &&
-           token->type != type3)
+           (depth > 0 ||
+            (token->type != type1 &&
+             token->type != type2 &&
+             token->type != type3)))
     {
         readToken (token);
-        if (token->type == TOKEN_OPEN_CURLY)
+        switch (token->type)
         {
-            skipTo (token, TOKEN_CLOSE_CURLY);
-            readToken (token);
-        }
-        else if (token->type == TOKEN_OPEN_SQUARE)
-        {
-            skipTo (token, TOKEN_CLOSE_SQUARE);
-            readToken (token);
+            case TOKEN_OPEN_CURLY:
+            case TOKEN_OPEN_SQUARE:
+                depth++;
+                break;
+            case TOKEN_CLOSE_CURLY:
+            case TOKEN_CLOSE_SQUARE:
+                if (depth > 0)
+                    depth--;
+                break;
+            default:
+                break;
         }
     }
 }

However, it doesn't fix the issue at hand, which happens when parsing valid enormous input.

codebrainz · 2018-02-15T02:45:00Z

Just brainstorming, but one thing that comes to mind is to use GCC's instrumentation hooks as a way to keep track of recursion depth in a static global variable without having to modify every function. In the entry point of the parser, it could setjmp and then within the entry hook, if a recursion depth limit is exceeded, it could longjmp back to the entry point, cleanup the accumulated tags, and return an error code.

It would be really hacky, but would require minimal changes to any parsing code and could be easily guarded out for unsupported compilers or for highly optimized builds. Just a thought, not sure how practical it really is.

codebrainz · 2018-02-15T02:58:22Z

I just noticed the -fstack-limit-symbol=sym option on the same page, which may be more or less hacky.

codebrainz · 2018-02-15T03:32:48Z

I did a quick test, the enter/leave instrumentation functions won't work with longjmp, ostensibly since they aren't part of the call stack proper (longjmp just segfaults). A similar technique could still be used without setjmp/longjmp but AFAICT it would require changing every single function. Oh well.

Close universal-ctags#2107. Related to universal-ctags#1682. To avoid stack overflow, this change provides the way to terminate the parsing. The recursion is very related to square brackets and curly brackets. Instead of limit the function recursion, this change tracks the depth of brackets. Here I assume deep brackets in input stream causes deep function call recursion. readTokenFull is changed to return EOF when it detects too deep (> 512) brackets in the current input stream. The EOF terminates the parsing. Signed-off-by: Masatake YAMATO <yamato@redhat.com>

masatake · 2019-06-11T11:52:20Z

Could you look at #2111?

…airs Close universal-ctags#2107. Related to universal-ctags#1682. To avoid stack overflow, this change provides the way to terminate the parsing if recursion CAN be too deep. The recursion is very related to square brackets and curly brackets in input stream. Instead of limit the depth of function recursion itself, this change tracks the depth of brackets. Here I assume deeply nested brackets increases the recursion depth of function calls. readTokenFull is changed to return EOF token when it detects too deeply nested (>= 512) brackets in the current input stream. The EOF token may terminate the parsing GENTLY. Signed-off-by: Masatake YAMATO <yamato@redhat.com>

@hanwen

…airs Close universal-ctags#2107 reported by @hanwen. Related to universal-ctags#1682. To avoid stack overflow, this change provides the way to terminate the parsing if recursion CAN be too deep. The recursion is very related to square brackets and curly brackets in input stream. Instead of limit the depth of function recursion itself, this change tracks the depth of brackets. Here I assume deeply nested brackets increases the recursion depth of function calls. readTokenFull is changed to return EOF token when it detects too deeply nested (>= 512) brackets in the current input stream. The EOF token may terminate the parsing GENTLY. Signed-off-by: Masatake YAMATO <yamato@redhat.com>

masatake · 2019-07-15T05:09:08Z

About only JSON, the issue itself is fixed via #2111.
What I did is very similar to the example in " b4n commented on Feb 15, 2018".
My fix rejects too deeply nested brackets even if the input is valid.

masatake changed the title ~~JSON parser overflows stack on deeply nested files~~ JSON: the parser overflows stack on deeply nested files (modified by @masatake) Jan 24, 2019

masatake added the Parser bug/enhancement label Jan 24, 2019

masatake mentioned this issue May 30, 2019

JSON DOS #2107

Closed

masatake mentioned this issue Jun 11, 2019

JSON: limit the depth of recursion by tracking character pair balancing #2111

Merged

masatake closed this as completed Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON: the parser overflows stack on deeply nested files (modified by @masatake) #1682

JSON: the parser overflows stack on deeply nested files (modified by @masatake) #1682

codebrainz commented Feb 11, 2018

masatake commented Feb 13, 2018

b4n commented Feb 14, 2018 •

edited

Loading

b4n commented Feb 14, 2018

codebrainz commented Feb 15, 2018 •

edited

Loading

codebrainz commented Feb 15, 2018

codebrainz commented Feb 15, 2018

masatake commented Jun 11, 2019

masatake commented Jul 15, 2019

JSON: the parser overflows stack on deeply nested files (modified by @masatake) #1682

JSON: the parser overflows stack on deeply nested files (modified by @masatake) #1682

Comments

codebrainz commented Feb 11, 2018

masatake commented Feb 13, 2018

b4n commented Feb 14, 2018 • edited Loading

b4n commented Feb 14, 2018

codebrainz commented Feb 15, 2018 • edited Loading

codebrainz commented Feb 15, 2018

codebrainz commented Feb 15, 2018

masatake commented Jun 11, 2019

masatake commented Jul 15, 2019

b4n commented Feb 14, 2018 •

edited

Loading

codebrainz commented Feb 15, 2018 •

edited

Loading