alternator: limit expression length and recursion depth

DynamoDB limits of all expressions (ConditionExpression, UpdateExpression, ProjectionExpression, FilterExpression, KeyConditionExpression) to just 4096 bytes. Until now, Alternator did not enforce this limit, and we had an xfailing test showing this. But it turns out that not enforcing this limit can be dangerous: The user can pass arbitrarily-long and arbitrarily nested expressions, such as: a<b and (a<b and (a<b and (a<b and (a<b and (a<b and (...)))))) or ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( and those can cause recursive algorithms in Alternator's parser and later when applying expressions to recurse very deeply, overflow the stack, and crash. This patch includes new tests that demonstrate how Scylla crashes during parsing before enforcing the 4096-byte length limit on expressions. The patch then enforces this length limit, and these tests stop crashing. We also verify that deeply-nested expressions shorter than the 4096-byte limit are apparently short enough for our recursion ability, and work as expected. Unforuntately, running these tests many times showed that the 4096-byte limit is not low enough to avoid all crashes so this patch needs to do more: The parsers created by ANTLR are recursive, and there is no way to limit the depth of their recursion (i.e., nothing like YACC's YYMAXDEPTH). Very deep recursion can overflow the stack and crash Scylla. After we limited the length of expression strings to 4096 bytes this was *almost* enough to prevent stack overflows. But unfortunetely the tests revealed that even limited to 4096 bytes, the expression can sometimes recurse too deeply: Consider the expression "((((((....((((" with 4000 parentheses. To realize this is a syntax error, the parser needs to do a recursive call 4000 times. Or worse - because of other Antlr limitations (see rants in comments in expressions.g) it's actually 12000 recursive calls, and each of these calls have a pretty large frame. In some cases, this overflows the stack. The solution used in this patch is not pretty, but works. We add to rules in alternator/expressions.g that recurse (there are two of those - "value" and "boolean_expression") an integer "depth" parameter, which we increase when the rule recurses. Moreover, we add a so-called predicate "{depth<MAX_DEPTH}?" that stops the parsing when this limit is reached. When the parsing is stopped, the user will see a special kind of parse error, saying "expression nested too deeply". With this last modification to expressions.g, the tests for deeply-nested but still-below-4096-bytes expressions (test_limits.py::test_deeply_nested_expression_*) would not fail sporadically as they did without it. While adding the "expression nested too deeply" case, I also made the general syntax-error reporting in Alternator nicer: It no longer prints the internal "expression_syntax_error" type name (an exception type will only be printed if some sort of unexpected exception happens), and it prints the character position where the syntax error (or too deep nested expression) was recognized. Fixes #14473 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14477
scylladb · Jul 31, 2023 · 04e5082 · 04e5082
1 parent a637ddd
commit 04e5082
Show file tree

Hide file tree

Showing 6 changed files with 355 additions and 47 deletions.
diff --git a/alternator/executor.cc b/alternator/executor.cc
@@ -1648,7 +1648,7 @@ static parsed::condition_expression get_parsed_condition_expression(rjson::value
         throw api_error::validation("ConditionExpression must not be empty");
     }
     try {
-        return parse_condition_expression(rjson::to_string_view(*condition_expression));
+        return parse_condition_expression(rjson::to_string_view(*condition_expression), "ConditionExpression");
     } catch(expressions_syntax_error& e) {
         throw api_error::validation(e.what());
     }
@@ -3450,7 +3450,7 @@ filter::filter(const rjson::value& request, request_type rt,
             throw api_error::validation("Cannot use both old-style and new-style parameters in same request: FilterExpression and AttributesToGet");
         }
         try {
-            auto parsed = parse_condition_expression(rjson::to_string_view(*expression));
+            auto parsed = parse_condition_expression(rjson::to_string_view(*expression), "FilterExpression");
             const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
             const rjson::value* expression_attribute_values = rjson::find(request, "ExpressionAttributeValues");
             resolve_condition_expression(parsed,
@@ -4078,7 +4078,7 @@ calculate_bounds_condition_expression(schema_ptr schema,
     // sort-key range.
     parsed::condition_expression p;
     try {
-        p = parse_condition_expression(rjson::to_string_view(expression));
+        p = parse_condition_expression(rjson::to_string_view(expression), "KeyConditionExpression");
     } catch(expressions_syntax_error& e) {
         throw api_error::validation(e.what());
     }

diff --git a/alternator/expressions.cc b/alternator/expressions.cc
@@ -29,7 +29,7 @@
 namespace alternator {
 
 template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>
-Result do_with_parser(std::string_view input, Func&& f) {
+static Result do_with_parser(std::string_view input, Func&& f) {
     expressionsLexer::InputStreamType input_stream{
         reinterpret_cast<const ANTLR_UINT8*>(input.data()),
         ANTLR_ENC_UTF8,
@@ -43,31 +43,41 @@ Result do_with_parser(std::string_view input, Func&& f) {
     return result;
 }
 
-parsed::update_expression
-parse_update_expression(std::string_view query) {
+template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>
+static Result parse(const char* input_name, std::string_view input, Func&& f) {
+    if (input.length() > 4096) {
+        throw expressions_syntax_error(format("{} expression size {} exceeds allowed maximum 4096.",
+            input_name, input.length()));
+    }
     try {
-        return do_with_parser(query,  std::mem_fn(&expressionsParser::update_expression));
+        return do_with_parser(input, f);
+    } catch (expressions_syntax_error& e) {
+        // If already an expressions_syntax_error, don't print the type's
+        // name (it's just ugly), just the message.
+        // TODO: displayRecognitionError could set a position inside the
+        // expressions_syntax_error in throws, and we could use it here to
+        // mark the broken position in 'input'.
+        throw expressions_syntax_error(format("Failed parsing {} '{}': {}",
+            input_name, input, e.what()));
     } catch (...) {
-        throw expressions_syntax_error(format("Failed parsing UpdateExpression '{}': {}", query, std::current_exception()));
+        throw expressions_syntax_error(format("Failed parsing {} '{}': {}",
+            input_name, input, std::current_exception()));
     }
 }
 
+parsed::update_expression
+parse_update_expression(std::string_view query) {
+    return parse("UpdateExpression", query,  std::mem_fn(&expressionsParser::update_expression));
+}
+
 std::vector<parsed::path>
 parse_projection_expression(std::string_view query) {
-    try {
-        return do_with_parser(query,  std::mem_fn(&expressionsParser::projection_expression));
-    } catch (...) {
-        throw expressions_syntax_error(format("Failed parsing ProjectionExpression '{}': {}", query, std::current_exception()));
-    }
+    return parse ("ProjectionExpression", query,  std::mem_fn(&expressionsParser::projection_expression));
 }
 
 parsed::condition_expression
-parse_condition_expression(std::string_view query) {
-    try {
-        return do_with_parser(query,  std::mem_fn(&expressionsParser::condition_expression));
-    } catch (...) {
-        throw expressions_syntax_error(format("Failed parsing ConditionExpression '{}': {}", query, std::current_exception()));
-    }
+parse_condition_expression(std::string_view query, const char* caller) {
+    return parse(caller, query,  std::mem_fn(&expressionsParser::condition_expression));
 }
 
 namespace parsed {

diff --git a/alternator/expressions.g b/alternator/expressions.g
@@ -74,7 +74,22 @@ options {
  */
 @parser::context {
     void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {
-        throw expressions_syntax_error("syntax error");
+        const char* err;
+        switch (ex->getType()) {
+        case antlr3::ExceptionType::FAILED_PREDICATE_EXCEPTION:
+            err = "expression nested too deeply";
+            break;
+        default:
+            err = "syntax error";
+            break;
+        }
+        // Alternator expressions are always single line so ex->get_line()
+        // is always 1, no sense to print it.
+        // TODO: return the position as part of the exception, so the
+        // caller in expressions.cc that knows the expression string can
+        // mark the error position in the final error message.
+        throw expressions_syntax_error(format("{} at char {}", err,
+            ex->get_charPositionInLine()));
     }
 }
 @lexer::context {
@@ -83,6 +98,23 @@ options {
     }
 }
 
+/* Unfortunately, ANTLR uses recursion - not the heap - to parse recursive
+ * expressions. To make things even worse, ANTLR has no way to limit the
+ * depth of this recursion (unlike Yacc which has YYMAXDEPTH). So deeply-
+ * nested expression like "(((((((((((((..." can easily crash Scylla on a
+ * stack overflow (see issue #14477).
+ *
+ * We are lucky that in the grammar for DynamoDB expressions (below),
+ * only a few specific rules can recurse, so it was fairly easy to add a
+ * "depth" counter to a few specific rules, and then use a predicate
+ * "{depth<MAX_DEPTH}?" to avoid parsing if the depth exceeds this limit,
+ * and throw a FAILED_PREDICATE_EXCEPTION in that case, which we will
+ * report to the user as a "expression nested too deeply" error.
+ */
+@parser::members {
+    static constexpr int MAX_DEPTH = 400;
+}
+
 /*
  * Lexical analysis phase, i.e., splitting the input up to tokens.
  * Lexical analyzer rules have names starting in capital letters.
@@ -155,19 +187,20 @@ path returns [parsed::path p]:
       | '[' INTEGER ']'           { $p.add_index(std::stoi($INTEGER.text)); }
     )*;
 
-value returns [parsed::value v]:
+/* See comment above why the "depth" counter was needed here */
+value[int depth] returns [parsed::value v]:
       VALREF       { $v.set_valref($VALREF.text); }
     | path         { $v.set_path($path.p); }
-    | NAME         { $v.set_func_name($NAME.text); }
-     '(' x=value   { $v.add_func_parameter($x.v); }
-     (',' x=value  { $v.add_func_parameter($x.v); })*
+    | {depth<MAX_DEPTH}? NAME { $v.set_func_name($NAME.text); }
+     '(' x=value[depth+1]    { $v.add_func_parameter($x.v); }
+     (',' x=value[depth+1]   { $v.add_func_parameter($x.v); })*
      ')'
     ;
 
 update_expression_set_rhs returns [parsed::set_rhs rhs]:
-    v=value  { $rhs.set_value(std::move($v.v)); }
-    (   '+' v=value  { $rhs.set_plus(std::move($v.v)); }
-      | '-' v=value  { $rhs.set_minus(std::move($v.v)); }
+    v=value[0]  { $rhs.set_value(std::move($v.v)); }
+    (   '+' v=value[0]  { $rhs.set_plus(std::move($v.v)); }
+      | '-' v=value[0]  { $rhs.set_minus(std::move($v.v)); }
     )?
     ;
 
@@ -205,7 +238,7 @@ projection_expression returns [std::vector<parsed::path> v]:
 
 
 primitive_condition returns [parsed::primitive_condition c]:
-      v=value         { $c.add_value(std::move($v.v));
+      v=value[0]      { $c.add_value(std::move($v.v));
                         $c.set_operator(parsed::primitive_condition::type::VALUE); }
       (  (  '='       { $c.set_operator(parsed::primitive_condition::type::EQ); }
           | '<' '>'   { $c.set_operator(parsed::primitive_condition::type::NE); }
@@ -214,14 +247,14 @@ primitive_condition returns [parsed::primitive_condition c]:
           | '>'       { $c.set_operator(parsed::primitive_condition::type::GT); }
           | '>' '='   { $c.set_operator(parsed::primitive_condition::type::GE); }
          )
-         v=value      { $c.add_value(std::move($v.v)); }
+         v=value[0]   { $c.add_value(std::move($v.v)); }
        | BETWEEN      { $c.set_operator(parsed::primitive_condition::type::BETWEEN); }
-         v=value      { $c.add_value(std::move($v.v)); }
+         v=value[0]   { $c.add_value(std::move($v.v)); }
          AND
-         v=value      { $c.add_value(std::move($v.v)); }
+         v=value[0]   { $c.add_value(std::move($v.v)); }
        | IN '('       { $c.set_operator(parsed::primitive_condition::type::IN); }
-         v=value      { $c.add_value(std::move($v.v)); }
-         (',' v=value { $c.add_value(std::move($v.v)); })*
+         v=value[0]   { $c.add_value(std::move($v.v)); }
+         (',' v=value[0] { $c.add_value(std::move($v.v)); })*
          ')'
       )?
     ;
@@ -231,19 +264,20 @@ primitive_condition returns [parsed::primitive_condition c]:
 // common rule prefixes, and (lack of) support for operator precedence.
 // These rules could have been written more clearly using a more powerful
 // parser generator - such as Yacc.
-boolean_expression returns [parsed::condition_expression e]:
-	  b=boolean_expression_1       { $e.append(std::move($b.e), '|'); }
-	  (OR b=boolean_expression_1   { $e.append(std::move($b.e), '|'); } )*
+// See comment above why the "depth" counter was needed here.
+boolean_expression[int depth] returns [parsed::condition_expression e]:
+	  b=boolean_expression_1[depth]       { $e.append(std::move($b.e), '|'); }
+	  (OR b=boolean_expression_1[depth]   { $e.append(std::move($b.e), '|'); } )*
 	;
-boolean_expression_1 returns [parsed::condition_expression e]:
-	  b=boolean_expression_2       { $e.append(std::move($b.e), '&'); }
-	  (AND b=boolean_expression_2  { $e.append(std::move($b.e), '&'); } )*
+boolean_expression_1[int depth] returns [parsed::condition_expression e]:
+	  b=boolean_expression_2[depth]       { $e.append(std::move($b.e), '&'); }
+	  (AND b=boolean_expression_2[depth]  { $e.append(std::move($b.e), '&'); } )*
 	;
-boolean_expression_2 returns [parsed::condition_expression e]:
+boolean_expression_2[int depth] returns [parsed::condition_expression e]:
 	  p=primitive_condition        { $e.set_primitive(std::move($p.c)); }
-	| NOT b=boolean_expression_2   { $e = std::move($b.e); $e.apply_not(); }
-	| '(' b=boolean_expression ')' { $e = std::move($b.e); }
+	| {depth<MAX_DEPTH}? NOT b=boolean_expression_2[depth+1]   { $e = std::move($b.e); $e.apply_not(); }
+	| {depth<MAX_DEPTH}? '(' b=boolean_expression[depth+1] ')' { $e = std::move($b.e); }
     ;
 
 condition_expression returns [parsed::condition_expression e]:
-    boolean_expression { e=std::move($boolean_expression.e); } EOF;
+    boolean_expression[0] { e=std::move($boolean_expression.e); } EOF;
diff --git a/alternator/expressions.hh b/alternator/expressions.hh
@@ -28,7 +28,7 @@ public:
 
 parsed::update_expression parse_update_expression(std::string_view query);
 std::vector<parsed::path> parse_projection_expression(std::string_view query);
-parsed::condition_expression parse_condition_expression(std::string_view query);
+parsed::condition_expression parse_condition_expression(std::string_view query, const char* caller);
 
 void resolve_update_expression(parsed::update_expression& ue,
         const rjson::value* expression_attribute_names,

diff --git a/test/alternator/test_condition_expression.py b/test/alternator/test_condition_expression.py
@@ -1888,3 +1888,39 @@ def test_update_item_condition_key_attribute_not_exists(test_table_s):
     with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
         test_table_s.update_item(Key={'p': p},
             ConditionExpression='attribute_not_exists(p)')
+
+# DynamoDB considers duplicate parentheses in expressions, which it calls
+# "redundant parentheses", to be illegal. Outlawing them is also useful to
+# avoid very deep recursion in the parser (see test_limits.py).
+# Let's test here what is considered redendant parentheses, and what isn't.
+@pytest.mark.xfail(reason="Alternator doesn't forbid redundant parentheses")
+def test_redundant_parentheses(test_table_s):
+    # Putting one set of unnecessary parentheses is fine - e.g., "(p<>p)"
+    # works just as well as "p<>p" - it isn't considered "redundant
+    p = random_string()
+    test_table_s.update_item(Key={'p': p},
+        ConditionExpression='p <> :p',
+        ExpressionAttributeValues={':p': p})
+    assert 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)
+    p = random_string()
+    test_table_s.update_item(Key={'p': p},
+        ConditionExpression='(p <> :p)',
+        ExpressionAttributeValues={':p': p})
+    assert 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)
+    # But putting two sets of parentheses, "((p<>p))", is considered redundant.
+    # DynamoDB prints: "Invalid ConditionExpression: The expression has
+    # redundant parentheses".
+    p = random_string()
+    with pytest.raises(ClientError, match='ValidationException.*redundant parentheses'):
+        test_table_s.update_item(Key={'p': p},
+            ConditionExpression='((p <> :p))',
+            ExpressionAttributeValues={':p': p})
+    # The expression "((p<>p) and p<>p)" isn't considered to have redundant
+    # parentheses - it's just like one unnecessary parentheses which we showed
+    # above is allowed. So the parser can't just claim "redundant parentheses"
+    # when it sees two successive parentheses beginning the expression.
+    p = random_string()
+    test_table_s.update_item(Key={'p': p},
+        ConditionExpression='((p <> :p) and p <> :p)',
+        ExpressionAttributeValues={':p': p})
+    assert 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)