Skip to content

Commit

Permalink
alternator: limit expression length and recursion depth
Browse files Browse the repository at this point in the history
DynamoDB limits of all expressions (ConditionExpression, UpdateExpression,
ProjectionExpression, FilterExpression, KeyConditionExpression) to just
4096 bytes. Until now, Alternator did not enforce this limit, and we had
an xfailing test showing this.

But it turns out that not enforcing this limit can be dangerous: The user
can pass arbitrarily-long and arbitrarily nested expressions, such as:

    a<b and (a<b and (a<b and (a<b and (a<b and (a<b and (...))))))

or
    (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((

and those can cause recursive algorithms in Alternator's parser and
later when applying expressions to recurse very deeply, overflow the
stack, and crash.

This patch includes new tests that demonstrate how Scylla crashes during
parsing before enforcing the 4096-byte length limit on expressions.
The patch then enforces this length limit, and these tests stop crashing.
We also verify that deeply-nested expressions shorter than the 4096-byte
limit are apparently short enough for our recursion ability, and work
as expected.

Unforuntately, running these tests many times showed that the 4096-byte
limit is not low enough to avoid all crashes so this patch needs to do
more:

The parsers created by ANTLR are recursive, and there is no way to limit
the depth of their recursion (i.e., nothing like YACC's YYMAXDEPTH).
Very deep recursion can overflow the stack and crash Scylla. After we
limited the length of expression strings to 4096 bytes this was *almost*
enough to prevent stack overflows. But unfortunetely the tests revealed
that even limited to 4096 bytes, the expression can sometimes recurse
too deeply: Consider the expression "((((((....((((" with 4000 parentheses.
To realize this is a syntax error, the parser needs to do a recursive
call 4000 times. Or worse - because of other Antlr limitations (see rants
in comments in expressions.g) it's actually 12000 recursive calls, and
each of these calls have a pretty large frame. In some cases, this
overflows the stack.

The solution used in this patch is not pretty, but works. We add to rules
in alternator/expressions.g that recurse (there are two of those - "value"
and "boolean_expression") an integer "depth" parameter, which we increase
when the rule recurses. Moreover, we add a so-called predicate
"{depth<MAX_DEPTH}?" that stops the parsing when this limit is reached.
When the parsing is stopped, the user will see a special kind of parse
error, saying "expression nested too deeply".

With this last modification to expressions.g, the tests for deeply-nested but
still-below-4096-bytes expressions
(test_limits.py::test_deeply_nested_expression_*) would not fail sporadically
as they did without it.

While adding the "expression nested too deeply" case, I also made the
general syntax-error reporting in Alternator nicer: It no longer prints
the internal "expression_syntax_error" type name (an exception type will
only be printed if some sort of unexpected exception happens), and it
prints the character position where the syntax error (or too deep
nested expression) was recognized.

Fixes #14473

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14477
  • Loading branch information
nyh authored and denesb committed Jul 31, 2023
1 parent a637ddd commit 04e5082
Show file tree
Hide file tree
Showing 6 changed files with 355 additions and 47 deletions.
6 changes: 3 additions & 3 deletions alternator/executor.cc
Expand Up @@ -1648,7 +1648,7 @@ static parsed::condition_expression get_parsed_condition_expression(rjson::value
throw api_error::validation("ConditionExpression must not be empty");
}
try {
return parse_condition_expression(rjson::to_string_view(*condition_expression));
return parse_condition_expression(rjson::to_string_view(*condition_expression), "ConditionExpression");
} catch(expressions_syntax_error& e) {
throw api_error::validation(e.what());
}
Expand Down Expand Up @@ -3450,7 +3450,7 @@ filter::filter(const rjson::value& request, request_type rt,
throw api_error::validation("Cannot use both old-style and new-style parameters in same request: FilterExpression and AttributesToGet");
}
try {
auto parsed = parse_condition_expression(rjson::to_string_view(*expression));
auto parsed = parse_condition_expression(rjson::to_string_view(*expression), "FilterExpression");
const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
const rjson::value* expression_attribute_values = rjson::find(request, "ExpressionAttributeValues");
resolve_condition_expression(parsed,
Expand Down Expand Up @@ -4078,7 +4078,7 @@ calculate_bounds_condition_expression(schema_ptr schema,
// sort-key range.
parsed::condition_expression p;
try {
p = parse_condition_expression(rjson::to_string_view(expression));
p = parse_condition_expression(rjson::to_string_view(expression), "KeyConditionExpression");
} catch(expressions_syntax_error& e) {
throw api_error::validation(e.what());
}
Expand Down
42 changes: 26 additions & 16 deletions alternator/expressions.cc
Expand Up @@ -29,7 +29,7 @@
namespace alternator {

template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>
Result do_with_parser(std::string_view input, Func&& f) {
static Result do_with_parser(std::string_view input, Func&& f) {
expressionsLexer::InputStreamType input_stream{
reinterpret_cast<const ANTLR_UINT8*>(input.data()),
ANTLR_ENC_UTF8,
Expand All @@ -43,31 +43,41 @@ Result do_with_parser(std::string_view input, Func&& f) {
return result;
}

parsed::update_expression
parse_update_expression(std::string_view query) {
template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>
static Result parse(const char* input_name, std::string_view input, Func&& f) {
if (input.length() > 4096) {
throw expressions_syntax_error(format("{} expression size {} exceeds allowed maximum 4096.",
input_name, input.length()));
}
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::update_expression));
return do_with_parser(input, f);
} catch (expressions_syntax_error& e) {
// If already an expressions_syntax_error, don't print the type's
// name (it's just ugly), just the message.
// TODO: displayRecognitionError could set a position inside the
// expressions_syntax_error in throws, and we could use it here to
// mark the broken position in 'input'.
throw expressions_syntax_error(format("Failed parsing {} '{}': {}",
input_name, input, e.what()));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing UpdateExpression '{}': {}", query, std::current_exception()));
throw expressions_syntax_error(format("Failed parsing {} '{}': {}",
input_name, input, std::current_exception()));
}
}

parsed::update_expression
parse_update_expression(std::string_view query) {
return parse("UpdateExpression", query, std::mem_fn(&expressionsParser::update_expression));
}

std::vector<parsed::path>
parse_projection_expression(std::string_view query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::projection_expression));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing ProjectionExpression '{}': {}", query, std::current_exception()));
}
return parse ("ProjectionExpression", query, std::mem_fn(&expressionsParser::projection_expression));
}

parsed::condition_expression
parse_condition_expression(std::string_view query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::condition_expression));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing ConditionExpression '{}': {}", query, std::current_exception()));
}
parse_condition_expression(std::string_view query, const char* caller) {
return parse(caller, query, std::mem_fn(&expressionsParser::condition_expression));
}

namespace parsed {
Expand Down
82 changes: 58 additions & 24 deletions alternator/expressions.g
Expand Up @@ -74,7 +74,22 @@ options {
*/
@parser::context {
void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {
throw expressions_syntax_error("syntax error");
const char* err;
switch (ex->getType()) {
case antlr3::ExceptionType::FAILED_PREDICATE_EXCEPTION:
err = "expression nested too deeply";
break;
default:
err = "syntax error";
break;
}
// Alternator expressions are always single line so ex->get_line()
// is always 1, no sense to print it.
// TODO: return the position as part of the exception, so the
// caller in expressions.cc that knows the expression string can
// mark the error position in the final error message.
throw expressions_syntax_error(format("{} at char {}", err,
ex->get_charPositionInLine()));
}
}
@lexer::context {
Expand All @@ -83,6 +98,23 @@ options {
}
}
/* Unfortunately, ANTLR uses recursion - not the heap - to parse recursive
* expressions. To make things even worse, ANTLR has no way to limit the
* depth of this recursion (unlike Yacc which has YYMAXDEPTH). So deeply-
* nested expression like "(((((((((((((..." can easily crash Scylla on a
* stack overflow (see issue #14477).
*
* We are lucky that in the grammar for DynamoDB expressions (below),
* only a few specific rules can recurse, so it was fairly easy to add a
* "depth" counter to a few specific rules, and then use a predicate
* "{depth<MAX_DEPTH}?" to avoid parsing if the depth exceeds this limit,
* and throw a FAILED_PREDICATE_EXCEPTION in that case, which we will
* report to the user as a "expression nested too deeply" error.
*/
@parser::members {
static constexpr int MAX_DEPTH = 400;
}
/*
* Lexical analysis phase, i.e., splitting the input up to tokens.
* Lexical analyzer rules have names starting in capital letters.
Expand Down Expand Up @@ -155,19 +187,20 @@ path returns [parsed::path p]:
| '[' INTEGER ']' { $p.add_index(std::stoi($INTEGER.text)); }
)*;

value returns [parsed::value v]:
/* See comment above why the "depth" counter was needed here */
value[int depth] returns [parsed::value v]:
VALREF { $v.set_valref($VALREF.text); }
| path { $v.set_path($path.p); }
| NAME { $v.set_func_name($NAME.text); }
'(' x=value { $v.add_func_parameter($x.v); }
(',' x=value { $v.add_func_parameter($x.v); })*
| {depth<MAX_DEPTH}? NAME { $v.set_func_name($NAME.text); }
'(' x=value[depth+1] { $v.add_func_parameter($x.v); }
(',' x=value[depth+1] { $v.add_func_parameter($x.v); })*
')'
;

update_expression_set_rhs returns [parsed::set_rhs rhs]:
v=value { $rhs.set_value(std::move($v.v)); }
( '+' v=value { $rhs.set_plus(std::move($v.v)); }
| '-' v=value { $rhs.set_minus(std::move($v.v)); }
v=value[0] { $rhs.set_value(std::move($v.v)); }
( '+' v=value[0] { $rhs.set_plus(std::move($v.v)); }
| '-' v=value[0] { $rhs.set_minus(std::move($v.v)); }
)?
;

Expand Down Expand Up @@ -205,7 +238,7 @@ projection_expression returns [std::vector<parsed::path> v]:


primitive_condition returns [parsed::primitive_condition c]:
v=value { $c.add_value(std::move($v.v));
v=value[0] { $c.add_value(std::move($v.v));
$c.set_operator(parsed::primitive_condition::type::VALUE); }
( ( '=' { $c.set_operator(parsed::primitive_condition::type::EQ); }
| '<' '>' { $c.set_operator(parsed::primitive_condition::type::NE); }
Expand All @@ -214,14 +247,14 @@ primitive_condition returns [parsed::primitive_condition c]:
| '>' { $c.set_operator(parsed::primitive_condition::type::GT); }
| '>' '=' { $c.set_operator(parsed::primitive_condition::type::GE); }
)
v=value { $c.add_value(std::move($v.v)); }
v=value[0] { $c.add_value(std::move($v.v)); }
| BETWEEN { $c.set_operator(parsed::primitive_condition::type::BETWEEN); }
v=value { $c.add_value(std::move($v.v)); }
v=value[0] { $c.add_value(std::move($v.v)); }
AND
v=value { $c.add_value(std::move($v.v)); }
v=value[0] { $c.add_value(std::move($v.v)); }
| IN '(' { $c.set_operator(parsed::primitive_condition::type::IN); }
v=value { $c.add_value(std::move($v.v)); }
(',' v=value { $c.add_value(std::move($v.v)); })*
v=value[0] { $c.add_value(std::move($v.v)); }
(',' v=value[0] { $c.add_value(std::move($v.v)); })*
')'
)?
;
Expand All @@ -231,19 +264,20 @@ primitive_condition returns [parsed::primitive_condition c]:
// common rule prefixes, and (lack of) support for operator precedence.
// These rules could have been written more clearly using a more powerful
// parser generator - such as Yacc.
boolean_expression returns [parsed::condition_expression e]:
b=boolean_expression_1 { $e.append(std::move($b.e), '|'); }
(OR b=boolean_expression_1 { $e.append(std::move($b.e), '|'); } )*
// See comment above why the "depth" counter was needed here.
boolean_expression[int depth] returns [parsed::condition_expression e]:
b=boolean_expression_1[depth] { $e.append(std::move($b.e), '|'); }
(OR b=boolean_expression_1[depth] { $e.append(std::move($b.e), '|'); } )*
;
boolean_expression_1 returns [parsed::condition_expression e]:
b=boolean_expression_2 { $e.append(std::move($b.e), '&'); }
(AND b=boolean_expression_2 { $e.append(std::move($b.e), '&'); } )*
boolean_expression_1[int depth] returns [parsed::condition_expression e]:
b=boolean_expression_2[depth] { $e.append(std::move($b.e), '&'); }
(AND b=boolean_expression_2[depth] { $e.append(std::move($b.e), '&'); } )*
;
boolean_expression_2 returns [parsed::condition_expression e]:
boolean_expression_2[int depth] returns [parsed::condition_expression e]:
p=primitive_condition { $e.set_primitive(std::move($p.c)); }
| NOT b=boolean_expression_2 { $e = std::move($b.e); $e.apply_not(); }
| '(' b=boolean_expression ')' { $e = std::move($b.e); }
| {depth<MAX_DEPTH}? NOT b=boolean_expression_2[depth+1] { $e = std::move($b.e); $e.apply_not(); }
| {depth<MAX_DEPTH}? '(' b=boolean_expression[depth+1] ')' { $e = std::move($b.e); }
;
condition_expression returns [parsed::condition_expression e]:
boolean_expression { e=std::move($boolean_expression.e); } EOF;
boolean_expression[0] { e=std::move($boolean_expression.e); } EOF;
2 changes: 1 addition & 1 deletion alternator/expressions.hh
Expand Up @@ -28,7 +28,7 @@ public:

parsed::update_expression parse_update_expression(std::string_view query);
std::vector<parsed::path> parse_projection_expression(std::string_view query);
parsed::condition_expression parse_condition_expression(std::string_view query);
parsed::condition_expression parse_condition_expression(std::string_view query, const char* caller);

void resolve_update_expression(parsed::update_expression& ue,
const rjson::value* expression_attribute_names,
Expand Down
36 changes: 36 additions & 0 deletions test/alternator/test_condition_expression.py
Expand Up @@ -1888,3 +1888,39 @@ def test_update_item_condition_key_attribute_not_exists(test_table_s):
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
ConditionExpression='attribute_not_exists(p)')

# DynamoDB considers duplicate parentheses in expressions, which it calls
# "redundant parentheses", to be illegal. Outlawing them is also useful to
# avoid very deep recursion in the parser (see test_limits.py).
# Let's test here what is considered redendant parentheses, and what isn't.
@pytest.mark.xfail(reason="Alternator doesn't forbid redundant parentheses")
def test_redundant_parentheses(test_table_s):
# Putting one set of unnecessary parentheses is fine - e.g., "(p<>p)"
# works just as well as "p<>p" - it isn't considered "redundant
p = random_string()
test_table_s.update_item(Key={'p': p},
ConditionExpression='p <> :p',
ExpressionAttributeValues={':p': p})
assert 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)
p = random_string()
test_table_s.update_item(Key={'p': p},
ConditionExpression='(p <> :p)',
ExpressionAttributeValues={':p': p})
assert 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)
# But putting two sets of parentheses, "((p<>p))", is considered redundant.
# DynamoDB prints: "Invalid ConditionExpression: The expression has
# redundant parentheses".
p = random_string()
with pytest.raises(ClientError, match='ValidationException.*redundant parentheses'):
test_table_s.update_item(Key={'p': p},
ConditionExpression='((p <> :p))',
ExpressionAttributeValues={':p': p})
# The expression "((p<>p) and p<>p)" isn't considered to have redundant
# parentheses - it's just like one unnecessary parentheses which we showed
# above is allowed. So the parser can't just claim "redundant parentheses"
# when it sees two successive parentheses beginning the expression.
p = random_string()
test_table_s.update_item(Key={'p': p},
ConditionExpression='((p <> :p) and p <> :p)',
ExpressionAttributeValues={':p': p})
assert 'Item' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)

0 comments on commit 04e5082

Please sign in to comment.