Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternator: evaluate the need to validate table name on each request #12538

Closed
nuivall opened this issue Jan 16, 2023 · 12 comments
Closed

Alternator: evaluate the need to validate table name on each request #12538

nuivall opened this issue Jan 16, 2023 · 12 comments
Labels
area/alternator Alternator related Issues feature/enhancement good first issue symptom/performance Issues causing performance problems
Milestone

Comments

@nuivall
Copy link
Member

nuivall commented Jan 16, 2023

Currently we call static void validate_table_name(const std::string& name) on each request, in theory we should only check if table name is correct when creating it and after that don't need to validate it in executor::find_table calls.

This issue was brough up in https://github.com/scylladb/scylladb/pull/12445/files/0d89b21a1582e2da58083a8eccb2c25a37637ed2#r1062322506

@nuivall nuivall added the area/alternator Alternator related Issues label Jan 16, 2023
@mykaul mykaul added feature/enhancement symptom/performance Issues causing performance problems labels Jan 18, 2023
@nyh
Copy link
Contributor

nyh commented Jan 19, 2023

The link above doesn't work for me, so here is one that does seem to work (I don't know if it's the same thing you wanted to link to): #12445 (comment)

@nuivall
Copy link
Member Author

nuivall commented Jan 19, 2023

Yes, looks like links to comments in diff view don't persist well

@nyh
Copy link
Contributor

nyh commented Jan 22, 2023

Unfortunately, it turns out that DynamoDB does check the validity the given table name in all requests, not just CreateTable. For example, consider the following test for PutItem:

def test_item_operations_improper_named_table(dynamodb):
    with pytest.raises(ClientError, match='ValidationException'):
        dynamodb.meta.client.put_item(TableName='non!existent!table',
            Item={'a':{'S':'b'}})
    with pytest.raises(ClientError, match='ValidationException'):
        dynamodb.meta.client.put_item(TableName='xx',
            Item={'a':{'S':'b'}})

This test attempts a PutItem operation with the table names non!existent!table and xx, both forbidden by the naming rules. Both DynamoDB and Alternator respond with a ValidationException - that the name is invalid, and not ResourceNotFoundException as happens when the table "just" doesn't exist.

It is arguable whether Alternator needs to be 100% compatible with the error handling of DynamoDB in esoteric error cases (this has no effect on successful requests). So if this extra check causes a significant slowdown of requests, by all means we should consider dropping the extra check. But if the slowdown caused by the extra check is tiny, maybe it's better not to break even this minor error-case compatibility with Cassandra, and leave the extra checking as is?

@nuivall do you have an estimate how much of the request time in the worst case (e.g., small read request from cache) can be attributed to this extra check?

nyh added a commit to nyh/scylla that referenced this issue Jan 22, 2023
Issue scylladb#12538 suggested that maybe Alternator shouldn't bother reporting an
invalid table name in item operations like PutItem, and that it's enough
to report that the table doesn't exist. But the test added in this patch
shows that DynamoDB, like Alternator, reports the invalid table name in
this case - not just that the table doesn't exist.

That should make us think twice before acting on issue scylladb#12538. If we do
what this issue recommended, this test will need to be fixed (e.g., to
accept as correct both types of errors).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@nyh
Copy link
Contributor

nyh commented Jan 23, 2023

In #12608 (comment) @nuivall had a good idea: we can do the table-name validation only after (and if) we realize it doesn't exist, so in the usual case where the table exists, it isn't done. If we do that, we remain error-compatible with DynamoDB, but don't lose any performance (not even a tiny bit) in the usual successful case.

Basically, we can remove the validate_table_name() call from find_table_name(), and move it into its callers - create_table_on_shard0() will call validate_table_name() after get_table_name(), and other callers use find_table() or get_table() and those should call validate_table_name() (we need to check we covered all the calls, though...).

@nuivall
Copy link
Member Author

nuivall commented Jan 23, 2023

Currently validate_table_name disappears from the profile, meaning probably that it's ~100 instructions or less (so ~0% of req time). Whole alternator::get_table_or_view is 5% of alternator::executor::query (and mostly due to needed work it seems and couple less needed allocations).

This is nothing compared to calculate_bounds_condition_expression which takes 27% of alternator::executor::query time. And to my current understanding especially antlr3 part is "just" parsing dozen of characters and needs 7k instructions to do that.

@avikivity
Copy link
Member

Let's tackle the big ones first, then move back to this.

I'm hopeful caching the query parse result will yield large improvements.

@avikivity
Copy link
Member

btw, I see alternator grew its own expression language (primitive_condition/condition_expression).

We should consider migrating to cql3 expressions (perhaps moved out to a top-level module), so that any improvements there (say, JIT compilation) are carried over.

@nyh
Copy link
Contributor

nyh commented Jan 24, 2023

Let's tackle the big ones first, then move back to this.

Yes, this is always a good approach in optimization, but sometimes if you already know how to easily shave off something small, there is no reason not to do it too. Of course it shouldn't have high priority.

I'm hopeful caching the query parse result will yield large improvements.

Indeed. This is is #5023.
It will definitely help with "new-style" DynamoDB requests which use one or more expressions in every request, and normally most of them will reuse the same expression.
Separately it will be interesting to understand why the expression parsing is slow. It shouldn't have been - the whole idea of parser generators is (or at least used to be...) to generate optimal code. Maybe it is something silly like we had in the http parser, where somehow the way we used ANTLR (or ANTLR itself) causes us to do a lot of work per byte or something.

@nyh
Copy link
Contributor

nyh commented Jan 24, 2023

btw, I see alternator grew its own expression language (primitive_condition/condition_expression).

Right. We had this before CQL had its ;-)

We should consider migrating to cql3 expressions (perhaps moved out to a top-level module), so that any improvements there (say, JIT compilation) are carried over.

This is easier said than done. You already saw how many changes you needed to do just to change the CQL expressions to support two variants of "= NULL". The DynamoDB expressions are different in almost every respect from CQL expressions. The terminals (constants, columns and nested pieces of these columns) and their types are different, the error handling is different, the "bound variable" support is different. See alternator/expression_types.hh for a commented introduction to the parsed expression tree that Alternator uses.

Maybe it's something we can and should consider in the future, but I don't think it's urgent. One downside of Alternator's different expressions is that today we can only use these expressions (e.g., filtering) on the coordinator - we can't push these expressions to replicas, and so on. Today it's not a big problem because CQL doesn't do this either, but if one day it would - it would be good to have just one expression class, not two.

denesb pushed a commit that referenced this issue Jan 24, 2023
Issue #12538 suggested that maybe Alternator shouldn't bother reporting an
invalid table name in item operations like PutItem, and that it's enough
to report that the table doesn't exist. But the test added in this patch
shows that DynamoDB, like Alternator, reports the invalid table name in
this case - not just that the table doesn't exist.

That should make us think twice before acting on issue #12538. If we do
what this issue recommended, this test will need to be fixed (e.g., to
accept as correct both types of errors).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12608
@DoronArazii DoronArazii added this to the 5.x milestone Jan 24, 2023
@mykaul mykaul modified the milestones: 5.x, 5.3 Feb 12, 2023
@harsh020
Copy link
Contributor

harsh020 commented Apr 30, 2023

Hi @nuivall @nyh @avikivity, I am a new contributor to scylladb and would love to work on this.

The summary that I gathered from all the above comments is:

  1. Dynamodb validates table name for all request, so does alternator
  2. To optimise this we would want to only validate table name on
    - create table request
    - other request iff the table does not exists (so as to keep compatibility with the error that we raise is same as that of Dynamodb)

So, the changes need to be:

  1. validate table before creating
  2. in other request intercept any ResourceNotFoundException or anything that suggests table does not exists and validate the table name. If the table name is invalid, raise ValidationError, else re-raise the catch exception.

That's an high level implementation that I understood. Please let me know if I am missing something. Meanwhile, I will dig into code for more details!

@nuivall
Copy link
Member Author

nuivall commented May 15, 2023

@harsh020 yes, your summary seems correct. Exception is in most (all?) places called no_such_column_family.

@harsh020
Copy link
Contributor

@nuivall, Thank you for the info! Please let me have a dig at the code and get back to you!

harsh020 pushed a commit to harsh020/scylladb that referenced this issue May 19, 2023
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).

Fixes: scylladb#12538
harsh020 added a commit to harsh020/scylladb that referenced this issue May 23, 2023
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).

Fixes: scylladb#12538
harsh020 added a commit to harsh020/scylladb that referenced this issue May 23, 2023
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).

Fixes: scylladb#12538
@DoronArazii DoronArazii modified the milestones: 5.3, 5.4 May 24, 2023
harsh020 added a commit to harsh020/scylladb that referenced this issue Jun 4, 2023
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).

Fixes: scylladb#12538
harsh020 added a commit to harsh020/scylladb that referenced this issue Jun 9, 2023
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).

Fixes: scylladb#12538
harsh020 added a commit to harsh020/scylladb that referenced this issue Jun 9, 2023
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).

Fixes: scylladb#12538
MeepoYang pushed a commit to storage-scsg/scylladb that referenced this issue Apr 28, 2024
Issue scylladb#12538 suggested that maybe Alternator shouldn't bother reporting an
invalid table name in item operations like PutItem, and that it's enough
to report that the table doesn't exist. But the test added in this patch
shows that DynamoDB, like Alternator, reports the invalid table name in
this case - not just that the table doesn't exist.

That should make us think twice before acting on issue scylladb#12538. If we do
what this issue recommended, this test will need to be fixed (e.g., to
accept as correct both types of errors).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb#12608
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alternator Alternator related Issues feature/enhancement good first issue symptom/performance Issues causing performance problems
Projects
None yet
Development

No branches or pull requests

6 participants