Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pk_indexes aren't generated for queries with duplicate named bind variables #15374

Closed
cvybhu opened this issue Sep 12, 2023 · 4 comments · Fixed by #15526
Closed

pk_indexes aren't generated for queries with duplicate named bind variables #15374

cvybhu opened this issue Sep 12, 2023 · 4 comments · Fixed by #15526

Comments

@cvybhu
Copy link
Contributor

cvybhu commented Sep 12, 2023

When the driver asks to prepare a statement, Scylla responds with a message that contains metadata about this prepared statement. pk_indexes is a field in this metadata which lets the driver know which bind variables it should use to calculate the partition token.

For example:

CREATE TABLE tks.tab (p1 int, p2 int, c int, PRIMARY KEY ((p1, p2), c));
SELECT * FROM tks.tab WHERE p1 = ? AND p2 = ? AND c = ?;

When the driver prepares this SELECT statement, Scylla responds with the following pk_indexes:

pk_indexes: [
        PartitionKeyIndex {
            index: 0,
            sequence: 0,
        },
        PartitionKeyIndex {
            index: 1,
            sequence: 1,
        },
    ],

The meaning is: "To calculate partition key token, the driver should take the bound values with indexes 0 and 1, concatenate them and then generate the hash"
This allows the driver to correctly compute the token, and send the queries to the nodes that own those token ranges.

The problem is that for some statements that use the same named variable twice, Scylla doesn't produce any pk_indexes, even though it could.

In this example:

CREATE TABLE tks.tab (p int, c int, PRIMARY KEY (p, c));
SELECT p FROM tks.tab WHERE p = :x AND c = :x;

PreparedMetadata looks like this:

PreparedMetadata {
    flags: 1,
    col_count: 1,
    pk_indexes: [],
    col_specs: [
        ColumnSpec {
            table_spec: TableSpec {
                ks_name: "tks",
                table_name: "tab",
            },
            name: "x",
            typ: Int,
        },
    ],
}

The pk_indexes field is empty, even though the driver would be able to calculate the token value using the first bind variable.

Because of this the queries will be send to a random node instead of the one that owns this token range. This isn't the end of the world, as the random node will forward the query to the correct one, but it would be more efficient to send it to the right node immediately.

Here's the relevant part in the CQL binary protocol: https://github.com/apache/cassandra/blob/1959502d8b16212479eecb076c89945c3f0f180c/doc/native_protocol_v4.spec#L699

@cvybhu
Copy link
Contributor Author

cvybhu commented Sep 12, 2023

This is most likely a regression caused by scylladb/scylladb@19a6e69

On 2021.1.16 PreparedMetadata looks like this (pk_indexes are correct):

PreparedMetadata {
    flags: 1,
    col_count: 2,
    pk_indexes: [
        PartitionKeyIndex {
            index: 0,
            sequence: 0,
        },
    ],
    col_specs: [
        ColumnSpec {
            table_spec: TableSpec {
                ks_name: "tks",
                table_name: "tab",
            },
            name: "x",
            typ: Int,
        },
        ColumnSpec {
            table_spec: TableSpec {
                ks_name: "tks",
                table_name: "tab",
            },
            name: "x",
            typ: Int,
        },
    ],
}

@cvybhu
Copy link
Contributor Author

cvybhu commented Sep 12, 2023

Here's the code I used to print the metadata: https://gist.github.com/cvybhu/23e5964a8daef57f8a5bd70651e6793d

@cvybhu cvybhu self-assigned this Sep 12, 2023
@mykaul mykaul added this to the 5.4 milestone Sep 12, 2023
cvybhu added a commit to cvybhu/scylladb that referenced this issue Sep 22, 2023
…ind variables

When presented with queries that use the same named bind variables twice,
like this one:
```cql
SELECT p FROM table WHERE p = :x AND c = :x
```

Scylla generated empty partition_key_bind_indexes (pk_indexes).
pk_indexes tell the driver which bind variables it should use to calculate the partition
token for a query. Without it, the driver is unable to determine the token and it will
send the query to a random node.

Scylla should generate pk_indexes which tell the driver that it can use bind variable
with bind_index = 0 to calculate the partition token for a query.

The problem was that _target_columns keep only a single target_column for each bind variable.
In the example above :x is compared with both p and c, but _target_columns would contain
only one of them, and Scylla wasn't able to tell that this bind variable is compared with
a partition key column.

To fix it, let's replace _target_columns with _targets. _targets keeps all comparisons
between bind variables and other expressions, so none of them will be forgotten/overwritten.

Fixes: scylladb#15374

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Sep 22, 2023
…variable name

Add a cql-pytest which reproduces issue scylladb#15374.
When presented with the following query:
```cql
SELECT p FROM table WHERE p = :x AND c = :x
```
Scylla generated empty pk_indexes, and driver wasn't able to determine the right
token for request routing.

Scylla should generate pk_indexes which tell the driver that it can use bind variable
with bind_index = 0 to calculate the partition token for a query.

I couldn't find a good place for this test, so I created a new file - test_prepare.py.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Sep 25, 2023
Add some tests that test whether `pk indexes` are generated correctly.
When a driver asks to prepare a statement, Scylla's response includes
the metadata for this prepared statement.
In this metadata there's `pk indexes`, which tells the driver which
bind variable values it should use to calculate the partition token.

For a query like:
SELECT * FROM t WHERE p2 = ? AND p1 = ? AND p3 = ?

The correct pk_indexes would be [1, 0, 2], which means
"To calculate the token calculate Hash(bind_vars[1] | bind_vars[0] | bind_vars[2])".

More information is available in the specification:
https://github.com/apache/cassandra/blob/1959502d8b16212479eecb076c89945c3f0f180c/doc/native_protocol_v4.spec#L699-L707

Two tests are marked as xfail because of scylladb#15374 - Scylla doesn't correctly handle using the same
named variable in multiple places. This will be fixed soon.

I couldn't find a good place for these tests, so I created a new file - test_prepare.py.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Sep 25, 2023
…te_named_variables

Issue scylladb#15374 has been fixed, so these tests can be enabled.
Duplicate bind variable names are now handled correctly.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
avikivity added a commit that referenced this issue Sep 26, 2023
… named bind variables' from Jan Ciołek

When presented with queries that use the same named bind variables twice, like this one:
```cql
SELECT p FROM table WHERE p = :x AND c = :x
```

Scylla generated empty `partition_key_bind_indexes` (`pk_indexes`).
`pk_indexes` tell the driver which bind variables it should use to calculate the partition token for a query. Without it, the driver is unable to determine the token and it will send the query to a random node.

Scylla should generate pk_indexes which tell the driver that it can use bind variable with `bind_index = 0` to calculate the partition token for this query.

The problem was that `_target_columns` keep only a single target_column for each bind variable.
In the example above `:x` is compared with both `p` and `c`, but `_target_columns` would contain only one of them, and Scylla wasn't able to tell that this bind variable is compared with a partition key column.

To fix it, let's replace `_target_columns` with `_targets`. `_targets` keeps all comparisons
between bind variables and other expressions, so none of them will be forgotten/overwritten.

A `cql-pytest` reproducer is added.

I also added some comments in `prepare_context.hh/cc` to make it easier to read.

Fixes: #15374

Closes #15526

* github.com:scylladb/scylladb:
  cql-pytest/test-prepare: remove xfail marker from *pk_indexes_duplicate_named_variables
  cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables
  cql3: improve readability of prepare_context
  cql-pytest: test generation of pk indexes during PREPARE
@denesb
Copy link
Contributor

denesb commented Dec 18, 2023

Doesn't apply cleanly to 5.2, @cvybhu please prepare a backport PR.

@cvybhu
Copy link
Contributor Author

cvybhu commented Dec 18, 2023

Doesn't apply cleanly to 5.2, @cvybhu please prepare a backport PR.

@denesb sorry, but I don't work at Scylla anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment