New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Unexpected SELECT DISTINCT Behaviour with mixed primary key, non-primary key clauses or aggregate targets. #17648
Comments
Found another example that fails to return the correct results With the following setup,
the query
Returns no rows when the query
returns 1 row corresponding to the second tuple in the original table. This particular query differs from the one present in the issue in that the filtering happens on the postgres side and not all filter clauses are pushed down to DocDB. The above behavior has an notable implication. Avoiding pushing down DISTINCT past any filters (both on the DocDB side and the postgres side) is non-trivial. Instead, we approach the bug by being very apprehensive about when to use the prefix logic. This is not optimal and hence created #17741 for improvements increasing the applicability of prefix based skipping. |
Found yet another interesting query that behaves unexpectedly Setup
Query
Returned result
Postgres result
DocDB log
Observe the length of the prefix. The prefix must encompass both columns of the row because the query involves an aggregation. Additional Caveat This issue manifests when we request a system column and not a regular column which might sound surprising. The truth is that the postgres optimization layer requests the complete tuple in situations where projecting the tuples is more expensive for the executor. This doesn't happen for system columns. More importantly, the optimization to return the whole tuple may not be accurate in our case since we transfer tuples over the network. Hence, the optimization maybe disabled in the future. Conclusion Ensuring that pushdown works with all aspects of SELECT DISTINCT is non-trivial because of the complexity of postgres grammar. Created #17801 for emergency scenarios for when we need to disable this particular feature. |
…nditions Summary: We fail to retrieve certain rows with SELECT DISTINCT because we **push** the **uniqueness** constraint past some **filters** or some **aggregation** targets. **Context** The change https://phorge.dev.yugabyte.com/D20742 introduces pushdown logic for the SELECT DISTINCT clause. In this case, we ask the DocDB layer to not return duplicate values. The DocDB layer utilizes the fact that the primary key is unique to avoid executing any uniqueness constraints. Moreover, whenever the requested columns are a subset of a prefix of the primary key then an additional optimization is applied. This optimization skips over certain rows as long as these rows do not change the said prefix. //Example Table:// | r1 | r2 | v | | 1 | 1 | 1 | | 1 | 2 | 2 | Here, `r1` and `r2` combined constitute the primary key. `s` is not primary. If we request only the first column in a SELECT DISTINCT clause, the DocDB layer does not seek the second row since it is a duplicate as far as the key `r1` is concerned. **Issue** Observe that we cannot apply the uniqueness constraint (DISTINCTness) until all the filters have been applied. However, in the above example, returning only the first row leads to an incorrect result when we have an additional filter on the third column `v` such as `v = 2`. This is because, the scan logic in DocDB only returns the first row while it is in fact the second row that satisfies the condition. Additionally, the predicates may also be of the form `r1 < now()` where we are unable to pushdown the predicate to the DocDB layer or even push it down as an index scan predicate such as `r1 * r2 >= 0`. **Fix** Do not set a prefix whenever there is a reference to non primary keys. Also do not set a prefix whenever there are aggregate functions as target since an incorrect prefix there might produce inaccurate results. **Applicability** Prefix pushdown optimization is in effect when we have - An Index Scan - No additional filters on top of those required by the Index Scan - No aggregate functions as targets Jira: DB-6769 Test Plan: Jenkins ``` ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctRemoteFilter ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctLocalFilter ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctAgg ``` Reviewers: tnayak, mihnea Reviewed By: tnayak Differential Revision: https://phorge.dev.yugabyte.com/D25980
…th non index conditions Summary: Original commit: db3d7f4 / D25980 We fail to retrieve certain rows with SELECT DISTINCT because we **push** the **uniqueness** constraint past some **filters** or some **aggregation** targets. **Context** The change https://phorge.dev.yugabyte.com/D20742 introduces pushdown logic for the SELECT DISTINCT clause. In this case, we ask the DocDB layer to not return duplicate values. The DocDB layer utilizes the fact that the primary key is unique to avoid executing any uniqueness constraints. Moreover, whenever the requested columns are a subset of a prefix of the primary key then an additional optimization is applied. This optimization skips over certain rows as long as these rows do not change the said prefix. //Example Table:// | r1 | r2 | v | | 1 | 1 | 1 | | 1 | 2 | 2 | Here, `r1` and `r2` combined constitute the primary key. `s` is not primary. If we request only the first column in a SELECT DISTINCT clause, the DocDB layer does not seek the second row since it is a duplicate as far as the key `r1` is concerned. **Issue** Observe that we cannot apply the uniqueness constraint (DISTINCTness) until all the filters have been applied. However, in the above example, returning only the first row leads to an incorrect result when we have an additional filter on the third column `v` such as `v = 2`. This is because, the scan logic in DocDB only returns the first row while it is in fact the second row that satisfies the condition. Additionally, the predicates may also be of the form `r1 < now()` where we are unable to pushdown the predicate to the DocDB layer or even push it down as an index scan predicate such as `r1 * r2 >= 0`. **Fix** Do not set a prefix whenever there is a reference to non primary keys. Also do not set a prefix whenever there are aggregate functions as target since an incorrect prefix there might produce inaccurate results. **Applicability** Prefix pushdown optimization is in effect when we have - An Index Scan - No additional filters on top of those required by the Index Scan - No aggregate functions as targets Jira: DB-6769 Test Plan: Jenkins ``` ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctRemoteFilter ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctLocalFilter ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctAgg ``` Reviewers: tnayak, mihnea Reviewed By: tnayak Differential Revision: https://phorge.dev.yugabyte.com/D26351
Summary: Add a GUC option named `yb_enable_distinct_pushdown`. This option controls whether or not the pushdown feature for SELECT DISTINCT is in effect. **Context** The issue #17648 documents several issues found with the current implementation of SELECT DISTINCT. More importantly, the source of all these issues can be traced back to pushing down the DISTINCT operator too deep past operations such as filters and aggregations. These lead to correctness issues and we want to provide users with the flexibility to toggle the pushdown feature to mitigate issues with the current implementation as well as in anticipation of issues arising from future changes to the pushdown feature. Jira: DB-6890 Test Plan: ``` ./yb_build.sh --java-test TestPgRegressDistinctPushdown ``` Reviewers: tnayak Reviewed By: tnayak Subscribers: jason Differential Revision: https://phorge.dev.yugabyte.com/D26268
Summary: Original commit: 5104be4 / D26268 Add a GUC option named `yb_enable_distinct_pushdown`. This option controls whether or not the pushdown feature for SELECT DISTINCT is in effect. **Context** The issue #17648 documents several issues found with the current implementation of SELECT DISTINCT. More importantly, the source of all these issues can be traced back to pushing down the DISTINCT operator too deep past operations such as filters and aggregations. These lead to correctness issues and we want to provide users with the flexibility to toggle the pushdown feature to mitigate issues with the current implementation as well as in anticipation of issues arising from future changes to the pushdown feature. Jira: DB-6890 Test Plan: ``` ./yb_build.sh --java-test TestPgRegressDistinctPushdown ``` Reviewers: tnayak Reviewed By: tnayak Subscribers: jason Differential Revision: https://phorge.dev.yugabyte.com/D26589
…index conditions Summary: We fail to retrieve certain rows with SELECT DISTINCT because we **push** the **uniqueness** constraint past some **filters** or some **aggregation** targets. **Context** The change https://phorge.dev.yugabyte.com/D20742 introduces pushdown logic for the SELECT DISTINCT clause. In this case, we ask the DocDB layer to not return duplicate values. The DocDB layer utilizes the fact that the primary key is unique to avoid executing any uniqueness constraints. Moreover, whenever the requested columns are a subset of a prefix of the primary key then an additional optimization is applied. This optimization skips over certain rows as long as these rows do not change the said prefix. //Example Table:// | r1 | r2 | v | | 1 | 1 | 1 | | 1 | 2 | 2 | Here, `r1` and `r2` combined constitute the primary key. `s` is not primary. If we request only the first column in a SELECT DISTINCT clause, the DocDB layer does not seek the second row since it is a duplicate as far as the key `r1` is concerned. **Issue** Observe that we cannot apply the uniqueness constraint (DISTINCTness) until all the filters have been applied. However, in the above example, returning only the first row leads to an incorrect result when we have an additional filter on the third column `v` such as `v = 2`. This is because, the scan logic in DocDB only returns the first row while it is in fact the second row that satisfies the condition. Additionally, the predicates may also be of the form `r1 < now()` where we are unable to pushdown the predicate to the DocDB layer or even push it down as an index scan predicate such as `r1 * r2 >= 0`. **Fix** Do not set a prefix whenever there is a reference to non primary keys. Also do not set a prefix whenever there are aggregate functions as target since an incorrect prefix there might produce inaccurate results. **Applicability** Prefix pushdown optimization is in effect when we have - An Index Scan - No additional filters on top of those required by the Index Scan - No aggregate functions as targets Jira: DB-6769 Test Plan: Jenkins ``` ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctRemoteFilter ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctLocalFilter ybd --java-test org.yb.pgsql.TestPgSelect#testDistinctAgg ``` Reviewers: tnayak, mihnea Reviewed By: tnayak Differential Revision: https://phorge.dev.yugabyte.com/D25980
Summary: Add a GUC option named `yb_enable_distinct_pushdown`. This option controls whether or not the pushdown feature for SELECT DISTINCT is in effect. **Context** The issue yugabyte#17648 documents several issues found with the current implementation of SELECT DISTINCT. More importantly, the source of all these issues can be traced back to pushing down the DISTINCT operator too deep past operations such as filters and aggregations. These lead to correctness issues and we want to provide users with the flexibility to toggle the pushdown feature to mitigate issues with the current implementation as well as in anticipation of issues arising from future changes to the pushdown feature. Jira: DB-6890 Test Plan: ``` ./yb_build.sh --java-test TestPgRegressDistinctPushdown ``` Reviewers: tnayak Reviewed By: tnayak Subscribers: jason Differential Revision: https://phorge.dev.yugabyte.com/D26268
Jira Link: DB-6769
Description
We fail to retrieve certain rows with SELECT DISTINCT because we push the uniqueness constraint to lower layers without also pushing the required filters or aggregation functions past the de-duplication operator. Below we present one such example. See the comments for more such examples.
Example
Setup
Distinct Query
The expected behavior is to return one row with the value 1, but the current behavior does not yield any rows.
To further elaborate the issue, lets consider two more scenarios,
No clause on r1
This query returns the expected row since we do not use scan choices in this scenario.
No clause on s
This query returns the expected row since the scan choices module is aware of all the bounds.
However, scan choices is not aware of where clauses on the non-primary keys. Hence, in combination, the query behaves unexpectedly.
We just discussed an example involving a filter pushed down to the DocDB layer. There are more examples in comments where further processing on the postgres side makes it inaccurate to push down the uniqueness constraint to DocDB without also pushing down the surrounding constraints/functions.
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: