-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] flaky test org.yb.pgsql.TestPgRegressMisc#testPgRegressMiscSerial3 #21663
Comments
This test failure shows a deadlock as described below: In native PG:
In ysql
When creating gin_like_test_idx, because gin_test has two indexes, we also
As a result of the error, we did not execute the step: As a result, the docdb has already created the table and the index. Then at verification phase, we
The dead lock is here: These two operations are done asynchronously. So the dead lock can happen when This is an existing deadlock because of the work flow of rolling back index and the table are two separate operations. I think the commit below changed how the verification tasks are scheduled so the timing aspect has changed, causing the deadlock easier to show up:
|
Simpler repro:
|
…master Summary: The deadlock can be shown via the following test fragment extracted from test `org.yb.pgsql.TestPgRegressMisc#testPgRegressMiscSerial3`. This test became flaky in debug build after commit 588e705 (but the deadlock issue existed before that). ``` CREATE TEMP TABLE gin_test (a int[]); CREATE INDEX ON gin_test USING GIN (a); CREATE TABLE gin_like_test (LIKE gin_test INCLUDING ALL); CREATE INDEX ON gin_test (a); ``` After the above SQL statements, we have a **temp** table gin_test as: ``` yugabyte=# \d gin_test Table "pg_temp_1.gin_test" Column | Type | Collation | Nullable | Default --------+-----------+-----------+----------+--------- a | integer[] | | | Indexes: "gin_test_a_idx" gin (a) "gin_test_a_idx1" btree (a ASC) ``` This table has two indexes: `gin_test_a_idx` and `gin_test_a_idx1` (the index names are automatically chosen by PG when not explicitly given by the user). Then we execute the next SQL statement to create a **regular** YSQL table in the same session as the temp table: ``` yugabyte=# CREATE TABLE gin_like_test_idx (LIKE gin_test INCLUDING ALL); NOTICE: index method "btree" was replaced with "lsm" in YugabyteDB ERROR: INDEX on column of type 'INT4ARRAY' not yet supported ``` We get one PG error message on the surface. However internally PG has a multi-step work flow as described below: 1. create the table gin_like_test_idx * PG creates the heap for gin_like_test_idx * YB creates the docdb table yugabyte.public.gin_like_test_idx ``` [m-1] I0322 22:17:22.206276 32153 catalog_manager.cc:3774] CreateTable from 127.125.78.0:32888: [m-1] name: "gin_like_test_idx" ``` 2. create the first index gin_like_test_idx_a_idx * PG creates the heap for gin_like_test_idx_a_idx * YB creates the docdb table yugabyte.public.gin_like_test_idx_a_idx ``` [m-1] I0322 22:17:22.516096 32189 catalog_manager.cc:3774] CreateTable from 127.125.78.0:45259: [m-1] name: "gin_like_test_idx_a_idx" ``` 3. create the second index gin_like_test_idx_a_idx1 * PG creates the heap for gin_like_test_idx_a_idx1 * It is at this step, before we execute the matching YB part to create docdb table yugabyte.public.gin_like_test_idx_a_idx1, an YB specific error is reported: ``` if (!YbDataTypeIsValidForKey(att->atttypid)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("INDEX on column of type '%s' not yet supported", YBPgTypeOidToStr(att->atttypid)))); ``` As a result of step 3 error, the entire DDL transaction is rolled back at PG side. At yb-master we have created two background tasks (`TableSchemaVerificationTask`): * task 1: for the table `yugabyte.public.gin_like_test_idx` * task 2: for the index `yugabyte.public.gin_like_test_idx_a_idx` Both background tasks run asynchronously by polling the status of their matching PG counterparts (via `TableSchemaVerificationTask::CheckTableExists`). When seeing the PG counterparts are rolled back, they also will rollback the docdb entities. Task 1 needs to delete table yugabyte.public.gin_like_test_idx (**T**). This is a table that also has an index yugabyte.public.gin_like_test_idx_a_idx (**I**). In order to delete the table, task 1 needs to lock the table **T**, in order to delete the index associated with **T**, task 1 also needs to lock the index **I**. Task 2 needs to delete the index **I**. Index **I** belongs to table **T**. In order to delete **I**, task 2 needs to lock index **I**, in order to mark that the index **I** is deleted and therefore no longer associated with **T**, it also needs to lock **T**. In summary: Task 1 locks **T**, then locks **I** Task 2 locks **I**, then locks **T** When task 1 has locked **T** and task 2 has locked **I**, there is a deadlock. Neither can move forward to lock their second entity. The deadlock eventually manifested as a timeout error and caused the test to fail. But the test can be flaky because if task 1 or task 2 moved in a way that does not interleave as above, then there is no deadlock and the test can succeed. For example, in release build, we do not see the deadlock because in release build the test run has a different timing from debug build. To fix the dead lock, we can let both tasks to acquire locks in the same order by sorting the table_id using std::map. Some API changes are made in this change. To avoid more code changes, I limited this change to the none multistep situation in the current code. For multistep situation I don't have a deadlock case that need to be addressed. Jira: DB-10556 Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgRegressMisc#testPgRegressMiscSerial3 -n 10 --tp 1 Reviewers: hsunder, tverona Reviewed By: hsunder Subscribers: ybase, yql, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D33528
… and index in yb-master Summary: The deadlock can be shown via the following test fragment extracted from test `org.yb.pgsql.TestPgRegressMisc#testPgRegressMiscSerial3`. This test became flaky in debug build after commit 588e705 (but the deadlock issue existed before that). ``` CREATE TEMP TABLE gin_test (a int[]); CREATE INDEX ON gin_test USING GIN (a); CREATE TABLE gin_like_test (LIKE gin_test INCLUDING ALL); CREATE INDEX ON gin_test (a); ``` After the above SQL statements, we have a **temp** table gin_test as: ``` yugabyte=# \d gin_test Table "pg_temp_1.gin_test" Column | Type | Collation | Nullable | Default --------+-----------+-----------+----------+--------- a | integer[] | | | Indexes: "gin_test_a_idx" gin (a) "gin_test_a_idx1" btree (a ASC) ``` This table has two indexes: `gin_test_a_idx` and `gin_test_a_idx1` (the index names are automatically chosen by PG when not explicitly given by the user). Then we execute the next SQL statement to create a **regular** YSQL table in the same session as the temp table: ``` yugabyte=# CREATE TABLE gin_like_test_idx (LIKE gin_test INCLUDING ALL); NOTICE: index method "btree" was replaced with "lsm" in YugabyteDB ERROR: INDEX on column of type 'INT4ARRAY' not yet supported ``` We get one PG error message on the surface. However internally PG has a multi-step work flow as described below: 1. create the table gin_like_test_idx * PG creates the heap for gin_like_test_idx * YB creates the docdb table yugabyte.public.gin_like_test_idx ``` [m-1] I0322 22:17:22.206276 32153 catalog_manager.cc:3774] CreateTable from 127.125.78.0:32888: [m-1] name: "gin_like_test_idx" ``` 2. create the first index gin_like_test_idx_a_idx * PG creates the heap for gin_like_test_idx_a_idx * YB creates the docdb table yugabyte.public.gin_like_test_idx_a_idx ``` [m-1] I0322 22:17:22.516096 32189 catalog_manager.cc:3774] CreateTable from 127.125.78.0:45259: [m-1] name: "gin_like_test_idx_a_idx" ``` 3. create the second index gin_like_test_idx_a_idx1 * PG creates the heap for gin_like_test_idx_a_idx1 * It is at this step, before we execute the matching YB part to create docdb table yugabyte.public.gin_like_test_idx_a_idx1, an YB specific error is reported: ``` if (!YbDataTypeIsValidForKey(att->atttypid)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("INDEX on column of type '%s' not yet supported", YBPgTypeOidToStr(att->atttypid)))); ``` As a result of step 3 error, the entire DDL transaction is rolled back at PG side. At yb-master we have created two background tasks (`TableSchemaVerificationTask`): * task 1: for the table `yugabyte.public.gin_like_test_idx` * task 2: for the index `yugabyte.public.gin_like_test_idx_a_idx` Both background tasks run asynchronously by polling the status of their matching PG counterparts (via `TableSchemaVerificationTask::CheckTableExists`). When seeing the PG counterparts are rolled back, they also will rollback the docdb entities. Task 1 needs to delete table yugabyte.public.gin_like_test_idx (**T**). This is a table that also has an index yugabyte.public.gin_like_test_idx_a_idx (**I**). In order to delete the table, task 1 needs to lock the table **T**, in order to delete the index associated with **T**, task 1 also needs to lock the index **I**. Task 2 needs to delete the index **I**. Index **I** belongs to table **T**. In order to delete **I**, task 2 needs to lock index **I**, in order to mark that the index **I** is deleted and therefore no longer associated with **T**, it also needs to lock **T**. In summary: Task 1 locks **T**, then locks **I** Task 2 locks **I**, then locks **T** When task 1 has locked **T** and task 2 has locked **I**, there is a deadlock. Neither can move forward to lock their second entity. The deadlock eventually manifested as a timeout error and caused the test to fail. But the test can be flaky because if task 1 or task 2 moved in a way that does not interleave as above, then there is no deadlock and the test can succeed. For example, in release build, we do not see the deadlock because in release build the test run has a different timing from debug build. To fix the dead lock, we can let both tasks to acquire locks in the same order by sorting the table_id using std::map. Some API changes are made in this change. To avoid more code changes, I limited this change to the none multistep situation in the current code. For multistep situation I don't have a deadlock case that need to be addressed. Jira: DB-10556 Original commit: 4214860 / D33528 Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgRegressMisc#testPgRegressMiscSerial3 -n 10 --tp 1 Reviewers: hsunder, tverona Reviewed By: hsunder Subscribers: bogdan, yql, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33882
Jira Link: DB-10556
Description
https://detective-gcp.dev.yugabyte.com/stability/test?analyze_trends=true&branch=master&class=org.yb.pgsql.TestPgRegressMisc&name=testPgRegressMiscSerial3
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: