Skip to content

2.25.0.0-b12

@myang2021 myang2021 tagged this 19 Sep 16:54
Summary:
The unit test Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1
in asan build fails a lot due to a newly added assertion by commit
a6981f20f910d2ad61051e3190c5cee768dacd1b like:

```
F20240916 20:59:08 ../../src/yb/master/catalog_entity_info.cc:531] Check failed: !l->is_being_created_by_ysql_ddl_txn()
```

After debugging, I found that the assertion failure is related to a DDL statement
that does alter table add a unique constraint. It creates an index for
the base table. The DDL transaction involves two tables: the base table
and the index. If we compare the base table's schema to decide whether
the PG side of the DDL transaction has committed or aborted, we will
find that the base table's schema does not change. So it is not possible
to use base table's schema to make a decision of the transaction's
commit/abort state. In this case we must use the index for schema
comparison because it is newly created.

The bug is that the base table's schema is used for schema comparison
and we returned std::nullopt to signify that we do not know whether the
DDL transaction is committed or aborted. However, the std::nullopt is
not only used by the base table, but also used by the index to call
`TableInfo::IsBeingDroppedDueToDdlTxn` because there are two tables in
the DDL transaction and we just loop through them using the same
std::nullopt. For the index, it is being created so the first DCHECK
below fails:

```
  if (!txn_success.has_value()) {
    // This means the DDL txn only increments the schema version and does not change
    // the DocDB schema at all. It cannot be one of the following 2 cases.
    DCHECK(!l->is_being_created_by_ysql_ddl_txn());
    DCHECK(!l->is_being_deleted_by_ysql_ddl_txn());
    return false;
  }
```

I fixed the DCHECK failure by changing the API of `TableInfo::IsBeingDroppedDueToDdlTxn` to
not use std::optional<bool> for its argument `txn_success`. The function used to
take a simple `bool txn_success` as argument. I get it back and only call the
function when the argument isn't `std::nullopt`. In other words, I now only call
the function when the caller has already figured out the DDL transaction's
commit/abort status.

Added a new unit test that would fail the assertion before the fix.

NOTE: this diff only fixes the DCHECK failure so that we are not worse than before.
However we are not better than before either. There is still a flaw with the current
DDL atomicity work flow based upon schema comparison when the DDL transaction
involves multiple tables. Only the first table's schema is used for comparison
on the grounds that we only need to compare one table's schema against PG catalog
to determine the DDL transaction's committed/aborted status. While this assumption
is generally true, it fails when the first table's schema does not change and only involves
a schema version increment. Currently it is not straightforward to revise the work flow
to pick the right table which can be used to tell the DDL transaction's status based upon
schema comparison. Therefore if the PG side has aborted the DDL transaction, schema
comparison based method will conclude that there is no schema change of the base table
and therefore it simply clears the DDL verification state for **both the base table and the index**,
which is equivalent to a successful commit. This isn't right and the DocDB index will be left
as garbage that isn't cleaned up. This issue is tracked separately by https://github.com/yugabyte/yugabyte-db/issues/23988
Jira: DB-12825

Test Plan:
./yb_build.sh debug --cxx-test pg_ddl_atomicity-test --gtest_filter PgDdlAtomicityTest.TestAlterTableAddUniqueConstraint -n 10

./yb_build.sh asan --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 --clang17 -n 10 --tp 1

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D38107
Assets 2
Loading