Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Disable packed row for colocated tables as Colocated table + Packed Rows: DML+DDL Workload and compaction cannot find schema packing and fail. #21218

Closed
1 task done
shishir2001-yb opened this issue Feb 27, 2024 · 4 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/highest Highest priority issue qa_stress Bugs identified via Stress automation QA QA filed bugs

Comments

@shishir2001-yb
Copy link

shishir2001-yb commented Feb 27, 2024

Jira Link: DB-10146

Description

Tried on version: 2.21.1.0-b158

Encountering the following fatal error again on 2.21.1.0-b158(This has Jonathan’s fix for https://github.com/yugabyte/yugabyte-db/issues/20638) in a new Cross-DB DDL test. Note: I believe this time it’s occurring on a table named ‘tb_0_temp_old,’ which is likely created internally during the execution of certain DDLs.

F20240227 13:30:07 ../../src/yb/tablet/tablet.cc:1522] T f9ecba67fd0d44c78ef87b691e6647fc P d79ad410277a49fa85364466087c2ea2: Failed to write a batch with 0 operations into RocksDB: Corruption (yb/tablet/tablet_metadata.cc:354): Cannot find packing for table: 00004022000030008000000000004223, schema version: 0
    @     0x55debc004907  google::LogMessage::SendToLog()
    @     0x55debc00583d  google::LogMessage::Flush()
    @     0x55debc005e89  google::LogMessageFatal::~LogMessageFatal()
    @     0x55debd3b9b73  yb::tablet::Tablet::WriteToRocksDB()
    @     0x55debd3b5b45  yb::tablet::Tablet::ApplyIntents()
    @     0x55debd3b6712  yb::tablet::Tablet::ApplyIntents()
    @     0x55debd46e416  yb::tablet::TransactionParticipant::Impl::ProcessReplicated()
    @     0x55debd38fffc  yb::tablet::UpdateTxnOperation::DoReplicated()
    @     0x55debd3835fe  yb::tablet::Operation::Replicated()
    @     0x55debd385a7f  yb::tablet::OperationDriver::ReplicationFinished()
    @     0x55debc4a7e2b  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0x55debc4f68ff  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0x55debc4f5c69  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0x55debc4ddae0  yb::consensus::RaftConsensus::UpdateReplica()
    @     0x55debc4bf373  yb::consensus::RaftConsensus::Update()
    @     0x55debd6bf323  yb::tserver::ConsensusServiceImpl::UpdateConsensus()
    @     0x55debc54d5fe  std::__1::__function::__func<>::operator()()
    @     0x55debc54e22f  yb::consensus::ConsensusServiceIf::Handle()
    @     0x55debd2db56f  yb::rpc::ServicePoolImpl::Handle()
    @     0x55debd1fbbbf  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x55debd2eb3f3  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x55debdae4913  yb::Thread::SuperviseThread()
    @     0x7fda53c551ca  start_thread
    @     0x7fda53ea6e73  __GI___clone

Test Details:

So this occurred in 2nd iteration of Step 3. So we did execute a Backup Restore on on database(postgres_20)

1. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (20 colocated database and 20 non-colocated database), run this for 20-30 mins
2. Create a PITR schedule on 10 random database
3. Start a while loop which executed
  a.  Note down time for PITR(0) 
  b. Create a backup of 1 random database
  c.  Start the cross DB DDL workload and stop it after 10 mins
  d. Note down the time for PITR(1)
  e. Start the cross DB DDL workload and keep it running
  f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago) while the workload is running.
  g. Wait for the workload to stop
  h.  Restore to PITR(1)
  i. Validate data
  j. Restore to PITR(0) with a probability of 0.6 and validate data
  k. Delete the PITR schedule for the backup db (In our case it was postgres_20)
  l. Drop the database 
  m. Restore the backup
  n. Create the snapshot schedule for this new DB
  

List of DDLs in sample app

private static List<List<String>> ddlList = List.of(
            List.of("CREATE INDEX idx1 ON ? (k)", "DROP INDEX idx1"),
            List.of("CREATE TABLE tempTable1 AS SELECT * FROM ? limit 1000000", "ALTER TABLE tempTable1 RENAME TO tempTable1_new", "DROP TABLE tempTable1_new"),
            List.of("CREATE MATERIALIZED VIEW mv1 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv1", "DROP MATERIALIZED VIEW mv1"),
            List.of("ALTER TABLE ? ADD newColumn1 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? DROP newColumn1"),
            List.of("ALTER TABLE ? ADD newColumn2 TEXT NULL", "ALTER TABLE ? DROP newColumn2"),
            List.of("CREATE VIEW view1_? AS SELECT k from ?", "DROP VIEW view1_?"),
            List.of("ALTER TABLE ? ADD newColumn3 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? ALTER newColumn3 TYPE VARCHAR(1000)", "ALTER TABLE ? DROP newColumn3"),
            List.of("CREATE TABLE tempTable2 AS SELECT * FROM ? limit 1000000", "CREATE INDEX idx2 ON tempTable2(k)", "ALTER TABLE ? ADD newColumn4 TEXT DEFAULT 'dummyString'", "ALTER TABLE tempTable2 ADD newColumn2 TEXT DEFAULT 'dummyString'", "TRUNCATE table ? cascade", "ALTER TABLE ? DROP newColumn4", "ALTER TABLE tempTable2 DROP newColumn2", "DROP INDEX idx2", "DROP TABLE tempTable2"),
            List.of("CREATE VIEW view2_? AS SELECT k from ?", "CREATE MATERIALIZED VIEW mv2 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv2", "DROP MATERIALIZED VIEW mv2", "DROP VIEW view2_?")
 );

Logs: http://stress.dev.yugabyte.com/stress_test/e56e3d49-de37-4a93-91d8-fe7493f6c4d1 (Attachments -> Universe logs)

G-flags

tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
                "tablet_replicas_per_gib_limit": 0
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "tablet_split_high_phase_shard_count_per_node": 20000,
                "tablet_split_high_phase_size_threshold_bytes": 2097152,  # 2MB
                # low_phase_size 100KB
                "tablet_split_low_phase_size_threshold_bytes": 102400,  # 100 KB
                "tablet_split_low_phase_shard_count_per_node": 10000,
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
                "tablet_replicas_per_gib_limit": 0
}

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shishir2001-yb shishir2001-yb added area/docdb YugabyteDB core features QA QA filed bugs status/awaiting-triage Issue awaiting triage qa_stress Bugs identified via Stress automation labels Feb 27, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Feb 27, 2024
@shishir2001-yb
Copy link
Author

Similar issue
#20638

@yugabyte-ci yugabyte-ci added priority/highest Highest priority issue and removed priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage labels Feb 27, 2024
Huqicheng added a commit that referenced this issue Feb 28, 2024
Summary:
Disable packed row feature for colocated tables to avoid running into #21218, while we debug the underlying issue in #21218.

Jira: DB-10016

Test Plan:
PgPackedRowTest.ColocatedCompactionPackRowDisabled
PgPackedRowTest.ColocatedPackRowDisabled

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: qhu, ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D32695
@Huqicheng
Copy link
Contributor

Huqicheng commented Feb 28, 2024

yb-tserver.ip-172-151-18-254.us-west-2.compute.internal.yugabyte.log.INFO.20240227-133202.273248.gz:I0227 13:34:04.453986 273608 doc_read_context.cc:64] TBL 0000401b000030008000000000004226 T 00c52216ed3341aab818c29f44ec7f4c P d79ad410277a49fa85364466087c2ea2: DocReadContext, copy and filter: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33] => [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], min_schema_version: 21
...
yb-tserver.ip-172-151-18-254.us-west-2.compute.internal.yugabyte.log.INFO.20240227-133202.273248.gz:I0227 13:34:08.325945 277681 doc_read_context.cc:78] TBL 0000401b000030008000000000004226 T 00c52216ed3341aab818c29f44ec7f4c P d79ad410277a49fa85364466087c2ea2: LogAfterMerge: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], overwrite: 0
yb-tserver.ip-172-151-18-254.us-west-2.compute.internal.yugabyte.log.INFO.20240227-133202.273248.gz:W0227 13:34:08.771888 273608 db_impl.cc:3817] T 00c52216ed3341aab818c29f44ec7f4c P d79ad410277a49fa85364466087c2ea2 [R]: Compaction error: Corruption (yb/tablet/tablet_metadata.cc:354): Cannot find packing for table: 0000401b000030008000000000004226, schema version: 16

16 existing right after Restore

@Huqicheng
Copy link
Contributor

Huqicheng commented Feb 28, 2024

From the tablet meta, 16 non existing

yb-tserver.ip-172-151-18-254.us-west-2.compute.internal.yugabyte.log.INFO.20240227-133502.283758.gz:I0227 13:35:02.566885 283861 doc_read_context.cc:74] TBL 0000401b000030008000000000004226 T 00c52216ed3341aab818c29f44ec7f4c P d79ad410277a49fa85364466087c2ea2: LogAfterLoad: [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]

@rthallamko3
Copy link
Contributor

Using this issue to track the disabling of Packed Row + Colocation. #21244 will track the real fix.

@rthallamko3 rthallamko3 changed the title [DocDB] Colocated table + Packed Rows: DML Workload and compaction cannot find schema packing and fail. [DocDB] Colocated table + Packed Rows: DML+DDL Workload and compaction cannot find schema packing and fail. Mar 5, 2024
Huqicheng added a commit that referenced this issue Mar 5, 2024
Summary:
Disable packed row feature for colocated tables to avoid running into #21218, while we debug the underlying issue in #21218.
Jira: DB-10146

Test Plan:
PgPackedRowTest.ColocatedCompactionPackRowDisabled
PgPackedRowTest.ColocatedPackRowDisabled

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: qhu, ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D32693
Huqicheng added a commit that referenced this issue Mar 5, 2024
Summary:
Disable packed row feature for colocated tables to avoid running into #21218, while we debug the underlying issue in #21218.
Jira: DB-10146

Test Plan:
PgPackedRowTest.ColocatedCompactionPackRowDisabled
PgPackedRowTest.ColocatedPackRowDisabled

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: yql, ybase, qhu

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D32694
Huqicheng added a commit that referenced this issue Mar 6, 2024
Summary:
Disable packed row feature for colocated tables to avoid running into #21218, while we debug the underlying issue in #21218.
Jira: DB-10146

Test Plan:
PgPackedRowTest.ColocatedCompactionPackRowDisabled
PgPackedRowTest.ColocatedPackRowDisabled

Reviewers: rthallam, sergei

Reviewed By: rthallam

Subscribers: qhu, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D32872
@rthallamko3 rthallamko3 changed the title [DocDB] Colocated table + Packed Rows: DML+DDL Workload and compaction cannot find schema packing and fail. [DocDB] Disable packed row for colocated tables as Colocated table + Packed Rows: DML+DDL Workload and compaction cannot find schema packing and fail. Mar 6, 2024
Huqicheng added a commit that referenced this issue Mar 14, 2024
…KvStoreInfo::colocation_to_table for the same table

Summary:
## Customer Impact
Colocated table + Packed Rows: DML+DDL Workload fails with an error related to missing schema packing. The impact of the issue is on 2.20.0+ builds as the packed row feature is enabled by default for ysql api starting in 2.20 (for new clusters only, clusters that get upgraded from lower releases such as 2.14/2.6/2.18 to 2.20 are not impacted). If the user creates colocated tables and performs a lot of DDLs over them and compaction happens on the colocated tablet to trigger OldSchemaGC. After PITR, then there is a chance that the workload runs into the errors mentioned in #21218. This results in a tserver crash loop on the impacted node.

Note that it can impact customers on 2.18 builds, if they explicitly enable ysql packed row for colocated tables, using the gflag ysql_enable_packed_row_for_colocated_table.

## Details
During OldSchemaGC and OnBackfillDoneUnlocked, only the TableInfoPtr in KvStoreInfo::tables is updated to new_value but leave KvStoreInfo::colocation_to_table unchanged. This can lead to corrupt tablet metadata, if a PITR snapshot is restored after OldSchemaGC/OnBackfillDoneUnlocked function call. FindMatchingTable always gets the matching table from KvStoreInfo::colocation_to_table, and TableInfo::MergeSchemaPackings will merge the snapshot schema packings with the old TableInfo's schema packings, which runs into issues.

This fix addresses the code bug and ensures that the KvStoreInfo::colocation_to_table is updated correctly, when the TableInfoPtr in KvStoreInfo::tables is also updated.

Test Plan:
./yb_build.sh release --cxx-test pg_packed_row-test --gtest_filter "PackingVersion/PgPackedRowTest.RestorePITRSnapshotAfterOldSchemaGC/*"
./yb_build.sh release --cxx-test yb-backup-cross-feature-test --gtest_filter "PackedRows/YBBackupTestWithPackedRowsAndColocation.RestoreBackupAfterOldSchemaGC/1"
./yb_build.sh release --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlCreateIndex/*

Reviewers: sergei, rthallam

Reviewed By: sergei, rthallam

Subscribers: rthallam, bogdan, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D32922
asrinivasanyb pushed a commit to asrinivasanyb/yugabyte-db that referenced this issue Mar 18, 2024
…les and KvStoreInfo::colocation_to_table for the same table

Summary:
## Customer Impact
Colocated table + Packed Rows: DML+DDL Workload fails with an error related to missing schema packing. The impact of the issue is on 2.20.0+ builds as the packed row feature is enabled by default for ysql api starting in 2.20 (for new clusters only, clusters that get upgraded from lower releases such as 2.14/2.6/2.18 to 2.20 are not impacted). If the user creates colocated tables and performs a lot of DDLs over them and compaction happens on the colocated tablet to trigger OldSchemaGC. After PITR, then there is a chance that the workload runs into the errors mentioned in yugabyte#21218. This results in a tserver crash loop on the impacted node.

Note that it can impact customers on 2.18 builds, if they explicitly enable ysql packed row for colocated tables, using the gflag ysql_enable_packed_row_for_colocated_table.

## Details
During OldSchemaGC and OnBackfillDoneUnlocked, only the TableInfoPtr in KvStoreInfo::tables is updated to new_value but leave KvStoreInfo::colocation_to_table unchanged. This can lead to corrupt tablet metadata, if a PITR snapshot is restored after OldSchemaGC/OnBackfillDoneUnlocked function call. FindMatchingTable always gets the matching table from KvStoreInfo::colocation_to_table, and TableInfo::MergeSchemaPackings will merge the snapshot schema packings with the old TableInfo's schema packings, which runs into issues.

This fix addresses the code bug and ensures that the KvStoreInfo::colocation_to_table is updated correctly, when the TableInfoPtr in KvStoreInfo::tables is also updated.

Test Plan:
./yb_build.sh release --cxx-test pg_packed_row-test --gtest_filter "PackingVersion/PgPackedRowTest.RestorePITRSnapshotAfterOldSchemaGC/*"
./yb_build.sh release --cxx-test yb-backup-cross-feature-test --gtest_filter "PackedRows/YBBackupTestWithPackedRowsAndColocation.RestoreBackupAfterOldSchemaGC/1"
./yb_build.sh release --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlCreateIndex/*

Reviewers: sergei, rthallam

Reviewed By: sergei, rthallam

Subscribers: rthallam, bogdan, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D32922
Huqicheng added a commit that referenced this issue Mar 19, 2024
…fo::tables and KvStoreInfo::colocation_to_table for the same table

Summary:
Original commit: 6b0aa84 / D32922
Colocated table + Packed Rows: DML+DDL Workload fails with an error related to missing schema packing. The impact of the issue is on 2.20.0+ builds as the packed row feature is enabled by default for ysql api starting in 2.20 (for new clusters only, clusters that get upgraded from lower releases such as 2.14/2.6/2.18 to 2.20 are not impacted). If the user creates colocated tables and performs a lot of DDLs over them and compaction happens on the colocated tablet to trigger OldSchemaGC. After PITR, then there is a chance that the workload runs into the errors mentioned in #21218. This results in a tserver crash loop on the impacted node.

Note that it can impact customers on 2.18 builds, if they explicitly enable ysql packed row for colocated tables, using the gflag ysql_enable_packed_row_for_colocated_table.

During OldSchemaGC and OnBackfillDoneUnlocked, only the TableInfoPtr in KvStoreInfo::tables is updated to new_value but leave KvStoreInfo::colocation_to_table unchanged. This can lead to corrupt tablet metadata, if a PITR snapshot is restored after OldSchemaGC/OnBackfillDoneUnlocked function call. FindMatchingTable always gets the matching table from KvStoreInfo::colocation_to_table, and TableInfo::MergeSchemaPackings will merge the snapshot schema packings with the old TableInfo's schema packings, which runs into issues.

This fix addresses the code bug and ensures that the KvStoreInfo::colocation_to_table is updated correctly, when the TableInfoPtr in KvStoreInfo::tables is also updated.

Test Plan:
./yb_build.sh release --cxx-test pg_packed_row-test --gtest_filter "PackingVersion/PgPackedRowTest.RestorePITRSnapshotAfterOldSchemaGC/*"
./yb_build.sh release --cxx-test yb-backup-cross-feature-test --gtest_filter "PackedRows/YBBackupTestWithPackedRowsAndColocation.RestoreBackupAfterOldSchemaGC/1"
./yb_build.sh release --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlCreateIndex/*

Reviewers: sergei, rthallam

Reviewed By: rthallam

Subscribers: yql, ybase, bogdan, rthallam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33291
Huqicheng added a commit that referenced this issue Mar 20, 2024
…fo::tables and KvStoreInfo::colocation_to_table for the same table

Summary:
Original commit: 6b0aa84 / D32922
Colocated table + Packed Rows: DML+DDL Workload fails with an error related to missing schema packing. The impact of the issue is on 2.20.0+ builds as the packed row feature is enabled by default for ysql api starting in 2.20 (for new clusters only, clusters that get upgraded from lower releases such as 2.14/2.6/2.18 to 2.20 are not impacted). If the user creates colocated tables and performs a lot of DDLs over them and compaction happens on the colocated tablet to trigger OldSchemaGC. After PITR, then there is a chance that the workload runs into the errors mentioned in #21218. This results in a tserver crash loop on the impacted node.

Note that it can impact customers on 2.18 builds, if they explicitly enable ysql packed row for colocated tables, using the gflag ysql_enable_packed_row_for_colocated_table.

During OldSchemaGC and OnBackfillDoneUnlocked, only the TableInfoPtr in KvStoreInfo::tables is updated to new_value but leave KvStoreInfo::colocation_to_table unchanged. This can lead to corrupt tablet metadata, if a PITR snapshot is restored after OldSchemaGC/OnBackfillDoneUnlocked function call. FindMatchingTable always gets the matching table from KvStoreInfo::colocation_to_table, and TableInfo::MergeSchemaPackings will merge the snapshot schema packings with the old TableInfo's schema packings, which runs into issues.

This fix addresses the code bug and ensures that the KvStoreInfo::colocation_to_table is updated correctly, when the TableInfoPtr in KvStoreInfo::tables is also updated.

Test Plan:
./yb_build.sh release --cxx-test pg_packed_row-test --gtest_filter "PackingVersion/PgPackedRowTest.RestorePITRSnapshotAfterOldSchemaGC/*"
./yb_build.sh release --cxx-test yb-backup-cross-feature-test --gtest_filter "PackedRows/YBBackupTestWithPackedRowsAndColocation.RestoreBackupAfterOldSchemaGC/1"
./yb_build.sh release --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlCreateIndex/*

Reviewers: sergei, rthallam

Reviewed By: rthallam

Subscribers: yql, ybase, bogdan, rthallam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33292
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/highest Highest priority issue qa_stress Bugs identified via Stress automation QA QA filed bugs
Projects
None yet
Development

No branches or pull requests

4 participants