Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Master crashes for failed indexes, when backfill responses arrive in interleaved order. #20510

Closed
1 task done
amitanandaiyer opened this issue Jan 9, 2024 · 0 comments

Comments

@amitanandaiyer
Copy link
Contributor

amitanandaiyer commented Jan 9, 2024

Jira Link: DB-9516

Description

Master crashes for failed indexes, when backfill responses arrive in interleaved order

If the master is slow, or if network delays cause a backfill response to be processed when a different backfill operation is running on the same table, we may run into

 [m-1] [libprotobuf FATAL /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20231120041920-e307bff3a7-macos-arm64/installed/uninstrumented/include/google/protobuf/map.h:1064] CHECK failed: it != end(): key not found: ebd5a1637b4746f9bf4d6ab7218ffeee
 [m-1] libc++abi: terminating due to uncaught exception of type google::protobuf::FatalException: CHECK failed: it != end(): key not found: ebd5a1637b4746f9bf4d6ab7218ffeee
 [m-1] *** Aborted at 1704791869 (unix time) try "date -d @1704791869" if you are using GNU date ***
 [m-1] PC: @                0x0 yb::kFirstColumnId

Such scenarios should be handled gracefully.

Here is the analysis:

E1130 18:50:13.078498  1169 backfill_index.cc:915] Backfill Index Table(s) { ........... } failed to backfill the index: [2660a7f255f34d60a76e76dac6d2b978] due to Illegal state (yb/tablet/tablet.cc:2521): Table metadata cache is not present for index updat
Failed_indexes are [2660a7f255f34d60a76e76dac6d2b978]
I1130 18:50:13.091640  1169 backfill_index.cc:1056] Done backfill on xxxxx [id=37c1b1af6fba4e9ab2f73f6ed8bf63d0] setting permissions to [{2660a7f255f34d60a76e76dac6d2b978, 8}]
I1130 18:50:13.162415  1170 catalog_manager.cc:5283] MarkIndexInfoFromTableForDeletion table 37c1b1af6fba4e9ab2f73f6ed8bf63d0 index 2660a7f255f34d60a76e76dac6d2b978 multi_stage=0
I1130 18:50:13.162456  1170 catalog_manager.cc:5317] DeleteIndexInfoFromTable table 37c1b1af6fba4e9ab2f73f6ed8bf63d0 index 2660a7f255f34d60a76e76dac6d2b978
I1130 18:50:13.164863  1170 catalog_manager_ent.cc:3228] Deleting CDC streams for tables: 2660a7f255f34d60a76e76dac6d2b978
I1130 18:50:13.169730  1170 catalog_manager_ent.cc:3256] Deleting CDC streams metadata for tables: 2660a7f255f34d60a76e76dac6d2b978
I1130 18:50:13.170596  1170 catalog_manager.cc:5624] Successfully initiated deletion of index with table_id: "2660a7f255f34d60a76e76dac6d2b978"

Backfill to Index-1 = 2660a7f255f34d60a76e76dac6d2b978/[xxxxxx] fails for some reason. (the reason is not important). However, it looks like the next backfill job starts immediately/quickly. (Index-2 = xxxxx)


I1130 18:50:13.187362  1172 backfill_index.cc:390] StartBackfillingData starting backfill on xxxxx [id=37c1b1af6fba4e9ab2f73f6ed8bf63d0] for [table_id: "412983eff5584d7c807308cc4e15b309" is_local: false columns { column_id: 0 indexed_column_id: 8 column_name: "C
...
I1130 18:50:13.204161  1169 catalog_manager.cc:5821] Marking table as DELETED: xxxxx [id=2660a7f255f34d60a76e76dac6d2b978]

However, some async tasks to the tablet for the first index-backfill are yet to be processed.


W1130 18:50:13.207756  1170 backfill_index.cc:1489] TS 44b3510e16ef4273beb52f0bcda0a9bd: backfill failed for tablet 41a89647275b44a799f13a7ae8e07e8d (table xxxxx [id=37c1b1af6fba4e9ab2f73f6ed8bf63d0]) no further retry: Invalid argument (yb/tserver/tablet_service
I1130 18:50:13.207855  1170 backfill_index.cc:1241] Failed to backfill the tablet 0x000000000cb24500 -> 41a89647275b44a799f13a7ae8e07e8d (table xxxxx [id=37c1b1af6fba4e9ab2f73f6ed8bf63d0]): Invalid argument (yb/tserver/tablet_service.cc:714): Tablet has a differ
Failed_indexes are [2660a7f255f34d60a76e76dac6d2b978]
E1130 18:50:13.207862  1170 backfill_index.cc:915] Backfill Index Table(s) { xxxxx } failed to backfill the index: [2660a7f255f34d60a76e76dac6d2b978] due to Invalid argument (yb/tserver/tablet_service.cc:714): Tablet has a different schema 155 vs 148

When these try to come and mark Index-1 = 2660a7f255f34d60a76e76dac6d2b978/xxxxx as failed, it sees a problem that the desired index is no longer found in the map.
 backfill_jobs() has now been updated to track `Index-2 = 412983eff5584d7c807308cc4e15b309/xxxxx  when Index 2 started backfilling. Thus Index-1 = 2660a7f255f34d60a76e76dac6d2b978/xxxxx is not found in the map which seems to cause the crash.


   auto* backfill_state_pb = indexed_table_pb.mutable_backfill_jobs(0)->mutable_backfill_state();
    for (const auto& idx_id : index_ids_set) {
      backfill_state_pb->at(idx_id) = state;
      VLOG(2) << "Marking index " << idx_id << " as " << BackfillJobPB_State_Name(state);
    }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@amitanandaiyer amitanandaiyer added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jan 9, 2024
@amitanandaiyer amitanandaiyer self-assigned this Jan 9, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Jan 9, 2024
amitanandaiyer added a commit that referenced this issue Jan 10, 2024
…different backfill operations

Summary:
If the master is slow, or if network delays cause a backfill response to be processed when a different backfill operation is running on the same table, we may run into

```
[m-1] [libprotobuf FATAL /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20231120041920-e307bff3a7-macos-arm64/installed/uninstrumented/include/google/protobuf/map.h:1064] CHECK failed: it != end(): key not found: ebd5a1637b4746f9bf4d6ab7218ffeee
```

It may affect both ysql and ycql backfills. However, it is less likely to be seen on ysql because the create index call in ysql is synchronous.

This change looks to check if the reponse being handled and the current backfill job are for the same indexes, if not it bails out gracefully.
Jira: DB-9516

Test Plan: ./yb_build.sh --cxx-test integration-tests_cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.ConcurrentBackfillIndexFailures

Reviewers: jason, hsunder

Reviewed By: jason

Subscribers: ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D31571
@rthallamko3 rthallamko3 changed the title [DocDB] Handle backfill responses getting interleaved [DocDB] Master crashes for failed indexes, when backfill responses arrive in interleaved order. Jan 10, 2024
amitanandaiyer added a commit that referenced this issue Jan 23, 2024
…rleaved across different backfill operations

Summary:
Original commit: 0fe1bba / D31571
If the master is slow, or if network delays cause a backfill response to be processed when a different backfill operation is running on the same table, we may run into

```
[m-1] [libprotobuf FATAL /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20231120041920-e307bff3a7-macos-arm64/installed/uninstrumented/include/google/protobuf/map.h:1064] CHECK failed: it != end(): key not found: ebd5a1637b4746f9bf4d6ab7218ffeee
```

It may affect both ysql and ycql backfills. However, it is less likely to be seen on ysql because the create index call in ysql is synchronous.

This change looks to check if the reponse being handled and the current backfill job are for the same indexes, if not it bails out gracefully.
Jira: DB-9516

Test Plan: ./yb_build.sh --cxx-test integration-tests_cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.ConcurrentBackfillIndexFailures

Reviewers: jason, hsunder

Reviewed By: jason

Subscribers: bogdan, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31598
amitanandaiyer added a commit that referenced this issue Jan 24, 2024
…rleaved across different backfill operations

Summary:
Original commit: 0fe1bba / D31571
If the master is slow, or if network delays cause a backfill response to be processed when a different backfill operation is running on the same table, we may run into

```
[m-1] [libprotobuf FATAL /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20231120041920-e307bff3a7-macos-arm64/installed/uninstrumented/include/google/protobuf/map.h:1064] CHECK failed: it != end(): key not found: ebd5a1637b4746f9bf4d6ab7218ffeee
```

It may affect both ysql and ycql backfills. However, it is less likely to be seen on ysql because the create index call in ysql is synchronous.

This change looks to check if the reponse being handled and the current backfill job are for the same indexes, if not it bails out gracefully.
Jira: DB-9516

Test Plan: ./yb_build.sh --cxx-test integration-tests_cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.ConcurrentBackfillIndexFailures

Reviewers: jason, hsunder

Reviewed By: jason

Subscribers: ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31599
amitanandaiyer added a commit that referenced this issue Jan 24, 2024
…rleaved across different backfill operations

Summary:
Original commit: 0fe1bba / D31571
If the master is slow, or if network delays cause a backfill response to be processed when a different backfill operation is running on the same table, we may run into

```
[m-1] [libprotobuf FATAL /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20231120041920-e307bff3a7-macos-arm64/installed/uninstrumented/include/google/protobuf/map.h:1064] CHECK failed: it != end(): key not found: ebd5a1637b4746f9bf4d6ab7218ffeee
```

It may affect both ysql and ycql backfills. However, it is less likely to be seen on ysql because the create index call in ysql is synchronous.

This change looks to check if the reponse being handled and the current backfill job are for the same indexes, if not it bails out gracefully.
Jira: DB-9516

Test Plan: ./yb_build.sh --cxx-test integration-tests_cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.ConcurrentBackfillIndexFailures

Reviewers: jason, hsunder

Reviewed By: jason

Subscribers: ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31601
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants