Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Taking backup of all database got stuck after restore failure #13773

Open
sridhar-yb opened this issue Aug 25, 2022 · 0 comments
Open
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue QA QA filed bugs

Comments

@sridhar-yb
Copy link

sridhar-yb commented Aug 25, 2022

Jira Link: DB-3297

Description

yugabyte@ip-10-9-79-113 ~]$ /home/yugabyte/tserver/bin/yb-admin --master_addresses 10.9.79.113:7100 --certs_dir_name /home/yugabyte/yugabyte-tls-config  list_snapshots show_details
Snapshot UUID                    	State 	 	Creation Time
2baeb272-8ca0-46d0-972f-d942f0aa6599 	DELETING 	2022-08-25 14:15:39.572204
 	{"type":"NAMESPACE","id":"0000400b000030008000000000000000","data":{"name":"postgres_non_ybc_no_ear","database_type":"YQL_DATABASE_PGSQL","next_pg_oid":13328,"colocated":false,"state":"RUNNING"}}
 	{"type":"TABLE","id":"0000400b00003000800000000000400c","data":{"name":"postgresqlkeyvalue","version":0,"state":"RUNNING","next_column_id":2,"table_type":"PGSQL_TABLE_TYPE","namespace_id":"0000400b000030008000000000000000","namespace_name":"postgres_non_ybc_no_ear"}}
Restoration UUID                 	State
6fd6af6d-9097-49d6-a670-64c20828f493 	FAILED
[yugabyte@ip-10-9-79-113 ~]$
  • Now try to take Backup for All Database (Which includes the failed DB from above restore operation)
  • Observed the backup process got stuck at
  • Please find the below error logs:

Restore failure error in Platform:

Failed to execute task {"nodeExporterUser":"prometheus","universeUUID":"64cdd943-1508-4579-a60e-d1fa69ea5c9d","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPort":7100,"tserverHttpPort":9000,"tserverRpcPort":9100,"ybControllerHttpPort":14000,"ybControllerrRpcPort":18018,"redisServerHttpPort":11000,"redisServerRpcPort":6379,"yqlServerHttpPort":12000,"yqlSe..., hit error:

java.lang.RuntimeException: {"error": "Backup exception: Snapshot id 6fd6af6d-9097-49d6-a670-64c20828f493, restoring failed!"}.
  • Observed that backup got stuck for all long time

Tserver Error Log:

Log file created at: 2022/08/25 14:16:01
Running on machine: ip-10-9-79-113.us-west-2.compute.internal
Application fingerprint: version 2.15.3.0 build 89 revision 24558166aabde408742c7c9200f89236944c0553 build_type RELEASE built at 24 Aug 2022 03:34:53 UTC
Node information: { hostname: 'ip-10-9-79-113.us-west-2.compute.internal', rpc_ip: '10.9.79.113', webserver_ip: '10.9.79.113', uuid: 'f3d5b9840f384e6e9567dfe0e4350eea' }
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0825 14:16:01.669158 23631 tablet.cc:731] T f9eab7eb62164cfa8953e97ca883983b P f3d5b9840f384e6e9567dfe0e4350eea: Failed to open a RocksDB database in directory /mnt/d0/yb-data/tserver/data/rocksdb/table-0000400b00003000800000000000400c/tablet-f9eab7eb62164cfa8953e97ca883983b: Invalid argument (yb/encryption/universe_key_manager.cc:75): Key with version number dmF1bHQ6djE6em9ZZlZQZ1kwcXpKTGt1ejNBUVJTNEFtWlZ0aTlVUCtGaEtPeno5WTFUalNhZ0JXRDUyY1g4UUsxS3VrWVB1OUZzRUxIUHBSNTZtODh2bjQ= does not exist
E0825 14:16:01.669664 23631 ts_tablet_manager.cc:1459] T f9eab7eb62164cfa8953e97ca883983b P f3d5b9840f384e6e9567dfe0e4350eea: Tablet failed to bootstrap: Illegal state (yb/tablet/tablet.cc:736): Invalid argument (yb/encryption/universe_key_manager.cc:75): Key with version number dmF1bHQ6djE6em9ZZlZQZ1kwcXpKTGt1ejNBUVJTNEFtWlZ0aTlVUCtGaEtPeno5WTFUalNhZ0JXRDUyY1g4UUsxS3VrWVB1OUZzRUxIUHBSNTZtODh2bjQ= does not exist
~

Tserver FATAL.log:

Log file created at: 2022/08/25 14:15:43
Running on machine: ip-10-9-79-113.us-west-2.compute.internal
Application fingerprint: version 2.15.3.0 build 89 revision 24558166aabde408742c7c9200f89236944c0553 build_type RELEASE built at 24 Aug 2022 03:34:53 UTC
Node information: { hostname: 'ip-10-9-79-113.us-west-2.compute.internal', rpc_ip: '10.9.79.113', webserver_ip: '10.9.79.113', uuid: 'f3d5b9840f384e6e9567dfe0e4350eea' }
Running duration (h:mm:ss): 0:15:39
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0825 14:15:43.071085 18085 operation_driver.cc:396] T f9eab7eb62164cfa8953e97ca883983b P f3d5b9840f384e6e9567dfe0e4350eea S RD-P Ts { physical: 1661436943066997 } kSnapshot (0x00000000060217a0): Apply failed: Invalid argument (yb/encryption/universe_key_manager.cc:75): Key with version number dmF1bHQ6djE6em9ZZlZQZ1kwcXpKTGt1ejNBUVJTNEFtWlZ0aTlVUCtGaEtPeno5WTFUalNhZ0JXRDUyY1g4UUsxS3VrWVB1OUZzRUxIUHBSNTZtODh2bjQ= does not exist, request: dest_uuid: "f3d5b9840f384e6e9567dfe0e4350eea" operation: RESTORE_ON_TABLET snapshot_id: "+\256\262r\214\240F\320\227/\331B\360\252e\231" tablet_id: "f9eab7eb62164cfa8953e97ca883983b" propagated_hybrid_time: 6805245718801002496 restoration_id: "o\326\257m\220\227I\326\246pd\302\010(\364\223"

Master Logs:

[yugabyte@ip-10-9-79-113 logs]$ grep '2baeb272-8ca0-46d0-972f-d942f0aa6599' * | more
yb-master.INFO:W0825 14:15:51.441174 23525 async_rpc_tasks.cc:292] f9eab7eb62164cfa8953e97ca883983b (table postgresqlkeyvalue [id=0000400b00003000800000000000400c]) Tablet Snapshot Operation RESTORE_ON_TABLET RPC 2baeb272-
8ca0-46d0-972f-d942f0aa6599 (task=0x000000000468b1a0, state=kRunning): TS f3d5b9840f384e6e9567dfe0e4350eea: Tablet Snapshot Operation RPC failed for tablet f9eab7eb62164cfa8953e97ca883983b: Network error (yb/util/net/socke
t.cc:540): recvmsg error: Connection refused (system error 111)
yb-master.INFO:I0825 14:15:51.441238 23525 async_rpc_tasks.cc:373] f9eab7eb62164cfa8953e97ca883983b (table postgresqlkeyvalue [id=0000400b00003000800000000000400c]) Tablet Snapshot Operation RESTORE_ON_TABLET RPC 2baeb272-
8ca0-46d0-972f-d942f0aa6599 (task=0x000000000468b1a0, state=kRunning): Scheduling retry with a delay of 8205ms (attempt = 10 / 20)...
yb-master.INFO:W0825 14:15:59.647578 23566 async_rpc_tasks.cc:292] f9eab7eb62164cfa8953e97ca883983b (table postgresqlkeyvalue [id=0000400b00003000800000000000400c]) Tablet Snapshot Operation RESTORE_ON_TABLET RPC 2baeb272-
8ca0-46d0-972f-d942f0aa6599 (task=0x000000000468b1a0, state=kRunning): TS f3d5b9840f384e6e9567dfe0e4350eea: Tablet Snapshot Operation RPC failed for tablet f9eab7eb62164cfa8953e97ca883983b: Network error (yb/util/net/socke
t.cc:540): recvmsg error: Connection refused (system error 111)
@sridhar-yb sridhar-yb added area/docdb YugabyteDB core features QA QA filed bugs status/awaiting-triage Issue awaiting triage labels Aug 25, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Aug 25, 2022
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue QA QA filed bugs
Projects
None yet
Development

No branches or pull requests

3 participants