[dr-autosync] switch to backup dc during sync_recover mode, flashback panic due to no leader for one region #6715

mayjiang0203 · 2023-06-29T08:18:27Z

Bug Report

What did you do?

Switch to backup dc during sync_recover mode, do online recovery, and flashback in the new backup cluster.

What did you expect to see?

What did you see instead?

[root@tiup-0 ~]# tiup ctl:v6.5.3 tikv --pd pd3-peer.e2e-dr-auto-sync-5r-func-tps-1744499-1-8:2379 flashback -v 442490139984003200
Starting component ctl: /root/.tiup/components/ctl/v6.5.3/ctl /root/.tiup/components/ctl/v6.5.3/ctl tikv --pd pd3-peer.e2e-dr-auto-sync-5r-func-tps-1744499-1-8:2379 flashback -v 442490139984003200
thread 'flashback' panicked at 'called Result::unwrap() on an Err value: Other("[components/pd_client/src/util.rs:878]: invalid store ID 0, not found")', cmd/tikv-ctl/src/main.rs:820:62
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread 'main' panicked at 'called Result::unwrap() on an Err value: JoinError::Panic(Id(9), ...)', cmd/tikv-ctl/src/main.rs:945:6
Error: exit status 101

Above error cause by following region.

What version of PD are you using (`pd-server -V`)?

sh-4.2# ./pd-server -V
Release Version: v6.5.3-hotfix0627
Edition: Community
Git Commit Hash: 22e04ce
Git Branch: heads/refs/tags/v6.5.3-hotfix0627
UTC Build Time: 2023-06-27 03:13:08

Seems it caused by cluster change to async after pd allocate id for region split and before region split started in tikv.

[2023/06/28 23:25:30.964 +08:00] [INFO] [cluster_worker.go:145] ["alloc ids for region split"] [region-id=3583] [peer-ids="[3584,3585,3586,3587,3588,3589,3590]"]
[2023/06/28 23:25:33.126 +08:00] [WARN] [stream.go:436] ["lost TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=28e83f861b3ec367] [remote-peer-id=8e7df1128a41d7ab] [error="read tcp 10.200.57.142:59160->10.200.35.239:2380: i/o timeout"]
[2023/06/28 23:25:33.127 +08:00] [WARN] [peer_status.go:68] ["peer became inactive (message send to peer failed)"] [peer-id=8e7df1128a41d7ab] [error="failed to read 8e7df1128a41d7ab on stream MsgApp v2 (read tcp 10.200.57.142:59160->10.200.35.239:2380: i/o timeout)"]
[2023/06/28 23:25:33.235 +08:00] [WARN] [stream.go:436] ["lost TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=28e83f861b3ec367] [remote-peer-id=8e7df1128a41d7ab] [error="read tcp 10.200.57.142:59158->10.200.35.239:2380: i/o timeout"]
[2023/06/28 23:25:37.487 +08:00] [INFO] [audit.go:126] ["Audit Log"] [service-info="{ServiceLabel:SavePersistFile, Method:HTTP/1.1/POST:/pd/api/v1/admin/persist-file/DR_STATE, Component:anonymous, IP:10.200.57.142, StartTime:2023-06-28 23:25:37 +0800 CST, URLParam:{}, BodyParam:{\"state\":\"async_wait\",\"state_id\":3591,\"available_stores\":[2,9,10]}}"]


[2023/06/28 23:25:43.993 +08:00] [INFO] [audit.go:126] ["Audit Log"] [service-info="{ServiceLabel:SavePersistFile, Method:HTTP/1.1/POST:/pd/api/v1/admin/persist-file/DR_STATE, Component:anonymous, IP:10.200.57.142, StartTime:2023-06-28 23:25:43 +0800 CST, URLParam:{}, BodyParam:{\"state\":\"async\",\"state_id\":3592,\"available_stores\":[2,9,10]}}"]

[2023/06/28 23:25:47.645 +08:00] [INFO] [apply.rs:2470] ["split region"] [keys="key 7480000000000000FF535F728000000000FF09867E0000000000FA"] [region="id: 1223 start_key: 7480000000000000FF535F728000000000FF08A7580000000000FA end_key: 7480000000000000FF535F728000000000FF09FA3D0000000000FA region_epoch { conf_ver: 19 version: 69 } peers { id: 1224 store_id: 9 } peers { id: 1225 store_id: 11 } peers { id: 1226 store_id: 10 } peers { id: 1227 store_id: 8 role: Learner } peers { id: 1228 store_id: 2 } peers { id: 1229 store_id: 1 } peers { id: 3222 store_id: 1385 role: Learner }"] [peer_id=1224] [region_id=1223]

The text was updated successfully, but these errors were encountered:

mayjiang0203 · 2023-06-29T08:19:09Z

/assign @disksing
/severity critical

nolouch · 2023-07-13T07:30:18Z

it's a requirement.

…covery (#6869) ref #6715 In DR auto sync scenario, it is possible that a empty range hole is left in the secondary data center when the primary data center is totally unavailable and the replication mode is "sync recover" (super rare). In this case, unsafe recovery + flashback is not able to recover the cluster. The fallback is BR + PiTR. While, to help the people faster identify whether the fallback is needed, we need to add empty regions info in the output of unsafe recovery. If there are empty regions created, use BR+PiTR. If there are no empty regions, use flashback. Signed-off-by: Yang Zhang <yang.zhang@pingcap.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

mayjiang0203 added the type/bug The issue is confirmed as a bug. label Jun 29, 2023

ti-chi-bot bot assigned disksing Jun 29, 2023

ti-chi-bot bot added severity/critical may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Jun 29, 2023

nolouch removed may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.1 type/bug The issue is confirmed as a bug. severity/critical may-affects-6.5 may-affects-7.1 labels Jul 13, 2023

ti-chi-bot bot added the needs-type Indicates an issue requires a type label. label Jul 13, 2023

nolouch added the type/feature-request Categorizes issue or PR as related to a new feature. label Jul 13, 2023

ti-chi-bot bot removed the needs-type Indicates an issue requires a type label. label Jul 13, 2023

This was referenced Aug 1, 2023

unsafe recovery: Add newly created empty regions output for unsafe recovery #6869

Merged

unsafe recovery: Add newly created empty regions output for 6.5 #6877

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dr-autosync] switch to backup dc during sync_recover mode, flashback panic due to no leader for one region #6715

[dr-autosync] switch to backup dc during sync_recover mode, flashback panic due to no leader for one region #6715

mayjiang0203 commented Jun 29, 2023

mayjiang0203 commented Jun 29, 2023

nolouch commented Jul 13, 2023

[dr-autosync] switch to backup dc during sync_recover mode, flashback panic due to no leader for one region #6715

[dr-autosync] switch to backup dc during sync_recover mode, flashback panic due to no leader for one region #6715

Comments

mayjiang0203 commented Jun 29, 2023

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

mayjiang0203 commented Jun 29, 2023

nolouch commented Jul 13, 2023

What version of PD are you using (`pd-server -V`)?