Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dr-autosync] switch to backup dc during sync_recover mode, flashback panic due to no leader for one region #6715

Open
mayjiang0203 opened this issue Jun 29, 2023 · 2 comments
Assignees
Labels
type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@mayjiang0203
Copy link

Bug Report

What did you do?

Switch to backup dc during sync_recover mode, do online recovery, and flashback in the new backup cluster.

What did you expect to see?

What did you see instead?

[root@tiup-0 ~]# tiup ctl:v6.5.3 tikv --pd pd3-peer.e2e-dr-auto-sync-5r-func-tps-1744499-1-8:2379 flashback -v 442490139984003200
Starting component ctl: /root/.tiup/components/ctl/v6.5.3/ctl /root/.tiup/components/ctl/v6.5.3/ctl tikv --pd pd3-peer.e2e-dr-auto-sync-5r-func-tps-1744499-1-8:2379 flashback -v 442490139984003200
thread 'flashback' panicked at 'called Result::unwrap() on an Err value: Other("[components/pd_client/src/util.rs:878]: invalid store ID 0, not found")', cmd/tikv-ctl/src/main.rs:820:62
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread 'main' panicked at 'called Result::unwrap() on an Err value: JoinError::Panic(Id(9), ...)', cmd/tikv-ctl/src/main.rs:945:6
Error: exit status 101

Above error cause by following region.
img_v2_599e8c06-0966-4051-b8df-416ff202291g

What version of PD are you using (pd-server -V)?

sh-4.2# ./pd-server -V
Release Version: v6.5.3-hotfix0627
Edition: Community
Git Commit Hash: 22e04ce
Git Branch: heads/refs/tags/v6.5.3-hotfix0627
UTC Build Time: 2023-06-27 03:13:08

Seems it caused by cluster change to async after pd allocate id for region split and before region split started in tikv.

[2023/06/28 23:25:30.964 +08:00] [INFO] [cluster_worker.go:145] ["alloc ids for region split"] [region-id=3583] [peer-ids="[3584,3585,3586,3587,3588,3589,3590]"]
[2023/06/28 23:25:33.126 +08:00] [WARN] [stream.go:436] ["lost TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=28e83f861b3ec367] [remote-peer-id=8e7df1128a41d7ab] [error="read tcp 10.200.57.142:59160->10.200.35.239:2380: i/o timeout"]
[2023/06/28 23:25:33.127 +08:00] [WARN] [peer_status.go:68] ["peer became inactive (message send to peer failed)"] [peer-id=8e7df1128a41d7ab] [error="failed to read 8e7df1128a41d7ab on stream MsgApp v2 (read tcp 10.200.57.142:59160->10.200.35.239:2380: i/o timeout)"]
[2023/06/28 23:25:33.235 +08:00] [WARN] [stream.go:436] ["lost TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=28e83f861b3ec367] [remote-peer-id=8e7df1128a41d7ab] [error="read tcp 10.200.57.142:59158->10.200.35.239:2380: i/o timeout"]
[2023/06/28 23:25:37.487 +08:00] [INFO] [audit.go:126] ["Audit Log"] [service-info="{ServiceLabel:SavePersistFile, Method:HTTP/1.1/POST:/pd/api/v1/admin/persist-file/DR_STATE, Component:anonymous, IP:10.200.57.142, StartTime:2023-06-28 23:25:37 +0800 CST, URLParam:{}, BodyParam:{\"state\":\"async_wait\",\"state_id\":3591,\"available_stores\":[2,9,10]}}"]


[2023/06/28 23:25:43.993 +08:00] [INFO] [audit.go:126] ["Audit Log"] [service-info="{ServiceLabel:SavePersistFile, Method:HTTP/1.1/POST:/pd/api/v1/admin/persist-file/DR_STATE, Component:anonymous, IP:10.200.57.142, StartTime:2023-06-28 23:25:43 +0800 CST, URLParam:{}, BodyParam:{\"state\":\"async\",\"state_id\":3592,\"available_stores\":[2,9,10]}}"]
[2023/06/28 23:25:47.645 +08:00] [INFO] [apply.rs:2470] ["split region"] [keys="key 7480000000000000FF535F728000000000FF09867E0000000000FA"] [region="id: 1223 start_key: 7480000000000000FF535F728000000000FF08A7580000000000FA end_key: 7480000000000000FF535F728000000000FF09FA3D0000000000FA region_epoch { conf_ver: 19 version: 69 } peers { id: 1224 store_id: 9 } peers { id: 1225 store_id: 11 } peers { id: 1226 store_id: 10 } peers { id: 1227 store_id: 8 role: Learner } peers { id: 1228 store_id: 2 } peers { id: 1229 store_id: 1 } peers { id: 3222 store_id: 1385 role: Learner }"] [peer_id=1224] [region_id=1223]
@mayjiang0203 mayjiang0203 added the type/bug The issue is confirmed as a bug. label Jun 29, 2023
@mayjiang0203
Copy link
Author

/assign @disksing
/severity critical

@nolouch
Copy link
Contributor

nolouch commented Jul 13, 2023

it's a requirement.

ti-chi-bot bot added a commit that referenced this issue Aug 7, 2023
…covery (#6869)

ref #6715

In DR auto sync scenario, it is possible that a empty range hole is left in the secondary data center 
when the primary data center is totally unavailable and the replication mode is "sync recover" (super 
rare). In this case, unsafe recovery + flashback is not able to recover the cluster. The fallback 
is BR + PiTR.  While, to help the people faster identify whether the fallback is needed, we need to add 
empty regions info in the output of unsafe recovery.

If there are empty regions created, use BR+PiTR. If there are no empty regions, use flashback.

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants