Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: simulate for worker node heartbeat timeout #7640

Merged
merged 27 commits into from
Apr 6, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
1f54c42
simulate worker node heartbeat timeout
wangrunji0408 Feb 1, 2023
76e32b7
patch madsim
wangrunji0408 Feb 1, 2023
9dc237f
don't kill all meta services
wangrunji0408 Feb 1, 2023
9e57b87
Merge remote-tracking branch 'origin/main' into wrj/long-delay-after-…
wangrunji0408 Feb 7, 2023
6a9909c
switch madsim to crates.io version
wangrunji0408 Feb 7, 2023
529414f
Merge remote-tracking branch 'origin/main' into wrj/long-delay-after-…
wangrunji0408 Feb 27, 2023
2a96450
tune timeout interval
wangrunji0408 Feb 27, 2023
2618218
Merge remote-tracking branch 'origin/main' into wrj/long-delay-after-…
wangrunji0408 Feb 27, 2023
ca2bc37
Merge branch 'main' into wrj/long-delay-after-kill
yezizp2012 Mar 2, 2023
2c17361
fix etcd election simulation
wangrunji0408 Mar 6, 2023
8e45117
Merge branch 'main' into wrj/long-delay-after-kill
wangrunji0408 Mar 8, 2023
ac5d6ae
bump madsim version
wangrunji0408 Mar 8, 2023
e5ee8d4
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Mar 24, 2023
8d2d05f
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Mar 27, 2023
00721a0
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Mar 28, 2023
6f3c51c
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Mar 28, 2023
54eb2b1
Increase meta_nodes in Configuration struct to 3.
shanicky Mar 30, 2023
8e720d9
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Mar 30, 2023
d9e8fa0
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Apr 3, 2023
a7b8303
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Apr 4, 2023
919744e
Define new struct and retry constants for MetaMemberManagement, modif…
shanicky Mar 31, 2023
887abd8
Update pull-request.yml workflow: timeout increased to 25 min, misc c…
shanicky Apr 4, 2023
0d8c283
Simplify async subscription retry in `meta_client.rs`
shanicky Apr 4, 2023
c7acc59
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Apr 6, 2023
2461daa
Merge branch 'peng/fix-timeout-retry' into wrj/long-delay-after-kill
shanicky Apr 6, 2023
f77570b
Downgraded "lru" to 0.7.6.
shanicky Apr 6, 2023
3ca1880
Merge branch 'main' into wrj/long-delay-after-kill
shanicky Apr 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/prost/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ pbjson = "0.5"
prost = "0.11"
prost-helpers = { path = "helpers" }
serde = { version = "1", features = ["derive"] }
tonic = { version = "0.2.14", package = "madsim-tonic" }
tonic = { version = "0.2.18", package = "madsim-tonic" }

[target.'cfg(not(madsim))'.dependencies]
workspace-hack = { path = "../workspace-hack" }
Expand Down
4 changes: 2 additions & 2 deletions src/tests/simulation/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ async-trait = "0.1"
aws-sdk-s3 = { version = "0.2.17", package = "madsim-aws-sdk-s3" }
clap = { version = "4", features = ["derive"] }
console = "0.15"
etcd-client = { version = "0.2.17", package = "madsim-etcd-client" }
etcd-client = { version = "0.2.18", package = "madsim-etcd-client" }
futures = { version = "0.3", default-features = false, features = ["alloc"] }
glob = "0.3"
itertools = "0.10"
lru = { git = "https://github.com/risingwavelabs/lru-rs.git", branch = "evict_by_timestamp" }
madsim = "0.2.17"
madsim = "0.2.18"
paste = "1"
pin-project = "1.0"
pretty_assertions = "1"
Expand Down
13 changes: 12 additions & 1 deletion src/tests/simulation/src/cluster.rs
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,10 @@ impl Cluster {
}
nodes.push(format!("meta-{}", i));
}
// don't kill all meta services
if nodes.len() == self.config.meta_nodes {
nodes.truncate(1);
}
}
if opts.kill_frontend {
let rand = rand::thread_rng().gen_range(0..3);
Expand Down Expand Up @@ -454,7 +458,14 @@ impl Cluster {
tracing::info!("kill {name}");
madsim::runtime::Handle::current().kill(name);

let t = rand::thread_rng().gen_range(Duration::from_secs(0)..Duration::from_secs(1));
let mut t =
rand::thread_rng().gen_range(Duration::from_secs(0)..Duration::from_secs(1));
// has a small chance to restart after a long time
// so that the node is expired and removed from the cluster
Copy link
Contributor

@jon-chuang jon-chuang Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the node will be assigned a new worker id when rejoining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

if rand::thread_rng().gen_bool(0.1) {
// max_heartbeat_interval_secs = 60
t += Duration::from_secs(20);
}
tokio::time::sleep(t).await;
tracing::info!("restart {name}");
madsim::runtime::Handle::current().restart(name);
Expand Down
4 changes: 4 additions & 0 deletions src/tests/simulation/src/risingwave.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
#
# Note: this file is embedded in the binary and cannot be changed without recompiling.

[meta]
# a relatively small number to make it easier to timeout
max_heartbeat_interval_secs = 10
shanicky marked this conversation as resolved.
Show resolved Hide resolved

[system]
barrier_interval_ms = 250
checkpoint_frequency = 4
Expand Down
2 changes: 1 addition & 1 deletion src/tests/simulation/src/slt.rs
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ pub async fn run_slt_task(cluster: Arc<Cluster>, glob: &str, opts: &KillOpts) {
continue;
}

let should_kill = thread_rng().gen_ratio((opts.kill_rate * 1000.0) as u32, 1000);
let should_kill = thread_rng().gen_bool(opts.kill_rate as f64);
// spawn a background task to kill nodes
let handle = if should_kill {
let cluster = cluster.clone();
Expand Down