Track UJ node progress #2076

d-helios · 2023-10-10T10:57:02Z

Motivation

Monitoring should be able to identify nodes that are too slow to join the cluster, or perhaps simply stuck.

Possible implementation

Once new node is created and scylla is started, monitoring stack should start scrapping metrics from scylla and operating system.
If new node in UJ state.

verify network traffic - scylla_node_network_receive_bytes_total
verify disk space utilisation - scylla_node_filesystem_total_avail_bytes

if diff for the last X minutes lower then Y over the Z minutes, trigger an alert.

amnonh · 2023-10-10T17:11:56Z

@d-helios luckily, I've added a metrics for a node state, scylla_node_operation_mode so we can check if a node is in joining mode for more than X minutes

amnonh · 2023-10-11T09:46:36Z

@d-helios note that this is monitoring only and if it's cloud related, there should be a cloud issue

d-helios added the enhancement label Oct 10, 2023

amnonh mentioned this issue Oct 12, 2023

prometheus.rules.yml warn when a node is in joining mode for a long time #2079

Merged

amnonh closed this as completed in #2079 Oct 12, 2023

amnonh added this to the Monitoring 4.5 milestone Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track UJ node progress #2076

Track UJ node progress #2076

d-helios commented Oct 10, 2023

amnonh commented Oct 10, 2023

amnonh commented Oct 11, 2023

Track UJ node progress #2076

Track UJ node progress #2076

Comments

d-helios commented Oct 10, 2023

Motivation

Possible implementation

amnonh commented Oct 10, 2023

amnonh commented Oct 11, 2023