diff --git a/content/operate/rs/clusters/logging/alerts-events.md b/content/operate/rs/clusters/logging/alerts-events.md index fc30d6a221..62f8282b14 100644 --- a/content/operate/rs/clusters/logging/alerts-events.md +++ b/content/operate/rs/clusters/logging/alerts-events.md @@ -12,66 +12,69 @@ weight: 50 The following alerts and events can appear in `syslog` and the Cluster Manager UI logs. -| Alert/Event | UI message | Severity | Notes | -|-----------------------------------|----------------------------------------------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------| -| aof_slow_disk_io | Redis performance is degraded as a result of disk I/O limits | True: error, False: info | node alert | -| authentication_err | | error | bdb event; Replica of - error authenticating with the source database | -| backup_delayed | Periodic backup has been delayed for longer than `` minutes | True: warning, False: info | bdb alert; Has threshold parameter in the data: section of the log entry. | -| backup_failed | | error | bdb event | -| backup_started | | info | bdb event | -| backup_succeeded | | info | bdb event | -| bdb_created | | info | bdb event | -| bdb_deleted | | info | bdb event | -| bdb_updated | | info | bdb event; Indicates that a bdb configuration has been updated | -| checks_error | | error | node event; Indicates that one or more node checks have failed | -| cluster_updated | | info | cluster event; Indicates that cluster settings have been updated | -| compression_unsup_err | | error | bdb event; Replica of - Compression not supported by sync destination | -| crossslot_err | | error | bdb event; Replica of - sharded destination does not support operation executed on source | -| cpu_utilization | CPU utilization has reached ``% | True: warning, False: info | node alert; Has global_threshold parameter in the key/value section of the log entry. | -| even_node_count | True high availability requires an odd number of nodes | True: warning, False: info | cluster alert | -| ephemeral_storage | Ephemeral storage has reached ``% of its capacity | True: warning, False: info | node alert; Has global_threshold parameter in the key/value section of the log entry. | -| export_failed | | error | bdb event | -| export_started | | info | bdb event | -| export_succeeded | | info | bdb event | -| failed | Node failed | critical | node alert | -| free_flash | Flash storage has reached ``% of its capacity | True: warning, False: info | node alert; Has global_threshold parameter in the key/value section of the log entry. | -| high_latency | Latency is higher than `` milliseconds | True: warning, False: info | bdb alert; Has threshold parameter in the key/value section of the log entry. | -| high_syncer_lag | Replica of - sync lag is higher than `` seconds | True: warning, False: info | bdb alert; Has threshold parameter in the key/value section of the log entry. | -| high_throughput | Throughput is higher than `` RPS (requests per second) | True: warning, False: info | bdb alert; Has threshold parameter in the key/value section of the log entry. | -| import_failed | | error | bdb event | -| import_started | | info | bdb event | -| import_succeeded | | info | bdb event | -| inconsistent_redis_sw | Not all databases are running the same open source version | True: warning, False: info | cluster alert | -| inconsistent_rl_sw | Not all nodes in the cluster are running the same Redis Labs Enterprise Cluster version | True: warning, False: info | cluster alert | -| insufficient_disk_aofrw | Node has insufficient disk space for AOF rewrite | True: error, False: info | node alert | -| internal_bdb | Issues with internal cluster databases | True: warning, False: info | cluster alert | -| license_added | | info | cluster event | -| license_deleted | | info | cluster event | -| license_updated | | info | cluster event | -| low_throughput | Throughput is lower than `` RPS (requests per second) | True: warning, False: info | bdb alert; Has threshold parameter in the key/value section of the log entry. | -| memory | Node memory has reached ``% of its capacity | True: warning, False: info | node alert; Has global_threshold parameter in the key/value section of the log entry. | -| multiple_nodes_down | Multiple cluster nodes are down - this might cause data loss | True: warning, False: info | cluster alert | -| net_throughput | Network throughput has reached ``MB/s | True: warning, False: info | node alert; Has global_threshold parameter in the key/value section of the log entry. | -| node_abort_remove_request | | info | node event | -| node_joined | Node joined | info | cluster event | -| node_operation_failed | Node operation failed | error | cluster event | -| node_remove_abort_completed | Node removed | info | cluster event; The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail. | -| node_remove_abort_failed | Node removed | error | cluster event; The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail. | -| node_remove_completed | Node removed | info | cluster event; The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail. | -| node_remove_failed | Node removed | error | cluster event; The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail. | -| node_remove_request | | info | node event | -| ocsp_query_failed | Failed querying OCSP server | True: error, False: info | cluster alert | -| ocsp_status_revoked | OCSP status revoked | True: error, False: info | cluster alert | -| oom_err | | error | bdb event; Replica of - Replication source/target out of memory | -| persistent_storage | Persistent storage has reached ``% of its capacity | True: warning, False: info | node alert; Has global_threshold parameter in the key/value section of the log entry. | -| ram_dataset_overhead | RAM Dataset overhead in a shard has reached ``% of its RAM limit | True: warning, False: info | bdb alert; Has threshold parameter in the key/value section of the log entry. | -| ram_overcommit | Cluster capacity is less than total memory allocated to its databases | True: error, False: info | cluster alert | -| ram_values | Percent of values in a shard's RAM is lower than ``% of its key count | True: warning, False: info | bdb alert; Has threshold parameter in the key/value section of the log entry. | -| shard_num_ram_values | Number of values in a shard's RAM is lower than `` values | True: warning, False: info | bdb alert; Has threshold parameter in the key/value section of the log entry. | -| size | Dataset size has reached ``% of the memory limit | True: warning, False: info | bdb alert; Has threshold parameter in the key/value section of the log entry. | -| syncer_connection_error | | error | bdb alert | -| syncer_general_error | | error | bdb alert | -| too_few_nodes_for_replication | Database replication requires at least two nodes in cluster | True: warning, False: info | cluster alert | -| user_created | | info | user event | -| user_deleted | | info | user event | -| user_updated | | info | user event; Indicates that a user configuration has been updated | +## Alerts + +| Alert | UI message | Severity | Notes | +|-------|------------|----------|-------| +| aof_slow_disk_io | Redis performance is degraded as a result of disk I/O limits | True: error, False: info | Node alert | +| authentication_err | Error authenticating with the source database | error | BDB event | +| backup_delayed | Periodic backup has been delayed for longer than `` minutes | True: warning, False: info | BDB alert; Has threshold parameter in the data section of the log entry | +| cpu_utilization | CPU utilization has reached ``% | True: warning, False: info | Node alert; Has global_threshold parameter in the key/value section of the log entry | +| even_node_count | True high availability requires an odd number of nodes | True: warning, False: info | Cluster alert | +| ephemeral_storage | Ephemeral storage has reached ``% of its capacity | True: warning, False: info | Node alert; Has global_threshold parameter in the key/value section of the log entry | +| free_flash | Flash storage has reached ``% of its capacity | True: warning, False: info | Node alert; Has global_threshold parameter in the key/value section of the log entry | +| high_latency | Latency is higher than `` milliseconds | True: warning, False: info | BDB alert; Has threshold parameter in the key/value section of the log entry | +| high_syncer_lag | Sync lag is higher than `` seconds | True: warning, False: info | BDB alert; Has threshold parameter in the key/value section of the log entry | +| high_throughput | Throughput is higher than `` RPS (requests per second) | True: warning, False: info | BDB alert; Has threshold parameter in the key/value section of the log entry | +| inconsistent_redis_sw | Not all databases are running the same open source version | True: warning, False: info | Cluster alert | +| inconsistent_rl_sw | Not all nodes in the cluster are running the same Redis Labs Enterprise Cluster version | True: warning, False: info | Cluster alert | +| insufficient_disk_aofrw | Node has insufficient disk space for AOF rewrite | True: error, False: info | Node alert | +| memory | Node memory has reached ``% of its capacity | True: warning, False: info | Node alert; Has global_threshold parameter in the key/value section of the log entry | +| multiple_nodes_down | Multiple cluster nodes are down - this might cause data loss | True: warning, False: info | Cluster alert | +| net_throughput | Network throughput has reached `` MB/s | True: warning, False: info | Node alert; Has global_threshold parameter in the key/value section of the log entry | +| ocsp_query_failed | Failed querying OCSP server | True: error, False: info | Cluster alert | +| ocsp_status_revoked | OCSP status revoked | True: error, False: info | Cluster alert | +| persistent_storage | Persistent storage has reached ``% of its capacity | True: warning, False: info | Node alert; Has global_threshold parameter in the key/value section of the log entry | +| ram_dataset_overhead | RAM Dataset overhead in a shard has reached ``% of its RAM limit | True: warning, False: info | BDB alert; Has threshold parameter in the key/value section of the log entry | +| ram_overcommit | Cluster capacity is less than total memory allocated to its databases | True: error, False: info | Cluster alert | +| ram_values | Percent of values in a shard's RAM is lower than ``% of its key count | True: warning, False: info | BDB alert; Has threshold parameter in the key/value section of the log entry | +| shard_num_ram_values | Number of values in a shard's RAM is lower than `` values | True: warning, False: info | BDB alert; Has threshold parameter in the key/value section of the log entry | +| size | Dataset size has reached ``% of the memory limit | True: warning, False: info | BDB alert; Has threshold parameter in the key/value section of the log entry | +| syncer_connection_error | Syncer connection error | error | BDB alert | +| syncer_general_error | Syncer general error | error | BDB alert | +| too_few_nodes_for_replication | Database replication requires at least two nodes in cluster | True: warning, False: info | Cluster alert | + +## Events + +| Event | UI message | Severity | Notes | +|-------|------------|----------|-------| +| backup_failed | Backup failed | error | BDB event | +| backup_started | Backup started | info | BDB event | +| backup_succeeded | Backup succeeded | info | BDB event | +| bdb_created | Database created | info | BDB event | +| bdb_deleted | Database deleted | info | BDB event | +| bdb_updated | Database updated | info | BDB event; Indicates that a BDB configuration has been updated | +| checks_error | Node checks error | error | Node event; Indicates that one or more node checks have failed | +| cluster_updated | Cluster settings updated | info | Cluster event; Indicates that cluster settings have been updated | +| compression_unsup_err | Compression not supported by sync destination | error | BDB event | +| crossslot_err | Sharded destination does not support operation executed on source | error | BDB event | +| export_failed | Export failed | error | BDB event | +| export_started | Export started | info | BDB event | +| export_succeeded | Export succeeded | info | BDB event | +| import_failed | Import failed | error | BDB event | +| import_started | Import started | info | BDB event | +| import_succeeded | Import succeeded | info | BDB event | +| license_added | License added | info | Cluster event | +| license_deleted | License deleted | info | Cluster event | +| license_updated | License updated | info | Cluster event | +| node_abort_remove_request | Node abort remove request | info | Node event | +| node_joined | Node joined | info | Cluster event | +| node_operation_failed | Node operation failed | error | Cluster event | +| node_remove_abort_completed | Node remove abort completed | info | Cluster event; Node remove is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail | +| node_remove_abort_failed | Node remove abort failed | error | Cluster event; Node remove is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail | +| node_remove_completed | Node removed | info | Cluster event; Node remove is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail | +| node_remove_failed | Node removed | error | Cluster event; Node remove is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail | +| node_remove_request | Node remove request | info | Node event | +| user_created | User created | info | User event | +| user_deleted | User deleted | info | User event | +| user_updated | User updated | info | User event; Indicates that a user configuration has been updated | diff --git a/content/operate/rs/monitoring/v1_monitoring.md b/content/operate/rs/monitoring/v1_monitoring.md index fcd3cc8ce2..c556053468 100644 --- a/content/operate/rs/monitoring/v1_monitoring.md +++ b/content/operate/rs/monitoring/v1_monitoring.md @@ -70,6 +70,10 @@ We recommend migrating to the metrics stream engine for enhanced accuracy, scala If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. +### Transition cluster manager alerts + +As part of Redis Enterprise Software's transition to the [new metrics stream engine]({{}}), some internal cluster manager alerts were deprecated in favor of external monitoring solutions. See the [alerts transition plan]({{}}) for guidance. + ## Cluster manager metrics You can see the metrics of the cluster in: @@ -98,7 +102,7 @@ In **Cluster > Alert Settings**, you can enable alerts for node or cluster eve Configured alerts are shown: -- As a notification on the status icon ( {{< image filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" class="inline" >}} ) for the node and cluster +- As a notification on the status icon ( {{< inline-icon filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" >}} ) for the node and cluster - In the **log** - In email notifications, if you configure [email alerts](#send-alerts-by-email) @@ -118,7 +122,7 @@ For each database, you can enable alerts for database events, such as high memor Configured alerts are shown: -- As a notification on the status icon ( {{< image filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" class="inline" >}} ) for the database +- As a notification on the status icon ( {{< inline-icon filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" >}} ) for the database - In the **log** - In emails, if you configure [email alerts](#send-alerts-by-email) diff --git a/content/operate/rs/references/alerts/_index.md b/content/operate/rs/references/alerts/_index.md new file mode 100644 index 0000000000..0e803dc672 --- /dev/null +++ b/content/operate/rs/references/alerts/_index.md @@ -0,0 +1,62 @@ +--- +Title: Alerts +alwaysopen: false +categories: +- docs +- operate +- rs +- rc +description: Documents the alerts that are tracked with Redis Enterprise Software. +hideListLinks: true +linkTitle: Alerts +weight: $weight +--- + +Cluster alerts are triggered based on thresholds applied to these stored metrics. + +## Cluster alerts + +In **Cluster > Alert Settings**, you can enable alerts for node or cluster events, such as high memory usage or throughput. + +Configured alerts are shown: + +- As a notification on the status icon ( {{< inline-icon filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" >}} ) for the node and cluster +- In the **log** +- In email notifications, if you configure [email alerts](#send-alerts-by-email) + +{{< note >}} +If you enable alerts for "Node joined" or "Node removed" actions, +you must also enable "Receive email alerts" so that the notifications are sent. +{{< /note >}} + +To enable alerts for a cluster: + +1. In **Cluster > Alert Settings**, click **Edit**. +1. Select the alerts that you want to show for the cluster and click **Save**. + +## Database alerts + +For each database, you can enable alerts for database events, such as high memory usage or throughput. + +Configured alerts are shown: + +- As a notification on the status icon ( {{< inline-icon filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" >}} ) for the database +- In the **log** +- In emails, if you configure [email alerts](#send-alerts-by-email) + +To enable alerts for a database: + +1. In **Configuration** for the database, click **Edit**. +1. Select the **Alerts** section to open it. +1. Select the alerts that you want to show for the database and click **Save**. + +## Send alerts by email + +To send cluster and database alerts by email: + +1. In **Cluster > Alert Settings**, click **Edit**. +1. Select **Set an email** to configure the [email server settings]({{< relref "/operate/rs/clusters/configure/cluster-settings#configure-email-server-settings" >}}). +1. In **Configuration** for the database, click **Edit**. +1. Select the **Alerts** section to open it. +1. Select **Receive email alerts** and click **Save**. +1. In **Access Control**, select the [database and cluster alerts]({{< relref "/operate/rs/security/access-control/manage-users" >}}) that you want each user to receive. diff --git a/content/operate/rs/references/alerts/alerts-v1-to-v2.md b/content/operate/rs/references/alerts/alerts-v1-to-v2.md new file mode 100644 index 0000000000..adba9f1404 --- /dev/null +++ b/content/operate/rs/references/alerts/alerts-v1-to-v2.md @@ -0,0 +1,23 @@ +--- +Title: Transition cluster manager alerts to Prometheus alerts +alwaysopen: false +categories: +- docs +- operate +- rs +description: Transition from internal cluster manager alerts to external monitoring alerts using Prometheus. +linkTitle: Transition cluster manager alerts to Prometheus +weight: 50 +--- + +As Redis Enterprise Software transitions from the [deprecated monitoring system]({{}}) to the [new metrics stream engine]({{}}), some internal cluster manager alerts were deprecated in favor of external monitoring solutions. + +You can use the following table to transition from the deprecated alerts and set up equivalent alerts in Prometheus with [PromQL (Prometheus Query Language)](https://prometheus.io/docs/prometheus/latest/querying/basics/): + +| Cluster manager alert | Equivalent PromQL | Description | +|-----------------------|-------------------|-------------| +| BdbSizeAlert | `sum by(db, cluster) (redis_server_used_memory) / sum by(db, cluster) (redis_server_maxmemory) > 0.8` | Redis server memory usage exceeds 80% | +| NodeMemoryAlert | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.7` | Node memory usage exceeds 70% | +| NodeFreeFlashAlert | `(node_available_flash_bytes - node_bigstore_free_bytes) / node_available_flash_bytes > 0.7` | Node flash storage usage exceeds 70% | +| NodeEphemeralStorageAlert | `(node_ephemeral_storage_avail_bytes - node_ephemeral_storage_free_bytes) / node_ephemeral_storage_avail_bytes > 0.7` | Node ephemeral storage usage exceeds 70% | +| NodePersistentStorageAlert | `(node_persistent_storage_avail_bytes - node_persistent_storage_free_bytes) / node_persistent_storage_avail_bytes > 0.7` | Node persistent storage usage exceeds 70% | diff --git a/content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md b/content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md index 42437ce820..715e55b892 100644 --- a/content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md +++ b/content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md @@ -136,6 +136,8 @@ The [metrics stream engine]({{}}), or use new preconfigured dashboards when they become available. +- As part of the transition to the metrics stream engine, some internal cluster manager alerts were deprecated in favor of external monitoring solutions. See the [alerts transition plan]({{}}) for guidance. + ### Enhancements - Module management enhancements: diff --git a/layouts/shortcodes/inline-icon.html b/layouts/shortcodes/inline-icon.html new file mode 100644 index 0000000000..17c8f966ab --- /dev/null +++ b/layouts/shortcodes/inline-icon.html @@ -0,0 +1,13 @@ +{{ $fname := .Get "filename" | strings.TrimLeft "/" }} +{{ $cleanFname := $fname | strings.TrimSuffix "#no-click" }} +{{ $noClick := strings.HasSuffix $fname "#no-click" }} +{{ if not (hasPrefix $cleanFname "/") }} + {{ .Scratch.Set "file" (printf "/%s" $cleanFname) }} +{{ else }} + {{ .Scratch.Set "file" $cleanFname }} +{{ end }} +{{.}}