-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/k8s mongodb health #103
Merged
jon-funk
merged 9 commits into
runwhen-contrib:main
from
stewartshea:feature/k8s-mongodb-health
Mar 30, 2023
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
12b338c
fixup sli
stewartshea 3881309
add connection score
stewartshea ed5439d
Add member state check
stewartshea 04b78e4
Merge branch 'main' into feature/k8s-mongodb-health
stewartshea 56f67d1
add max/min to json metric util
stewartshea 3ad5a26
update task for repl lag
stewartshea 0705f56
numerious tweaks to the calculation and cleanup
stewartshea e791949
tweaks and readme docs
stewartshea f50b444
fix md typo
stewartshea File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# MongoDB Health Google Managed Prometheus (promql) | ||
This codebundle provides an opinionated healthcheck on mongoDB instances. It requires that the Mongodb Prometheus exporter (by Percona) is configured appropriately and that metrics are being sent to Google Managed Prometheus. | ||
|
||
|
||
## Service Level Indicator | ||
The SLI codebundle provides a composite health check which provides a score between 0 (unhealthy) and 1 (healthy). Any value between 0 and 1 indicates that one of the following health checks produced a score of 0 for its individual check. The score is derived by adding up the value of each test and dividing by the total number of tests. | ||
|
||
Evaluations performed in this healthcheck: | ||
|
||
- Instance Status: Are the expected amount of members running for each instance? | ||
- Connection Utilization Rate: Is the current connection utilization (current/max) above the desired threshold for any instance? | ||
- Member Health: Are any of the members reporting an unhealthy state? | ||
- Replication Lag: Is the largest replication for any cluster above the desired threshold? | ||
- Queue Size: Is size of the queue (reads or writes) above the desired threshold? | ||
- Assertion Rate: Is the rate of assertions over the last 5m above the desired threshold for any instance? | ||
|
||
This SLI does support measing health across multiple instances and often reports the Max value obtained across instances. The PROMQL_FILTER can be used to add specific labels for query filtering as necessary. | ||
|
||
> For those not looking for composite scores, the [gcp-opssuite-promql](https://docs.runwhen.com/public/v/codebundles/gcp-opssuite-promql) codebundle can be used to create specific SLIs for any specific metric. | ||
|
||
## Use Cases | ||
### Use Case: SLI: MongoDB Community Edition Health for All Instances in a Kubernetes Namespace | ||
The following use case provides an example configuration in which the SLI can be used to provide a composite score across multiple mongodb clusters in the same namespace. | ||
|
||
> For a full walkthough on the setup of an environment with MongoDB Community Edition, Percona MongoDB Prometheus Exporter, and Google Mangaged Prometheus, please view [the complete docs located here](https://docs.runwhen.com/public/use-cases/kubernetes-environments/measuring-mongodb-health-with-promql). | ||
|
||
- Example MongoDB Community edition object: | ||
``` | ||
apiVersion: mongodbcommunity.mongodb.com/v1 | ||
kind: MongoDBCommunity | ||
metadata: | ||
name: sandbox-mongodb | ||
namespace: mongodb-test | ||
spec: | ||
members: 3 | ||
type: ReplicaSet | ||
version: "4.4.0" | ||
security: | ||
authentication: | ||
modes: ["SCRAM"] | ||
users: | ||
- name: my-user | ||
db: admin | ||
passwordSecretRef: # a reference to the secret that will be used to generate the user's password | ||
name: my-user-password | ||
roles: | ||
- name: clusterAdmin | ||
db: admin | ||
- name: userAdminAnyDatabase | ||
db: admin | ||
scramCredentialsSecretName: my-scram | ||
additionalMongodConfig: | ||
storage.wiredTiger.engineConfig.journalCompressor: zlib | ||
net.maxIncomingConnections: 1000 | ||
``` | ||
|
||
- Example Percona MongoDB Prometheus Exporter: | ||
``` | ||
apiVersion: helm.toolkit.fluxcd.io/v2beta1 | ||
kind: HelmRelease | ||
metadata: | ||
name: mongodb-exporter | ||
namespace: mongodb-test | ||
spec: | ||
releaseName: mongodb-test-exporter | ||
chart: | ||
spec: | ||
chart: prometheus-mongodb-exporter | ||
# https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus-mongodb-exporter/values.yaml | ||
version: 3.1.2 | ||
sourceRef: | ||
kind: HelmRepository | ||
name: prometheus-community | ||
namespace: flux-system | ||
interval: 5m | ||
values: | ||
image: | ||
pullPolicy: IfNotPresent | ||
repository: percona/mongodb_exporter | ||
tag: "0.37.0" | ||
mongodb: | ||
uri: "mongodb://my-user:SuperSecretPassword@sandbox-mongodb-0.sandbox-mongodb-svc.mongodb-test.svc.cluster.local:27017" | ||
``` | ||
|
||
- Example codebundle configuration: | ||
``` | ||
configProvided: | ||
- name: PROMQL_FILTER | ||
value: namespace="mongodb-test" | ||
- name: CONNECTION_UTILIZATION_THRESHOLD | ||
value: '80' | ||
- name: MAX_LAG | ||
value: '60' | ||
- name: MAX_ASSERTION_RATE | ||
value: '1' | ||
- name: PROJECT_ID | ||
value: [gcp-project-id] | ||
- name: MAX_QUEUE_SIZE | ||
value: '0' | ||
secretsProvided: | ||
- name: ops-suite-sa | ||
workspaceKey: [secret-name] | ||
servicesProvided: | ||
- name: curl | ||
locationServiceName: curl-service.shared | ||
``` | ||
With the example above, a score of less than 1 would be produced if any of the conditions are true: | ||
- Any members are not running | ||
- Any instance member is returning an unhealthy state | ||
- The amount of active connections vs max is 80% or greater | ||
- Any instance has a replication lag of 60s or larger | ||
- Any instance has assertions are being generated at a rate of 1/s or greater | ||
- Any instance has any read or write requests waiting in the queue | ||
|
||
## Requirements | ||
### Version Details | ||
This codebundle was tested with MongoDB Community Edition Kubernetes Operator, with MongoDB versions: | ||
- 4.4.0 | ||
- 6.0.5 | ||
|
||
Along with the Percona MongoDB Prometheus Exporter chart version 3.1.2 and image version v0.37.0 | ||
|
||
### Service Account Requirements | ||
This codebundle requires a service account and accompanying json key uploaded as a secret to the workspace. | ||
|
||
The service account should have the following roles: | ||
- Logs Viewer - `roles/logging.viewer` | ||
- Monitoring Viewer - `roles/monitoring.viewer` | ||
|
||
> Note: It's likely that only the Monitoring Viewer role is required for promql queries, but both roles are helpful when using other gcp-opssuite* codebundles. | ||
|
||
Please see the [documentation for creating service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts) | ||
|
||
## Helpful Resources | ||
- https://www.mongodb.com/docs/v4.2/reference/replica-states/ | ||
- https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-mongodb-exporter | ||
- https://github.com/mongodb/mongodb-kubernetes-operator/blob/master/README.md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
*** Settings *** | ||
Metadata Author Shea Stewart | ||
Documentation Uses promql on the Ops Suite API to determine the health of a MongoDB database instance | ||
... and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource. | ||
Force Tags GCP OpsSuite PromQL MongoDB | ||
Library RW.GCP.OpsSuite | ||
Library RW.Core | ||
Library RW.Utils | ||
Library RW.Prometheus | ||
Library String | ||
Library Collections | ||
Suite Setup Suite Initialization | ||
|
||
*** Tasks *** | ||
Get Access Token | ||
${access_token_header_secret}= RW.GCP.OpsSuite.Get Access Token Header gcp_credentials=${ops-suite-sa} | ||
Set Global Variable ${access_token_header_secret} | ||
|
||
Get Instance Status | ||
[Documentation] Get the count of mongodb_up returning 1 dividided by the number of expected instances | ||
${up_rsp}= RW.Prometheus.Query Instant | ||
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1 | ||
... query=sum(mongodb_up{${PROMQL_FILTER}})/(count(count by (instance) (mongodb_up{${PROMQL_FILTER}}))) | ||
... optional_headers=${access_token_header_secret} | ||
... target_service=${CURL_SERVICE} | ||
${up_value}= RW.Utils.Json To Metric | ||
... data=${up_rsp} | ||
... search_filter=data.result[] | ||
... calculation_field=value[1].to_number(@) | ||
... calculation=Sum | ||
Log mongodb_up returned a total of ${up_value} | ||
${up_score}= Evaluate 1 if ${up_value} >= 1 else 0 | ||
Set Global Variable ${up_value} | ||
Append To List ${SCORES} ${up_score} | ||
|
||
Get Connection Utilization Rate | ||
[Documentation] Get the connection utilization (current/max) for all instances and score against threshold (1 = below threshold, 0 = above) | ||
${connection_utilization_rsp}= RW.Prometheus.Query Instant | ||
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1 | ||
... query=sum(mongodb_ss_connections{conn_type="current",rs_state="1",${PROMQL_FILTER}}) by (instance)/sum(mongodb_ss_connections{conn_type=~"current|available",rs_state="1",${PROMQL_FILTER}}) by (instance) *100 | ||
... optional_headers=${access_token_header_secret} | ||
... target_service=${CURL_SERVICE} | ||
${max_connection_utilization_value}= RW.Utils.Json To Metric | ||
... data=${connection_utilization_rsp} | ||
... search_filter=data.result[] | ||
... calculation_field=value[1].to_number(@) | ||
... calculation=Max | ||
Log The max connection utilization (current / available) is ${max_connection_utilization_value} | ||
${connection_score}= Evaluate 1 if ${max_connection_utilization_value} < ${CONNECTION_UTILIZATION_THRESHOLD} else 0 | ||
Set Global Variable ${max_connection_utilization_value} | ||
Append To List ${SCORES} ${connection_score} | ||
|
||
|
||
Get MongoDB Member State Health | ||
[Documentation] Fetch the replication state of each member and ensure they are within acceptable parameters. https://www.mongodb.com/docs/manual/reference/replica-states/ | ||
${acceptable_member_states}= Set Variable PRIMARY|SECONDARY|ARBITER | ||
${member_state_rsp}= RW.Prometheus.Query Instant | ||
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1 | ||
... query=mongodb_members_id{member_state!~"${acceptable_member_states}",${PROMQL_FILTER}} | ||
... optional_headers=${access_token_header_secret} | ||
... target_service=${CURL_SERVICE} | ||
${member_state_value}= RW.Utils.Json To Metric | ||
... data=${member_state_rsp} | ||
... search_filter=data.result[] | ||
... calculation_field=value[1].to_number(@) | ||
... calculation=Count | ||
Log The count of members that are NOT ${acceptable_member_states} is: ${member_state_value} | ||
${member_state_score}= Evaluate 1 if ${member_state_value} == 0 else 0 | ||
Set Global Variable ${member_state_value} | ||
Append To List ${SCORES} ${member_state_score} | ||
|
||
Get MongoDB Replication Lag | ||
[Documentation] Fetch the replication lag (in seconds) of all instances and determine if they are within acceptable parameters. | ||
${replication_lag_rsp}= RW.Prometheus.Query Instant | ||
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1 | ||
... query=(max by (instance) (mongodb_rs_members_optimeDate{member_state="PRIMARY",${PROMQL_FILTER}}) - min by (instance) (mongodb_rs_members_optimeDate{member_state="SECONDARY",${PROMQL_FILTER}})) / 1000 | ||
... optional_headers=${access_token_header_secret} | ||
... target_service=${CURL_SERVICE} | ||
${replication_lag_value}= RW.Utils.Json To Metric | ||
... data=${replication_lag_rsp} | ||
... search_filter=data.result[] | ||
... calculation_field=value[1].to_number(@) | ||
... calculation=Max | ||
Log Max lag of any instance is ${replication_lag_value} seconds. | ||
${replication_lag_score}= Evaluate 1 if ${replication_lag_value} <= ${MAX_LAG} else 0 | ||
Set Global Variable ${replication_lag_value} | ||
Append To List ${SCORES} ${replication_lag_score} | ||
|
||
|
||
Get MongoDB Queue Size | ||
[Documentation] Fetch the total size of the globalLock current queue for all instances. | ||
${queue_size_rsp}= RW.Prometheus.Query Instant | ||
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1 | ||
... query=sum by (instance) (mongodb_ss_globalLock_currentQueue{count_type="total",${PROMQL_FILTER}}) | ||
... optional_headers=${access_token_header_secret} | ||
... target_service=${CURL_SERVICE} | ||
${queue_size_value}= RW.Utils.Json To Metric | ||
... data=${queue_size_rsp} | ||
... search_filter=data.result[] | ||
... calculation_field=value[1].to_number(@) | ||
... calculation=Max | ||
Log Max total queue of any instance ${queue_size_value}. | ||
${queue_size_score}= Evaluate 1 if ${queue_size_value} <= ${MAX_QUEUE_SIZE} else 0 | ||
Set Global Variable ${queue_size_value} | ||
Append To List ${SCORES} ${queue_size_score} | ||
|
||
|
||
Get Assertion Rate | ||
[Documentation] Fetch the assertion rate (over the last 5m) of all instances and determine if they are within acceptable parameters. | ||
${assertion_rate_rsp}= RW.Prometheus.Query Instant | ||
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1 | ||
... query=sum by (instance) (rate(mongodb_ss_asserts{${PROMQL_FILTER}}[5m])) | ||
... optional_headers=${access_token_header_secret} | ||
... target_service=${CURL_SERVICE} | ||
${assertion_rate_value}= RW.Utils.Json To Metric | ||
... data=${assertion_rate_rsp} | ||
... search_filter=data.result[] | ||
... calculation_field=value[1].to_number(@) | ||
... calculation=Max | ||
Log The maximum assertion rate across all instances is ${assertion_rate_value}. | ||
${assertion_rate_score}= Evaluate 1 if ${assertion_rate_value} <= ${MAX_ASSERTION_RATE} else 0 | ||
Set Global Variable ${assertion_rate_value} | ||
Append To List ${SCORES} ${assertion_rate_score} | ||
|
||
|
||
|
||
Generate MongoDB Score | ||
${total_tests}= Get length ${SCORES} | ||
${total_score}= Evaluate sum(${SCORES}) / ${total_tests} | ||
${health_score}= Convert to Number ${total_score} 2 | ||
RW.Core.Push Metric ${health_score} | ||
RW.Core.Push Metric ${up_value} sub_name=instances_up | ||
RW.Core.Push Metric ${member_state_value} sub_name=members_not_healthy | ||
RW.Core.Push Metric ${max_connection_utilization_value} sub_name=connection_utilization | ||
RW.Core.Push Metric ${replication_lag_value} sub_name=replication_lag | ||
RW.Core.Push Metric ${queue_size_value} sub_name=queue_size | ||
RW.Core.Push Metric ${assertion_rate_value} sub_name=assertion_rate | ||
|
||
|
||
|
||
*** Variables *** | ||
@{SCORES} | ||
|
||
*** Keywords *** | ||
Suite Initialization | ||
${CURL_SERVICE}= RW.Core.Import Service curl | ||
... type=string | ||
... description=The selected RunWhen Service to use for accessing services within a network. | ||
... pattern=\w* | ||
... example=curl-service.shared | ||
... default=curl-service.shared | ||
RW.Core.Import Secret ops-suite-sa | ||
... type=string | ||
... description=GCP service account json used to authenticate with GCP APIs. | ||
... pattern=\w* | ||
... example={"type": "service_account","project_id":"myproject-ID", ... super secret stuff ...} | ||
RW.Core.Import User Variable PROJECT_ID | ||
... type=string | ||
... description=The GCP Project ID to scope the API to. | ||
... pattern=\w* | ||
... example=myproject-ID | ||
RW.Core.Import User Variable PROMQL_FILTER | ||
... type=string | ||
... description=The prometheus labels used to filter results. | ||
... pattern=\w* | ||
... default=instance=~".+" | ||
... example=namespace="mongodb-test" | ||
RW.Core.Import User Variable CONNECTION_UTILIZATION_THRESHOLD | ||
... type=string | ||
... description=The percentage of used vs available connections which is deemed acceptable. Utilization above this number will negatively affect the service health score. | ||
... pattern=\d* | ||
... default=80 | ||
... example=80 | ||
RW.Core.Import User Variable MAX_LAG | ||
... type=string | ||
... description=The maximum lag (in seconds) between members that is deemed acceptable. Lag above this number will negatively affect the service health score. | ||
... pattern=\d* | ||
... default=60 | ||
... example=60 | ||
RW.Core.Import User Variable MAX_ASSERTION_RATE | ||
... type=string | ||
... description=The maximum assertions per second (over the last 5 minutes) that is deemed acceptable. Assertion rates above this number will negatively affect the service health score. | ||
... pattern=\d* | ||
... default=1 | ||
... example=1 | ||
RW.Core.Import User Variable MAX_QUEUE_SIZE | ||
... type=string | ||
... description=The maximum amount of queued operations (read or write) that is deemed acceptable. Queued operations above this number will negatively affect the service health score. | ||
... pattern=\d* | ||
... default=0 | ||
... example=0 | ||
Set Suite Variable ${CURL_SERVICE} ${CURL_SERVICE} | ||
Set Suite Variable ${PROJECT_ID} ${PROJECT_ID} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that you made this its own task instead of in the initialization. Why's that?