Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/k8s mongodb health #103

Merged
merged 9 commits into from
Mar 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion codebundles/kong-ingress-health-gcp-promql/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Kong Ingress Health Google Managed Prometheus (promql)
This codebundle provides an opinionated healthcheck on ingress objects that are managed by the Kong ingress controller. It requires that the Prometheus plugin is configured appropriatel such that metrics are being sent to Google Managed Prometheus.
This codebundle provides an opinionated healthcheck on ingress objects that are managed by the Kong ingress controller. It requires that the Prometheus plugin is configured appropriately and that metrics are being sent to Google Managed Prometheus.


## Service Level Indicator
Expand Down
2 changes: 1 addition & 1 deletion codebundles/kong-ingress-health-gcp-promql/sli.robot
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Metadata Author Shea Stewart
Documentation Uses promql on the Ops Suite API to determine the health of a Kong managed ingress resource
... and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource.
Force Tags GCP OpsSuite PromQL Prometheus Kubernetes
Force Tags GCP OpsSuite PromQL Kubernetes Kong Ingress
Library RW.GCP.OpsSuite
Library RW.Core
Library RW.Utils
Expand Down
137 changes: 137 additions & 0 deletions codebundles/mongodb-health-gcp-promql/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# MongoDB Health Google Managed Prometheus (promql)
This codebundle provides an opinionated healthcheck on mongoDB instances. It requires that the Mongodb Prometheus exporter (by Percona) is configured appropriately and that metrics are being sent to Google Managed Prometheus.


## Service Level Indicator
The SLI codebundle provides a composite health check which provides a score between 0 (unhealthy) and 1 (healthy). Any value between 0 and 1 indicates that one of the following health checks produced a score of 0 for its individual check. The score is derived by adding up the value of each test and dividing by the total number of tests.

Evaluations performed in this healthcheck:

- Instance Status: Are the expected amount of members running for each instance?
- Connection Utilization Rate: Is the current connection utilization (current/max) above the desired threshold for any instance?
- Member Health: Are any of the members reporting an unhealthy state?
- Replication Lag: Is the largest replication for any cluster above the desired threshold?
- Queue Size: Is size of the queue (reads or writes) above the desired threshold?
- Assertion Rate: Is the rate of assertions over the last 5m above the desired threshold for any instance?

This SLI does support measing health across multiple instances and often reports the Max value obtained across instances. The PROMQL_FILTER can be used to add specific labels for query filtering as necessary.

> For those not looking for composite scores, the [gcp-opssuite-promql](https://docs.runwhen.com/public/v/codebundles/gcp-opssuite-promql) codebundle can be used to create specific SLIs for any specific metric.

## Use Cases
### Use Case: SLI: MongoDB Community Edition Health for All Instances in a Kubernetes Namespace
The following use case provides an example configuration in which the SLI can be used to provide a composite score across multiple mongodb clusters in the same namespace.

> For a full walkthough on the setup of an environment with MongoDB Community Edition, Percona MongoDB Prometheus Exporter, and Google Mangaged Prometheus, please view [the complete docs located here](https://docs.runwhen.com/public/use-cases/kubernetes-environments/measuring-mongodb-health-with-promql).

- Example MongoDB Community edition object:
```
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
name: sandbox-mongodb
namespace: mongodb-test
spec:
members: 3
type: ReplicaSet
version: "4.4.0"
security:
authentication:
modes: ["SCRAM"]
users:
- name: my-user
db: admin
passwordSecretRef: # a reference to the secret that will be used to generate the user's password
name: my-user-password
roles:
- name: clusterAdmin
db: admin
- name: userAdminAnyDatabase
db: admin
scramCredentialsSecretName: my-scram
additionalMongodConfig:
storage.wiredTiger.engineConfig.journalCompressor: zlib
net.maxIncomingConnections: 1000
```

- Example Percona MongoDB Prometheus Exporter:
```
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: mongodb-exporter
namespace: mongodb-test
spec:
releaseName: mongodb-test-exporter
chart:
spec:
chart: prometheus-mongodb-exporter
# https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus-mongodb-exporter/values.yaml
version: 3.1.2
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
interval: 5m
values:
image:
pullPolicy: IfNotPresent
repository: percona/mongodb_exporter
tag: "0.37.0"
mongodb:
uri: "mongodb://my-user:SuperSecretPassword@sandbox-mongodb-0.sandbox-mongodb-svc.mongodb-test.svc.cluster.local:27017"
```

- Example codebundle configuration:
```
configProvided:
- name: PROMQL_FILTER
value: namespace="mongodb-test"
- name: CONNECTION_UTILIZATION_THRESHOLD
value: '80'
- name: MAX_LAG
value: '60'
- name: MAX_ASSERTION_RATE
value: '1'
- name: PROJECT_ID
value: [gcp-project-id]
- name: MAX_QUEUE_SIZE
value: '0'
secretsProvided:
- name: ops-suite-sa
workspaceKey: [secret-name]
servicesProvided:
- name: curl
locationServiceName: curl-service.shared
```
With the example above, a score of less than 1 would be produced if any of the conditions are true:
- Any members are not running
- Any instance member is returning an unhealthy state
- The amount of active connections vs max is 80% or greater
- Any instance has a replication lag of 60s or larger
- Any instance has assertions are being generated at a rate of 1/s or greater
- Any instance has any read or write requests waiting in the queue

## Requirements
### Version Details
This codebundle was tested with MongoDB Community Edition Kubernetes Operator, with MongoDB versions:
- 4.4.0
- 6.0.5

Along with the Percona MongoDB Prometheus Exporter chart version 3.1.2 and image version v0.37.0

### Service Account Requirements
This codebundle requires a service account and accompanying json key uploaded as a secret to the workspace.

The service account should have the following roles:
- Logs Viewer - `roles/logging.viewer`
- Monitoring Viewer - `roles/monitoring.viewer`

> Note: It's likely that only the Monitoring Viewer role is required for promql queries, but both roles are helpful when using other gcp-opssuite* codebundles.

Please see the [documentation for creating service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts)

## Helpful Resources
- https://www.mongodb.com/docs/v4.2/reference/replica-states/
- https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-mongodb-exporter
- https://github.com/mongodb/mongodb-kubernetes-operator/blob/master/README.md
193 changes: 193 additions & 0 deletions codebundles/mongodb-health-gcp-promql/sli.robot
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
*** Settings ***
Metadata Author Shea Stewart
Documentation Uses promql on the Ops Suite API to determine the health of a MongoDB database instance
... and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource.
Force Tags GCP OpsSuite PromQL MongoDB
Library RW.GCP.OpsSuite
Library RW.Core
Library RW.Utils
Library RW.Prometheus
Library String
Library Collections
Suite Setup Suite Initialization

*** Tasks ***
Get Access Token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that you made this its own task instead of in the initialization. Why's that?

${access_token_header_secret}= RW.GCP.OpsSuite.Get Access Token Header gcp_credentials=${ops-suite-sa}
Set Global Variable ${access_token_header_secret}

Get Instance Status
[Documentation] Get the count of mongodb_up returning 1 dividided by the number of expected instances
${up_rsp}= RW.Prometheus.Query Instant
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
... query=sum(mongodb_up{${PROMQL_FILTER}})/(count(count by (instance) (mongodb_up{${PROMQL_FILTER}})))
... optional_headers=${access_token_header_secret}
... target_service=${CURL_SERVICE}
${up_value}= RW.Utils.Json To Metric
... data=${up_rsp}
... search_filter=data.result[]
... calculation_field=value[1].to_number(@)
... calculation=Sum
Log mongodb_up returned a total of ${up_value}
${up_score}= Evaluate 1 if ${up_value} >= 1 else 0
Set Global Variable ${up_value}
Append To List ${SCORES} ${up_score}

Get Connection Utilization Rate
[Documentation] Get the connection utilization (current/max) for all instances and score against threshold (1 = below threshold, 0 = above)
${connection_utilization_rsp}= RW.Prometheus.Query Instant
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
... query=sum(mongodb_ss_connections{conn_type="current",rs_state="1",${PROMQL_FILTER}}) by (instance)/sum(mongodb_ss_connections{conn_type=~"current|available",rs_state="1",${PROMQL_FILTER}}) by (instance) *100
... optional_headers=${access_token_header_secret}
... target_service=${CURL_SERVICE}
${max_connection_utilization_value}= RW.Utils.Json To Metric
... data=${connection_utilization_rsp}
... search_filter=data.result[]
... calculation_field=value[1].to_number(@)
... calculation=Max
Log The max connection utilization (current / available) is ${max_connection_utilization_value}
${connection_score}= Evaluate 1 if ${max_connection_utilization_value} < ${CONNECTION_UTILIZATION_THRESHOLD} else 0
Set Global Variable ${max_connection_utilization_value}
Append To List ${SCORES} ${connection_score}


Get MongoDB Member State Health
[Documentation] Fetch the replication state of each member and ensure they are within acceptable parameters. https://www.mongodb.com/docs/manual/reference/replica-states/
${acceptable_member_states}= Set Variable PRIMARY|SECONDARY|ARBITER
${member_state_rsp}= RW.Prometheus.Query Instant
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
... query=mongodb_members_id{member_state!~"${acceptable_member_states}",${PROMQL_FILTER}}
... optional_headers=${access_token_header_secret}
... target_service=${CURL_SERVICE}
${member_state_value}= RW.Utils.Json To Metric
... data=${member_state_rsp}
... search_filter=data.result[]
... calculation_field=value[1].to_number(@)
... calculation=Count
Log The count of members that are NOT ${acceptable_member_states} is: ${member_state_value}
${member_state_score}= Evaluate 1 if ${member_state_value} == 0 else 0
Set Global Variable ${member_state_value}
Append To List ${SCORES} ${member_state_score}

Get MongoDB Replication Lag
[Documentation] Fetch the replication lag (in seconds) of all instances and determine if they are within acceptable parameters.
${replication_lag_rsp}= RW.Prometheus.Query Instant
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
... query=(max by (instance) (mongodb_rs_members_optimeDate{member_state="PRIMARY",${PROMQL_FILTER}}) - min by (instance) (mongodb_rs_members_optimeDate{member_state="SECONDARY",${PROMQL_FILTER}})) / 1000
... optional_headers=${access_token_header_secret}
... target_service=${CURL_SERVICE}
${replication_lag_value}= RW.Utils.Json To Metric
... data=${replication_lag_rsp}
... search_filter=data.result[]
... calculation_field=value[1].to_number(@)
... calculation=Max
Log Max lag of any instance is ${replication_lag_value} seconds.
${replication_lag_score}= Evaluate 1 if ${replication_lag_value} <= ${MAX_LAG} else 0
Set Global Variable ${replication_lag_value}
Append To List ${SCORES} ${replication_lag_score}


Get MongoDB Queue Size
[Documentation] Fetch the total size of the globalLock current queue for all instances.
${queue_size_rsp}= RW.Prometheus.Query Instant
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
... query=sum by (instance) (mongodb_ss_globalLock_currentQueue{count_type="total",${PROMQL_FILTER}})
... optional_headers=${access_token_header_secret}
... target_service=${CURL_SERVICE}
${queue_size_value}= RW.Utils.Json To Metric
... data=${queue_size_rsp}
... search_filter=data.result[]
... calculation_field=value[1].to_number(@)
... calculation=Max
Log Max total queue of any instance ${queue_size_value}.
${queue_size_score}= Evaluate 1 if ${queue_size_value} <= ${MAX_QUEUE_SIZE} else 0
Set Global Variable ${queue_size_value}
Append To List ${SCORES} ${queue_size_score}


Get Assertion Rate
[Documentation] Fetch the assertion rate (over the last 5m) of all instances and determine if they are within acceptable parameters.
${assertion_rate_rsp}= RW.Prometheus.Query Instant
... api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
... query=sum by (instance) (rate(mongodb_ss_asserts{${PROMQL_FILTER}}[5m]))
... optional_headers=${access_token_header_secret}
... target_service=${CURL_SERVICE}
${assertion_rate_value}= RW.Utils.Json To Metric
... data=${assertion_rate_rsp}
... search_filter=data.result[]
... calculation_field=value[1].to_number(@)
... calculation=Max
Log The maximum assertion rate across all instances is ${assertion_rate_value}.
${assertion_rate_score}= Evaluate 1 if ${assertion_rate_value} <= ${MAX_ASSERTION_RATE} else 0
Set Global Variable ${assertion_rate_value}
Append To List ${SCORES} ${assertion_rate_score}



Generate MongoDB Score
${total_tests}= Get length ${SCORES}
${total_score}= Evaluate sum(${SCORES}) / ${total_tests}
${health_score}= Convert to Number ${total_score} 2
RW.Core.Push Metric ${health_score}
RW.Core.Push Metric ${up_value} sub_name=instances_up
RW.Core.Push Metric ${member_state_value} sub_name=members_not_healthy
RW.Core.Push Metric ${max_connection_utilization_value} sub_name=connection_utilization
RW.Core.Push Metric ${replication_lag_value} sub_name=replication_lag
RW.Core.Push Metric ${queue_size_value} sub_name=queue_size
RW.Core.Push Metric ${assertion_rate_value} sub_name=assertion_rate



*** Variables ***
@{SCORES}

*** Keywords ***
Suite Initialization
${CURL_SERVICE}= RW.Core.Import Service curl
... type=string
... description=The selected RunWhen Service to use for accessing services within a network.
... pattern=\w*
... example=curl-service.shared
... default=curl-service.shared
RW.Core.Import Secret ops-suite-sa
... type=string
... description=GCP service account json used to authenticate with GCP APIs.
... pattern=\w*
... example={"type": "service_account","project_id":"myproject-ID", ... super secret stuff ...}
RW.Core.Import User Variable PROJECT_ID
... type=string
... description=The GCP Project ID to scope the API to.
... pattern=\w*
... example=myproject-ID
RW.Core.Import User Variable PROMQL_FILTER
... type=string
... description=The prometheus labels used to filter results.
... pattern=\w*
... default=instance=~".+"
... example=namespace="mongodb-test"
RW.Core.Import User Variable CONNECTION_UTILIZATION_THRESHOLD
... type=string
... description=The percentage of used vs available connections which is deemed acceptable. Utilization above this number will negatively affect the service health score.
... pattern=\d*
... default=80
... example=80
RW.Core.Import User Variable MAX_LAG
... type=string
... description=The maximum lag (in seconds) between members that is deemed acceptable. Lag above this number will negatively affect the service health score.
... pattern=\d*
... default=60
... example=60
RW.Core.Import User Variable MAX_ASSERTION_RATE
... type=string
... description=The maximum assertions per second (over the last 5 minutes) that is deemed acceptable. Assertion rates above this number will negatively affect the service health score.
... pattern=\d*
... default=1
... example=1
RW.Core.Import User Variable MAX_QUEUE_SIZE
... type=string
... description=The maximum amount of queued operations (read or write) that is deemed acceptable. Queued operations above this number will negatively affect the service health score.
... pattern=\d*
... default=0
... example=0
Set Suite Variable ${CURL_SERVICE} ${CURL_SERVICE}
Set Suite Variable ${PROJECT_ID} ${PROJECT_ID}
8 changes: 7 additions & 1 deletion libraries/RW/Utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ def json_to_metric(
:data str: JSON data to search through.
:search_filter str: A jmespah filter used to help filter search results. See https://jmespath.org/? to test search strings.
:calculation_field str: The field from the json output that calculation should be performed on/with.
:calculation_type str: The type of calculation to perform. count, sum, avg.
:calculation_type str: The type of calculation to perform. count, sum, avg, max, min.
:return: A float that represents the single calculated metric.
"""
# Fix up single quoted json if necessary
Expand All @@ -150,6 +150,12 @@ def json_to_metric(
if calculation == "Avg":
metric = utils.search_json(data=payload, pattern="avg(" + search_pattern_prefix + "." + calculation_field + ")")
return float(metric)
if calculation == "Max":
metric = search_json(data=payload, pattern="max("+search_pattern_prefix+"."+calculation_field+")")
return float(metric)
if calculation == "Min":
metric = search_json(data=payload, pattern="min("+search_pattern_prefix+"."+calculation_field+")")
return float(metric)


def from_yaml(yaml_str) -> object:
Expand Down