runwhen-contrib · jon-funk · Mar 30, 2023 · Mar 20, 2023 · Mar 21, 2023 · Mar 21, 2023
@@ -1,5 +1,5 @@
 # Kong Ingress Health Google Managed Prometheus (promql)
-This codebundle provides an opinionated healthcheck on ingress objects that are managed by the Kong ingress controller. It requires that the Prometheus plugin is configured appropriatel such that metrics are being sent to Google Managed Prometheus. 
+This codebundle provides an opinionated healthcheck on ingress objects that are managed by the Kong ingress controller. It requires that the Prometheus plugin is configured appropriately and that metrics are being sent to Google Managed Prometheus. 
 
 
 ## Service Level Indicator

@@ -2,7 +2,7 @@
 Metadata          Author    Shea Stewart            
 Documentation     Uses promql on the Ops Suite API to determine the health of a Kong managed ingress resource
 ...               and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource. 
-Force Tags        GCP    OpsSuite    PromQL    Prometheus  Kubernetes
+Force Tags        GCP    OpsSuite    PromQL  Kubernetes    Kong    Ingress
 Library           RW.GCP.OpsSuite
 Library           RW.Core
 Library           RW.Utils

@@ -0,0 +1,137 @@
+# MongoDB Health Google Managed Prometheus (promql)
+This codebundle provides an opinionated healthcheck on mongoDB instances. It requires that the Mongodb Prometheus exporter (by Percona) is configured appropriately and that metrics are being sent to Google Managed Prometheus. 
+
+
+## Service Level Indicator
+The SLI codebundle provides a composite health check which provides a score between 0 (unhealthy) and 1 (healthy). Any value between 0 and 1 indicates that one of the following health checks produced a score of 0 for its individual check. The score is derived by adding up the value of each test and dividing by the total number of tests. 
+
+Evaluations performed in this healthcheck: 
+
+- Instance Status: Are the expected amount of members running for each instance?
+- Connection Utilization Rate: Is the current connection utilization (current/max) above the desired threshold for any instance?
+- Member Health: Are any of the members reporting an unhealthy state?
+- Replication Lag: Is the largest replication for any cluster above the desired threshold?
+- Queue Size: Is size of the queue (reads or writes) above the desired threshold?
+- Assertion Rate: Is the rate of assertions over the last 5m above the desired threshold for any instance?
+
+This SLI does support measing health across multiple instances and often reports the Max value obtained across instances. The PROMQL_FILTER can be used to add specific labels for query filtering as necessary. 
+
+> For those not looking for composite scores, the [gcp-opssuite-promql](https://docs.runwhen.com/public/v/codebundles/gcp-opssuite-promql) codebundle can be used to create specific SLIs for any specific metric. 
+
+## Use Cases
+### Use Case: SLI: MongoDB Community Edition Health for All Instances in a Kubernetes Namespace
+The following use case provides an example configuration in which the SLI can be used to provide a composite score across multiple mongodb clusters in the same namespace. 
+
+> For a full walkthough on the setup of an environment with MongoDB Community Edition, Percona MongoDB Prometheus Exporter, and Google Mangaged Prometheus, please view [the complete docs located here](https://docs.runwhen.com/public/use-cases/kubernetes-environments/measuring-mongodb-health-with-promql). 
+
+- Example MongoDB Community edition object: 
+```
+apiVersion: mongodbcommunity.mongodb.com/v1
+kind: MongoDBCommunity
+metadata:
+  name: sandbox-mongodb
+  namespace: mongodb-test
+spec:
+  members: 3
+  type: ReplicaSet
+  version: "4.4.0"
+  security:
+    authentication:
+      modes: ["SCRAM"]
+  users:
+    - name: my-user
+      db: admin
+      passwordSecretRef: # a reference to the secret that will be used to generate the user's password
+        name: my-user-password
+      roles:
+        - name: clusterAdmin
+          db: admin
+        - name: userAdminAnyDatabase
+          db: admin
+      scramCredentialsSecretName: my-scram
+  additionalMongodConfig:
+    storage.wiredTiger.engineConfig.journalCompressor: zlib
+    net.maxIncomingConnections: 1000
+```
+
+- Example Percona MongoDB Prometheus Exporter:
+```
+apiVersion: helm.toolkit.fluxcd.io/v2beta1
+kind: HelmRelease
+metadata:
+  name: mongodb-exporter
+  namespace: mongodb-test
+spec:
+  releaseName: mongodb-test-exporter
+  chart:
+    spec:
+      chart: prometheus-mongodb-exporter
+      # https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus-mongodb-exporter/values.yaml
+      version: 3.1.2
+      sourceRef:
+        kind: HelmRepository
+        name: prometheus-community
+        namespace: flux-system
+  interval: 5m
+  values:
+    image:
+      pullPolicy: IfNotPresent
+      repository: percona/mongodb_exporter
+      tag: "0.37.0"
+    mongodb:
+      uri: "mongodb://my-user:SuperSecretPassword@sandbox-mongodb-0.sandbox-mongodb-svc.mongodb-test.svc.cluster.local:27017"
+```
+
+- Example codebundle configuration: 
+```
+configProvided:
+  - name: PROMQL_FILTER
+    value: namespace="mongodb-test"
+  - name: CONNECTION_UTILIZATION_THRESHOLD
+    value: '80'
+  - name: MAX_LAG
+    value: '60'
+  - name: MAX_ASSERTION_RATE
+    value: '1'
+  - name: PROJECT_ID
+    value: [gcp-project-id]
+  - name: MAX_QUEUE_SIZE
+    value: '0'
+secretsProvided:
+  - name: ops-suite-sa
+    workspaceKey: [secret-name]
+servicesProvided:
+  - name: curl
+    locationServiceName: curl-service.shared
+```
+With the example above, a score of less than 1 would be produced if any of the conditions are true: 
+- Any members are not running
+- Any instance member is returning an unhealthy state
+- The amount of active connections vs max is 80% or greater
+- Any instance has a replication lag of 60s or larger
+- Any instance has assertions are being generated at a rate of 1/s or greater
+- Any instance has any read or write requests waiting in the queue
+
+## Requirements
+### Version Details
+This codebundle was tested with MongoDB Community Edition Kubernetes Operator, with MongoDB versions: 
+- 4.4.0
+- 6.0.5
+
+Along with the Percona MongoDB Prometheus Exporter chart version 3.1.2 and image version v0.37.0
+
+### Service Account Requirements  
+This codebundle requires a service account and accompanying json key uploaded as a secret to the workspace.
+
+The service account should have the following roles: 
+- Logs Viewer - `roles/logging.viewer`
+- Monitoring Viewer - `roles/monitoring.viewer`
+
+> Note: It's likely that only the Monitoring Viewer role is required for promql queries, but both roles are helpful when using other gcp-opssuite* codebundles. 
+
+Please see the [documentation for creating service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts)
+
+## Helpful Resources
+- https://www.mongodb.com/docs/v4.2/reference/replica-states/
+- https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-mongodb-exporter
+- https://github.com/mongodb/mongodb-kubernetes-operator/blob/master/README.md
@@ -0,0 +1,193 @@
+*** Settings ***
+Metadata          Author    Shea Stewart            
+Documentation     Uses promql on the Ops Suite API to determine the health of a MongoDB database instance
+...               and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource. 
+Force Tags        GCP    OpsSuite    PromQL    MongoDB 
+Library           RW.GCP.OpsSuite
+Library           RW.Core
+Library           RW.Utils
+Library           RW.Prometheus
+Library           String
+Library           Collections
+Suite Setup       Suite Initialization
+
+*** Tasks ***
+Get Access Token
+    ${access_token_header_secret}=  RW.GCP.OpsSuite.Get Access Token Header  gcp_credentials=${ops-suite-sa}
+    Set Global Variable    ${access_token_header_secret}
+
+Get Instance Status
+    [Documentation]    Get the count of mongodb_up returning 1 dividided by the number of expected instances
+    ${up_rsp}=      RW.Prometheus.Query Instant
+    ...    api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
+    ...    query=sum(mongodb_up{${PROMQL_FILTER}})/(count(count by (instance) (mongodb_up{${PROMQL_FILTER}})))
+    ...    optional_headers=${access_token_header_secret}
+    ...    target_service=${CURL_SERVICE}
+    ${up_value}=        RW.Utils.Json To Metric
+    ...    data=${up_rsp}
+    ...    search_filter=data.result[]
+    ...    calculation_field=value[1].to_number(@)
+    ...    calculation=Sum
+    Log    mongodb_up returned a total of ${up_value}
+    ${up_score}=    Evaluate    1 if ${up_value} >= 1 else 0
+    Set Global Variable    ${up_value}
+    Append To List     ${SCORES}       ${up_score} 
+
+Get Connection Utilization Rate
+    [Documentation]    Get the connection utilization (current/max) for all instances and score against threshold (1 = below threshold, 0 = above)
+    ${connection_utilization_rsp}=      RW.Prometheus.Query Instant
+    ...    api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
+    ...    query=sum(mongodb_ss_connections{conn_type="current",rs_state="1",${PROMQL_FILTER}}) by (instance)/sum(mongodb_ss_connections{conn_type=~"current|available",rs_state="1",${PROMQL_FILTER}}) by (instance) *100
+    ...    optional_headers=${access_token_header_secret}
+    ...    target_service=${CURL_SERVICE}
+    ${max_connection_utilization_value}=        RW.Utils.Json To Metric
+    ...    data=${connection_utilization_rsp}
+    ...    search_filter=data.result[]
+    ...    calculation_field=value[1].to_number(@)
+    ...    calculation=Max
+    Log    The max connection utilization (current / available) is ${max_connection_utilization_value}
+    ${connection_score}=    Evaluate    1 if ${max_connection_utilization_value} < ${CONNECTION_UTILIZATION_THRESHOLD} else 0
+    Set Global Variable    ${max_connection_utilization_value}
+    Append To List     ${SCORES}       ${connection_score} 
+
+
+Get MongoDB Member State Health
+    [Documentation]    Fetch the replication state of each member and ensure they are within acceptable parameters. https://www.mongodb.com/docs/manual/reference/replica-states/
+    ${acceptable_member_states}=  Set Variable  PRIMARY|SECONDARY|ARBITER
+    ${member_state_rsp}=      RW.Prometheus.Query Instant
+    ...    api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
+    ...    query=mongodb_members_id{member_state!~"${acceptable_member_states}",${PROMQL_FILTER}}
+    ...    optional_headers=${access_token_header_secret}
+    ...    target_service=${CURL_SERVICE}
+    ${member_state_value}=        RW.Utils.Json To Metric
+    ...    data=${member_state_rsp}
+    ...    search_filter=data.result[]
+    ...    calculation_field=value[1].to_number(@)
+    ...    calculation=Count
+    Log    The count of members that are NOT ${acceptable_member_states} is: ${member_state_value}
+    ${member_state_score}=    Evaluate    1 if ${member_state_value} == 0 else 0
+    Set Global Variable    ${member_state_value}
+    Append To List     ${SCORES}       ${member_state_score} 
+
+Get MongoDB Replication Lag
+    [Documentation]    Fetch the replication lag (in seconds) of all instances and determine if they are within acceptable parameters.
+    ${replication_lag_rsp}=      RW.Prometheus.Query Instant
+    ...    api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
+    ...    query=(max by (instance) (mongodb_rs_members_optimeDate{member_state="PRIMARY",${PROMQL_FILTER}}) - min by (instance) (mongodb_rs_members_optimeDate{member_state="SECONDARY",${PROMQL_FILTER}})) / 1000
+    ...    optional_headers=${access_token_header_secret}
+    ...    target_service=${CURL_SERVICE}
+    ${replication_lag_value}=        RW.Utils.Json To Metric
+    ...    data=${replication_lag_rsp}
+    ...    search_filter=data.result[]
+    ...    calculation_field=value[1].to_number(@)
+    ...    calculation=Max
+    Log    Max lag of any instance is ${replication_lag_value} seconds. 
+    ${replication_lag_score}=    Evaluate    1 if ${replication_lag_value} <= ${MAX_LAG} else 0
+    Set Global Variable    ${replication_lag_value}
+    Append To List     ${SCORES}       ${replication_lag_score} 
+
+
+Get MongoDB Queue Size
+   [Documentation]    Fetch the total size of the globalLock current queue for all instances. 
+    ${queue_size_rsp}=      RW.Prometheus.Query Instant
+    ...    api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
+    ...    query=sum by (instance) (mongodb_ss_globalLock_currentQueue{count_type="total",${PROMQL_FILTER}})
+    ...    optional_headers=${access_token_header_secret}
+    ...    target_service=${CURL_SERVICE}
+    ${queue_size_value}=        RW.Utils.Json To Metric
+    ...    data=${queue_size_rsp}
+    ...    search_filter=data.result[]
+    ...    calculation_field=value[1].to_number(@)
+    ...    calculation=Max
+    Log    Max total queue of any instance ${queue_size_value}. 
+    ${queue_size_score}=    Evaluate    1 if ${queue_size_value} <= ${MAX_QUEUE_SIZE} else 0
+    Set Global Variable    ${queue_size_value}
+    Append To List     ${SCORES}       ${queue_size_score} 
+
+
+Get Assertion Rate
+    [Documentation]    Fetch the assertion rate (over the last 5m) of all instances and determine if they are within acceptable parameters.
+    ${assertion_rate_rsp}=      RW.Prometheus.Query Instant
+    ...    api_url=https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/global/prometheus/api/v1
+    ...    query=sum by (instance) (rate(mongodb_ss_asserts{${PROMQL_FILTER}}[5m]))
+    ...    optional_headers=${access_token_header_secret}
+    ...    target_service=${CURL_SERVICE}
+    ${assertion_rate_value}=        RW.Utils.Json To Metric
+    ...    data=${assertion_rate_rsp}
+    ...    search_filter=data.result[]
+    ...    calculation_field=value[1].to_number(@)
+    ...    calculation=Max
+    Log    The maximum assertion rate across all instances is ${assertion_rate_value}. 
+    ${assertion_rate_score}=    Evaluate    1 if ${assertion_rate_value} <= ${MAX_ASSERTION_RATE} else 0
+    Set Global Variable    ${assertion_rate_value}
+    Append To List     ${SCORES}       ${assertion_rate_score} 
+
+
+
+Generate MongoDB Score
+    ${total_tests}=     Get length    ${SCORES}
+    ${total_score}=     Evaluate    sum(${SCORES}) / ${total_tests}
+    ${health_score}=      Convert to Number    ${total_score}  2
+    RW.Core.Push Metric    ${health_score}    
+    RW.Core.Push Metric    ${up_value}    sub_name=instances_up
+    RW.Core.Push Metric    ${member_state_value}    sub_name=members_not_healthy
+    RW.Core.Push Metric    ${max_connection_utilization_value}    sub_name=connection_utilization
+    RW.Core.Push Metric    ${replication_lag_value}    sub_name=replication_lag
+    RW.Core.Push Metric    ${queue_size_value}    sub_name=queue_size
+    RW.Core.Push Metric    ${assertion_rate_value}    sub_name=assertion_rate
+
+
+
+*** Variables ***
+@{SCORES}
+
+*** Keywords ***
+Suite Initialization
+    ${CURL_SERVICE}=    RW.Core.Import Service    curl
+    ...    type=string
+    ...    description=The selected RunWhen Service to use for accessing services within a network.
+    ...    pattern=\w*
+    ...    example=curl-service.shared
+    ...    default=curl-service.shared
+    RW.Core.Import Secret    ops-suite-sa
+    ...    type=string
+    ...    description=GCP service account json used to authenticate with GCP APIs.
+    ...    pattern=\w*
+    ...    example={"type": "service_account","project_id":"myproject-ID", ... super secret stuff ...}
+    RW.Core.Import User Variable    PROJECT_ID
+    ...    type=string
+    ...    description=The GCP Project ID to scope the API to.
+    ...    pattern=\w*
+    ...    example=myproject-ID
+    RW.Core.Import User Variable    PROMQL_FILTER
+    ...    type=string
+    ...    description=The prometheus labels used to filter results. 
+    ...    pattern=\w*
+    ...    default=instance=~".+"
+    ...    example=namespace="mongodb-test"
+    RW.Core.Import User Variable    CONNECTION_UTILIZATION_THRESHOLD
+    ...    type=string
+    ...    description=The percentage of used vs available connections which is deemed acceptable. Utilization above this number will negatively affect the service health score. 
+    ...    pattern=\d*
+    ...    default=80
+    ...    example=80
+    RW.Core.Import User Variable    MAX_LAG
+    ...    type=string
+    ...    description=The maximum lag (in seconds) between members that is deemed acceptable. Lag above this number will negatively affect the service health score. 
+    ...    pattern=\d*
+    ...    default=60
+    ...    example=60
+    RW.Core.Import User Variable    MAX_ASSERTION_RATE
+    ...    type=string
+    ...    description=The maximum assertions per second (over the last 5 minutes) that is deemed acceptable. Assertion rates above this number will negatively affect the service health score.  
+    ...    pattern=\d*
+    ...    default=1
+    ...    example=1
+    RW.Core.Import User Variable    MAX_QUEUE_SIZE
+    ...    type=string
+    ...    description=The maximum amount of queued operations (read or write) that is deemed acceptable. Queued operations above this number will negatively affect the service health score. 
+    ...    pattern=\d*
+    ...    default=0
+    ...    example=0
+    Set Suite Variable    ${CURL_SERVICE}    ${CURL_SERVICE}
+    Set Suite Variable    ${PROJECT_ID}    ${PROJECT_ID}
@@ -124,7 +124,7 @@ def json_to_metric(
         :data str: JSON data to search through.
         :search_filter str: A jmespah filter used to help filter search results. See https://jmespath.org/? to test search strings.
         :calculation_field str: The field from the json output that calculation should be performed on/with.
-        :calculation_type str:  The type of calculation to perform. count, sum, avg.
+        :calculation_type str:  The type of calculation to perform. count, sum, avg, max, min.
         :return: A float that represents the single calculated metric.
     """
     # Fix up single quoted json if necessary
@@ -150,6 +150,12 @@ def json_to_metric(
     if calculation == "Avg":
         metric = utils.search_json(data=payload, pattern="avg(" + search_pattern_prefix + "." + calculation_field + ")")
         return float(metric)
+    if calculation == "Max":
+        metric = search_json(data=payload, pattern="max("+search_pattern_prefix+"."+calculation_field+")")
+        return float(metric)
+    if calculation == "Min":
+        metric = search_json(data=payload, pattern="min("+search_pattern_prefix+"."+calculation_field+")")
+        return float(metric)
 
 
 def from_yaml(yaml_str) -> object: