Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K3S emitting duplicated metrics in all endpoints (Api server, kubelet, kube-proxy, kube-scheduler, etc) #67

Closed
ricsanfre opened this issue Aug 23, 2022 · 8 comments · Fixed by #74
Labels
bug Something isn't working

Comments

@ricsanfre
Copy link
Owner

ricsanfre commented Aug 23, 2022

Bug Description

Kuberentes Documentation - System Metrics details which Kubernetes components expose metrics in Prometheus format:

These components are:

  • kube-controller-manager (exposing /metrics endpoint at TCP 10257)
  • kube-proxy (exposing /metrics endpoint at TCP 10249)
  • kube-apiserver (exposing /metrics at Kubernetes API port)
  • kube-scheduler (exposing /metrics endpoint at TCP 10259)
  • kubelet (exposing /metrics, /metrics/cadvisor, /metrics/resource and /metrics/probes endpoints at TCP 10250)

K3S distribution has a special behavior since in each node only one process is deployed ( k3s-server running on master nodes or k3s-agent running on worker nodes) with all k8s components sharing the same memory.

K3s is emitting the same metrics, from all k8s components deployed in the node, at all '/metrics' endpoints available (api-server, kubelet (TCP 10250), kube-proxy (TCP 10249), kube-scheduler (TCP 10251), kube-controller-manager (TCP 10257). Thus, collecting from all port produces metrics duplicates.

kubelet additional metrics (endpoints /metrics/cadvisor, /metrics/resource and /metrics/probes) are only available at TCP 10250.

Enabling the scraping of all different metrics TCP ports (kubernetes components) causes the ingestion of duplicated metrics. Duplicated metrics in Prometheus need to be removed in order to reduce memory and CPU consumption.

Context Information

As stated in issue #22, there was a known issue in K3S: k3s-io/k3s#2262, where duplicated metrics are emitted by the three components (kube-proxy, kube-scheduler and kube-controller-manager).
The proposed solution by Rancher Monitoring(k3s-io/k3s#2262), was to avoid the scrape of duplicated metrics and activate only the service monitoring of one of the components. (i.e. kube-proxy).
That solution was implemented (see #22 (comment)) and it solved the main issue (out-of-memory).

Endpoints currently being scrapped by Prometheus are

  • api-server (TCP 6553)
  • kubelet (TCP 10250)
  • kube-proxy (TCP 10249)

Duplicated metrics

After deeper analysis on the metrics scrapped by Prometheus, it is clear that K3S is emitting duplicated metrics in all endpoints.

Example 1: API-server metrics emitted by kube-proxy, kubelet and api-server endpoints running on master server

Capture

Example 2: kubelet metrics emitted by kube-proxy, kubelet and api-server

image

Example3: kubepoxy metrics: kubeproxy_sync_proxy_rules_duration_seconds_bucket{le="0.001"}

image

@ricsanfre ricsanfre added the bug Something isn't working label Aug 23, 2022
@ricsanfre
Copy link
Owner Author

ricsanfre commented Aug 23, 2022

Procedure for obtaining raw metrics exposed by K3S.

The procedure described here SUSE/doc-caasp#166 (comment) can be used to manually query https metrics endpoints.
Most recent versions of Kubernetes are moving all metrics endpoint to use https.

For example: TCP ports numbers exposed by kube-scheduler and kube-controller-manager have changed from kubernetes release 1.22 (from 10251/10252 to 10257/10259) and now require https authenticated connection. Kubernetes authorized service account is needed.
Only kube-proxy endpoint remains open using HTTP, the rest of the ports are now using HTTPS

The procedure specified above creates a service account with not enough privileges to query directly kubelet metrics endpoints.
The following service account, role and rolebinding resources need to be created:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: monitoring
  namespace: kube-system
secrets:
- name: monitoring-secret-token
---
apiVersion: v1
kind: Secret
metadata:
  name: monitoring-secret-token
  namespace: kube-system
  annotations:
    kubernetes.io/service-account.name: monitoring
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-clusterrole
  namespace: kube-system
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/metrics
  - pods
  verbs: ["get", "list"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: monitoring-clusterrole-binding
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: monitoring-clusterrole
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: monitoring
  namespace: kube-system

Following script can be used to automatically extract metrics from kubelet, kube-proxy and apiserver endpoints and compare the results:

#!/bin/bash


# Get token
TOKEN=$(kubectl -n kube-system get secrets monitoring-secret-token -ojsonpath='{.data.token}' | base64 -d)


APISERVER=$(kubectl config view | grep server | cut -f 2- -d ":" | tr -d " ")

# Get apiserver
curl -ks $APISERVER/metrics  --header "Authorization: Bearer $TOKEN" | grep -v "# " > apiserver.txt

# Get list of nodes of k3s cluster from api server and iterate over it
for i in `kubectl get nodes -o json | jq -r '.items[].status.addresses[0].address'`; do
  echo "Getting metrics from node $i"
  curl -ks https://$i:10250/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kubelet_$i.txt
  curl -ks https://$i:10250/metrics/cadvisor --header "Authorization: Bearer $TOKEN" | grep -v "# " > kubelet_cadvisor_$i.txt
  curl -ks http://$i:10249/metrics | grep -v "# " > kubeproxy_$i.txt
done

# Get kube-controller and kube-scheduler

for i in `kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."node-role.kubernetes.io/master" != null) | .status.addresses[0].address'`; do
  echo "Getting metrics from master node $i"
  curl -ks https://$i:10259/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kube-scheduler_$i.txt
  curl -ks https://$i:10257/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kube-controller_$i.txt
done

Analyzing the results

Executing the previous script, the following files contains the metrics extracted from each of the exposed ports in each of the nodes of the cluster:

apiserver.txt
kube-controller_node1.txt
kubelet_cadvisor_node1.txt
kubelet_cadvisor_node2.txt
kubelet_cadvisor_node3.txt
kubelet_cadvisor_node4.txt
kubelet_node1.txt
kubelet_node2.txt
kubelet_node3.txt
kubelet_node4.txt
kubeproxy_node1.txt
kubeproxy_node2.txt
kubeproxy_node3.txt
kubeproxy_node4.txt
kube-scheduler_node1.txt

  • Checking metrics extracted from node1 (master) endpoints, all ports are exposing the same number of metrics:

    ~$ wc -l kubelet_node1.txt 
    40666 kubelet_node1.txt
    ~$ wc -l kubeproxy_node1.txt 
    40666 kubeproxy_node1.txt
    ~$ wc -l kube-controller_node1.txt 
    40666 kube-controller_node1.txt
    ~$ wc -l kube-scheduler_node1.txt 
    40666 kube-scheduler_node1.txt
    ~$ wc -l apiserver.txt 
    40666 apiserver.txt

    The metrics in the files are the same, when applying diff command the only differences showed are the values in some of the metrics (counters/seconds). This is due to that the different ports are polled in different times so, the counter of seconds type metric is showing different values

  • Checking metrics extracted from node2 (worker) endpoints, all ports are exposing the same number of metrics:

    ~$ wc -l kubelet_node2.txt 
    1723 kubelet_node2.txt
    ~$ wc -l kubeproxy_node2.txt 
    1723 kubeproxy_node2.txt

    and again the differences are only in the values of counters(seconds) type metrics.

Conclusion

To get all k3s metrics it is enough with collecting metrics from kubelet endpoints (/metrics, /metrics/cadvisor and /metrics/probe) in all nodes

@ricsanfre
Copy link
Owner Author

ricsanfre commented Aug 23, 2022

Possible solution.

Enabling only monitoring of kubelet endpoints /metrics, /metrics/cadvisor and /metrics/probes available on TCP port 10250, so all metrics can be collected. This is the same solution rancher monitoring chart seems to be using (rancher/rancher#29445).

Changes to be implemented:

  1. Remove from kube-prometheus-stack chart the creation of objects for monitoring all kubernetes componentes (including apiserver and kubelet).

    prometheusOperator:
      kubeletService:
        enabled: false
    kubelet:
      enabled: false
    kubeApiServer:
      enabled: false
    kubeControllerManager:
      enabled: false
    kubeScheduler:
      enabled: false
    kubeProxy:
      enabled: false
    kubeEtcd:
      enabled: false
  2. Create headless service pointing to TCP 10250 port of all k3s nodes.

    ---
    # Headless service for K3S metrics. No selector
    apiVersion: v1
    kind: Service
    metadata:
      name: k3s-metrics-service
      labels:
        app.kubernetes.io/name: k3s
      namespace: kube-system
    spec:
      clusterIP: None
      ports:
      - name: https-metrics
        port: 10250
        protocol: TCP
        targetPort: 10250
      type: ClusterIP
    ---
    # Endpoint for the headless service without selector
    apiVersion: v1
    kind: Endpoints
    metadata:
      name: k3s-metrics-service
      namespace: kube-system
    subsets:
    - addresses:
      - ip: 10.0.0.11
      - ip: 10.0.0.12
      - ip: 10.0.0.13
      - ip: 10.0.0.14
      ports:
      - name: https-metrics
        port: 10250
        protocol: TCP
  3. Create a single ServiceMonitor resource to enable the collection of all k8s components metrics from unique port TCP 10250. This ServiceMonitor should include all relabeling rules defined by default by the ServiceMonitor resources that kube-prometheus-stack chart creates by default for each individual k8s component.

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      labels:
        release: kube-prometheus-stack
      name: k3s-monitoring
      namespace: k3s-monitoring
    spec:
      endpoints:
      # /metrics endpoint
      - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
        honorLabels: true
        metricRelabelings:
        # apiserver
        - action: drop
          regex: apiserver_request_duration_seconds_bucket;(0.15|0.2|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2|3|3.5|4|4.5|6|7|8|9|15|25|40|50)
          sourceLabels:
          - __name__
          - le
        port: https-metrics
        relabelings:
        - action: replace
          sourceLabels:
          - __metrics_path__
          targetLabel: metrics_path
        scheme: https
        tlsConfig:
          caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecureSkipVerify: true
      # /metrics/cadvisor
      - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
        honorLabels: true
        metricRelabelings:
        - action: drop
          regex: container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)
          sourceLabels:
          - __name__
        - action: drop
          regex: container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)
          sourceLabels:
          - __name__
        - action: drop
          regex: container_memory_(mapped_file|swap)
          sourceLabels:
          - __name__
        - action: drop
          regex: container_(file_descriptors|tasks_state|threads_max)
          sourceLabels:
          - __name__
        - action: drop
          regex: container_spec.*
          sourceLabels:
          - __name__
        path: /metrics/cadvisor
        port: https-metrics
        relabelings:
        - action: replace
          sourceLabels:
          - __metrics_path__
          targetLabel: metrics_path
        scheme: https
        tlsConfig:
          caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecureSkipVerify: true
        # /metrics/probes
      - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
        honorLabels: true
        path: /metrics/probes
        port: https-metrics
        relabelings:
        - action: replace
          sourceLabels:
          - __metrics_path__
          targetLabel: metrics_path
        scheme: https
        tlsConfig:
          caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecureSkipVerify: true
      jobLabel: app.kubernetes.io/name
      namespaceSelector:
        matchNames:
        - kube-system
      selector:
        matchLabels:
          app.kubernetes.io/name: k3s
  4. Add manually Grafana dashboards corresponding to K8s components (api-server, kubelet, proxy, etc.). They are not installed when disabling monitoring of k8s components in kube-prometheus-stack chart installation:

  5. Add manually PrometheusRules of the disabled components. Chart also does not install them when disabling its monitoring.

    kube-prometheus-stack creates different PrometheusRules resources, but all of them are included in single manifest file in source repository (https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetesControlPlane-prometheusRule.yaml)

NOTE: Both PrometheusRules and Grafana Dashboards might need modifications. It includes metrics filtered by job label (kubelet, apiserver, etc.) and with the proposed solution only job label "k3s" will be used

@ricsanfre
Copy link
Owner Author

Final solution setting job label to "kubelet" for all metrics scrapped for k3s components through kubelet port.
This way only a few dashboards need to be changed. (kube-proxy, kube-controller-manager and apiserver).

Selecting a different name such as "k3s" (initial proposed solution) makes that all default kube-prometheus-stack dashboards using kubelet metrics (container metrics), need to be updated. For example: The following dashboards use "job=kubelet" when filtering the metrics.
Kubernetes - Compute Resources /Cluster
Kubernetes - Compute Resources / Namespace (Pods)
Kubernetes - Compute Resources / Namespace (Workloads)

@sherif-fanous
Copy link

sherif-fanous commented Feb 23, 2024

@ricsanfre First this repo and the accompanying website are awesome. Thanks for your efforts.

Regarding this issue I want to let you know that I've solved it in a little bit of a different manner that ensures that the kube-prometheus-stack chart is still creating the rules and grafana dashboards thus eliminating the need to manually handle this step.

So instead of disabling all the components in the Helm chart I actually keep them enabled but instruct all but the kubelet ServiceMonitor to drop all the metrics they scrape

e.g. This is how I defined my kubeApiServer section in my values.yaml file

kubeApiServer:
  serviceMonitor:
    metricRelabelings:
      - action: drop
        regex: .*
        sourceLabels:
          - __name__

I have a similar snippet for kubeControllerManager, kubeProxy, and kubeScheduler

With this the Chart is still creating the rules and dashboards without ingesting duplicate metrics. Only metrics from the kubelet are kept.

Now the rules and dashboards created by the chart refer to a job that needs to be replaced with kubelet so I make use of a very simple Argo CD Config Management Plugin.

In the init command I use helm template to generate the templates and then in the generate command I run a couple of sed commands that replace the job values with kubelet.

The end result is

  1. All rules and dashboards are automatically created by the chart with the correct job values
  2. Only 1 copy of metrics is ingested (The ones from the kubelet endpoint)

The only drawback is that although Prometheus doesn't ingest duplicate metrics it still ends up scraping multiple end points and dropping the metrics from these endpoints which of course means relatively higher CPU and memory usage.

@sherif-fanous
Copy link

One idea that just occurred to me to address the drawback is to set the interval of the ServiceMonitor to a very high value thus technically preventing Prometheus from even scraping the end points.

@mrclrchtr
Copy link

@sherif-fanous, thank you so much for sharing your ideas.

Would it be possible to share your values.yaml and especially a small example how to run the sed commands with the Config Management Plugin?

@sherif-fanous
Copy link

sherif-fanous commented Apr 22, 2024

The relevant sections of my values.yaml. Keep in mind this is a k3s single node cluster running on TrueNAS Scale. You might have a slightly different setup than mine especially regarding etcd and kube-proxy

kubeApiServer:
  serviceMonitor:
    interval: 1d
    metricRelabelings:
      - action: drop
        regex: .*
        sourceLabels:
          - __name__

kubeControllerManager:
  endpoints:
    - 192.168.4.59
  serviceMonitor:
    https: true
    insecureSkipVerify: true
    interval: 1d
    metricRelabelings:
      - action: drop
        regex: .*
        sourceLabels:
          - __name__

kubeEtcd:
  enabled: false

kubelet:
  serviceMonitor:
    metricRelabelings:
      - action: drop
        regex: apiserver_request_duration_seconds_bucket;(0.15|0.2|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2|3|3.5|4|4.5|6|7|8|9|15|25|40|50)
        sourceLabels:
          - __name__
          - le

kubeProxy:
  enabled: false

kubeScheduler:
  endpoints:
    - 192.168.4.59
  serviceMonitor:
    https: true
    insecureSkipVerify: true
    interval: 1d
    metricRelabelings:
      - action: drop
        regex: .*
        sourceLabels:
          - __name__

The sed command is in the Argo CD Application manifest. Here's what it looks like

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  annotations:
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
    argocd.argoproj.io/sync-wave: '32'
  finalizers:
    - resources-finalizer.argocd.argoproj.io
  name: kube-prometheus-stack
  namespace: argo-cd
spec:
  destination:
    namespace: kube-prometheus-stack
    server: https://kubernetes.default.svc
  project: default
  source:
    chart: kube-prometheus-stack
    repoURL: https://prometheus-community.github.io/helm-charts
    targetRevision: 58.2.1
  sources:
    - chart: kube-prometheus-stack
      plugin:
        name: config-management-plugin-template
        parameters:
          - name: generate-command
            string: >-
              sed -E -i 's/job="(apiserver|kube-scheduler|kube-controller-manager)"/job="kubelet"/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && sed -E -i 's/job=\\"(apiserver|kube-scheduler|kube-controller-manager)\\"/job=\\"kubelet\\"/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && sed -E -i 's/sum\(up\{cluster=\\"\$cluster\\", job=\\"kubelet\\"\}\)/sum\(up\{cluster=\\"\$cluster\\",job=\\"kubelet\\", metrics_path=\\"\/metrics\\"\}\)/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && cat ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml
          - name: init-command
            string: >-
              mkdir -p ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/ && helm template . --create-namespace --namespace prometheus-stack --values ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/helm/values/base/helm-kube-prometheus-stack-values.yaml --values ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/helm/values/overlays/truenas-mini-x-plus/helm-kube-prometheus-stack-values.yaml >
              ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml
      repoURL: https://prometheus-community.github.io/helm-charts
      targetRevision: 58.2.1
    - path: kubernetes/apps/kube-prometheus-stack/kustomize/overlays/truenas-mini-x-plus
      repoURL: git@github.com:ifanous/home-lab.git
      targetRevision: HEAD
    - ref: root
      repoURL: git@github.com:ifanous/home-lab.git
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    retry:
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
      limit: 5
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

P.S. My repo is private so you won't be able to access it but still everything you need is in this thread, just replace every reference to my repo with yours.


You also need to setup Argo CD to use a CMP plugin. At a high level here's what I'm doing in my Argo CD values.yaml

configs:
  cmp:
    create: true
    plugins:
      config-management-plugin-template:
        generate:
          args:
            - |
              echo "Starting generate phase for application $ARGOCD_APP_NAME" 1>&2;
              echo "Executing $PARAM_GENERATE_COMMAND" 1>&2;
              eval $PARAM_GENERATE_COMMAND;
              echo "Successfully completed generate phase for application $ARGOCD_APP_NAME" 1>&2;
          command: [/bin/sh, -c]
        init:
          args:
            - |
              echo "Starting init phase for application $ARGOCD_APP_NAME" 1>&2;
              echo "Starting a partial treeless clone of repo ifanous/home-lab.git" 1>&2; mkdir ifanous 1>&2; cd ifanous 1>&2; git clone -n --depth=1 --filter=tree:0 https://$IFANOUS_HOME_LAB_HTTPS_USERNAME:$IFANOUS_HOME_LAB_HTTPS_PASSWORD@github.com/ifanous/home-lab.git 1>&2; cd home-lab/ 1>&2; git sparse-checkout set --no-cone $ARGOCD_APP_NAME 1>&2; git checkout 1>&2;
              echo "Successfully completed a partial treeless clone of repo ifanous/home-lab.git" 1>&2;
              echo "Executing $PARAM_INIT_COMMAND" 1>&2;
              cd ../../ 1>&2; eval $PARAM_INIT_COMMAND;
              echo "Successfully completed init phase for application $ARGOCD_APP_NAME" 1>&2;
          command: ["/bin/sh", "-c"]

repoServer:
  extraContainers:
    - args:
        - '--logformat=json'
        - '--loglevel=debug'
      command:
        - /var/run/argocd/argocd-cmp-server
      env:
        - name: IFANOUS_HOME_LAB_HTTPS_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: argocd-repo-creds-ifanous-home-lab-https
        - name: IFANOUS_HOME_LAB_HTTPS_USERNAME
          valueFrom:
            secretKeyRef:
              key: username
              name: argocd-repo-creds-ifanous-home-lab-https
      image: alpine/k8s:1.29.2
      name: config-management-plugin-template
      resources:
        limits:
          memory: 512Mi
        requests:
          memory: 64Mi
      securityContext:
        runAsNonRoot: true
        runAsUser: 999
      volumeMounts:
        - mountPath: /var/run/argocd
          name: var-files
        - mountPath: /home/argocd/cmp-server/plugins
          name: plugins
        - mountPath: /home/argocd/cmp-server/config/plugin.yaml
          name: argocd-cmp-cm
          subPath: config-management-plugin-template.yaml
        - mountPath: /tmp
          name: cmp-tmp

@mrclrchtr
Copy link

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants