[monitoring v1] custom metrics endpoint is not working after upgrade from 2.6.2 -> 2.6.3 #35790

slickwarren · 2021-12-07T17:31:52Z

Rancher Server Setup

Rancher version: 2.6.3-rc4
Installation option (Docker install/Helm Chart):helm
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
Proxy/Cert Details:self signed

Information about the Cluster

Kubernetes version: v1.20.12 (rke1)
Cluster Type (Local/Downstream): downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): custom

Describe the bug

After user upgrades to 2.6.3, monitoring v1 custom metric endpoint stops working. If user deploys a new custom metric endpoint, that one will work but the existing one does not come up

To Reproduce

deploy monitoring v1 (v0.3.1) to a cluster on rancher v2.6.2
(I did the following through ember UI)
deploy monitoring to the default project
deploy a workload using a custom metrics endpoint
- verify that the endpoint is working correctly by checking that it is active in prometheus -> targets/graphs
upgrade rancher to 2.6.3-rc4

Result
target no longer appears in prometheus' list
new custom metrics endpoints work

Expected Result

target/endpoint should persist through upgrade

Additional context
this is possibly related to #35559 given the other linked issues, but I doubt it since this is working for new custom metrics endpoints

aiyengar2 · 2021-12-10T17:38:46Z

Good catch @slickwarren! I'm able to consistently reproduce this error and have figured out the root cause.

Although I'm still trying to investigate why a Rancher upgrade causes this to happen, it seems like on upgrade the custom metrics pod's labels are wiped out, including the service selector based on WorkloadIDLabelPrefix (e.g. workloadID_WORKLOADNAME: "true").

As a result, the service that is defined by the Rancher controllers (example pasted below) ends up having a selector that no longer selects the Pod:

apiVersion: v1
kind: Service
metadata:
  annotations:
    field.cattle.io/targetWorkloadIds: '["deployment:cattle-prometheus:system-example-app"]'
  creationTimestamp: "2021-12-10T05:01:59Z"
  labels:
    cattle.io/creator: norman
    cattle.io/metrics: 45979f8255e707a8d4617fd3cc8cc684
  name: system-example-app-metrics
  namespace: cattle-prometheus
  ownerReferences:
  - apiVersion: monitoring.coreos.com/v1
    controller: true
    kind: ServiceMonitor
    name: system-example-app
    uid: 4fac52c7-6260-4063-af62-c13a1385f94b
  resourceVersion: "12246"
  uid: 70983c92-bba8-4eea-83fa-b431e2ddb72c
spec:
  clusterIP: None
  clusterIPs:
  - None
  ports:
  - name: metrics8080
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    workload.user.cattle.io/workloadselector: deployment-cattle-prometheus-system-example-app
    workloadID_system-example-app-metrics: "true"
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Workaround

If you simply delete the service that is being targeted by the servicemonitor and allow it to be recreated, it will trigger reconciling the pods that need to be targeted on a service sync:

rancher/pkg/controllers/managementagent/targetworkloadservice/target_workload_service.go

Line 86 in 0f464e2

targetWorkloadIDs, err := c.reconcilePods(key, obj, workloadIDs)

As a result, the pod will be reloaded with the label and scraping will occur as expected.

aiyengar2 · 2021-12-10T17:40:26Z

One thing to note: the deployment / workload itself does not contain this label. This label is added directly to Pods by this Rancher controller on enqueuing a new or updated ServiceMonitor, which triggers the creation/update of a Service, which triggers ensuring the label is added to the Pod.

aiyengar2 · 2021-12-10T20:24:38Z

~~I also tested whether we experience this issue between 2.6.1 and 2.6.2: the workload label does not get wiped out.~~

On checking between 2.6.2 and 2.6.3-rc1, the workload label does get wiped out.

~~Therefore, it's likely whatever is causing this condition to happen was introduced between those two tags:~~

~~v2.6.2...v2.6.3-rc1~~

Edit: see below.

aiyengar2 · 2021-12-11T02:30:54Z

I also tested whether we experience this issue between 2.6.1 and 2.6.2: the workload label does not get wiped out.

Actually seems like I was incorrect on this front; just tried checking it twice and it seems like an upgrade from 2.6.1 to 2.6.2 also introduces this error. This means that discovering the root cause of this issue might be more difficult that previously envisioned.

@sowmyav27 @slickwarren @Jono-SUSE-Rancher I can continue looking into this on Monday, but the simplest workaround for this issue as mentioned above is to just delete the services that are being affected to allow a resync to occur. Doing this across all services in your cluster that are impacted by this would be as simple as running the following one-line command:

kubectl delete services -A -l cattle.io/metrics

e.g. "delete all services across all namespaces with the label cattle.io/metrics", where that label is defined in:

rancher/pkg/controllers/managementagent/servicemonitor/register.go

Line 15 in e62e1ae

metricsServiceLabel = "cattle.io/metrics"

and is only expected to exist on services maintained by Monitoring V1 controllers (since they will be synced by our controllers), as shown in:

rancher/pkg/controllers/managementagent/servicemonitor/ensure.go

Lines 27 to 29 in cb7de4e

    
           if _, ok := svc.Annotations[metricsServiceLabel]; !ok { 
        
           	return svc, nil 
        
           }

rancher/pkg/controllers/managementagent/servicemonitor/workloadmetricsservice.go

Lines 83 to 85 in cb7de4e

    
           serviceLabels := map[string]string{ 
        
           	metricsServiceLabel: base64Key, 
        
           }

My call here would be to just release note this as a known 2.6 bug and put it on our backlog, considering Monitoring V1 is already deprecated in favor of V2 and the workaround is fairly simple (which we can test). Moving this to 2.6.4 also works, but I don't think this needs to be a release blocker.

Thoughts?

aiyengar2 · 2021-12-11T02:43:06Z

Let's test that 1) this impacts 2.6.1-2.6.2 and 2) that the workaround works as expected for fixing workloads before adding a release note on it, as discussed offline.

For testing the workaround, we should ensure that both Project and Cluster Monitoring are fixed for monitoring custom metrics endpoints simply by running the one kubectl command provided in #35790 (comment).

kubectl delete services -A -l cattle.io/metrics

slickwarren · 2022-01-03T17:26:10Z

I was not able to reproduce this on v2.6.1 -> v2.6.2 upgrade.
I can confirm that the workaround is valid (run kubectl delete services -A -l cattle.io/metrics if custom metrics endpoints are down after upgrade) for 2.6.2 -> 2.6-head (3dbea4c)

slickwarren added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement area/monitoring labels Dec 7, 2021

slickwarren added this to the v2.6.3 milestone Dec 7, 2021

slickwarren self-assigned this Dec 7, 2021

slickwarren added the status/release-blocker label Dec 8, 2021

aiyengar2 self-assigned this Dec 9, 2021

aiyengar2 added the [zube]: Working label Dec 9, 2021

aiyengar2 added [zube]: To Test release-note Note this issue in the milestone's release notes and removed [zube]: Working labels Dec 11, 2021

sowmyav27 added [zube]: QA Next up and removed [zube]: To Test labels Dec 11, 2021

Jono-SUSE-Rancher modified the milestones: v2.6.3, v2.6.4 - Triaged Dec 11, 2021

Jono-SUSE-Rancher added feature/charts-monitoring-v1 team/area3 labels Dec 11, 2021

slickwarren closed this as completed Jan 4, 2022

zube bot added [zube]: Done and removed [zube]: QA Next up labels Jan 4, 2022

zube bot removed the [zube]: Done label Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[monitoring v1] custom metrics endpoint is not working after upgrade from 2.6.2 -> 2.6.3 #35790

[monitoring v1] custom metrics endpoint is not working after upgrade from 2.6.2 -> 2.6.3 #35790

slickwarren commented Dec 7, 2021

aiyengar2 commented Dec 10, 2021 •

edited

Loading

aiyengar2 commented Dec 10, 2021 •

edited

Loading

aiyengar2 commented Dec 10, 2021 •

edited

Loading

aiyengar2 commented Dec 11, 2021 •

edited

Loading

aiyengar2 commented Dec 11, 2021

slickwarren commented Jan 3, 2022

[monitoring v1] custom metrics endpoint is not working after upgrade from 2.6.2 -> 2.6.3 #35790

[monitoring v1] custom metrics endpoint is not working after upgrade from 2.6.2 -> 2.6.3 #35790

Comments

slickwarren commented Dec 7, 2021

aiyengar2 commented Dec 10, 2021 • edited Loading

Workaround

aiyengar2 commented Dec 10, 2021 • edited Loading

aiyengar2 commented Dec 10, 2021 • edited Loading

aiyengar2 commented Dec 11, 2021 • edited Loading

aiyengar2 commented Dec 11, 2021

slickwarren commented Jan 3, 2022

aiyengar2 commented Dec 10, 2021 •

edited

Loading

aiyengar2 commented Dec 10, 2021 •

edited

Loading

aiyengar2 commented Dec 10, 2021 •

edited

Loading

aiyengar2 commented Dec 11, 2021 •

edited

Loading