Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[monitoring v1] custom metrics endpoint is not working after upgrade from 2.6.2 -> 2.6.3 #35790

Closed
slickwarren opened this issue Dec 7, 2021 · 6 comments
Assignees
Labels
area/monitoring feature/charts-monitoring-v1 kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement release-note Note this issue in the milestone's release notes status/release-blocker team/area3
Milestone

Comments

@slickwarren
Copy link
Contributor

Rancher Server Setup

  • Rancher version: 2.6.3-rc4
  • Installation option (Docker install/Helm Chart):helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
  • Proxy/Cert Details:self signed

Information about the Cluster

  • Kubernetes version: v1.20.12 (rke1)
  • Cluster Type (Local/Downstream): downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): custom

Describe the bug

After user upgrades to 2.6.3, monitoring v1 custom metric endpoint stops working. If user deploys a new custom metric endpoint, that one will work but the existing one does not come up

To Reproduce

  • deploy monitoring v1 (v0.3.1) to a cluster on rancher v2.6.2
  • (I did the following through ember UI)
  • deploy monitoring to the default project
  • deploy a workload using a custom metrics endpoint
    • verify that the endpoint is working correctly by checking that it is active in prometheus -> targets/graphs
  • upgrade rancher to 2.6.3-rc4

Result
target no longer appears in prometheus' list
new custom metrics endpoints work

Expected Result

target/endpoint should persist through upgrade

Additional context
this is possibly related to #35559 given the other linked issues, but I doubt it since this is working for new custom metrics endpoints

@slickwarren slickwarren added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement area/monitoring labels Dec 7, 2021
@slickwarren slickwarren added this to the v2.6.3 milestone Dec 7, 2021
@slickwarren slickwarren self-assigned this Dec 7, 2021
@aiyengar2 aiyengar2 self-assigned this Dec 9, 2021
@aiyengar2
Copy link
Contributor

aiyengar2 commented Dec 10, 2021

Good catch @slickwarren! I'm able to consistently reproduce this error and have figured out the root cause.

Although I'm still trying to investigate why a Rancher upgrade causes this to happen, it seems like on upgrade the custom metrics pod's labels are wiped out, including the service selector based on WorkloadIDLabelPrefix (e.g. workloadID_WORKLOADNAME: "true").

As a result, the service that is defined by the Rancher controllers (example pasted below) ends up having a selector that no longer selects the Pod:

apiVersion: v1
kind: Service
metadata:
  annotations:
    field.cattle.io/targetWorkloadIds: '["deployment:cattle-prometheus:system-example-app"]'
  creationTimestamp: "2021-12-10T05:01:59Z"
  labels:
    cattle.io/creator: norman
    cattle.io/metrics: 45979f8255e707a8d4617fd3cc8cc684
  name: system-example-app-metrics
  namespace: cattle-prometheus
  ownerReferences:
  - apiVersion: monitoring.coreos.com/v1
    controller: true
    kind: ServiceMonitor
    name: system-example-app
    uid: 4fac52c7-6260-4063-af62-c13a1385f94b
  resourceVersion: "12246"
  uid: 70983c92-bba8-4eea-83fa-b431e2ddb72c
spec:
  clusterIP: None
  clusterIPs:
  - None
  ports:
  - name: metrics8080
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    workload.user.cattle.io/workloadselector: deployment-cattle-prometheus-system-example-app
    workloadID_system-example-app-metrics: "true"
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Workaround

If you simply delete the service that is being targeted by the servicemonitor and allow it to be recreated, it will trigger reconciling the pods that need to be targeted on a service sync:

targetWorkloadIDs, err := c.reconcilePods(key, obj, workloadIDs)

As a result, the pod will be reloaded with the label and scraping will occur as expected.

@aiyengar2
Copy link
Contributor

aiyengar2 commented Dec 10, 2021

One thing to note: the deployment / workload itself does not contain this label. This label is added directly to Pods by this Rancher controller on enqueuing a new or updated ServiceMonitor, which triggers the creation/update of a Service, which triggers ensuring the label is added to the Pod.

@aiyengar2
Copy link
Contributor

aiyengar2 commented Dec 10, 2021

I also tested whether we experience this issue between 2.6.1 and 2.6.2: the workload label does not get wiped out.

On checking between 2.6.2 and 2.6.3-rc1, the workload label does get wiped out.

Therefore, it's likely whatever is causing this condition to happen was introduced between those two tags:

v2.6.2...v2.6.3-rc1

Edit: see below.

@aiyengar2
Copy link
Contributor

aiyengar2 commented Dec 11, 2021

I also tested whether we experience this issue between 2.6.1 and 2.6.2: the workload label does not get wiped out.

Actually seems like I was incorrect on this front; just tried checking it twice and it seems like an upgrade from 2.6.1 to 2.6.2 also introduces this error. This means that discovering the root cause of this issue might be more difficult that previously envisioned.

@sowmyav27 @slickwarren @Jono-SUSE-Rancher I can continue looking into this on Monday, but the simplest workaround for this issue as mentioned above is to just delete the services that are being affected to allow a resync to occur. Doing this across all services in your cluster that are impacted by this would be as simple as running the following one-line command:

kubectl delete services -A -l cattle.io/metrics

e.g. "delete all services across all namespaces with the label cattle.io/metrics", where that label is defined in:

metricsServiceLabel = "cattle.io/metrics"

and is only expected to exist on services maintained by Monitoring V1 controllers (since they will be synced by our controllers), as shown in:

if _, ok := svc.Annotations[metricsServiceLabel]; !ok {
return svc, nil
}

serviceLabels := map[string]string{
metricsServiceLabel: base64Key,
}

My call here would be to just release note this as a known 2.6 bug and put it on our backlog, considering Monitoring V1 is already deprecated in favor of V2 and the workaround is fairly simple (which we can test). Moving this to 2.6.4 also works, but I don't think this needs to be a release blocker.

Thoughts?

@aiyengar2 aiyengar2 added [zube]: To Test release-note Note this issue in the milestone's release notes and removed [zube]: Working labels Dec 11, 2021
@aiyengar2
Copy link
Contributor

Let's test that 1) this impacts 2.6.1-2.6.2 and 2) that the workaround works as expected for fixing workloads before adding a release note on it, as discussed offline.

For testing the workaround, we should ensure that both Project and Cluster Monitoring are fixed for monitoring custom metrics endpoints simply by running the one kubectl command provided in #35790 (comment).

kubectl delete services -A -l cattle.io/metrics

@slickwarren
Copy link
Contributor Author

I was not able to reproduce this on v2.6.1 -> v2.6.2 upgrade.
I can confirm that the workaround is valid (run kubectl delete services -A -l cattle.io/metrics if custom metrics endpoints are down after upgrade) for 2.6.2 -> 2.6-head (3dbea4c)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring feature/charts-monitoring-v1 kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement release-note Note this issue in the milestone's release notes status/release-blocker team/area3
Projects
None yet
Development

No branches or pull requests

4 participants