-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[monitoring v1] custom metrics endpoint is not working after upgrade from 2.6.2 -> 2.6.3 #35790
Comments
Good catch @slickwarren! I'm able to consistently reproduce this error and have figured out the root cause. Although I'm still trying to investigate why a Rancher upgrade causes this to happen, it seems like on upgrade the custom metrics pod's labels are wiped out, including the service selector based on As a result, the service that is defined by the Rancher controllers (example pasted below) ends up having a selector that no longer selects the Pod: apiVersion: v1
kind: Service
metadata:
annotations:
field.cattle.io/targetWorkloadIds: '["deployment:cattle-prometheus:system-example-app"]'
creationTimestamp: "2021-12-10T05:01:59Z"
labels:
cattle.io/creator: norman
cattle.io/metrics: 45979f8255e707a8d4617fd3cc8cc684
name: system-example-app-metrics
namespace: cattle-prometheus
ownerReferences:
- apiVersion: monitoring.coreos.com/v1
controller: true
kind: ServiceMonitor
name: system-example-app
uid: 4fac52c7-6260-4063-af62-c13a1385f94b
resourceVersion: "12246"
uid: 70983c92-bba8-4eea-83fa-b431e2ddb72c
spec:
clusterIP: None
clusterIPs:
- None
ports:
- name: metrics8080
port: 8080
protocol: TCP
targetPort: 8080
selector:
workload.user.cattle.io/workloadselector: deployment-cattle-prometheus-system-example-app
workloadID_system-example-app-metrics: "true"
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {} WorkaroundIf you simply delete the service that is being targeted by the servicemonitor and allow it to be recreated, it will trigger reconciling the pods that need to be targeted on a service sync: rancher/pkg/controllers/managementagent/targetworkloadservice/target_workload_service.go Line 86 in 0f464e2
As a result, the pod will be reloaded with the label and scraping will occur as expected. |
One thing to note: the deployment / workload itself does not contain this label. This label is added directly to Pods by this Rancher controller on enqueuing a new or updated ServiceMonitor, which triggers the creation/update of a Service, which triggers ensuring the label is added to the Pod. |
On checking between 2.6.2 and 2.6.3-rc1, the workload label does get wiped out.
Edit: see below. |
Actually seems like I was incorrect on this front; just tried checking it twice and it seems like an upgrade from 2.6.1 to 2.6.2 also introduces this error. This means that discovering the root cause of this issue might be more difficult that previously envisioned. @sowmyav27 @slickwarren @Jono-SUSE-Rancher I can continue looking into this on Monday, but the simplest workaround for this issue as mentioned above is to just delete the services that are being affected to allow a resync to occur. Doing this across all services in your cluster that are impacted by this would be as simple as running the following one-line command:
e.g. "delete all services across all namespaces with the label
and is only expected to exist on services maintained by Monitoring V1 controllers (since they will be synced by our controllers), as shown in: rancher/pkg/controllers/managementagent/servicemonitor/ensure.go Lines 27 to 29 in cb7de4e
rancher/pkg/controllers/managementagent/servicemonitor/workloadmetricsservice.go Lines 83 to 85 in cb7de4e
My call here would be to just release note this as a known 2.6 bug and put it on our backlog, considering Monitoring V1 is already deprecated in favor of V2 and the workaround is fairly simple (which we can test). Moving this to 2.6.4 also works, but I don't think this needs to be a release blocker. Thoughts? |
Let's test that 1) this impacts 2.6.1-2.6.2 and 2) that the workaround works as expected for fixing workloads before adding a release note on it, as discussed offline. For testing the workaround, we should ensure that both Project and Cluster Monitoring are fixed for monitoring custom metrics endpoints simply by running the one kubectl command provided in #35790 (comment).
|
I was not able to reproduce this on v2.6.1 -> v2.6.2 upgrade. |
Rancher Server Setup
Information about the Cluster
Describe the bug
After user upgrades to 2.6.3, monitoring v1 custom metric endpoint stops working. If user deploys a new custom metric endpoint, that one will work but the existing one does not come up
To Reproduce
Result
target no longer appears in prometheus' list
new custom metrics endpoints work
Expected Result
target/endpoint should persist through upgrade
Additional context
this is possibly related to #35559 given the other linked issues, but I doubt it since this is working for new custom metrics endpoints
The text was updated successfully, but these errors were encountered: