Skip to content

fix: nodemetric version comparison#969

Merged
yulken merged 1 commit intorancher:mainfrom
yulken:fix/list-metrics-cache-issues
Feb 19, 2026
Merged

fix: nodemetric version comparison#969
yulken merged 1 commit intorancher:mainfrom
yulken:fix/list-metrics-cache-issues

Conversation

@yulken
Copy link
Contributor

@yulken yulken commented Jan 7, 2026

This PR addresses the following issues:

Analyzing the case, I noticed that setting the flag features=ui-sql-cache=false made the metrics collection work as expected.
In summary, the thing is that NodeMetrics is not natively compatible with the watch verb, meaning its cache will never be updated after Steve starts.

root@cp:/home/rancher# kc api-resources -o wide | grep NodeMetrics
nodes                                                              metrics.k8s.io/v1beta1                    false        NodeMetrics                                 get,list   

Similarly, the "Restarts" data for the pod will also not be updated once it is cached, unless there is a modification to the pod itself.

@yulken yulken requested a review from a team as a code owner January 7, 2026 13:17
@tomleb
Copy link
Contributor

tomleb commented Jan 7, 2026

Hi @yulken thanks for looking into that. There should be a "synthetic watcher" somewhere in steve that essentially polls the data for non-watchable APIs. I vaguely remember it not working correctly but have forgotten about it. That would fix the NodeMetrics issue.

As for Pod restart column, I remember it being mentioned somewhere, not sure if it's your issue or some other issue 🤔, I'll have to check. To fix this, we're likely going to want to extract column data that we receive from the api server and store it differently. Basically, same thing as we do with the other duration columns.

We shouldn't bypass the cache since the new filters / sorting functionality depends on it.

@tomleb
Copy link
Contributor

tomleb commented Jan 7, 2026

There we go, found it: rancher/rancher#52217

@yulken
Copy link
Contributor Author

yulken commented Jan 7, 2026

Thank you @tomleb. I'll investigate this synthetic watcher.

@yulken yulken force-pushed the fix/list-metrics-cache-issues branch from db5b321 to 1a82b5a Compare January 15, 2026 14:52
@yulken
Copy link
Contributor Author

yulken commented Jan 15, 2026

@tomleb I found the root cause for this issue:
We use resource version as comparison parameter to determine if we need to update previous data¹.

The thing is that NodeMetrics version is always an empty string, so we should use a different attribute to compare this.

I tried comparing timestamps specifically for NodeMetrics and it seems to work in my local environment.
I think we can reach a better solution, but I'm not sure how.
Do you have any advice for me?

  1. https://github.com/rancher/steve/blob/main/pkg/sqlcache/informer/synthetic_watcher.go#L86C1

Ps.: I also added .vscode to .gitignore.

@yulken yulken marked this pull request as draft January 15, 2026 15:08
Copy link
Contributor

@tomleb tomleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution to fix it for NodeMetrics makes sense.

Can you add at least a test?

I'd like to check with UI (@richard-cox) on how we should tackle this to support HA (aka when Rancher has multiple replicas). Without a resourceVersion, the logic where we avoid "stale" data via revision=X doesn't really work.

Are we using mode=resources.changes with NodeMetrics?

@richard-cox
Copy link
Member

richard-cox commented Jan 26, 2026

@tomleb which specific resource are we talking about?

  • metrics.k8s.io podmetrics
    • not used in the UI
  • metrics.k8s.io nodemetrics
    • we use the schema for this
    • The schema does not report WATCH though, so we don't watch. Instead we poll (every 30 seconds).
    • The schema reports the resource to LIST and GET as a different resource metrics.k8s.io nodes
  • metrics.k8s.io nodes
    • We use this type for GET nodemetrics requests given above (in node lists and detail pages)
    • The GET request does not supply a revision (because it doesn't come from a watch with one)
    • Given that i think replica cache syncs shouldn't be longer than 30s in theory worst case we show 59 second old data (ui polls for data, a change happens after a second, ui polls for data from stale cache not containing change after 29 seconds, ui polls for data after 30 seconds - at which point the change should be in all replicas)?

@tomleb
Copy link
Contributor

tomleb commented Jan 27, 2026

The schema reports the resource to LIST and GET as a different resource metrics.k8s.io nodes

@richard-cox

TIL, didn't know this was a thing. Why do we do that? I see the schema also reports nodes instead of nodemetrics.

You're right that in theory the data wouldn't be stale for too long, might always be 1 "state" behind. Are we okay with this? What about the general case?

@richard-cox
Copy link
Member

Why do we do that?

Not quite sure. the resources themselves look exactly the same as well

Are we okay with this? What about the general case?

Until it becomes an issue i say we stick with what we have. In terms of the general case how do you mean?

@tomleb
Copy link
Contributor

tomleb commented Jan 27, 2026

In terms of the general case how do you mean?

There might be other resources coming from Extensions API server that will also be list-able but not watch-able. Maybe we'd want to handle the general case now. Though I'm tempted to go with "Until it becomes an issue i say we stick with what we have." for that as well. So maybe let's just "wait-and-see" in that case.

@yulken yulken force-pushed the fix/list-metrics-cache-issues branch from 1a82b5a to 1ebfeaf Compare January 28, 2026 14:33
@yulken yulken marked this pull request as ready for review January 28, 2026 15:50
@yulken yulken requested a review from tomleb January 28, 2026 15:51
@yulken yulken changed the title fix: bypass cache queries for nodemetrics/podmetrics fix: nodemetric version comparison Jan 28, 2026
@yulken yulken self-assigned this Jan 28, 2026
Copy link
Contributor

@tomleb tomleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yulken yulken merged commit 89138d2 into rancher:main Feb 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants