🌱 Sync with upstream open-cluster-management-io/cluster-proxy 20260427#568
Conversation
…ent-io#282) * Add TokenReview result caching to reduce API server load The service-proxy performs TokenReview API calls for every incoming request without any caching. Under high concurrency (e.g., ArgoCD push model traffic routed through cluster-proxy), this creates excessive load on both the managed cluster and hub API servers, leading to significant latency degradation. This change introduces a TTL-based in-memory cache for TokenReview results: - Caches both authenticated and unauthenticated results - Default TTL of 5 minutes, configurable via --token-review-cache-ttl flag - Set TTL to 0 to disable caching entirely - Token hashed (SHA-256) before use as cache key for security - Thread-safe with automatic expired entry eviction Signed-off-by: Zhao Xue <zxue@redhat.com> Signed-off-by: xuezhaojun <zxue@redhat.com> * Upgrade TokenReview cache from pure TTL to LRU+TTL Add LRU eviction to cap memory usage under high cardinality token scenarios. When the cache exceeds maxSize (default 1000), the least recently used entry is evicted. TTL expiration is preserved as before. Signed-off-by: Zhao Xue <zxue@redhat.com> Signed-off-by: xuezhaojun <zxue@redhat.com> * Fix code review issues: context lifecycle, cache pollution, deep copy Address issues found during code review: 1. Goroutine leak: eviction goroutine now accepts context.Context and stops when context is cancelled, preventing leaks on shutdown or cache re-creation. 2. Cache pollution: no longer cache unauthenticated (negative) results. Previously an attacker could flood the cache with invalid tokens to evict legitimate entries via LRU. Now only authenticated results are cached, so the cache cannot be polluted with arbitrary tokens. 3. Deep copy on Get: return user.DeepCopy() from Get to prevent callers from accidentally mutating cached state. 4. Fix misleading comment in eviction loop: LRU order != TTL order. Signed-off-by: Zhao Xue <zxue@redhat.com> Signed-off-by: xuezhaojun <zxue@redhat.com> * Fix misspell lint: cancelled -> canceled Signed-off-by: Zhao Xue <zxue@redhat.com> Signed-off-by: xuezhaojun <zxue@redhat.com> * Increase default kube client QPS/burst for service-proxy The default client-go rate limit (QPS=5, burst=10) causes client-side throttling under high-concurrency workloads, resulting in request delays of 1min+ as TokenReview calls queue up internally. Raise the defaults to QPS=50 and burst=100, and expose --kube-api-qps / --kube-api-burst flags for tuning. Signed-off-by: Zhao Xue <zxue@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: xuezhaojun <zxue@redhat.com> * Replace custom TokenReview cache with official k8s.io/apiserver token cache Switch from our custom LRU+TTL cache to the battle-tested k8s.io/apiserver/pkg/authentication/token/cache package used by kube-apiserver itself. Key improvements: - singleflight: concurrent requests for the same token share one API call (vs our cache which only helped after the first response) - Striped cache with HMAC-SHA256 key derivation - Reduce default TTL from 5min to 10s (singleflight handles burst dedup) - Log managed cluster authenticated identity for debugging parity Signed-off-by: Zhao Xue <zxue@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: xuezhaojun <zxue@redhat.com> * Cache TokenReview failure results to eliminate singleflight cold-start spikes Set failureTTL = successTTL, matching kube-apiserver best practice (see k8s.io/apiserver/pkg/authentication/authenticatorfactory/delegating.go). Without failure caching, hub tokens that always fail managed cluster auth trigger a new singleflight group after each completion, causing 8s+ latency spikes at c=100. With failure caching, the "not authenticated" result is cached and subsequent requests skip the API call entirely. Signed-off-by: Zhao Xue <zxue@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: xuezhaojun <zxue@redhat.com> * Address CodeRabbit review: handle TokenReview Status.Error and make token reader injectable - Return error when TokenReview Status.Error is set instead of treating it as unauthenticated, so managed cluster backend failures are not silently swallowed and don't trigger incorrect hub auth fallback. - Make getImpersonateToken injectable (getImpersonateTokenFunc) so tests don't depend on the host's service-account token file mount. - Add tests for Status.Error propagation and token reader error path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: xuezhaojun <zxue@redhat.com> --------- Signed-off-by: Zhao Xue <zxue@redhat.com> Signed-off-by: xuezhaojun <zxue@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…io#285) * feat: add TLS profile support to user-server - update sdk-go in go.mod to include pkg/tls library - add TLS profile ConfigMap watcher to user-server - add POD_NAMESPACE env var to user-server deployment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tesshu Flower <tflower@redhat.com> * health probes pass tlsconfig for consistency - health probes are using HTTP however the setup of the server sets a default TLSConfig anyway - to reduce confusion, pass through our custom TLSConfig rather than hardcoding a default. Signed-off-by: Tesshu Flower <tflower@redhat.com> * feat: add TLS profile support to service-proxy - Add TLS ConfigMap watcher to service-proxy using sdk-go pkg/tls - Add POD_NAMESPACE env var to service-proxy container via downward API - Add configmap get/list/watch permissions to addon-agent role Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tesshu Flower <tflower@redhat.com> * feat: add TLS profile support to addon-manager (proxy-server) - Watch the ocm-tls-profile ConfigMap via StartTLSConfigMapWatcher and pass cipher suites to the ANP proxy-server deployment. The addon-manager restarts on ConfigMap changes, consistent with user-server and service-proxy. - TLS min version support for anp-server is prepared but commented out pending upstream ANP support for --tls-min-version. - unit test for args to anp-server - including new tlsconfig args (will also need to be updated once ANP has support for tls-min-version) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tesshu Flower <tflower@redhat.com> * Update ensure to propagate TLS config changes to proxy-server deployment The ensure function only updated resources when the ManagedProxyConfiguration CR generation bumped, so TLS config changes from the ocm-tls-profile ConfigMap were never applied to the proxy-server deployment. Add a TLS config hash annotation to the deployment and trigger updates when the hash differs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tesshu Flower <tflower@redhat.com> * address some nitpick comments Signed-off-by: Tesshu Flower <tflower@redhat.com> --------- Signed-off-by: Tesshu Flower <tflower@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
|
/override sonar-pre-submit |
|
@tesshuflower: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mikeshng: Overrode contexts on behalf of mikeshng: ci/prow/sonar-pre-submit DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mikeshng, tesshuflower The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |




Summary
Merge upstream
open-cluster-management-io/cluster-proxymain to bring in 2 commits:Conflicts in go.mod, go.sum, and vendor resolved by taking upstream versions.
Test plan