Skip to content

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

@towca

Description

@towca

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

There are 2 ClusterSnapshotStore implementations with very different performance characteristics:

  • BasicSnapshotStore is a very simple, reference implementation that clones the whole state during Fork(). It's easy to understand and can be used e.g. in tests, but the complexity of operations is not optimized for the typical usage patterns during a Cluster Autoscaler loop. Not really intended for production use because of this.
  • DeltaSnapshotStore is a much more complex implementation, that branches and keeps deltas separately for every Fork(). The complexity of operations is optimized for typical Cluster Autoscaler usage patterns. This is the de-facto production implementation.

In order for DRA autoscaling to work, a ClusterSnapshotStore implementation has to integrate with dynamicresources.Snapshot. This means correctly handling the DRA snapshot during Fork()/Commit()/Revert().

For DRA autoscaling MVP, only BasicSnapshotStore was integrated with dynamicresources.Snapshot. It's pretty trivial in this case - we just need dynamicresources.Snapshot.Clone(). This was enough to test the MVP, but for production use we need to integrate DeltaSnapshotStore as well.

Describe the solution you'd like.:

  • Add the ability to chain multiple dynamicresources.Snapshot objects, where each object represents a delta from the previous one.
  • The chain can be queried/modified in the same way as a single dynamicresources.Snapshot. Queries fall back the chain and are expensive, modifications are applied to the top object in the chain and are cheap.
  • The chain can be consolidated into a single dynamicresources.Snapshot.
  • DeltaSnapshotStore keeps a chain of dynamicresources.Snapshot objects, and modifies the chain in the same way as the NodeInfo storage chain during Fork()/Commit()/Revert().
  • The chain is probably easiest to implement by turning dynamicresources.Snapshot into an interface and introducing another "DeltaChain" implementation that uses the current one internally.

Describe any alternative solutions you've considered.:

IMO we should avoid having two completely separate Basic/Delta implementations for dynamicresources.Snapshot like we do for ClusterSnapshotStore. This pattern leads to duplicating large portions of non-trivial code and extending ClusterSnapshotStore is more painful than it should because of it. The Delta/Chain implementation should use the Basic one internally instead.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

Activity

added
kind/featureCategorizes issue or PR as related to a new feature.
on Jan 9, 2025
added
area/core-autoscalerDenotes an issue that is related to the core autoscaler and is not specific to any provider.
wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.
on Jan 9, 2025
towca

towca commented on Feb 3, 2025

@towca
CollaboratorAuthor

/assign @towca

towca

towca commented on Apr 11, 2025

@towca
CollaboratorAuthor

/assign @mtrqq

k8s-ci-robot

k8s-ci-robot commented on Apr 11, 2025

@k8s-ci-robot
Contributor

@towca: GitHub didn't allow me to assign the following users: mtrqq.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @mtrqq

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

area/cluster-autoscalerarea/core-autoscalerDenotes an issue that is related to the core autoscaler and is not specific to any provider.kind/featureCategorizes issue or PR as related to a new feature.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @towca@k8s-ci-robot

    Issue actions

      CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot · Issue #7681 · kubernetes/autoscaler