Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for data mover node selection #7383

Merged
merged 1 commit into from
Mar 18, 2024

Conversation

Lyndon-Li
Copy link
Contributor

Add the design for node selection for data mover backup

@github-actions github-actions bot added the Area/Design Design Documents label Feb 5, 2024
@Lyndon-Li Lyndon-Li marked this pull request as ready for review February 5, 2024 02:43
Copy link

codecov bot commented Feb 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.62%. Comparing base (08a020e) to head (2f9d8ae).
Report is 52 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7383      +/-   ##
==========================================
- Coverage   61.75%   61.62%   -0.13%     
==========================================
  Files         262      263       +1     
  Lines       28433    28681     +248     
==========================================
+ Hits        17558    17675     +117     
- Misses       9643     9758     +115     
- Partials     1232     1248      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ywk253100
ywk253100 previously approved these changes Feb 27, 2024
blackpiglet
blackpiglet previously approved these changes Feb 27, 2024
sseago
sseago previously approved these changes Feb 27, 2024
design/node-agent-affinity.md Outdated Show resolved Hide resolved
As mentioned in the [Volume Snapshot Data Movement Design][2], the exposer decides where to launch the VGDP instances. At present, for volume snapshot data movement backups, the exposer creates backupPods and the VGDP instances will be initiated in the nodes where backupPods are scheduled. So the loadAffinity will be translated (from `metav1.LabelSelector` to `corev1.Affinity`) and set to the backupPods.

It is possible that node-agent pods, as a daemonset, don't run in every worker nodes, users could fulfil this by specify `nodeSelector` or `nodeAffinity` to the node-agent daemonset spec. On the other hand, at present, VGDP instances must be assigned to nodes where node-agent pods are running. Therefore, if there is any node selection for node-agent pods, users must consider this into this load affinity configuration, so as to guarantee that VGDP instances are always assigned to nodes where node-agent pods are available. This is done by users, we don't inherit any node selection configuration from node-agent daemonset as we think daemonset scheduler works differently from plain pod scheduler, simply inheriting all the configurations may cause unexpected result of backupPod schedule.
Otherwise, if a backupPod are scheduled to a node where node-agent pod is absent, the corresponding DataUpload CR will stay in `Accepted` phase until the prepare timeout (by default 30min).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify, this is a possible situation in v1.13, right?

IMO it's possible to take the node-selector in node-agent spec when scheduling the bakcupPod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible because users could modify node-agent's yaml and add any configurations affecting the schedule, i.e., topologies, affinities/anti-affinities, node selectors, etc.

Node-agent's node selection is done by Kubernetes' scheduler, the only way to control its behavior is to modify the node-agent's yaml; on the other hand, the data mover node-selection configuration is in a configMap that is detected after node-agent starts.
Therefore, if we reflect node-selection configuration into node-agent scheduling, we must dynamically edit node-agent's yaml after node-agent starts, so this causes node-agent restarts one more time.
Moreover, users may be surprised when node-agent restarts once more because they may not have realized their node-selection configuration has affected the node-agent scheduling.

Moreover, it is possible that node-agent pod cannot run in a specific node for sure, if we change the node-agent spec, things still doesn't work.

Therefore, users must know the relationship of node-agent scheduling and node-selection in either case. So we'd better have user realize this from the beginning and it is easy for them to make two correct configurations for node-agent spec and node-selection configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's make sure this is covered in documentation.

design/node-agent-affinity.md Show resolved Hide resolved
It is possible that node-agent pods, as a daemonset, don't run in every worker nodes, users could fulfil this by specify `nodeSelector` or `nodeAffinity` to the node-agent daemonset spec. On the other hand, at present, VGDP instances must be assigned to nodes where node-agent pods are running. Therefore, if there is any node selection for node-agent pods, users must consider this into this load affinity configuration, so as to guarantee that VGDP instances are always assigned to nodes where node-agent pods are available. This is done by users, we don't inherit any node selection configuration from node-agent daemonset as we think daemonset scheduler works differently from plain pod scheduler, simply inheriting all the configurations may cause unexpected result of backupPod schedule.
Otherwise, if a backupPod are scheduled to a node where node-agent pod is absent, the corresponding DataUpload CR will stay in `Accepted` phase until the prepare timeout (by default 30min).

At present, as part of the expose operations, the exposer creates a volume, represented by backupPVC, from the snapshot. The backupPVC uses the same storageClass with the source volume. If the `volumeBindingMode` in the storageClass is `Immediate`, the volume is immediately allocated from the underlying storage without waiting for the backupPod. On the other hand, the loadAffinity is set to the backupPod's affinity. If the backupPod is scheduled to a node where the snapshot volume is not accessible, e.g., because of storage topologies, the backupPod won't get into Running state, concequently, the data movement won't complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can explicitly document that when the storageclass has the BindingMode as "Immediate", the user SHOULD NOT set node selector for data mover

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can document this, but probably, we just need to tell users to be careful to set node-selection for Immediate volumes, as not all volumes have the constraints like topologies. In the envs with no constraints, node-selection works well with Immediate volumes

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>

There is a common solution for the both problems:
- We have an existing logic to periodically enqueue the dataupload CRs which are in the `Accepted` phase for timeout and cancel checks
- We add a new logic to this existing logic to check if the corresponding backupPods are in unrecoverable status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we give more information about the definition of the unrecoverable state?
It's better to let the user know in which condition the DataUpload cancelation or failure is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Unrecoverable status check is an existing logic, here we add a further check in this existing logic.
At present, we don't have this check included in the doc, we can add it and include all the checks for backupPod/restorePod's Unrecoverable status.

@Lyndon-Li
Copy link
Contributor Author

There are several comments covering requests for document. We will create a separate PR to add a document for node-selection. We will cover all the requests in that PR later.

@Lyndon-Li Lyndon-Li merged commit 6ec1701 into vmware-tanzu:main Mar 18, 2024
65 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants