Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding capability for tolerations in the agent #703

Closed
wants to merge 1 commit into from

Conversation

MbolotSuse
Copy link

Related to rancher/rancher#34159 . When the fleet controller is deployed, it deploys the fleet agent (both in the cluster which it is deployed in and in downstream clusters which use fleet). Currently, neither of these deployments allow specification of tolerations, making them difficult/impossible to use in a tainted cluster.

Goals of this PR:

  • For the agent deployed in the same cluster as the controller, use the same tolerations which were specified for the controller
  • For any agents deployed in the downstream cluster, allow custom specification of the tolerations. This custom specification is stored, like the agent env vars, on the spec of the imported fleet cluster.

@MbolotSuse MbolotSuse changed the title [WIP] Adding capability for tolerations in the agent Adding capability for tolerations in the agent Jan 25, 2022
@MbolotSuse
Copy link
Author

Update: Was able to test this and fix a few bugs. I would now consider this PR ready for review.

Tests:

  • Build fleet-controller and push to test registry.
  • Deploy a 3 node cluster and taint each node (node 1: priority=low:NoSchedule, node 2: priority=mid:NoSchedule, node 3: priority=high:NoSchedule)
  • Add a key: Priority, operator: Equals, value: High, effect: NoSchedule toleration to the chart for both fleet and the gitjob job
  • Install fleet-crd using helm -n fleet-system install --create-namespace --wait fleet-crd . in the charts/fleet-crd directory.
  • Install fleet using helm -n fleet-system install --create-namespace --wait fleet . in the charts/fleet directory.
  • Ensure that all pods successfully scheduled using k get pods -A
  • Create a GitRepo using the following description:
apiVersion: fleet.cattle.io/v1alpha1                                                                                                                                                  
kind: GitRepo                                                                                                                                                                         
metadata:                                                                                                                                                                             
  name: test                                                                                                                                                                          
  namespace: fleet-local                                                                                                                                                             
spec:                                                                                                                                                                                
  branch: master                                                                                                                                                                     
  paths:                                                                                                                                                                             
  - simple                                                                                                                                                                          
  repo: https://github.com/rancher/fleet-examples
  • Ensure that the import job successfully ran. One note: the resources that were deployed (from fleet-examples) had no tolerations and was not able to be scheduled. I'm not concerned about this since they weren't defined with tolerations in the fleet-examples, so I consider this behavior to be acceptable.

@MbolotSuse
Copy link
Author

Overview of Changes:

  • Changed the definition of a fleet Cluster (CRD) to include a field for agentTolerations (values are Tolerations as defined by the K8s api)
  • Added a value to the fleet-controller config map defining the tolerations which were set on the fleet controller. This was done so that the various parts of the code which need these tolerations don't have to attempt to look up the deployment
  • Added a variable to the basic.Deployment function that allows a calling function to pass in additional tolerations (there is 1 pre-set that needs to be used).
  • Removed references to the cattle.io/os toleration, since that should now be passed in by the values set in the chart.
  • For the local cluster, set the AgentTolerations to what was defined for the fleet-controller.
  • Passed in no additional tolerations for the controller deployment created by the operator (modules/cli/controllermanifest/template.go) I'm unsure if this functionality is currently in use .

Changing agent deployments to use tolerations defined on the
fleet cluster, and changing the local cluster creation to
use the tolerations from the controller deployment.
@aiyengar2
Copy link
Contributor

re:

For the agent deployed in the same cluster as the controller, use the same tolerations which were specified for the controller

I'm not sure if there might be use cases where a fleet-agent needs to have different tolerations than fleet? I'm not too concerned about this though.

re:

For any agents deployed in the downstream cluster, allow custom specification of the tolerations. This custom specification is stored, like the agent env vars, on the spec of the imported fleet cluster.

This is where my primary concern is with the approach of this PR; I think that the specification of tolerations should happen in the downstream cluster, not the local cluster (where the fleet cluster object lives) since the current approach only works well if you are using Manager-Initiated registration (which is what Rancher uses).

In Agent-Initiated registration, where the Cluster object is auto-created on seeing a ClusterRegistrationToken, there's undefined behavior since the Fleet Agent is deployed and managed via a Helm chart in the downstream cluster; in this case, these fields on the spec of the cluster in the management cluster would need to be effectively ignored right?

@manno
Copy link
Member

manno commented Mar 8, 2023

Support for tolerations was added in PR #1154

@manno manno closed this Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants