Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] rke2 clusters with invalid values for tolerations / affinity agent customization do not show error to user, stay in updating state on cluster create #41606

Open
slickwarren opened this issue May 18, 2023 · 7 comments
Assignees
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement QA/XS team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support [zube]: Review
Milestone

Comments

@slickwarren
Copy link
Contributor

Rancher Server Setup

  • Rancher version: v2.7-head ()
  • Installation option (Docker install/Helm Chart): helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): rke1, 1.4.5
  • Proxy/Cert Details: valid certs

Information about the Cluster

  • Kubernetes version: tested with 1.25.9, 1.26.x
  • Cluster Type (Local/Downstream): downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): rke2 linode driver

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: standard user, cluster owner

Describe the bug

when provisioning a cluster with invalid values for tolerations or affinity agent customizations (cluster agent or fleet agent), the cluster is able to start being created. However, the cluster never shows any error state and hangs in an updating state

To Reproduce

  • deploy an rke2 cluster
    • for cluster agent and fleet agent customization, enter a bad label for tolerations and node affinity, i.e. badLabel 123"[];'{}-+=
  • wait for the cluster to provision

Result
cluster is able to be created, but stays in an updating state, see screenshots

Expected Result

cluster should error out and bubble this state up to the user on the management page
error should be shown to the user

if user ssh's into the node and views rke2-server logs, you can see there is an error that spams there every few seconds:


May 18 22:37:07 bad-custization1-dsa-3a9fb14b-s4rsd rke2[2054]: time="2023-05-18T22:37:07Z" level=error msg="Failed to process config: failed to process /var/lib/rancher/rke2/server/manifests/rancher/cluster-agent.yaml: failed to create cattle-system/cattle-cluster-agent apps/v1, Kind=Deployment for  kube-system/cluster-agent: Deployment.apps \"cattle-cluster-agent\" is invalid: [spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].key: Invalid value: \"notGood~ [];'/..,\": prefix part a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].key: Invalid value: \"notGood~ [];'/..,\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]'), spec.template.spec.tolerations[3].key: Invalid value: \"notGood~ [];'/..,\": prefix part a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.spec.tolerations[3].key: Invalid value: \"notGood~ [];'/..,\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]"

Screenshots

Screen Shot 2023-05-18 at 3 30 18 PM
Screen Shot 2023-05-18 at 3 30 12 PM
Screen Shot 2023-05-18 at 1 42 00 PM

Additional context

error is bubbled up to the user for rke1

@slickwarren slickwarren added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement area/capr/rke2 RKE2 Provisioning issues involving CAPR team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels May 18, 2023
@slickwarren slickwarren added this to the 2023-Q2-v2.7x milestone May 18, 2023
@slickwarren slickwarren self-assigned this May 18, 2023
@Oats87 Oats87 added the area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework label May 19, 2023
@a-blender a-blender self-assigned this May 22, 2023
@snasovich
Copy link
Collaborator

snasovich commented May 23, 2023

Kicked it out to Q3 for now as it's not a critical bug applicable only if the user provides an invalid configuration and is essentially a poor UX where error root cause is not bubbled up in the UI.
This can be improved by adding format validation on these fields - best to do it on both UI and backend. For backend options include adding pattern on the generated CRD (which may require deeper changes as CRDs are generated by Wrangler) or adding these validations to webhook.

Adding release-note label to document it as a known issue for this feature.

@zube zube bot removed the [zube]: Working label May 23, 2023
@snasovich snasovich removed their assignment May 23, 2023
@snasovich snasovich added the release-note Note this issue in the milestone's release notes label May 23, 2023
@slickwarren
Copy link
Contributor Author

referencing UI issue: rancher/dashboard#8920

@snasovich
Copy link
Collaborator

snasovich commented Aug 9, 2023

We may want to follow something like 64d6bc5 (#41768) to add validation to these fields on cluster.provisioning.cattle.io but with how large and core this CRD this potentially should be done as a larger effort and not part of fixing this (rather small) issue.

Kicking out to Q1.

@felipe-colussi
Copy link
Contributor

I was trying to take a look into this, it is impossible to change to the kubebuilder with simple changes, it would require a lot of effort/tests.

Wile testing:

  • Some fields that weren't required become required.
  • The fields with rkev1.GenericMap don't work (can't unmarshal into it).
  • Some data that is not mapped on the struct are passed, generating errors.
  • When embebeding a structs the embebed fields don't seem to be created.

@a-blender
Copy link
Contributor

a-blender commented Oct 27, 2023

Would adding validation to the CRD bubble up via rke2? The fact that rke bubbles up the error and rke2 does not may due to how RKE2 is designed (I've run into this before). cc @jakefhyde

This won't show the rke2 logs, but another option is to add a validator/regex to the systemtemplate and check that agent customization fields are valid before setting them on the deployment -

if appendTolerations := util.GetClusterAgentTolerations(cluster); appendTolerations != nil {
agentAppendTolerations = templates.ToYAML(appendTolerations)
if agentAppendTolerations == "" {
return fmt.Errorf("error converting agent append tolerations to YAML")
}
}
affinity, err := util.GetClusterAgentAffinity(cluster)
if err != nil {
return err
}
agentAffinity = templates.ToYAML(affinity)
if agentAffinity == "" {
return fmt.Errorf("error converting agent affinity to YAML")
}
if resourceRequirements := util.GetClusterAgentResourceRequirements(cluster); resourceRequirements != nil {
agentResourceRequirements = templates.ToYAML(resourceRequirements)
if agentResourceRequirements == "" {
return fmt.Errorf("error converting agent resource requirements to YAML")
}
}

This approach will alert the user a lot earlier if they try to add an invalid agent customization configuration via kubectl instead of the UI. It will also not require modifying the "large and core" cluster.provisioning.cattle.io CRD and has a lower chance of causing regressions.

@felipe-colussi
Copy link
Contributor

@a-blender, after talking with @snasovich we decided to add the validation on the web-hook.
It is not the k8s "recommended" way, but given the way we create the CRDs and the size/importance of the cluster one it is much safer.

@slickwarren
Copy link
Contributor Author

test plan automation considerations:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement QA/XS team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support [zube]: Review
Projects
None yet
Development

No branches or pull requests

10 participants