Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade_strategy.timeout on upgraded Rancher clusters sets to 0 instead of 120 #27333

Closed
bentastic27 opened this issue May 29, 2020 · 10 comments
Closed
Assignees
Labels
internal kind/bug Issues that are defects reported by users or that we know have reached a real release team/ui
Milestone

Comments

@bentastic27
Copy link
Contributor

bentastic27 commented May 29, 2020

What kind of request is this (question/bug/enhancement/feature request):
bug

Steps to reproduce (least amount of steps as possible):

  1. Start with with Rancher 2.3.6 and a custom downstream cluster on any version.
  2. Update Rancher to 2.4.3
  3. Update cluster to a current supported version (I did 1.17.5)
  4. edit cluster as yaml and hit save.

Result:

The UI will show the following error:

 Validation failed in API: rancherKubernetesEngineConfig upgradeStrategy=InvalidFormat 422: nodeDrainInput=InvalidFormat 422: timeout=MinLimitExceeded 422:

To fix, edit the cluster yaml and set the upgrade_strategy.timeout to something other than 0, like 120. The UI allows you to save then.

gzrancher/rancher#11317

@bentastic27 bentastic27 added kind/bug Issues that are defects reported by users or that we know have reached a real release internal labels May 29, 2020
@sowmyav27
Copy link
Contributor

sowmyav27 commented Jun 16, 2020

Reproduced on a fresh install of 2.4.5-rc6

  • Deploy a custom cluster through the API
  • Upgrade strategy: Drain - False
  • When the cluster comes up active
  • Do an Edit on the cluster. Select Drain - True (with default values)

Screen Shot 2020-06-16 at 1 40 08 PM

  • Enable Scheduled CIS scan on the cluster
  • Save the changes made
  • Do an edit on the cluster
  • Drain timeout is set to 0

Screen Shot 2020-06-16 at 1 43 18 PM

  • Click on Save
  • Error seen Validation failed in API: rancherKubernetesEngineConfig upgradeStrategy=InvalidFormat 422: nodeDrainInput=InvalidFormat 422: timeout=MinLimitExceeded 422:
  • Choose Drain as True, Select Drain Timeout Keep trying for - enter 120 seconds. Choose Drain as False
  • User is able to save without any error.

@StoneCut
Copy link

StoneCut commented Jul 9, 2020

Same issue occurs on 2.4.5 after the first succesful upgrade of a custom cluster. Doing it again (or simply choosing "edit" and then "save") results in the same error:
Validation failed in API: rancherKubernetesEngineConfig upgradeStrategy=InvalidFormat 422: nodeDrainInput=InvalidFormat 422: timeout=MinLimitExceeded 422:
Is there any workaround for this?

@jloisel
Copy link

jloisel commented Jul 17, 2020

Same issue here on Rancher v2.4.5, when trying to upgrade a cluster from v1.17.6 to v1.18.6:

image

@StoneCut
Copy link

This is a frustrating bug.

As a workaround edit the cluster configuration as YAML file. Then find the section "upgrade_strategy" and edit "node_drain_input" -> "timeout: 0" to "timeout: 120".

@Tejeev
Copy link

Tejeev commented Jul 22, 2020

We saw this on v2.4.5 when trying to update to k8s v1.18.5

@jloisel
Copy link

jloisel commented Jul 22, 2020

The workaround works well, it's probably just because the old data contains the wrong value.

@maggieliu maggieliu added this to the v2.4.x milestone Jul 22, 2020
@maggieliu maggieliu modified the milestones: v2.4.x, v2.4.6 Jul 23, 2020
codyrancher added a commit to codyrancher/ui that referenced this issue Jul 24, 2020
codyrancher added a commit to codyrancher/ui that referenced this issue Jul 24, 2020
codyrancher added a commit to codyrancher/ui that referenced this issue Jul 24, 2020
codyrancher added a commit to codyrancher/ui that referenced this issue Jul 24, 2020
@codyrancher
Copy link

What appears to be happening is the backend seems to be setting timeout to 0 when we save changes with these two settings:

  • upgradeStrategy.drain = false
  • upgradeStrategy.nodeDrainInput.timeout = undefined

I put in a stopgap from the frontend to resolve this but we should ultimately resolve this from the backend so the API users don't run into this.

@sowmyav27
Copy link
Contributor

Verified on 2.4-head - commit id: 3e543f7c4

  • Deploy a custom cluster through the API - Upgrade strategy: Drain - False
  • When the cluster comes up active
  • Do an Edit on the cluster. Select Drain - True (with default values)
  • Enable Scheduled CIS scan on the cluster
  • Save the changes made
  • Do an edit on the cluster
  • Drain timeout is set to 1
  • Change the max-pods in the cluster.yml and click on Save
  • The cluster goes into updating state and an error is seen [controlPlane] Failed to upgrade Control Plane: [[error draining node ip-<>: error when waiting for pod "cattle-cluster-agent-66849d8fb9-kbqmr" terminating: global timeout reached: 1s]]
  • Default value 1second for Drain timeout causes this error.

Expected:

  • Drain timeout 1 second is too small for an upgrade to go through when Drain is set to true.
  • Drain can be set to 120 seconds when the default value is null or 0

codyrancher added a commit to codyrancher/ui that referenced this issue Jul 28, 2020
Turns out that the min value that the backend accepts won't allow
upgrades to complete. This switches the value to the default value to
mitigate that issue.

rancher/rancher#27333
codyrancher added a commit to codyrancher/ui that referenced this issue Jul 28, 2020
Turns out that the min value that the backend accepts won't allow
upgrades to complete. This switches the value to the default value to
mitigate that issue.

rancher/rancher#27333
@zube zube bot removed the [zube]: Review label Jul 28, 2020
@sowmyav27
Copy link
Contributor

Verified on 2.4-head - commit id: 3e543f7, ui tag: latest-2.4

  • Deploy a custom cluster through the API - Upgrade strategy: Drain - False
  • When the cluster comes up active
  • Do an Edit on the cluster. Select Drain - True (with default values)
  • Enable Scheduled CIS scan on the cluster
  • Save the changes made
  • Do an edit on the cluster
  • Drain timeout is seen set to 120
  • edit cluster as yaml, Change the max-pods in the cluster.yml and click on Save
  • Cluster is updated successfully.

On master-head commit id: e20f472d4 ui tag: latest2

  • Deploy a custom cluster through the API - Upgrade strategy: Drain - False
  • When the cluster comes up active
  • Do an Edit on the cluster. Select Drain - True (with default values)
  • Enable Scheduled CIS scan on the cluster
  • Save the changes made
  • Do an edit on the cluster
  • Drain timeout is seen set to 120
  • click on save
  • Error is seen: "Timeout" should be between 1 and 10800

Screen Shot 2020-07-28 at 12 07 21 PM

codyrancher added a commit to codyrancher/ui that referenced this issue Aug 5, 2020
If the appliedSpec is present it will be validated along with the rest of the
model. Unfortunately the backend is sometimes saving invalid models
which causes this validation to fail. We shouldn't be modifying or sending
this appliedSpec so I'm removing it.

rancher/rancher#27333 (comment)
codyrancher added a commit to codyrancher/ui that referenced this issue Aug 5, 2020
Unfortunately the backend is sometimes saving invalid models
which causes the validation of appliedSpec to fail. To avoid
this validation we're not ignoring the appliedSpec where this
can go wrong.

rancher/rancher#27333 (comment)
@zube zube bot removed the [zube]: Review label Aug 5, 2020
@sowmyav27
Copy link
Contributor

Another way to reproduce the issue on an upgraded setup:

  • Deploy a cluster in 2.3.6 in k8s 1.17
  • Upgrade Rancher to 2.4.5
  • Upgrade k8s version to 1.17.9. Save changes made.
  • When the cluster comes back active, Edit cluster --> Enable drain, notice that the drain timeout field is "blank"
  • Save changes made.
  • When the cluster comes to active state, Edit cluster, notice drain timeout is now 0. Save the cluster.
  • Error seen on UI: Validation failed in API: rancherKubernetesEngineConfig upgradeStrategy=InvalidFormat 422: nodeDrainInput=InvalidFormat 422: timeout=MinLimitExceeded 422:
  • Edit Drain timeout as 120 seconds. And now save changes made. Cluster will get updated successfully.

On master-head - commit id: 9b0dd20b7 - Issue is seen fixed

  • Deploy a custom cluster through the API - Upgrade strategy: Drain - False
  • When the cluster comes up active
  • Do an Edit on the cluster. Select Drain - True (with default values)
  • Enable Scheduled CIS scan on the cluster
  • Save the changes made
  • Do an edit on the cluster
  • Drain timeout is seen set to 120
  • click on save. Cluster goes into updating state and no error is seen

Upgrade from 2.3.6 to 2.4-head commit id: 2c7dc4ba8 - issue is seen fixed

  • Deploy a cluster in 2.3.6 in k8s 1.17
  • Upgrade Rancher to 2.4-head
  • Upgrade k8s version to 1.17.9. Save changes made.
  • When the cluster comes back active, Edit cluster --> Enable drain, notice that the drain timeout field has value 120
  • Save changes made.
  • When the cluster comes to active state, Edit cluster, notice drain timeout is now 120. Save the cluster.
  • No error seen

@zube zube bot removed the [zube]: Done label Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal kind/bug Issues that are defects reported by users or that we know have reached a real release team/ui
Projects
None yet
Development

No branches or pull requests

8 participants