upgrade_strategy.timeout on upgraded Rancher clusters sets to 0 instead of 120 #27333

bentastic27 · 2020-05-29T14:30:02Z

What kind of request is this (question/bug/enhancement/feature request):
bug

Steps to reproduce (least amount of steps as possible):

Start with with Rancher 2.3.6 and a custom downstream cluster on any version.
Update Rancher to 2.4.3
Update cluster to a current supported version (I did 1.17.5)
edit cluster as yaml and hit save.

Result:

The UI will show the following error:

 Validation failed in API: rancherKubernetesEngineConfig upgradeStrategy=InvalidFormat 422: nodeDrainInput=InvalidFormat 422: timeout=MinLimitExceeded 422:

To fix, edit the cluster yaml and set the upgrade_strategy.timeout to something other than 0, like 120. The UI allows you to save then.

gzrancher/rancher#11317

The text was updated successfully, but these errors were encountered:

sowmyav27 · 2020-06-16T20:47:27Z

Reproduced on a fresh install of 2.4.5-rc6

Deploy a custom cluster through the API
Upgrade strategy: Drain - False
When the cluster comes up active
Do an Edit on the cluster. Select Drain - True (with default values)

Enable Scheduled CIS scan on the cluster
Save the changes made
Do an edit on the cluster
Drain timeout is set to 0

Click on Save
Error seen Validation failed in API: rancherKubernetesEngineConfig upgradeStrategy=InvalidFormat 422: nodeDrainInput=InvalidFormat 422: timeout=MinLimitExceeded 422:
Choose Drain as True, Select Drain Timeout Keep trying for - enter 120 seconds. Choose Drain as False
User is able to save without any error.

StoneCut · 2020-07-09T14:22:36Z

Same issue occurs on 2.4.5 after the first succesful upgrade of a custom cluster. Doing it again (or simply choosing "edit" and then "save") results in the same error:
Validation failed in API: rancherKubernetesEngineConfig upgradeStrategy=InvalidFormat 422: nodeDrainInput=InvalidFormat 422: timeout=MinLimitExceeded 422:
Is there any workaround for this?

jloisel · 2020-07-17T09:55:08Z

Same issue here on Rancher v2.4.5, when trying to upgrade a cluster from v1.17.6 to v1.18.6:

StoneCut · 2020-07-17T11:33:45Z

This is a frustrating bug.

As a workaround edit the cluster configuration as YAML file. Then find the section "upgrade_strategy" and edit "node_drain_input" -> "timeout: 0" to "timeout: 120".

Tejeev · 2020-07-22T09:45:12Z

We saw this on v2.4.5 when trying to update to k8s v1.18.5

jloisel · 2020-07-22T10:59:15Z

The workaround works well, it's probably just because the old data contains the wrong value.

… the min value rancher/rancher#27333

codyrancher · 2020-07-24T23:21:25Z

What appears to be happening is the backend seems to be setting timeout to 0 when we save changes with these two settings:

upgradeStrategy.drain = false
upgradeStrategy.nodeDrainInput.timeout = undefined

I put in a stopgap from the frontend to resolve this but we should ultimately resolve this from the backend so the API users don't run into this.

sowmyav27 · 2020-07-27T23:35:52Z

Verified on 2.4-head - commit id: 3e543f7c4

Deploy a custom cluster through the API - Upgrade strategy: Drain - False
When the cluster comes up active
Do an Edit on the cluster. Select Drain - True (with default values)
Enable Scheduled CIS scan on the cluster
Save the changes made
Do an edit on the cluster
Drain timeout is set to 1
Change the max-pods in the cluster.yml and click on Save
The cluster goes into updating state and an error is seen [controlPlane] Failed to upgrade Control Plane: [[error draining node ip-<>: error when waiting for pod "cattle-cluster-agent-66849d8fb9-kbqmr" terminating: global timeout reached: 1s]]
Default value 1second for Drain timeout causes this error.

Expected:

Drain timeout 1 second is too small for an upgrade to go through when Drain is set to true.
Drain can be set to 120 seconds when the default value is null or 0

Turns out that the min value that the backend accepts won't allow upgrades to complete. This switches the value to the default value to mitigate that issue. rancher/rancher#27333

sowmyav27 · 2020-07-28T19:52:18Z

Verified on 2.4-head - commit id: 3e543f7, ui tag: latest-2.4

Deploy a custom cluster through the API - Upgrade strategy: Drain - False
When the cluster comes up active
Do an Edit on the cluster. Select Drain - True (with default values)
Enable Scheduled CIS scan on the cluster
Save the changes made
Do an edit on the cluster
Drain timeout is seen set to 120
edit cluster as yaml, Change the max-pods in the cluster.yml and click on Save
Cluster is updated successfully.

On master-head commit id: e20f472d4 ui tag: latest2

Deploy a custom cluster through the API - Upgrade strategy: Drain - False
When the cluster comes up active
Do an Edit on the cluster. Select Drain - True (with default values)
Enable Scheduled CIS scan on the cluster
Save the changes made
Do an edit on the cluster
Drain timeout is seen set to 120
click on save
Error is seen: "Timeout" should be between 1 and 10800

If the appliedSpec is present it will be validated along with the rest of the model. Unfortunately the backend is sometimes saving invalid models which causes this validation to fail. We shouldn't be modifying or sending this appliedSpec so I'm removing it. rancher/rancher#27333 (comment)

Unfortunately the backend is sometimes saving invalid models which causes the validation of appliedSpec to fail. To avoid this validation we're not ignoring the appliedSpec where this can go wrong. rancher/rancher#27333 (comment)

sowmyav27 · 2020-08-06T18:50:57Z

Another way to reproduce the issue on an upgraded setup:

Deploy a cluster in 2.3.6 in k8s 1.17
Upgrade Rancher to 2.4.5
Upgrade k8s version to 1.17.9. Save changes made.
When the cluster comes back active, Edit cluster --> Enable drain, notice that the drain timeout field is "blank"
Save changes made.
When the cluster comes to active state, Edit cluster, notice drain timeout is now 0. Save the cluster.
Error seen on UI: Validation failed in API: rancherKubernetesEngineConfig upgradeStrategy=InvalidFormat 422: nodeDrainInput=InvalidFormat 422: timeout=MinLimitExceeded 422:
Edit Drain timeout as 120 seconds. And now save changes made. Cluster will get updated successfully.

On master-head - commit id: 9b0dd20b7 - Issue is seen fixed

Deploy a custom cluster through the API - Upgrade strategy: Drain - False
When the cluster comes up active
Do an Edit on the cluster. Select Drain - True (with default values)
Enable Scheduled CIS scan on the cluster
Save the changes made
Do an edit on the cluster
Drain timeout is seen set to 120
click on save. Cluster goes into updating state and no error is seen

Upgrade from 2.3.6 to 2.4-head commit id: 2c7dc4ba8 - issue is seen fixed

Deploy a cluster in 2.3.6 in k8s 1.17
Upgrade Rancher to 2.4-head
Upgrade k8s version to 1.17.9. Save changes made.
When the cluster comes back active, Edit cluster --> Enable drain, notice that the drain timeout field has value 120
Save changes made.
When the cluster comes to active state, Edit cluster, notice drain timeout is now 120. Save the cluster.
No error seen

bentastic27 added kind/bug Issues that are defects reported by users or that we know have reached a real release internal labels May 29, 2020

maggieliu added this to the v2.4.x milestone Jul 22, 2020

maggieliu added [zube]: To Triage [zube]: Team Red Backlog and removed [zube]: To Triage labels Jul 22, 2020

maggieliu assigned codyrancher Jul 23, 2020

maggieliu added the [zube]: Team Black Backlog label Jul 23, 2020

zube bot removed the [zube]: Team Red Backlog label Jul 23, 2020

maggieliu added the [zube]: Next Up label Jul 23, 2020

zube bot removed the [zube]: Team Black Backlog label Jul 23, 2020

maggieliu modified the milestones: v2.4.x, v2.4.6 Jul 23, 2020

codyrancher added a commit to codyrancher/ui that referenced this issue Jul 24, 2020

Ensure that the upgradeStrategy.nodeDrainInput.timeout doesn't exceed…

2f8f670

… the min value rancher/rancher#27333

codyrancher added a commit to codyrancher/ui that referenced this issue Jul 24, 2020

Ensure that the upgradeStrategy.nodeDrainInput.timeout doesn't exceed…

07cd54c

… the min value rancher/rancher#27333

codyrancher added a commit to codyrancher/ui that referenced this issue Jul 24, 2020

Ensure that the upgradeStrategy.nodeDrainInput.timeout doesn't exceed…

d5336bc

… the min value rancher/rancher#27333

codyrancher added a commit to codyrancher/ui that referenced this issue Jul 24, 2020

Ensure that the upgradeStrategy.nodeDrainInput.timeout doesn't exceed…

8b43f8c

… the min value rancher/rancher#27333

This was referenced Jul 24, 2020

Ensure that the upgradeStrategy.nodeDrainInput.timeout doesn't exceed the min value rancher/ui#4092

Merged

[forwardport] Ensure that the upgradeStrategy.nodeDrainInput.timeout doesn't exceed the min value rancher/ui#4093

Merged

codyrancher added [zube]: Review and removed [zube]: Next Up labels Jul 24, 2020

maggieliu added the team/ui label Jul 24, 2020

westlywright added the [zube]: To Test label Jul 24, 2020

zube bot removed the [zube]: Review label Jul 24, 2020

maggieliu mentioned this issue Jul 24, 2020

upgrade_strategy.timeout on upgraded Rancher clusters sets to 0 #28074

Open

sangeethah assigned sowmyav27 Jul 24, 2020

sowmyav27 added [zube]: Reopened and removed [zube]: To Test labels Jul 27, 2020

This was referenced Jul 28, 2020

[backport] Switching from min value to default value for upgradeStrategy timeout rancher/ui#4098

Merged

Switching from min value to default value for upgradeStrategy timeout rancher/ui#4099

Merged

codyrancher added [zube]: Review and removed [zube]: Reopened labels Jul 28, 2020

westlywright added the [zube]: To Test label Jul 28, 2020

zube bot removed the [zube]: Review label Jul 28, 2020

sowmyav27 added [zube]: Reopened and removed [zube]: To Test labels Jul 28, 2020

codyrancher mentioned this issue Aug 5, 2020

Removing appliedSpec before saving driver-rke rancher/ui#4115

Merged

codyrancher added [zube]: Review and removed [zube]: Reopened labels Aug 5, 2020

westlywright added the [zube]: To Test label Aug 5, 2020

zube bot removed the [zube]: Review label Aug 5, 2020

sowmyav27 closed this as completed Aug 6, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Aug 6, 2020

zube bot removed the [zube]: Done label Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade_strategy.timeout on upgraded Rancher clusters sets to 0 instead of 120 #27333

upgrade_strategy.timeout on upgraded Rancher clusters sets to 0 instead of 120 #27333

bentastic27 commented May 29, 2020 •

edited by maggieliu

sowmyav27 commented Jun 16, 2020 •

edited

StoneCut commented Jul 9, 2020

jloisel commented Jul 17, 2020

StoneCut commented Jul 17, 2020

Tejeev commented Jul 22, 2020

jloisel commented Jul 22, 2020 •

edited

codyrancher commented Jul 24, 2020

sowmyav27 commented Jul 27, 2020

sowmyav27 commented Jul 28, 2020

sowmyav27 commented Aug 6, 2020

upgrade_strategy.timeout on upgraded Rancher clusters sets to 0 instead of 120 #27333

upgrade_strategy.timeout on upgraded Rancher clusters sets to 0 instead of 120 #27333

Comments

bentastic27 commented May 29, 2020 • edited by maggieliu

sowmyav27 commented Jun 16, 2020 • edited

StoneCut commented Jul 9, 2020

jloisel commented Jul 17, 2020

StoneCut commented Jul 17, 2020

Tejeev commented Jul 22, 2020

jloisel commented Jul 22, 2020 • edited

codyrancher commented Jul 24, 2020

sowmyav27 commented Jul 27, 2020

sowmyav27 commented Jul 28, 2020

sowmyav27 commented Aug 6, 2020

bentastic27 commented May 29, 2020 •

edited by maggieliu

sowmyav27 commented Jun 16, 2020 •

edited

jloisel commented Jul 22, 2020 •

edited