Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher server logs are spammed with logs when one of the node of a downstream RKE2 cluster is in Reconciling state (UI sends null) #8480

Open
dasarinaidu opened this issue Mar 17, 2023 · 18 comments · May be fixed by #10396
Assignees
Labels
kind/bug QA/manual-test Indicates issue requires manually testing QA/S release-note size/2 Size Estimate 2 team/area2 Hostbusters
Milestone

Comments

@dasarinaidu
Copy link

dasarinaidu commented Mar 17, 2023

Rancher Server Setup

  • Rancher version: 2.7.0 upgraded to 2.7-head (2.7.2-rc6)
  • Installation option (Docker install/Helm Chart): Helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2 custom cluster
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: v1.23.14+rke2r1 to v1.25.7+rke2r1
  • Cluster Type (Local/Downstream): Downstream Custom cluster(RKE2)

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: Admin

Describe the bug
Sometimes when performing an HA Upgrade to Rancher v2.7.0 TO v2.7.2-rc6, while having an AWS RKE2 custom cluster - the cluster will go into Updating state and the etcd nodes will go into a Reconciling state and the INFO logs are spammed and keep coming the same message.

To Reproduce

  1. Create a HA Rancher serve on v2.7.0
  2. Provision a downstream AWS RKE2 custom cluster(hardened) with k8 version : v1.23.14+rke2r1
  3. Perform an HA Upgrade on Rancher v2.7-head (v2.7.2-rc6)
  4. Make sure the upgrade success and downstream cluster came back active
  5. Upgrade downstream cluster kubarnetes version to to v1.25.7+rke2r1 and SAVE
  6. Check for the system behavior, Logs and cluster state

Result
a. Downstream cluster initially came back active and then it went to update state with an error and it never came back to Active
b. Rancher logs are spammed with INFO, these logs are keep coming for every second
c. One of the node (etcd) is in Reconciling state and never came back with message on UI (Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet)

Expected Result
a. Rancher logs should not be spammed with the logs every second.

** LOGS**
2023/03/17 18:36:58 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:36:58 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:03 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:03 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:08 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:08 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:13 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:13 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:18 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:18 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:23 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:23 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:28 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:28 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:33 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:33 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:38 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:38 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:43 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:43 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:48 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:48 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:51 [ERROR] error syncing 'fleet-default/custom-5e4c4afa8ddd': handler unmanaged-machine: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes ip-172-31-2-93), requeuing

@dasarinaidu dasarinaidu added this to the v2.7.2 milestone Mar 17, 2023
@sowmyav27 sowmyav27 changed the title Rancher INFO Logs are panic when one of the node is in Reconciling state Rancher logs are spammed with logs when one of the node of a downstream RKE2 cluster is in Reconciling state Mar 17, 2023
@sowmyav27 sowmyav27 changed the title Rancher logs are spammed with logs when one of the node of a downstream RKE2 cluster is in Reconciling state Rancher server logs are spammed with logs when one of the node of a downstream RKE2 cluster is in Reconciling state Mar 17, 2023
@Oats87
Copy link

Oats87 commented Mar 21, 2023

This is a UI bug.

The root cause of the problem is the fact that we are delivering an invalid configuration to RKE2, namely specifying profile: null in machineGlobalConfig.

This is very easily reproducible -- create a new v2prov cluster in the UI, click "Edit as YAML", and note that profile: null under machineGlobalConfig.

Screenshot 2023-03-20 at 5 44 45 PM

See attached screenshot.

@Sahota1225 Sahota1225 transferred this issue from rancher/rancher Mar 21, 2023
@snasovich
Copy link
Contributor

@nwmac @gaktive , this is likely a side-effect of recent changes for profile behavior to accommodate k8s 1.25 changes so possibly a regression we don't want to happen in the upcoming 2.7.2 release. Marking as regression and release-blocker.

@nwmac
Copy link
Member

nwmac commented Mar 21, 2023

@snasovich Backend should ignore the null profile. If the configuration is invalid, backend should reject the request.

@nwmac
Copy link
Member

nwmac commented Mar 21, 2023

/fordwardport v2.7.next2

@snasovich
Copy link
Contributor

@snasovich Backend should ignore the null profile. If the configuration is invalid, backend should reject the request.

@nwmac , I agree with this in principle, but it's still a change of the behavior on UI's part that we want to be reverted. I see it's already in test, thank you for a quick turn-around.

@Sahota1225
Copy link

rancher/rancher#40942 - backend issue

@jameson-mcghee
Copy link

jameson-mcghee commented Mar 23, 2023

Setting this ticket to Reopened until a Test Template can be provided.

cc @nwmac

@gaktive gaktive added the QA/dev-automation Issues that engineers have written automation around so QA doesn't have look at this label Apr 24, 2023
@gaktive gaktive changed the title Rancher server logs are spammed with logs when one of the node of a downstream RKE2 cluster is in Reconciling state Rancher server logs are spammed with logs when one of the node of a downstream RKE2 cluster is in Reconciling state (UI sends null) May 9, 2023
@sowmyav27 sowmyav27 removed the team/area2 Hostbusters label May 11, 2023
@nwmac nwmac removed the regression label May 22, 2023
@nwmac
Copy link
Member

nwmac commented May 22, 2023

The remaining work is to comment out the profile key when editing as yaml - this is currently driven off of data that comes from the backend via the schema.

Bumping out of 2.7.next2 - tis requires for investigation and will require test automation

@nwmac nwmac modified the milestones: v2.7.next2, v2.7.next3 May 22, 2023
@gaktive
Copy link
Member

gaktive commented Jun 8, 2023

@Oats87 we should coordinate with you and/or your team about how UI & backend should handle this.

@gaktive gaktive added the team/area2 Hostbusters label Jun 8, 2023
@gaktive
Copy link
Member

gaktive commented Jun 8, 2023

cc @Sahota1225

@gaktive gaktive modified the milestones: v2.7.next3, v2.8.next4 Jul 18, 2023
@gaktive
Copy link
Member

gaktive commented Sep 6, 2023

This may be something the UI is doing when trying to populate fields we expect, so we should scrub upon save.

@gaktive gaktive added size/2 Size Estimate 2 [zube]: Groomed and removed [zube]: Backlog labels Sep 6, 2023
@nwmac nwmac modified the milestones: v2.8.0, v2.8.next1 Sep 22, 2023
@gaktive gaktive assigned jordojordo and unassigned thaneunsoo Jan 24, 2024
@nwmac nwmac added QA/manual-test Indicates issue requires manually testing and removed QA/dev-automation Issues that engineers have written automation around so QA doesn't have look at this labels Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug QA/manual-test Indicates issue requires manually testing QA/S release-note size/2 Size Estimate 2 team/area2 Hostbusters
Projects
None yet
Development

Successfully merging a pull request may close this issue.