Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Creating/deleting management resources cause resources to be unsubscribed #40558

Closed
richard-cox opened this issue Feb 15, 2023 · 8 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/2 team/area1
Milestone

Comments

@richard-cox
Copy link
Member

Rancher Server Setup

  • Rancher version: v2.7-head (1dd2af2)
  • Installation option (Docker install/Helm Chart):
    • docker install
    • docker run -d --restart=unless-stopped --privileged -e CATTLE_PASSWORD_MIN_LENGTH=8 --name $container --network host rancher/rancher:v2.7-head --acme-domain

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • Admin

Describe the bug

I can't quite tie down the pattern of which resources cause the resource.stop messages, there's some findings below

To Reproduce

  • Open browser 1 and navigate to the Dashboards Cluster Management page
  • Open browser 1's Dev Tools --> Network tab --> WS Tab --> select v1/subscription --> Messages tab
  • Open browser 2 and navigate to the Dashboards Cluster Management page
  • In browser 2 --> Import Existing --> Generic --> enter anything for the Name --> Create

Result

  • In browser 1's dev tools there will be multiple resource.error and resource.stop messages

Expected Result

  • No resource.stop message - aka resource subscriptions aren't dropped

Screenshots

Additional context

management.cattle.io.fleetworkspace

Specific resource messages

image

Create - resource.error and resource.stop
Edit - Neither
Delete - resource.error and resource.stop

resource.error data

"event watch error: an error on the server (\"unable to decode an event from the watch stream: http2: response body closed\") has prevented the request from succeeding"
All messages

image

management.cattle.io.settings

Create / Edit did not cause resource.stop or resource.error

provisioning.cattle.io.cluster

Create - resource.stop but not resource.error
Delete - resource.stop and resource.error

resource.error data 1 (many)

notifier watch error: clusters.provisioning.cattle.io \"asdsad\" not found: clusters.provisioning.cattle.io \"asdsad\" not found

resource.error data 2 (one)

event watch error: an error on the server (\"unable to decode an event from the watch stream: http2: response body closed\") has prevented the request from succeeding

namespace (via local cluster explorer)

Create / Edit / Delete in both v1 and k8s/clusters/local/v1/subscribe show correct events, no resource.stop or resource.error

management.cattle.io.project (via local cluster explorer)

Create - resource.error and resource.stop
Edit - Neither
Delete - resource.error and resource.stop

resource.error data 1

notifier watch error: projects.management.cattle.io \"p-mrqlr\" not found: projects.management.cattle.io \"p-mrqlr\" not found

resource.error data 2

event watch error: an error on the server (\"unable to decode an event from the watch stream: http2: response body closed\") has prevented the request from succeeding
@richard-cox richard-cox added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Feb 15, 2023
@richard-cox richard-cox added this to the v2.7.2 milestone Feb 15, 2023
@KevinJoiner KevinJoiner self-assigned this Feb 16, 2023
@gaktive gaktive modified the milestones: v2.7.2, 2023-Q2-v2.7x Feb 16, 2023
@gaktive
Copy link
Member

gaktive commented Feb 16, 2023

For 2.7.2, UI will do a client-side workaround. @MbolotSuse to link to an existing issue to help with buffering results to the UI.

@MbolotSuse
Copy link
Contributor

Backend Issue for Watch buffering: #39568

@richard-cox
Copy link
Member Author

richard-cox commented Feb 17, 2023

Just tracking results.

I've validated the same behaviour for fleet workspaces happens in 2.7.0 (as mentioned in the call)

image

Similar resource messages for fleet workspaces are also seen in 2.6.9

Further testing

RKE2 DO cluster. Created/Edited and navigated to cluster detail page

Create Cluster

resource.stop/start cycles: 3
cluster state: updated ok
pool state: updated ok

Scale Pool

resource.stop/start cycles: 11
cluster state: updated ok
pool state: updated ok

Add Pool

Had to fix an undefined machine.status error before this worked https://github.com/rancher/dashboard/blob/master/shell/models/provisioning.cattle.io.cluster.js#L460

resource.stop/start cycles: 10
cluster state: updated ok
pool state: new pool never shows new machine

The machines counts is 3
image

Remove Pool

resource.stop/start cycles: 4
cluster state: updated ok
pool state: new pool never removes deleted machine (but does remove pool)

The resource.remove message for the machine is received by the UI

The machines counts is 2

image

Three nodes, 1 per role

As per rancher/dashboard#7819

Create Cluster

resource.stop/start cycles: 18
cluster state: updated ok
pool state: updated mostly ok (1 pool node summary graph did not update)

image

Scale Pool

resource.stop/start cycles: 8
cluster state: stuck at updating
pool state: updated mostly ok (1 pool node summary graph did not update)

The last resource.change event received for the cluster has a metadata.state "updating"

image

Other

Also tested master vs a local build with fixes to unavailableMachines and the old way we handled resource.stop events. Nothing conclusive there

TL;DR

There's a number of 'stale' screen content when creating an RKE2 DO cluster, adding/removing machine pools (deployments) and scaling up/down a pool (deployment). Stale content covers

  • Cluster State
  • Freshly created pool does not show new machine
  • Machine from a removed pool remained
  • Pool's Machine summary bar graph not updating
  • Pool scaled down still shows removed machine
  • Created cluster shows empty pools without deployments

I can't nail down a consistent way to reproduce these, but I've tried all of the following and still see odd errors. I've tried with ...

  • 2.7-head 62c5e32 (deployed)
  • local dashboard
  • local dashboard with previous way to handle resource.stop events (which should have solved the issue)
  • As above, but with fixes to unavailableMachines (see below)

From what I can tell there's three causes

  • We don't receive the relevant resource.create, resource.change, resource.remove message
    • Probably missed due to the lag between resource.stop and re-subscribing
  • We do receive the relevant resource.create, resource.change, resource.remove message
    • Mystery vue reactivity issue
  • There are errors in shell/models/provisioning.cattle.io.cluster.js get unavailableMachines
    • Sometimes the machine we iterate over does not contain a status or status.condition
    • Must be 'crap in' somewhere

@torchiaf
Copy link
Contributor

torchiaf commented Feb 20, 2023

Three nodes, 1 pool

As per rancher/dashboard#7819

Create Cluster

Driver: Digital Ocean
Machine Count: 3 (all roles)
Kubernetes version: v1.23.15+rke2r1

resource.stop/start cycles: n
cluster state: updated ok
nodes states: Machines graph did not update; only 1 or 2 nodes each time become available

image

image

The issue is consistently replicated on Chrome/Edge browsers

@richard-cox
Copy link
Member Author

Think I've come up with three reasonable improvements for 2.7.2 for the three issues issues listed in #40558 (comment) (see rancher/dashboard#8224).

As there's no decent reproduction steps anywhere it might take some playing to see if the improvements do resolve all issues.

In terms of this specific issue though, as the functionality has been the same for a number of releases (including going back to 2.6) and we have something proposed to fix the linked issues, i'm going to close this.

If the improvements still require more work, we may need to look into a joint fix

  1. Reducing the need for the frontend to re-subscribe given resource.stop events (30 minute refresh of kube socket, frequent permissions changes, etc)
  2. Reducing the impact of the re-subscribe. Could see issues with possibly stale local revision (though this won't cause spam), scale of fetching everything again, etc

Something concrete that would be grand is to smother the two resource.error messages that are harmless/to be expected (unable to decode an event.. & notifier watch error: clusters.provisioning.cattle.io \"asdsad\" not found.. more details above). Those appear a lot in logs.

@boltdog2022
Copy link

hi,after happen this issue, is there anyway to recover?

@richard-cox
Copy link
Member Author

hi,after happen this issue, is there anyway to recover?

@boltdog2022 The user just needs to refresh their browser. That will get the latest resource info and subscribe to changes from that revision.

@boltdog2022
Copy link

hi,after happen this issue, is there anyway to recover?

@boltdog2022 The user just needs to refresh their browser. That will get the latest resource info and subscribe to changes from that revision.

Thanks, by the way, can you have a look at my issue? I don't know how it happens and how to resolve it.
#41748

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/2 team/area1
Projects
None yet
Development

No branches or pull requests

8 participants