[BUG] Creating/deleting management resources cause resources to be unsubscribed #40558

richard-cox · 2023-02-15T17:25:21Z

Rancher Server Setup

Rancher version: v2.7-head (1dd2af2)
Installation option (Docker install/Helm Chart):
- docker install
- docker run -d --restart=unless-stopped --privileged -e CATTLE_PASSWORD_MIN_LENGTH=8 --name $container --network host rancher/rancher:v2.7-head --acme-domain

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- Admin

Describe the bug

In the UI we subscribe over socket for updates to resources
There are three socket endpoints
- steve management / v1/subscribe
- steve cluster k8s/clusters/local/v1/subscribe
- norman / v3/subscribe
This issue only applies to steve management, and not steve cluster or norman.
When a resource that is subscribed to is created or deleted, all/most resource subscriptions in the socket will get resource.stop message (sometimes preceded by resource.error) aka is unsubscribed. The UI then re-subscribes.
If there is enough churn of created/deleted resources then resource subscriptions die and are reborn a lot
This causes missed resource updates resulting in stale information on screen. This could be the root cause of RKE2 cluster stays in updating state even after scaled up nodes are active dashboard#7819 and Cluster/Nodes status do not update automatically and needs a browser refresh dashboard#7815

I can't quite tie down the pattern of which resources cause the resource.stop messages, there's some findings below

To Reproduce

Open browser 1 and navigate to the Dashboards Cluster Management page
Open browser 1's Dev Tools --> Network tab --> WS Tab --> select v1/subscription --> Messages tab
Open browser 2 and navigate to the Dashboards Cluster Management page
In browser 2 --> Import Existing --> Generic --> enter anything for the Name --> Create

Result

In browser 1's dev tools there will be multiple resource.error and resource.stop messages

Expected Result

No resource.stop message - aka resource subscriptions aren't dropped

Screenshots

Additional context

management.cattle.io.fleetworkspace

Specific resource messages

Create - resource.error and resource.stop
Edit - Neither
Delete - resource.error and resource.stop

resource.error data

"event watch error: an error on the server (\"unable to decode an event from the watch stream: http2: response body closed\") has prevented the request from succeeding"

All messages

management.cattle.io.settings

Create / Edit did not cause resource.stop or resource.error

provisioning.cattle.io.cluster

Create - resource.stop but not resource.error
Delete - resource.stop and resource.error

resource.error data 1 (many)

notifier watch error: clusters.provisioning.cattle.io \"asdsad\" not found: clusters.provisioning.cattle.io \"asdsad\" not found

resource.error data 2 (one)

event watch error: an error on the server (\"unable to decode an event from the watch stream: http2: response body closed\") has prevented the request from succeeding

namespace (via `local` cluster explorer)

Create / Edit / Delete in both v1 and k8s/clusters/local/v1/subscribe show correct events, no resource.stop or resource.error

management.cattle.io.project (via `local` cluster explorer)

Create - resource.error and resource.stop
Edit - Neither
Delete - resource.error and resource.stop

resource.error data 1

notifier watch error: projects.management.cattle.io \"p-mrqlr\" not found: projects.management.cattle.io \"p-mrqlr\" not found

resource.error data 2

event watch error: an error on the server (\"unable to decode an event from the watch stream: http2: response body closed\") has prevented the request from succeeding

The text was updated successfully, but these errors were encountered:

gaktive · 2023-02-16T19:24:46Z

For 2.7.2, UI will do a client-side workaround. @MbolotSuse to link to an existing issue to help with buffering results to the UI.

MbolotSuse · 2023-02-16T19:36:55Z

Backend Issue for Watch buffering: #39568

richard-cox · 2023-02-17T10:48:50Z

Just tracking results.

I've validated the same behaviour for fleet workspaces happens in 2.7.0 (as mentioned in the call)

Similar resource messages for fleet workspaces are also seen in 2.6.9

Further testing

RKE2 DO cluster. Created/Edited and navigated to cluster detail page

Create Cluster

resource.stop/start cycles: 3
cluster state: updated ok
pool state: updated ok

Scale Pool

resource.stop/start cycles: 11
cluster state: updated ok
pool state: updated ok

Add Pool

Had to fix an undefined machine.status error before this worked https://github.com/rancher/dashboard/blob/master/shell/models/provisioning.cattle.io.cluster.js#L460

resource.stop/start cycles: 10
cluster state: updated ok
pool state: new pool never shows new machine

The machines counts is 3

Remove Pool

resource.stop/start cycles: 4
cluster state: updated ok
pool state: new pool never removes deleted machine (but does remove pool)

The resource.remove message for the machine is received by the UI

The machines counts is 2

Three nodes, 1 per role

As per rancher/dashboard#7819

Create Cluster

resource.stop/start cycles: 18
cluster state: updated ok
pool state: updated mostly ok (1 pool node summary graph did not update)

Scale Pool

resource.stop/start cycles: 8
cluster state: stuck at updating
pool state: updated mostly ok (1 pool node summary graph did not update)

The last resource.change event received for the cluster has a metadata.state "updating"

Other

Also tested master vs a local build with fixes to unavailableMachines and the old way we handled resource.stop events. Nothing conclusive there

TL;DR

There's a number of 'stale' screen content when creating an RKE2 DO cluster, adding/removing machine pools (deployments) and scaling up/down a pool (deployment). Stale content covers

Cluster State
Freshly created pool does not show new machine
Machine from a removed pool remained
Pool's Machine summary bar graph not updating
Pool scaled down still shows removed machine
Created cluster shows empty pools without deployments

I can't nail down a consistent way to reproduce these, but I've tried all of the following and still see odd errors. I've tried with ...

2.7-head 62c5e32 (deployed)
local dashboard
local dashboard with previous way to handle resource.stop events (which should have solved the issue)
As above, but with fixes to unavailableMachines (see below)

From what I can tell there's three causes

We don't receive the relevant resource.create, resource.change, resource.remove message
- Probably missed due to the lag between resource.stop and re-subscribing
We do receive the relevant resource.create, resource.change, resource.remove message
- Mystery vue reactivity issue
There are errors in shell/models/provisioning.cattle.io.cluster.js get unavailableMachines
- Sometimes the machine we iterate over does not contain a status or status.condition
- Must be 'crap in' somewhere

torchiaf · 2023-02-20T14:32:32Z

Three nodes, 1 pool

As per rancher/dashboard#7819

Create Cluster

Driver: Digital Ocean
Machine Count: 3 (all roles)
Kubernetes version: v1.23.15+rke2r1

resource.stop/start cycles: n
cluster state: updated ok
nodes states: Machines graph did not update; only 1 or 2 nodes each time become available

The issue is consistently replicated on Chrome/Edge browsers

richard-cox · 2023-02-21T18:19:46Z

Think I've come up with three reasonable improvements for 2.7.2 for the three issues issues listed in #40558 (comment) (see rancher/dashboard#8224).

As there's no decent reproduction steps anywhere it might take some playing to see if the improvements do resolve all issues.

In terms of this specific issue though, as the functionality has been the same for a number of releases (including going back to 2.6) and we have something proposed to fix the linked issues, i'm going to close this.

If the improvements still require more work, we may need to look into a joint fix

Reducing the need for the frontend to re-subscribe given resource.stop events (30 minute refresh of kube socket, frequent permissions changes, etc)
Reducing the impact of the re-subscribe. Could see issues with possibly stale local revision (though this won't cause spam), scale of fetching everything again, etc

Something concrete that would be grand is to smother the two resource.error messages that are harmless/to be expected (unable to decode an event.. & notifier watch error: clusters.provisioning.cattle.io \"asdsad\" not found.. more details above). Those appear a lot in logs.

boltdog2022 · 2023-06-06T05:48:00Z

hi，after happen this issue, is there anyway to recover?

richard-cox · 2023-06-06T06:30:03Z

hi，after happen this issue, is there anyway to recover?

@boltdog2022 The user just needs to refresh their browser. That will get the latest resource info and subscribe to changes from that revision.

boltdog2022 · 2023-06-06T07:09:56Z

hi，after happen this issue, is there anyway to recover?

@boltdog2022 The user just needs to refresh their browser. That will get the latest resource info and subscribe to changes from that revision.

Thanks, by the way, can you have a look at my issue? I don't know how it happens and how to resolve it.
#41748

richard-cox added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Feb 15, 2023

richard-cox added this to the v2.7.2 milestone Feb 15, 2023

This was referenced Feb 15, 2023

RKE2 cluster stays in updating state even after scaled up nodes are active rancher/dashboard#7819

Closed

Cluster/Nodes status do not update automatically and needs a browser refresh rancher/dashboard#7815

Closed

snasovich added the team/area1 label Feb 15, 2023

KevinJoiner self-assigned this Feb 16, 2023

gaktive modified the milestones: v2.7.2, 2023-Q2-v2.7x Feb 16, 2023

samjustus added the priority/2 label Feb 16, 2023

richard-cox mentioned this issue Feb 21, 2023

Fix stale management cluster resources rancher/dashboard#8224

Merged

richard-cox closed this as completed Feb 21, 2023

KevinJoiner mentioned this issue Feb 21, 2023

[BUG] Filter Steve WebSocket errors. #40627

Open

boltdog2022 mentioned this issue Jun 6, 2023

[BUG] rancher basic function broke, seems cannot listen crd changes #41748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Creating/deleting management resources cause resources to be unsubscribed #40558

[BUG] Creating/deleting management resources cause resources to be unsubscribed #40558

richard-cox commented Feb 15, 2023

gaktive commented Feb 16, 2023

MbolotSuse commented Feb 16, 2023

richard-cox commented Feb 17, 2023 •

edited

Loading

torchiaf commented Feb 20, 2023 •

edited

Loading

richard-cox commented Feb 21, 2023

boltdog2022 commented Jun 6, 2023

richard-cox commented Jun 6, 2023

boltdog2022 commented Jun 6, 2023

[BUG] Creating/deleting management resources cause resources to be unsubscribed #40558

[BUG] Creating/deleting management resources cause resources to be unsubscribed #40558

Comments

richard-cox commented Feb 15, 2023

management.cattle.io.fleetworkspace

Specific resource messages

All messages

management.cattle.io.settings

provisioning.cattle.io.cluster

namespace (via local cluster explorer)

management.cattle.io.project (via local cluster explorer)

gaktive commented Feb 16, 2023

MbolotSuse commented Feb 16, 2023

richard-cox commented Feb 17, 2023 • edited Loading

Further testing

RKE2 DO cluster. Created/Edited and navigated to cluster detail page

Create Cluster

Scale Pool

Add Pool

Remove Pool

Three nodes, 1 per role

Create Cluster

Scale Pool

Other

TL;DR

torchiaf commented Feb 20, 2023 • edited Loading

Three nodes, 1 pool

Create Cluster

richard-cox commented Feb 21, 2023

boltdog2022 commented Jun 6, 2023

richard-cox commented Jun 6, 2023

boltdog2022 commented Jun 6, 2023

namespace (via `local` cluster explorer)

management.cattle.io.project (via `local` cluster explorer)

richard-cox commented Feb 17, 2023 •

edited

Loading

torchiaf commented Feb 20, 2023 •

edited

Loading