cluster and rancher stores now use advanced worker #8717

ghost · 2023-04-21T15:13:43Z

Still requires a queueing system in place since some subscribes are fired prior to the advanced worker creation

Summary

Fixes #8837

Occurred changes and/or fixed issues

Technical notes summary

Areas or cases that should be tested

Areas which could experience regressions

Screenshot/Video

richard-cox · 2023-04-27T08:45:36Z

@Sean-McQ There's some test failures, can you take a look? I've added the milestone as well, can you create and link the issue as well?

richard-cox

going from a context with no cluster (cluster manager with cluster of _) to one with a cluster (explorer) throws the below error
```
(actual error) TypeError: Cannot read properties of undefined (reading 'revision')
  at Object.<anonymous> (subscribe.js:973:1)
```
(in nextResourceVersion)
there might be an issue with the way stop requests are handled for steve based stores. At the moment the resource watcher deletes the cached revision and doesn't send one when starting the subscription again. This means we get a resource.create for every matching resource (which doesn't scale). We need the same approach as in current master, where we try to sub with a best effort revision (which should normally be ok). The only way around this, given the current code base, is to bring in some of the sync worker to ui thread requests (not store / getters, but the promises over thread boundry stuff)

As a rough guide, here's what i think should be happening given the following scenarios

steve socket unsubs everything (CATTLE_WATCH_TIMEOUT_SECONDS related)

we receive resource.stop per type
in resource.stop handler, as we're still interesting in the type, we call watch with no revision supplied
watch handles no revision by using a best effort to find one (cache.revision or highest resource revision from store)

ui subscribes with a revision that's too old

In kind of parallel
- resource.error received
  - set error state of type
  - call resync
    - resync starts afresh (http request & re-sub with new revision)
- resource.stop received
  - as per resource.stop above... but as we're in an error state we don't rewatch

From testing it looks like we're handling the later ok, so it's just the former to look at

shell/plugins/steve/subscribe.js

shell/plugins/steve/worker/web-worker.basic.js

shell/plugins/steve/subscribe.js

shell/plugins/steve/resourceWatcher.js

richard-cox

Didn't get too far today. I need to resolve the existing review comments by just understanding your updates. There's one additional comment from this review, containing two points. One regarding an exception in the new batch mutator. The other is to ensure the resource.stop --> watch process works the same as non-advanced worker (uses a revision from the store). that means fetching the revision from cross worker boundary (or alternatively find a way to add socket message resource revisions to the resource cache).

shell/plugins/steve/resourceWatcher.js

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

Remove dependency: @nuxtjs/proxy Remove dependency: @nuxtjs/style-resources Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

* Added placeholder in url input * Added format checker for Url input * Update test * Fixed lint * Fixed breaking for other providers * Fixed lint

* Update vue options to preserve whitespace * Bug fix with tidy-up

* Update extensions workflow to move assets into gh-pages branch Fix gh-pages check * Move bundle execution - add default basename and ext version * Fix basename op * PR changes

* upgrade @vue/cli-xxx * Update Vue to latest 2.7 * Update eslint-plugin-vue * Disable new linting rules * Remove linting issue * Pin Dom purify library version * Add resolution to avoid conflicts with packages * Update yarn/lock after the enforced resolution * Exclude node 16 types resolution

* Fixed extra space bug in the generic cluster-page * Fixed space issue in banner * Revert changes * Removed extra div

* Fixed selected rows counter * Fixed lint * Fixed counter in selection.js * Small fix in toAdd condition * Lints * Fixed condition for selected row counter * Changes in focusAdjacent function * Fixed lints

This adds the following functionality to the violations list on the OPA Gatekeeper constraint detail page: * Add a namespace column to the violations * Make the violations list searchable * Allow to download the violations as a CSV, similar to CIS scanner violations Signed-off-by: Bastian Hofmann <mail@bastianhofmann.de>

* Removed hasClass method * Clean code

- supported store types now a const - WORKERMODES --> WORKER_MODES

- Ensure advanced worker runs in cluster manager with no current cluster context - Ensure `resource.stop` from advanced worker successfully kicks off watch Also - make rancher, managment and cluster store names a const

…nced_worker_for_cluster_and_rancher_stores

- Fix some comments (jsdoc --> standard, todos) - Made the web worker redispatch requirement clearer

- Persist watcher error state after resource.stop - this will ensure the resource.stop given alongside the resource.error TOO_OLD is ignored - Ensure any errors are cleared when we succesfully connect (given above) - Should fix the next resource.stop being ignored after recovering from TOO_OLD - Fix resync watch params - these weren't correct, so the resync after TOO_OLD never worked - the format could be slimmer, but i thinkg it matches how other socket's data works Note - I manufactored the TOO_OLD by setting a revision of 1 in subscribe resource.stop if it came from the advanced worker

richard-cox

I've fixed some bugs

Ensured advanced worker is used in cluster manager with no current cluster context
Ensured resource.stop from advanced worker successfully kicks off watch
- In current master socket state is tracked in the worker thread, but here we're checking state in the ui thread, so need to work around it
Persist the error state instead of nuking all state when we receive a resource.stop
- This will ensure the resource.stop given alongside the resource.error TOO_OLD is ignored
Fix resource.error TOO_OLD --> resync params
- these weren't correct, so the resync after TOO_OLD never worked

With the above fixes, and a tweak to force TOO_OLD (setting a revision of 1 in subscribe resource.stop if it came from the advanced worker), i've seen a successful set of events

resource.stop from BE
resource.start (bad revision) to BE
resource.stop & resource.error TOO_OLD from BE
http request to BE
resource.start (new revision) to BE

I had to merge instead of rebase, so the history is crazy. When merging please squash & merge

git-ival · 2023-06-06T21:05:00Z

@Sean-McQ @richard-cox Is there a description of what the advanced worker is meant to do and how to validate for QA? It's unclear what the current vs. expected behavior is.

ghost · 2023-06-06T21:38:32Z

@git-ival The advanced worker is a javascript web-worker, essentially a new thread outside of the main UI thread to process CPU intensive work that would otherwise block UX responsiveness in the browser tab.

In the case of the "Advanced Worker" feature, we've offloaded the websockets that would typically be waiting for real-time updates for the cluster, rancher, and management apis into separate worker threads (one for each socket) so that excessive incoming messages are handled separately and don't slow down or block the user's interaction with the Dashboard app in the browser.

Simple validation would probably just mean validating that the three websockets in question do in fact open in webworkers, you can do this by checking the Chrome devtools under "Network" and further selecting the "WS" (Websockets) tab to verify that the websockets in question opened in webworkers as opposed to in the main UI thread.

In the above screenshot, I can tell the Websockets in question are running in webworkers by the gear icon next to their "name" and I can tell they are separate worker threads by verifying that the filename under the "Initiator" column are all different (and random).

Apart from that, you'll want to perform some actions in another browser whose changes should appear in the active browser tab under test to ensure that the messages being received via websocket make their way back into the UI thread and update the app that the user is interacting with.

You can also validate performance improvements by scaling up a large cluster that generates excessive websocket traffic to the point of slowing down the UX and then switching on the "advanced worker" to verify that responsiveness improves with the feature toggled on.

Finally, you should notice a decrease in both the amount of memory consumed in the browser for clusters with a lot of resources but also significantly slowed memory growth over time as well. You'll want to test this on a list page with a lot of resources (such as pods or really anything under workloads).

Also, feel free to contact me on slack and I can send you a link to the recorded support training that Richard and I did recently in which we go into some detail for most of the recent performance features.

git-ival · 2023-06-06T22:35:06Z

Thanks for the clarification Sean! Just had a couple of follow-up questions.

When you refer to scaling up a large cluster, are you referring to having a large # of workloads/objects, or a large # of downstream clusters in general?
When you refer to memory consumption and performance, are you referring to the local machine running the browser or to the local cluster itself?

floatingman · 2023-06-09T01:07:14Z

@Sean-McQ Do you have answers to Iramis's questions?

ghost · 2023-06-09T01:13:27Z

Apologies, I thought I had answered these but I'm not seeing that here.

In this case, for scaling I'm referring to large clusters as those with a lot of resources or at least those that produce a lot of events since really what we're addressing here is the web-app's ability to keep up with potentially hundreds or thousands of socket messages per second.

For memory consumption, we're specifically referring to memory used by the web browser itself and not the cluster itself, the only real effect we should see on the cluster is that it shouldn't have to close as many sockets due to a slow client (for which the memory impact would be negligible).

ghost requested a review from richard-cox April 21, 2023 15:13

github-actions bot assigned ghost Apr 21, 2023

ghost marked this pull request as ready for review April 25, 2023 14:25

ghost changed the title ~~WIP: cluster and rancher stores now use advanced worker~~ cluster and rancher stores now use advanced worker Apr 25, 2023

cluster and rancher stores now use advanced worker

ee2f4e3

richard-cox added this to the v2.7.next2 milestone Apr 27, 2023

richard-cox requested changes Apr 28, 2023

View reviewed changes

Addressing feedback

1b1ab0e

richard-cox requested changes May 5, 2023

View reviewed changes

shell/plugins/steve/resourceWatcher.js Show resolved Hide resolved

richard-cox mentioned this pull request May 9, 2023

Investigate performance in cluster manager pages / management store #6829

Open

addressing PR feedback and unit tests

fe346f8

richard-cox self-requested a review May 9, 2023 17:23

bisht-richa and others added 17 commits May 10, 2023 10:44

Fixed sticky footer in Helm Chart Install Yaml (#8497)

b239df5

Remove dependency: nuxt

da59ddc

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

Remove dependency @nuxt/types

f846ef7

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

fix lint

d14db9a

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

Remove dependency: @nuxt/typescript-build

4aa736c

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

Remove dependency: @nuxtjs/eslint-module

8b57c9a

Remove dependency: @nuxtjs/proxy Remove dependency: @nuxtjs/style-resources Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

align package.log

515f189

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

Storybook, add vue-loader dependency

0813f8b

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

Added placeholder in url input (#8036)

8d95287

* Added placeholder in url input * Added format checker for Url input * Update test * Fixed lint * Fixed breaking for other providers * Fixed lint

Update vue options to preserve whitespace (#8742)

6930507

* Update vue options to preserve whitespace * Bug fix with tidy-up

Fix response times for multiple clusters (#8720)

c834e6f

Update extensions publish workflow for assets in gh-pages branch (#8618)

c1caf93

* Update extensions workflow to move assets into gh-pages branch Fix gh-pages check * Move bundle execution - add default basename and ext version * Fix basename op * PR changes

[v2.7] Update Chinese translation (#8731)

ff2c669

Fixed extra space bug in the generic cluster-page (#8626)

5c1d278

* Fixed extra space bug in the generic cluster-page * Fixed space issue in banner * Revert changes * Removed extra div

Fixed selected rows counter (#8419)

cbf10d3

* Fixed selected rows counter * Fixed lint * Fixed counter in selection.js * Small fix in toAdd condition * Lints * Fixed condition for selected row counter * Changes in focusAdjacent function * Fixed lints

cnotv and others added 10 commits May 10, 2023 10:44

Increase memory on build (#8751)

b3bca49

Remove extension autoLoad functionality (#8700)

02bb95a

fix(BrandImage): remove async fetch method (#8432)

3e1f088

Removed hasClass method (#8752)

8bc8c36

* Removed hasClass method * Clean code

Two minor tweaks

ca6150f

- supported store types now a const - WORKERMODES --> WORKER_MODES

Fix two bugs

9f5b79a

- Ensure advanced worker runs in cluster manager with no current cluster context - Ensure `resource.stop` from advanced worker successfully kicks off watch Also - make rancher, managment and cluster store names a const

Merge remote-tracking branch 'upstream/master' into feature/adds_adva…

73bda2e

…nced_worker_for_cluster_and_rancher_stores

Tweaks

cbd11a9

- Fix some comments (jsdoc --> standard, todos) - Made the web worker redispatch requirement clearer

Fix unit tests

f6b7a71

richard-cox approved these changes May 12, 2023

View reviewed changes

ghost merged commit e256669 into rancher:master May 17, 2023

richard-cox mentioned this pull request Aug 7, 2023

Re-fetch resources on some resource watcher failures #7917

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster and rancher stores now use advanced worker #8717

cluster and rancher stores now use advanced worker #8717

ghost commented Apr 21, 2023 •

edited by zube bot

Loading

richard-cox commented Apr 27, 2023

richard-cox left a comment •

edited

Loading

richard-cox left a comment

richard-cox left a comment

git-ival commented Jun 6, 2023 •

edited

Loading

ghost commented Jun 6, 2023 •

edited by ghost

Loading

git-ival commented Jun 6, 2023

floatingman commented Jun 9, 2023

ghost commented Jun 9, 2023

cluster and rancher stores now use advanced worker #8717

cluster and rancher stores now use advanced worker #8717

Conversation

ghost commented Apr 21, 2023 • edited by zube bot Loading

Summary

Occurred changes and/or fixed issues

Technical notes summary

Areas or cases that should be tested

Areas which could experience regressions

Screenshot/Video

richard-cox commented Apr 27, 2023

richard-cox left a comment • edited Loading

Choose a reason for hiding this comment

steve socket unsubs everything (CATTLE_WATCH_TIMEOUT_SECONDS related)

ui subscribes with a revision that's too old

richard-cox left a comment

Choose a reason for hiding this comment

richard-cox left a comment

Choose a reason for hiding this comment

git-ival commented Jun 6, 2023 • edited Loading

ghost commented Jun 6, 2023 • edited by ghost Loading

git-ival commented Jun 6, 2023

floatingman commented Jun 9, 2023

ghost commented Jun 9, 2023

ghost commented Apr 21, 2023 •

edited by zube bot

Loading

richard-cox left a comment •

edited

Loading

git-ival commented Jun 6, 2023 •

edited

Loading

ghost commented Jun 6, 2023 •

edited by ghost

Loading