Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster and rancher stores now use advanced worker #8717

Merged
30 commits merged into from May 17, 2023
Merged

cluster and rancher stores now use advanced worker #8717

30 commits merged into from May 17, 2023

Conversation

ghost
Copy link

@ghost ghost commented Apr 21, 2023

Still requires a queueing system in place since some subscribes are fired prior to the advanced worker creation

Summary

Fixes #8837

Occurred changes and/or fixed issues

Technical notes summary

Areas or cases that should be tested

Areas which could experience regressions

Screenshot/Video

@ghost ghost requested a review from richard-cox April 21, 2023 15:13
@github-actions github-actions bot assigned ghost Apr 21, 2023
@ghost ghost marked this pull request as ready for review April 25, 2023 14:25
@ghost ghost changed the title WIP: cluster and rancher stores now use advanced worker cluster and rancher stores now use advanced worker Apr 25, 2023
@richard-cox
Copy link
Member

@Sean-McQ There's some test failures, can you take a look? I've added the milestone as well, can you create and link the issue as well?

@richard-cox richard-cox added this to the v2.7.next2 milestone Apr 27, 2023
Copy link
Member

@richard-cox richard-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • going from a context with no cluster (cluster manager with cluster of _) to one with a cluster (explorer) throws the below error
    (actual error) TypeError: Cannot read properties of undefined (reading 'revision')
      at Object.<anonymous> (subscribe.js:973:1)
    
    (in nextResourceVersion)
  • there might be an issue with the way stop requests are handled for steve based stores. At the moment the resource watcher deletes the cached revision and doesn't send one when starting the subscription again. This means we get a resource.create for every matching resource (which doesn't scale). We need the same approach as in current master, where we try to sub with a best effort revision (which should normally be ok). The only way around this, given the current code base, is to bring in some of the sync worker to ui thread requests (not store / getters, but the promises over thread boundry stuff)

As a rough guide, here's what i think should be happening given the following scenarios

steve socket unsubs everything (CATTLE_WATCH_TIMEOUT_SECONDS related)

  • we receive resource.stop per type
  • in resource.stop handler, as we're still interesting in the type, we call watch with no revision supplied
  • watch handles no revision by using a best effort to find one (cache.revision or highest resource revision from store)

ui subscribes with a revision that's too old

  • In kind of parallel
    • resource.error received
      • set error state of type
      • call resync
        • resync starts afresh (http request & re-sub with new revision)
    • resource.stop received
      • as per resource.stop above... but as we're in an error state we don't rewatch

From testing it looks like we're handling the later ok, so it's just the former to look at

shell/plugins/steve/subscribe.js Outdated Show resolved Hide resolved
shell/plugins/steve/subscribe.js Show resolved Hide resolved
shell/plugins/steve/subscribe.js Show resolved Hide resolved
shell/plugins/steve/subscribe.js Outdated Show resolved Hide resolved
shell/plugins/steve/subscribe.js Show resolved Hide resolved
shell/plugins/steve/subscribe.js Outdated Show resolved Hide resolved
shell/plugins/steve/worker/web-worker.basic.js Outdated Show resolved Hide resolved
shell/plugins/steve/worker/web-worker.basic.js Outdated Show resolved Hide resolved
shell/plugins/steve/subscribe.js Outdated Show resolved Hide resolved
shell/plugins/steve/resourceWatcher.js Outdated Show resolved Hide resolved
Copy link
Member

@richard-cox richard-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get too far today. I need to resolve the existing review comments by just understanding your updates. There's one additional comment from this review, containing two points. One regarding an exception in the new batch mutator. The other is to ensure the resource.stop --> watch process works the same as non-advanced worker (uses a revision from the store). that means fetching the revision from cross worker boundary (or alternatively find a way to add socket message resource revisions to the resource cache).

shell/plugins/steve/resourceWatcher.js Show resolved Hide resolved
@richard-cox richard-cox self-requested a review May 9, 2023 17:23
bisht-richa and others added 17 commits May 10, 2023 10:44
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Remove dependency: @nuxtjs/proxy
Remove dependency: @nuxtjs/style-resources

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
* Added placeholder in url input

* Added format checker for Url input

* Update test

* Fixed lint

* Fixed breaking for other providers

* Fixed lint
* Update vue options to preserve whitespace

* Bug fix with tidy-up
* Update extensions workflow to move assets into gh-pages branch

Fix gh-pages check

* Move bundle execution - add default basename and ext version

* Fix basename op

* PR changes
* upgrade @vue/cli-xxx

* Update Vue to latest 2.7

* Update eslint-plugin-vue

* Disable new linting rules

* Remove linting issue

* Pin Dom purify library version

* Add resolution to avoid conflicts with packages

* Update yarn/lock after the enforced resolution

* Exclude node 16 types resolution
* Fixed extra space bug in the generic cluster-page

* Fixed space issue in banner

* Revert changes

* Removed extra div
* Fixed selected rows counter

* Fixed lint

* Fixed counter in selection.js

* Small fix in toAdd condition

* Lints

* Fixed condition for selected row counter

* Changes in focusAdjacent function

* Fixed lints
This adds the following functionality to the violations list on the OPA Gatekeeper constraint detail page:

* Add a namespace column to the violations
* Make the violations list searchable
* Allow to download the violations as a CSV, similar to CIS scanner violations

Signed-off-by: Bastian Hofmann <mail@bastianhofmann.de>
cnotv and others added 10 commits May 10, 2023 10:44
* Removed hasClass method

* Clean code
- supported store types now a const
- WORKERMODES --> WORKER_MODES
- Ensure advanced worker runs in cluster manager with no current cluster context
- Ensure `resource.stop` from advanced worker successfully kicks off watch


Also
- make rancher, managment and cluster store names a const
- Fix some comments (jsdoc --> standard, todos)
- Made the web worker redispatch requirement clearer
- Persist watcher error state after resource.stop
  - this will ensure the resource.stop given alongside the resource.error TOO_OLD is ignored
- Ensure any errors are cleared when we succesfully connect (given above)
  - Should fix the next resource.stop being ignored after recovering from TOO_OLD
- Fix resync watch params
  - these weren't correct, so the resync after TOO_OLD never worked
  - the format could be slimmer, but i thinkg it matches how other socket's data works

Note - I manufactored the TOO_OLD by setting a revision of 1 in subscribe resource.stop if it came from the advanced worker
Copy link
Member

@richard-cox richard-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed some bugs

  • Ensured advanced worker is used in cluster manager with no current cluster context
  • Ensured resource.stop from advanced worker successfully kicks off watch
    • In current master socket state is tracked in the worker thread, but here we're checking state in the ui thread, so need to work around it
  • Persist the error state instead of nuking all state when we receive a resource.stop
    • This will ensure the resource.stop given alongside the resource.error TOO_OLD is ignored
  • Fix resource.error TOO_OLD --> resync params
    • these weren't correct, so the resync after TOO_OLD never worked

With the above fixes, and a tweak to force TOO_OLD (setting a revision of 1 in subscribe resource.stop if it came from the advanced worker), i've seen a successful set of events

  1. resource.stop from BE
  2. resource.start (bad revision) to BE
  3. resource.stop & resource.error TOO_OLD from BE
  4. http request to BE
  5. resource.start (new revision) to BE

Screenshot_20230512_195038

Screenshot_20230512_195414

I had to merge instead of rebase, so the history is crazy. When merging please squash & merge

@ghost ghost merged commit e256669 into rancher:master May 17, 2023
@git-ival
Copy link

git-ival commented Jun 6, 2023

@Sean-McQ @richard-cox Is there a description of what the advanced worker is meant to do and how to validate for QA? It's unclear what the current vs. expected behavior is.

@ghost
Copy link
Author

ghost commented Jun 6, 2023

@git-ival The advanced worker is a javascript web-worker, essentially a new thread outside of the main UI thread to process CPU intensive work that would otherwise block UX responsiveness in the browser tab.

In the case of the "Advanced Worker" feature, we've offloaded the websockets that would typically be waiting for real-time updates for the cluster, rancher, and management apis into separate worker threads (one for each socket) so that excessive incoming messages are handled separately and don't slow down or block the user's interaction with the Dashboard app in the browser.

Simple validation would probably just mean validating that the three websockets in question do in fact open in webworkers, you can do this by checking the Chrome devtools under "Network" and further selecting the "WS" (Websockets) tab to verify that the websockets in question opened in webworkers as opposed to in the main UI thread.
image
In the above screenshot, I can tell the Websockets in question are running in webworkers by the gear icon next to their "name" and I can tell they are separate worker threads by verifying that the filename under the "Initiator" column are all different (and random).

Apart from that, you'll want to perform some actions in another browser whose changes should appear in the active browser tab under test to ensure that the messages being received via websocket make their way back into the UI thread and update the app that the user is interacting with.

You can also validate performance improvements by scaling up a large cluster that generates excessive websocket traffic to the point of slowing down the UX and then switching on the "advanced worker" to verify that responsiveness improves with the feature toggled on.

Finally, you should notice a decrease in both the amount of memory consumed in the browser for clusters with a lot of resources but also significantly slowed memory growth over time as well. You'll want to test this on a list page with a lot of resources (such as pods or really anything under workloads).

Also, feel free to contact me on slack and I can send you a link to the recorded support training that Richard and I did recently in which we go into some detail for most of the recent performance features.

@git-ival
Copy link

git-ival commented Jun 6, 2023

Thanks for the clarification Sean! Just had a couple of follow-up questions.

  • When you refer to scaling up a large cluster, are you referring to having a large # of workloads/objects, or a large # of downstream clusters in general?
  • When you refer to memory consumption and performance, are you referring to the local machine running the browser or to the local cluster itself?

@floatingman
Copy link

@Sean-McQ Do you have answers to Iramis's questions?

@ghost
Copy link
Author

ghost commented Jun 9, 2023

Apologies, I thought I had answered these but I'm not seeing that here.

In this case, for scaling I'm referring to large clusters as those with a lot of resources or at least those that produce a lot of events since really what we're addressing here is the web-app's ability to keep up with potentially hundreds or thousands of socket messages per second.

For memory consumption, we're specifically referring to memory used by the web browser itself and not the cluster itself, the only real effect we should see on the cluster is that it shouldn't have to close as many sockets due to a slow client (for which the memory impact would be negligible).

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend Advanced Worker support for management and rancher stores