-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster and rancher stores now use advanced worker #8717
Conversation
@Sean-McQ There's some test failures, can you take a look? I've added the milestone as well, can you create and link the issue as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- going from a context with no cluster (cluster manager with cluster of
_
) to one with a cluster (explorer) throws the below error(in nextResourceVersion)(actual error) TypeError: Cannot read properties of undefined (reading 'revision') at Object.<anonymous> (subscribe.js:973:1)
- there might be an issue with the way stop requests are handled for steve based stores. At the moment the resource watcher deletes the cached revision and doesn't send one when starting the subscription again. This means we get a
resource.create
for every matching resource (which doesn't scale). We need the same approach as in current master, where we try to sub with a best effort revision (which should normally be ok). The only way around this, given the current code base, is to bring in some of the sync worker to ui thread requests (not store / getters, but the promises over thread boundry stuff)
As a rough guide, here's what i think should be happening given the following scenarios
steve socket unsubs everything (CATTLE_WATCH_TIMEOUT_SECONDS related)
- we receive resource.stop per type
- in resource.stop handler, as we're still interesting in the type, we call
watch
with no revision supplied watch
handles no revision by using a best effort to find one (cache.revision or highest resource revision from store)
ui subscribes with a revision that's too old
- In kind of parallel
- resource.error received
- set error state of type
- call resync
- resync starts afresh (http request & re-sub with new revision)
- resource.stop received
- as per resource.stop above... but as we're in an error state we don't re
watch
- as per resource.stop above... but as we're in an error state we don't re
- resource.error received
From testing it looks like we're handling the later ok, so it's just the former to look at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get too far today. I need to resolve the existing review comments by just understanding your updates. There's one additional comment from this review, containing two points. One regarding an exception in the new batch mutator. The other is to ensure the resource.stop
--> watch
process works the same as non-advanced worker (uses a revision from the store). that means fetching the revision from cross worker boundary (or alternatively find a way to add socket message resource revisions to the resource cache).
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Remove dependency: @nuxtjs/proxy Remove dependency: @nuxtjs/style-resources Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
* Added placeholder in url input * Added format checker for Url input * Update test * Fixed lint * Fixed breaking for other providers * Fixed lint
* Update vue options to preserve whitespace * Bug fix with tidy-up
* Update extensions workflow to move assets into gh-pages branch Fix gh-pages check * Move bundle execution - add default basename and ext version * Fix basename op * PR changes
* upgrade @vue/cli-xxx * Update Vue to latest 2.7 * Update eslint-plugin-vue * Disable new linting rules * Remove linting issue * Pin Dom purify library version * Add resolution to avoid conflicts with packages * Update yarn/lock after the enforced resolution * Exclude node 16 types resolution
* Fixed extra space bug in the generic cluster-page * Fixed space issue in banner * Revert changes * Removed extra div
* Fixed selected rows counter * Fixed lint * Fixed counter in selection.js * Small fix in toAdd condition * Lints * Fixed condition for selected row counter * Changes in focusAdjacent function * Fixed lints
This adds the following functionality to the violations list on the OPA Gatekeeper constraint detail page: * Add a namespace column to the violations * Make the violations list searchable * Allow to download the violations as a CSV, similar to CIS scanner violations Signed-off-by: Bastian Hofmann <mail@bastianhofmann.de>
* Removed hasClass method * Clean code
- supported store types now a const - WORKERMODES --> WORKER_MODES
- Ensure advanced worker runs in cluster manager with no current cluster context - Ensure `resource.stop` from advanced worker successfully kicks off watch Also - make rancher, managment and cluster store names a const
…nced_worker_for_cluster_and_rancher_stores
- Persist watcher error state after resource.stop - this will ensure the resource.stop given alongside the resource.error TOO_OLD is ignored - Ensure any errors are cleared when we succesfully connect (given above) - Should fix the next resource.stop being ignored after recovering from TOO_OLD - Fix resync watch params - these weren't correct, so the resync after TOO_OLD never worked - the format could be slimmer, but i thinkg it matches how other socket's data works Note - I manufactored the TOO_OLD by setting a revision of 1 in subscribe resource.stop if it came from the advanced worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed some bugs
- Ensured advanced worker is used in cluster manager with no current cluster context
- Ensured
resource.stop
from advanced worker successfully kicks off watch- In current master socket state is tracked in the worker thread, but here we're checking state in the ui thread, so need to work around it
- Persist the error state instead of nuking all state when we receive a resource.stop
- This will ensure the resource.stop given alongside the resource.error TOO_OLD is ignored
- Fix resource.error TOO_OLD --> resync params
- these weren't correct, so the resync after TOO_OLD never worked
With the above fixes, and a tweak to force TOO_OLD (setting a revision of 1 in subscribe resource.stop if it came from the advanced worker), i've seen a successful set of events
- resource.stop from BE
- resource.start (bad revision) to BE
- resource.stop & resource.error TOO_OLD from BE
- http request to BE
- resource.start (new revision) to BE
I had to merge instead of rebase, so the history is crazy. When merging please squash & merge
@Sean-McQ @richard-cox Is there a description of what the advanced worker is meant to do and how to validate for QA? It's unclear what the current vs. expected behavior is. |
@git-ival The advanced worker is a javascript web-worker, essentially a new thread outside of the main UI thread to process CPU intensive work that would otherwise block UX responsiveness in the browser tab. In the case of the "Advanced Worker" feature, we've offloaded the websockets that would typically be waiting for real-time updates for the cluster, rancher, and management apis into separate worker threads (one for each socket) so that excessive incoming messages are handled separately and don't slow down or block the user's interaction with the Dashboard app in the browser. Simple validation would probably just mean validating that the three websockets in question do in fact open in webworkers, you can do this by checking the Chrome devtools under "Network" and further selecting the "WS" (Websockets) tab to verify that the websockets in question opened in webworkers as opposed to in the main UI thread. Apart from that, you'll want to perform some actions in another browser whose changes should appear in the active browser tab under test to ensure that the messages being received via websocket make their way back into the UI thread and update the app that the user is interacting with. You can also validate performance improvements by scaling up a large cluster that generates excessive websocket traffic to the point of slowing down the UX and then switching on the "advanced worker" to verify that responsiveness improves with the feature toggled on. Finally, you should notice a decrease in both the amount of memory consumed in the browser for clusters with a lot of resources but also significantly slowed memory growth over time as well. You'll want to test this on a list page with a lot of resources (such as pods or really anything under workloads). Also, feel free to contact me on slack and I can send you a link to the recorded support training that Richard and I did recently in which we go into some detail for most of the recent performance features. |
Thanks for the clarification Sean! Just had a couple of follow-up questions.
|
@Sean-McQ Do you have answers to Iramis's questions? |
Apologies, I thought I had answered these but I'm not seeing that here. In this case, for scaling I'm referring to large clusters as those with a lot of resources or at least those that produce a lot of events since really what we're addressing here is the web-app's ability to keep up with potentially hundreds or thousands of socket messages per second. For memory consumption, we're specifically referring to memory used by the web browser itself and not the cluster itself, the only real effect we should see on the cluster is that it shouldn't have to close as many sockets due to a slow client (for which the memory impact would be negligible). |
Still requires a queueing system in place since some subscribes are fired prior to the advanced worker creation
Summary
Fixes #8837
Occurred changes and/or fixed issues
Technical notes summary
Areas or cases that should be tested
Areas which could experience regressions
Screenshot/Video