-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce various supervisor policy #241
Conversation
🧐 |
🙄 |
Is cargo clippy check running above 1.75.0? |
🧐 |
🫠 |
Note: This PR does not include the memory leak fixes. (That change has been separated with PR-240) |
…isolate correctly (cherry picked from commit 2108977)
…termination of worker (cherry picked from commit 84fafd1)
When the user sets the CPU time to zero for hard or soft the supervisor will not make the signal pair for the CPU alarm. (cherry picked from commit c4a87c2)
(cherry picked from commit af8287f)
When massive requests arrive simultaneously, the pool makes the extra workers eventually because the workers must have initializing time. That means the pool must manage the multiple workers per service path. Until now, extra workers created in this scenario have just been abandoned. This commit introduces the simple data structure with round robin to fix this problem. (cherry picked from commit 7167813)
(cherry picked from commit 7e643d3)
(cherry picked from commit 9d646d4)
(cherry picked from commit 4717889)
(cherry picked from commit ae5401e)
Actually... The timeout is no longer necessary since we have introduced a line that invokes `waker.wake()` directly from the supervisor routine ;) (cherry picked from commit c6fbf7c)
…sion strategy (cherry picked from commit 4796463)
(cherry picked from commit a29b7fc)
…ifetime with an outbound request Currently, The unix stream pair is used to communicate with the main thread, main worker, and user workers. But each stream pair is independent of the other pair, so there are no guarantees that it will be synchronized between stream pairs. If streams are not synchronized, each stream will have a different lifetime, so depending on the case, it will cause unexpected connection breakage. This problem was already exposed as an internal server error or timeout on the surface, which was too hard to notice. It might be a substantial solution for the Pull Requests and Issues described below. supabase#50 supabase#56 supabase#69 supabase#75 supabase#198 🫠 (cherry picked from commit 2387c9e)
…hen entering the job (cherry picked from commit 6e00e48)
(cherry picked from commit c5b4013)
(cherry picked from commit b1334ef)
(cherry picked from commit 18470f0)
(cherry picked from commit ae430ca)
(cherry picked from commit d10bc64)
(cherry picked from commit 88f2d74)
(cherry picked from commit 47581d4)
This policy forces the supervisor to terminate the isolation immediately if the request is complete. Using this policy with development will make sense because it terminates the isolation immediately if the request is complete, so developers will not have to restart runtime. This commit solves cases such as supabase#192 and supabase#212 (cherry picked from commit 0b1ddd0)
(cherry picked from commit 6ff4802)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome PR
🙄 |
🤔 |
6c865dc
to
084e79a
Compare
Characteristics (
|
So I tested everything, it seems like all it's good to go. But before, I approve, it would be good if we can get some tests that touches on the different supervisors, and the limits and all. Like an integration test. cc @nyannyacha Thanks for the work! |
@andreespirela As a continuation of that exploration, I'm refactoring in the other branch the native thread spawning parts that are used to run the runtime event loop as that can be used in the green thread pool. 😁 |
When you say integration testing, are you referring to integration testing in the context of rust, and if so, can you be more specific about what you need? I can't think of anything right now 😅 |
@nyannyacha Yes so we have a couple of integration tests you can take a look at for example, |
@andreespirela Currently, I'm just refactoring the integration tests to be compatible with the new supervisor policies in the green thread PR. I've added some flow control routines to the integration tests to reproduce the exact behaviors of these policies. A little later, I'll submit a few commits for the integration test changes in the green thread PR. |
Yep will review ! @nyannyacha Thanks |
Good! I'll submit changes after testing on the local machine! I'll be happy if you would leave the comments for the integration test changes to better determine the direction of any integration tests that may written in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this excellent contribution.
🎉 This PR is included in version 1.31.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
awesome @nyannyacha! can you help migrate to the latest |
Hi! @spence 😋 Well... I'm not a team member of Supabase, so I think any issues with core dependencies are probably out of my hands. |
I guess I could create a PR if they decide to, I don't know right now. |
@nyannyacha it’s okay! I’m working on migrating to newest Deno. Will keep you posted as far as this PR, there shouldn’t be any big issue |
What kind of change does this PR introduce?
Feature
Description
This PR introduces various supervisor policies.
per_worker
The behavior of this policy same as the behavior before introducing this PR. It accumulates CPU time and Wall-clock time for each incoming request until hitting the specified cap.
per_request
It calculates CPU time per request and the Wall-clock time limit is calculated based on the last request. However, it terminates the isolate if either incoming request reaches a hard CPU time limit. This policy allows the isolation can exist for longer periods so it can maximize the benefits of JIT compilation.
oneshot
Development purpose only. This policy forces the supervisor to terminate the isolation immediately if the request is complete. It is most useful in the development environment. Because the isolate lives only until a request is done, developers can receive feedback easily about per source code change.
These policies can be specified through the CLI argument and in some cases, you can specify additional arguments that specified policy supports.
Additional policy arguments
max-parallelism
is the maximum count of workers that can exist in the worker pool simultaneously. It is calculated per service path. (It applied to all policies)request-wait-timeout
is only affected if usingper_request
policy. It represents a maximum time in milliseconds that can wait to establish a connection with a worker. It is necessary because theper_request
supervisor will not allow requests if the worker is still working on the previous request. (It only applied toper_request
,oneshot
policy)Considerations of the
per_request
policyAs described earlier, In the
per_request
policy, one or more requests can't coexist in the isolate during unit time.Thus, depending on the case, if the handler does not handle the requests fast enough, It might lead to consequences of increased latency or creating massive extra isolates.
So It's necessary to limit the active user workers count by specifying an appropriate
max-parallelism
on the CLI to prevent excessive memory consumption.These characteristics show us that it's not a complete alternative to the
per_worker
policy, and there is a trade-off clearly between the two policies.1
Notes on Major behavior changes
Pooler will strictly limit the number of Active Isolates
This PR changes the behavior of the pooler to minimize memory resource consumption due to creating an isolate.
Previously, The pooler there was no limit to the active isolation count explicitly. These results appeared to unbalance memory usage per service path. Furthermore, allowing to create the unlimited isolation could appear as OOM, depending on the cases.
PR makes the pooler limit the active isolation count per service path by using the semaphore.
Active isolation count can be adjusted by specifying
--max-parallelism
in the CLI argument. It follows the host CPU(or core) count by default if the user has not specified it.Additionally, on the
per_worker
policy, active workers going to a retired state shouldn't affect the active isolation count. That means even if all active workers go to a retired state at once, it MUST not affect the request throughput or make the pending requests as the orphan state for a long time.Each supervisor task is no longer assigned on its native thread
Creating the native threads is expensive, and considering the isolation could be created or removed massively, it is better to spawn the supervisor tasks as a green thread managed by the tokio runtime to minimize resource consumption.
per_worker
behavior changeIf workers reach the retired state, they can't accept subsequent requests, but they exist until the wall clock timer has timed out, so it will cause the consumption of massive resources in some cases.
This PR mitigates this problem by implementing the early termination of retired workers by adding a routine tracking the lifetime of each request to the supervisor.
The synchronized connection lifetime of each unix stream pair
Before this PR, each unix stream pair used to process requests had a different connection lifetime. These connection lifetime mismatch issue led to an unexpected EOFs, and sometimes to internal server errors or timeouts.
This PR fixes the problem by introducing a routine that preserves the lifetime of all unix stream pairs used in a request until the main thread flushes the response to the requester completely. This resulted in some performance regression in request throughput for the
per_worker
policy.CPU timer can be turned off optionally
This PR makes the user turn off the CPU timer limit optionally by specifying a zero for each CPU timer parameter.
Added
applyConnectionWatcher
function intoEdgeRuntime
namespaceIt was introduced to address synchronization issues with Unix stream pair connection lifetimes described above.
This is to keep track of the connection lifetime to exactly which response object is associated with the cloned request object, and it should be called after when the
Request.clone()
method is used.If you do not pass a cloned request object to this function, connection lifetime is not guaranteed. So this can lead to previous behavior like intermittent EOFs, internal server errors, or timeouts. (for backward compatibility,
Worker.fetch
will emit console warnings instead of hard errors to the cloned request object that was not passed to that function)Usage
Benchmark2
Running environment
example.json
examples/hello-world/index.ts
v1.30.0 vs PR-241 / per_worker / Normal
Limits
command
v1.30.0 (VCPU: 8, Max parallelism: unspecified)
PR-241 (VCPU: 8, Max parallelism: 8 by default)
v1.30.0 vs PR-241 / per_worker / Chunked Request (Transfer-Encoding: chunked)
Limits
payload.bin
command
v1.30.0
PR-241
Resolves supabase/supabase#19815
Resolves supabase/cli#247
Resolves #197
Resolves #198
Resolves #192
Resolves #103
Footnotes
Figure is a briefly expressed flow chart of how the
per_request
policy works. ↩hey
was run in the same host, so the request per sec may not be accurate ↩