feat: introduce various supervisor policy #241

nyannyacha · 2024-01-08T01:32:26Z

What kind of change does this PR introduce?

Feature

Description

This PR introduces various supervisor policies.

per_worker
The behavior of this policy same as the behavior before introducing this PR. It accumulates CPU time and Wall-clock time for each incoming request until hitting the specified cap.
per_request
It calculates CPU time per request and the Wall-clock time limit is calculated based on the last request. However, it terminates the isolate if either incoming request reaches a hard CPU time limit. This policy allows the isolation can exist for longer periods so it can maximize the benefits of JIT compilation.
oneshot
Development purpose only. This policy forces the supervisor to terminate the isolation immediately if the request is complete. It is most useful in the development environment. Because the isolate lives only until a request is done, developers can receive feedback easily about per source code change.

These policies can be specified through the CLI argument and in some cases, you can specify additional arguments that specified policy supports.

Additional policy arguments

max-parallelism is the maximum count of workers that can exist in the worker pool simultaneously. It is calculated per service path. (It applied to all policies)
request-wait-timeout is only affected if using per_request policy. It represents a maximum time in milliseconds that can wait to establish a connection with a worker. It is necessary because the per_request supervisor will not allow requests if the worker is still working on the previous request. (It only applied to per_request, oneshot policy)

Considerations of the `per_request` policy

As described earlier, In the per_request policy, one or more requests can't coexist in the isolate during unit time.

Thus, depending on the case, if the handler does not handle the requests fast enough, It might lead to consequences of increased latency or creating massive extra isolates.

So It's necessary to limit the active user workers count by specifying an appropriate max-parallelism on the CLI to prevent excessive memory consumption.

These characteristics show us that it's not a complete alternative to the per_worker policy, and there is a trade-off clearly between the two policies.

¹

Notes on Major behavior changes

Pooler will strictly limit the number of Active Isolates

This PR changes the behavior of the pooler to minimize memory resource consumption due to creating an isolate.

Previously, The pooler there was no limit to the active isolation count explicitly. These results appeared to unbalance memory usage per service path. Furthermore, allowing to create the unlimited isolation could appear as OOM, depending on the cases.

PR makes the pooler limit the active isolation count per service path by using the semaphore.

Active isolation count can be adjusted by specifying --max-parallelism in the CLI argument. It follows the host CPU(or core) count by default if the user has not specified it.

Additionally, on the per_worker policy, active workers going to a retired state shouldn't affect the active isolation count. That means even if all active workers go to a retired state at once, it MUST not affect the request throughput or make the pending requests as the orphan state for a long time.

Each supervisor task is no longer assigned on its native thread

Creating the native threads is expensive, and considering the isolation could be created or removed massively, it is better to spawn the supervisor tasks as a green thread managed by the tokio runtime to minimize resource consumption.

`per_worker` behavior change

If workers reach the retired state, they can't accept subsequent requests, but they exist until the wall clock timer has timed out, so it will cause the consumption of massive resources in some cases.

This PR mitigates this problem by implementing the early termination of retired workers by adding a routine tracking the lifetime of each request to the supervisor.

The synchronized connection lifetime of each unix stream pair

Before this PR, each unix stream pair used to process requests had a different connection lifetime. These connection lifetime mismatch issue led to an unexpected EOFs, and sometimes to internal server errors or timeouts.

This PR fixes the problem by introducing a routine that preserves the lifetime of all unix stream pairs used in a request until the main thread flushes the response to the requester completely. This resulted in some performance regression in request throughput for the per_worker policy.

CPU timer can be turned off optionally

This PR makes the user turn off the CPU timer limit optionally by specifying a zero for each CPU timer parameter.

...
const cpuTimeSoftLimitMs = 0;
const cpuTimeHardLimitMs = 0;

return await EdgeRuntime.userWorkers.create({
    ...
    cpuTimeSoftLimitMs,
    cpuTimeHardLimitMs,
...

Added `applyConnectionWatcher` function into `EdgeRuntime` namespace

It was introduced to address synchronization issues with Unix stream pair connection lifetimes described above.

This is to keep track of the connection lifetime to exactly which response object is associated with the cloned request object, and it should be called after when the Request.clone() method is used.

If you do not pass a cloned request object to this function, connection lifetime is not guaranteed. So this can lead to previous behavior like intermittent EOFs, internal server errors, or timeouts. (for backward compatibility, Worker.fetch will emit console warnings instead of hard errors to the cloned request object that was not passed to that function)

Note: It's only used when the user needs to clone the request. If the request does not need to be cloned for any reason, invoking this function is unnecessary. Because the original request passed by an argument of the serve function is already assigned the reference for the connection watcher.

Note: This is only relevant in the main worker context, not the user worker context.

Object.defineProperty(globalThis, 'EdgeRuntime', {
	get() {
		return {
			userWorkers: SUPABASE_USER_WORKERS,
                         // 👇
			applyConnectionWatcher: (src, dest) => {
				applyWatcherRid(src, dest); 
			}
		};
	},
	configurable: true,
});

// Copies the connection watcher reference of `srcRequest` to `destRequest`.
// Throws `TypeError` if both arguments are not `Request`.
EdgeRuntime.applyConnectionWatcher(srcRequest: Request, destRequest: Request): void;

Usage

// NOTE: It covers the case if the user NEEDS to clone the request.
// If you don't need to clone the request, then you don't have to call.
Deno.serve(async (req: Request) => {
	// clone the request
	const req2 = new Request(req); // 👈 or `req.clone()`

	// do something...

	const worker = await createWorker();
	const controller = new AbortController();

	// ...

	EdgeRuntime.applyConnectionWatcher(req, req2); // 👈
	return await worker.fetch(req2, { signal });

	// ...
});

Benchmark²

Running environment

Hardware: Macbook Air M1 2020, 16GB
Container: Docker/Colima(0.6.7, vz, no rosetta 2)
Guest OS: Debian/Bookworm
ARCH: aarch64/arm64
VCPU: 8
RAM: 8GiB (16GiB Swap)

example.json

{
    "name": "meow"
}

examples/hello-world/index.ts

Deno.serve((req: Request) => {
	const result = mySlowFunction(2);
	const data = {
		message: `Hello33 ${result}`,
		test: 'foo',
	};

	return Response.json(data);
});

function mySlowFunction(baseNumber) {
	// console.time('mySlowFunction');
	let result = 0;	
	for (var i = Math.pow(baseNumber, 7); i >= 0; i--) {		
		result += Math.atan(i) * Math.tan(i);
	};
	// console.timeEnd('mySlowFunction');
	return result;
}

v1.30.0 vs PR-241 / per_worker / Normal

Limits

const workerTimeoutMs = 20 * 1000;

// XXX(Nyannyacha): What CPU time limit is used in production on Supabase?
const cpuTimeSoftLimitMs = 1000 * 100;
const cpuTimeHardLimitMs = 1000 * 200;

command

hey -c 20 -z 10s -m POST -D ./example.json http://127.0.0.1:8888/hello-world

v1.30.0 (VCPU: 8, Max parallelism: unspecified)

Summary:
  Total:	10.0026 secs
  Slowest:	0.2269 secs
  Fastest:	0.0011 secs
  Average:	0.0044 secs
  Requests/sec:	4579.3274


Response time histogram:
  0.001 [1]	|
  0.024 [45723]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.046 [41]	|
  0.069 [12]	|
  0.091 [8]	|
  0.114 [0]	|
  0.137 [0]	|
  0.159 [0]	|
  0.182 [9]	|
  0.204 [1]	|
  0.227 [10]	|


Latency distribution:
  10% in 0.0026 secs
  25% in 0.0031 secs
  50% in 0.0038 secs
  75% in 0.0052 secs
  90% in 0.0064 secs
  95% in 0.0071 secs
  99% in 0.0096 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0011 secs, 0.2269 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0052 secs
  resp wait:	0.0043 secs, 0.0011 secs, 0.2236 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0135 secs

Status code distribution:
  [200]	45777 responses
  [500]	28 responses

PR-241 (VCPU: 8, Max parallelism: 8 by default)

Summary:
  Total:	10.0047 secs
  Slowest:	0.1230 secs
  Fastest:	0.0014 secs
  Average:	0.0046 secs
  Requests/sec:	4361.3648


Response time histogram:
  0.001 [1]	|
  0.014 [43483]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.026 [102]	|
  0.038 [8]	|
  0.050 [0]	|
  0.062 [1]	|
  0.074 [2]	|
  0.087 [17]	|
  0.099 [0]	|
  0.111 [2]	|
  0.123 [18]	|


Latency distribution:
  10% in 0.0029 secs
  25% in 0.0033 secs
  50% in 0.0040 secs
  75% in 0.0055 secs
  90% in 0.0068 secs
  95% in 0.0075 secs
  99% in 0.0097 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0014 secs, 0.1230 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0088 secs
  resp wait:	0.0045 secs, 0.0012 secs, 0.1201 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0134 secs

Status code distribution:
  [200]	43634 responses

v1.30.0 vs PR-241 / per_worker / Chunked Request (Transfer-Encoding: chunked)

Limits

const workerTimeoutMs = 20 * 1000;
const cpuTimeSoftLimitMs = 2000;
const cpuTimeHardLimitMs = 3000;

payload.bin

POST http://127.0.0.1:8888/hello-world
Content-Type: application/json
@./example.json

command

 cat payload.bin | vegeta attack -rate=1000 -duration=13s -chunked

v1.30.0

Requests      [total, rate, throughput]         13000, 1000.08, 991.46
Duration      [total, attack, wait]             13s, 12.999s, 1.136ms
Latencies     [min, mean, 50, 90, 95, 99, max]  911.917µs, 3.521ms, 1.182ms, 1.733ms, 4.452ms, 80.947ms, 253.677ms
Bytes In      [total, mean]                     690527, 53.12
Bytes Out     [total, mean]                     299000, 23.00
Success       [ratio]                           99.15%
Status Codes  [code:count]                      200:12889  500:111
Error Set:
500 Internal Server Error

PR-241

Requests      [total, rate, throughput]         13000, 1000.08, 999.99
Duration      [total, attack, wait]             13s, 12.999s, 1.2ms
Latencies     [min, mean, 50, 90, 95, 99, max]  949.625µs, 2.6ms, 1.258ms, 1.663ms, 3.17ms, 43.234ms, 158.649ms
Bytes In      [total, mean]                     689000, 53.00
Bytes Out     [total, mean]                     299000, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:13000
Error Set:

Resolves supabase/supabase#19815
Resolves supabase/cli#247
Resolves #197
Resolves #198
Resolves #192
Resolves #103

Figure is a briefly expressed flow chart of how the per_request policy works. ↩
hey was run in the same host, so the request per sec may not be accurate ↩

nyannyacha · 2024-01-08T01:42:45Z

error: use of a fallible conversion when an infallible one could be used
   --> crates/node/ops/crypto/mod.rs:413:65
    |
413 |                     .verify_prehash(digest, &signature.to_vec().try_into()?)
    |                                                                 ^^^^^^^^ help: use: `into`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_fallible_conversions
    = note: `-D clippy::unnecessary-fallible-conversions` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::unnecessary_fallible_conversions)]`

🧐

nyannyacha · 2024-01-08T01:59:34Z

running 2 tests
test test_file_upload ... ok
error: test failed, to rerun pass `-p base --test oak_user_worker_tests`

Caused by:
  process didn't exit successfully: `/home/runner/work/edge-runtime/edge-runtime/target/debug/deps/oak_user_worker_tests-f0e7ee40c14d05fb` (signal: 14, SIGALRM: alarm clock)
Error: Process completed with exit code 101.

🙄

nyannyacha · 2024-01-08T02:13:54Z

Is cargo clippy check running above 1.75.0?
Looking at that lint description, it introduced from 1.75.0
🤔

nyannyacha · 2024-01-08T03:21:26Z

...
test deno_runtime::test::test_read_file_user_rt ... ok

failures:

error: test failed, to rerun pass `-p base --lib`
---- deno_runtime::test::test_create_eszip_from_graph stdout ----
thread 'deno_runtime::test::test_create_eszip_from_graph' panicked at crates/base/src/deno_runtime.rs:497:14:
called `Result::unwrap()` on an `Err` value: Reading /home/runner/.cache/deno/npm/registry.npmjs.org/is-even/1.0.0/.deno_sync_lock

Caused by:
    No such file or directory (os error 2)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    deno_runtime::test::test_create_eszip_from_graph

test result: FAILED. 10 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.53s

🧐

nyannyacha · 2024-01-08T06:24:32Z

🫠

nyannyacha · 2024-01-10T03:58:18Z

Note: This PR does not include the memory leak fixes. (That change has been separated with PR-240)

…isolate correctly (cherry picked from commit 2108977)

…termination of worker (cherry picked from commit 84fafd1)

When the user sets the CPU time to zero for hard or soft the supervisor will not make the signal pair for the CPU alarm. (cherry picked from commit c4a87c2)

(cherry picked from commit af8287f)

When massive requests arrive simultaneously, the pool makes the extra workers eventually because the workers must have initializing time. That means the pool must manage the multiple workers per service path. Until now, extra workers created in this scenario have just been abandoned. This commit introduces the simple data structure with round robin to fix this problem. (cherry picked from commit 7167813)

(cherry picked from commit 7e643d3)

(cherry picked from commit 9d646d4)

(cherry picked from commit 4717889)

(cherry picked from commit ae5401e)

Actually... The timeout is no longer necessary since we have introduced a line that invokes `waker.wake()` directly from the supervisor routine ;) (cherry picked from commit c6fbf7c)

…sion strategy (cherry picked from commit 4796463)

(cherry picked from commit a29b7fc)

…ifetime with an outbound request Currently, The unix stream pair is used to communicate with the main thread, main worker, and user workers. But each stream pair is independent of the other pair, so there are no guarantees that it will be synchronized between stream pairs. If streams are not synchronized, each stream will have a different lifetime, so depending on the case, it will cause unexpected connection breakage. This problem was already exposed as an internal server error or timeout on the surface, which was too hard to notice. It might be a substantial solution for the Pull Requests and Issues described below. supabase#50 supabase#56 supabase#69 supabase#75 supabase#198 🫠 (cherry picked from commit 2387c9e)

…hen entering the job (cherry picked from commit 6e00e48)

(cherry picked from commit c5b4013)

(cherry picked from commit b1334ef)

(cherry picked from commit 18470f0)

(cherry picked from commit ae430ca)

(cherry picked from commit d10bc64)

(cherry picked from commit 88f2d74)

(cherry picked from commit 47581d4)

This policy forces the supervisor to terminate the isolation immediately if the request is complete. Using this policy with development will make sense because it terminates the isolation immediately if the request is complete, so developers will not have to restart runtime. This commit solves cases such as supabase#192 and supabase#212 (cherry picked from commit 0b1ddd0)

(cherry picked from commit 6ff4802)

andreespirela

Awesome PR

crates/base/src/rt_worker/supervisor/strategy_per_request.rs

nyannyacha · 2024-01-10T16:23:26Z

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.28s

     Running tests/oak_user_worker_tests.rs (target/debug/deps/oak_user_worker_tests-f0e7ee40c14d05fb)

running 2 tests
test test_oak_server ... ok
error: test failed, to rerun pass `-p base --test oak_user_worker_tests`

Caused by:
  process didn't exit successfully: `/home/runner/work/edge-runtime/edge-runtime/target/debug/deps/oak_user_worker_tests-f0e7ee40c14d05fb` (signal: 14, SIGALRM: alarm clock)
Error: Process completed with exit code 101.

🙄

nyannyacha · 2024-01-10T16:26:03Z

🤔
I have changed to turn off the CPUTimer-related bits on testing, I have no idea why SIGALRM triggered

nyannyacha · 2024-01-11T07:29:59Z

Characteristics (`per_worker` vs `per_request`) / 3 min

Running environment

Hardware: Macbook Air M1 2020, 16GB
Container: Docker/Colima(0.6.7, vz, no rosetta 2)
Guest OS: Debian/Bookworm
ARCH: aarch64/arm64
VCPU: 8
RAM: 8GiB (16GiB Swap)

Limits

const workerTimeoutMs = 20 * 1000;
const cpuTimeSoftLimitMs = 500;
const cpuTimeHardLimitMs = 600;

Command

vegeta attack -rate=0 -duration=3m -max-workers=12 [-chunked]

per_worker

Requests      [total, rate, throughput]         708556, 3936.39, 3936.36
Duration      [total, attack, wait]             3m0s, 3m0s, 1.341ms
Latencies     [min, mean, 50, 90, 95, 99, max]  911.042µs, 2.956ms, 2.566ms, 4.816ms, 5.462ms, 7.062ms, 144.479ms
Bytes In      [total, mean]                     37553468, 53.00
Bytes Out     [total, mean]                     16296788, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:708556
Error Set:

per_request

Requests      [total, rate, throughput]         715516, 3975.08, 3975.04
Duration      [total, attack, wait]             3m0s, 3m0s, 1.792ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.003ms, 2.958ms, 2.498ms, 5.002ms, 5.627ms, 7.152ms, 131.394ms
Bytes In      [total, mean]                     37922348, 53.00
Bytes Out     [total, mean]                     16456868, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:715516

per_worker / chunked

Requests      [total, rate, throughput]         661733, 3676.28, 3676.24
Duration      [total, attack, wait]             3m0s, 3m0s, 1.953ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.088ms, 3.22ms, 2.8ms, 5.071ms, 5.7ms, 7.533ms, 120.342ms
Bytes In      [total, mean]                     35071849, 53.00
Bytes Out     [total, mean]                     15219859, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:661733
Error Set:

per_request / chunked

Requests      [total, rate, throughput]         688821, 3826.77, 3826.74
Duration      [total, attack, wait]             3m0s, 3m0s, 1.694ms
Latencies     [min, mean, 50, 90, 95, 99, max]  999.458µs, 3.109ms, 2.65ms, 4.974ms, 5.672ms, 7.618ms, 110.429ms
Bytes In      [total, mean]                     36507513, 53.00
Bytes Out     [total, mean]                     15842883, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:688821
Error Set:

andreespirela · 2024-01-12T16:10:05Z

So I tested everything, it seems like all it's good to go. But before, I approve, it would be good if we can get some tests that touches on the different supervisors, and the limits and all. Like an integration test.

cc @nyannyacha

Thanks for the work!

nyannyacha · 2024-01-12T17:21:25Z

@andreespirela
Ah, that's funny 🤣 I was just exploring that topic at the right time.

As a continuation of that exploration, I'm refactoring in the other branch the native thread spawning parts that are used to run the runtime event loop as that can be used in the green thread pool. 😁

nyannyacha · 2024-01-12T18:15:18Z

When you say integration testing, are you referring to integration testing in the context of rust, and if so, can you be more specific about what you need?

I can't think of anything right now 😅

andreespirela · 2024-01-15T01:39:23Z

@nyannyacha Yes so we have a couple of integration tests you can take a look at for example, oak_user_worker_tests.rs, they are testing some of the behaviors of the calls for example. Here we want to test the behavior of the supervisors, just making sure the isolates get killed and everything that goes on when the supervisor signals take place.

nyannyacha · 2024-01-15T01:56:29Z

@andreespirela
Hi! 😋

Currently, I'm just refactoring the integration tests to be compatible with the new supervisor policies in the green thread PR.

I've added some flow control routines to the integration tests to reproduce the exact behaviors of these policies.

A little later, I'll submit a few commits for the integration test changes in the green thread PR.
Can you review and leave some comments for integration test changes?

andreespirela · 2024-01-15T01:58:41Z

Yep will review ! @nyannyacha Thanks

nyannyacha · 2024-01-15T02:08:31Z

Good! I'll submit changes after testing on the local machine!

I'll be happy if you would leave the comments for the integration test changes to better determine the direction of any integration tests that may written in the future.

laktek

Thanks for this excellent contribution.

github-actions · 2024-01-16T21:37:23Z

🎉 This PR is included in version 1.31.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

spence · 2024-01-18T04:38:46Z

awesome @nyannyacha! can you help migrate to the latest deno_core 😅

nyannyacha · 2024-01-18T04:44:21Z

Hi! @spence 😋

Well... I'm not a team member of Supabase, so I think any issues with core dependencies are probably out of my hands.

nyannyacha · 2024-01-18T04:45:34Z

I guess I could create a PR if they decide to, I don't know right now.

andreespirela · 2024-01-18T05:05:27Z

@nyannyacha it’s okay! I’m working on migrating to newest Deno. Will keep you posted as far as this PR, there shouldn’t be any big issue

nyannyacha marked this pull request as ready for review January 8, 2024 06:24

nyannyacha added 23 commits January 10, 2024 13:04

fix: supervisor must wake up the worker's event loop to terminate an …

8ea2b37

…isolate correctly (cherry picked from commit 2108977)

feat: supervisor need to track lifetime of requests to perform early …

b4a0834

…termination of worker (cherry picked from commit 84fafd1)

feat: make CPU time limit optional

14f75b0

When the user sets the CPU time to zero for hard or soft the supervisor will not make the signal pair for the CPU alarm. (cherry picked from commit c4a87c2)

fix: check retiring workers more strictly

a0b5688

(cherry picked from commit af8287f)

fix: remove error spam

ade9298

(cherry picked from commit 7e643d3)

fix: don't use cpu timer when unit testing

b870346

(cherry picked from commit 9d646d4)

feat: allowing cpu timer to be reset from another thread

52af11c

(cherry picked from commit 4717889)

chore(base): update dependencies

88a3a51

(cherry picked from commit ae5401e)

fix: polishing runtime entrypoint

b775128

Actually... The timeout is no longer necessary since we have introduced a line that invokes `waker.wake()` directly from the supervisor routine ;) (cherry picked from commit c6fbf7c)

refactor: improve worker pooling and introduce another worker supervi…

e97d0b2

…sion strategy (cherry picked from commit 4796463)

fix: drop the runtime after receiving the termination event

f602313

(cherry picked from commit a29b7fc)

fix: request creation task must be synchronized with the supervisor w…

58cc591

…hen entering the job (cherry picked from commit 6e00e48)

fix: relaxing connection watcher restriction

a187692

(cherry picked from commit c5b4013)

fix: bump up rust builder image into 1.74.1

cc97e1c

(cherry picked from commit b1334ef)

fix: missing null check in nextRequest shim

03f3899

(cherry picked from commit 18470f0)

chore: make clippy happy

d380e78

(cherry picked from commit ae430ca)

fix: logging unexpected event loop escape explicitly

91b9d0e

(cherry picked from commit d10bc64)

fix: make all unit test pass

c4a4b63

(cherry picked from commit 88f2d74)

fix: pooler needs to consider force creation mode

a088f3c

(cherry picked from commit 47581d4)

fix: use available_parallelism as default maximum parallelism value

65ef283

(cherry picked from commit 6ff4802)

andreespirela reviewed Jan 10, 2024

View reviewed changes

crates/base/src/rt_worker/supervisor/strategy_per_request.rs Outdated Show resolved Hide resolved

nyannyacha added 3 commits January 10, 2024 16:01

fix(cpu_timer): polishing

9b2d794

fix: polishing supervisor nits

299beb9

fix: oops 🙄

ace1057

fix: apply acceptable bounds into max-parallelism

084e79a

nyannyacha force-pushed the feat-policy branch from 6c865dc to 084e79a Compare January 11, 2024 03:21

fix: rid flakiness of integration tests

f111403

chore: comment

5f99749

nyannyacha mentioned this pull request Jan 13, 2024

feat: use the tokio task instead of the native thread to poll each isolate #244

Merged

laktek approved these changes Jan 16, 2024

View reviewed changes

laktek merged commit a6bd18a into supabase:main Jan 16, 2024
3 checks passed

github-actions bot added the released label Jan 16, 2024

nyannyacha deleted the feat-policy branch January 16, 2024 23:13

nicetomytyuk mentioned this pull request Jan 19, 2024

InvalidWorkerResponse: user worker failed to respond supabase/supabase#20554

Closed

2 tasks

nyannyacha mentioned this pull request Jan 23, 2024

fix: supervisor should also consider cases where the CPU time limit is zero #247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce various supervisor policy #241

feat: introduce various supervisor policy #241

nyannyacha commented Jan 8, 2024 •

edited

Loading

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 10, 2024 •

edited

Loading

andreespirela left a comment

nyannyacha commented Jan 10, 2024

nyannyacha commented Jan 10, 2024 •

edited

Loading

nyannyacha commented Jan 11, 2024 •

edited

Loading

andreespirela commented Jan 12, 2024

nyannyacha commented Jan 12, 2024

nyannyacha commented Jan 12, 2024

andreespirela commented Jan 15, 2024

nyannyacha commented Jan 15, 2024

andreespirela commented Jan 15, 2024

nyannyacha commented Jan 15, 2024

laktek left a comment

github-actions bot commented Jan 16, 2024

spence commented Jan 18, 2024

nyannyacha commented Jan 18, 2024

nyannyacha commented Jan 18, 2024 •

edited

Loading

andreespirela commented Jan 18, 2024

feat: introduce various supervisor policy #241

feat: introduce various supervisor policy #241

Conversation

nyannyacha commented Jan 8, 2024 • edited Loading

What kind of change does this PR introduce?

Description

Additional policy arguments

Considerations of the per_request policy

Notes on Major behavior changes

Pooler will strictly limit the number of Active Isolates

Each supervisor task is no longer assigned on its native thread

per_worker behavior change

The synchronized connection lifetime of each unix stream pair

CPU timer can be turned off optionally

Added applyConnectionWatcher function into EdgeRuntime namespace

Benchmark2

v1.30.0 vs PR-241 / per_worker / Normal

v1.30.0 vs PR-241 / per_worker / Chunked Request (Transfer-Encoding: chunked)

Footnotes

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 8, 2024

nyannyacha commented Jan 10, 2024 • edited Loading

andreespirela left a comment

Choose a reason for hiding this comment

nyannyacha commented Jan 10, 2024

nyannyacha commented Jan 10, 2024 • edited Loading

nyannyacha commented Jan 11, 2024 • edited Loading

Characteristics (per_worker vs per_request) / 3 min

per_worker

per_request

per_worker / chunked

per_request / chunked

andreespirela commented Jan 12, 2024

nyannyacha commented Jan 12, 2024

nyannyacha commented Jan 12, 2024

andreespirela commented Jan 15, 2024

nyannyacha commented Jan 15, 2024

andreespirela commented Jan 15, 2024

nyannyacha commented Jan 15, 2024

laktek left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 16, 2024

spence commented Jan 18, 2024

nyannyacha commented Jan 18, 2024

nyannyacha commented Jan 18, 2024 • edited Loading

andreespirela commented Jan 18, 2024

nyannyacha commented Jan 8, 2024 •

edited

Loading

Considerations of the `per_request` policy

`per_worker` behavior change

Added `applyConnectionWatcher` function into `EdgeRuntime` namespace

Benchmark²

nyannyacha commented Jan 10, 2024 •

edited

Loading

nyannyacha commented Jan 10, 2024 •

edited

Loading

nyannyacha commented Jan 11, 2024 •

edited

Loading

Characteristics (`per_worker` vs `per_request`) / 3 min

nyannyacha commented Jan 18, 2024 •

edited

Loading