Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce various supervisor policy #241

Merged
merged 40 commits into from
Jan 16, 2024
Merged

Conversation

nyannyacha
Copy link
Collaborator

@nyannyacha nyannyacha commented Jan 8, 2024

What kind of change does this PR introduce?

Feature

Description

This PR introduces various supervisor policies.

  1. per_worker
    The behavior of this policy same as the behavior before introducing this PR. It accumulates CPU time and Wall-clock time for each incoming request until hitting the specified cap.

  2. per_request
    It calculates CPU time per request and the Wall-clock time limit is calculated based on the last request. However, it terminates the isolate if either incoming request reaches a hard CPU time limit. This policy allows the isolation can exist for longer periods so it can maximize the benefits of JIT compilation.

  3. oneshot
    Development purpose only. This policy forces the supervisor to terminate the isolation immediately if the request is complete. It is most useful in the development environment. Because the isolate lives only until a request is done, developers can receive feedback easily about per source code change.

These policies can be specified through the CLI argument and in some cases, you can specify additional arguments that specified policy supports.

Additional policy arguments

  1. max-parallelism is the maximum count of workers that can exist in the worker pool simultaneously. It is calculated per service path. (It applied to all policies)

  2. request-wait-timeout is only affected if using per_request policy. It represents a maximum time in milliseconds that can wait to establish a connection with a worker. It is necessary because the per_request supervisor will not allow requests if the worker is still working on the previous request. (It only applied to per_request, oneshot policy)

Considerations of the per_request policy

As described earlier, In the per_request policy, one or more requests can't coexist in the isolate during unit time.

Thus, depending on the case, if the handler does not handle the requests fast enough, It might lead to consequences of increased latency or creating massive extra isolates.

So It's necessary to limit the active user workers count by specifying an appropriate max-parallelism on the CLI to prevent excessive memory consumption.

These characteristics show us that it's not a complete alternative to the per_worker policy, and there is a trade-off clearly between the two policies.

per_request drawio (2)1

Notes on Major behavior changes

Pooler will strictly limit the number of Active Isolates

This PR changes the behavior of the pooler to minimize memory resource consumption due to creating an isolate.

Previously, The pooler there was no limit to the active isolation count explicitly. These results appeared to unbalance memory usage per service path. Furthermore, allowing to create the unlimited isolation could appear as OOM, depending on the cases.

PR makes the pooler limit the active isolation count per service path by using the semaphore.

Active isolation count can be adjusted by specifying --max-parallelism in the CLI argument. It follows the host CPU(or core) count by default if the user has not specified it.

Additionally, on the per_worker policy, active workers going to a retired state shouldn't affect the active isolation count. That means even if all active workers go to a retired state at once, it MUST not affect the request throughput or make the pending requests as the orphan state for a long time.

Each supervisor task is no longer assigned on its native thread

Creating the native threads is expensive, and considering the isolation could be created or removed massively, it is better to spawn the supervisor tasks as a green thread managed by the tokio runtime to minimize resource consumption.

per_worker behavior change

If workers reach the retired state, they can't accept subsequent requests, but they exist until the wall clock timer has timed out, so it will cause the consumption of massive resources in some cases.

This PR mitigates this problem by implementing the early termination of retired workers by adding a routine tracking the lifetime of each request to the supervisor.

The synchronized connection lifetime of each unix stream pair

Before this PR, each unix stream pair used to process requests had a different connection lifetime. These connection lifetime mismatch issue led to an unexpected EOFs, and sometimes to internal server errors or timeouts.

This PR fixes the problem by introducing a routine that preserves the lifetime of all unix stream pairs used in a request until the main thread flushes the response to the requester completely. This resulted in some performance regression in request throughput for the per_worker policy.

CPU timer can be turned off optionally

This PR makes the user turn off the CPU timer limit optionally by specifying a zero for each CPU timer parameter.

...
const cpuTimeSoftLimitMs = 0;
const cpuTimeHardLimitMs = 0;

return await EdgeRuntime.userWorkers.create({
    ...
    cpuTimeSoftLimitMs,
    cpuTimeHardLimitMs,
...

Added applyConnectionWatcher function into EdgeRuntime namespace

It was introduced to address synchronization issues with Unix stream pair connection lifetimes described above.

This is to keep track of the connection lifetime to exactly which response object is associated with the cloned request object, and it should be called after when the Request.clone() method is used.

If you do not pass a cloned request object to this function, connection lifetime is not guaranteed. So this can lead to previous behavior like intermittent EOFs, internal server errors, or timeouts. (for backward compatibility, Worker.fetch will emit console warnings instead of hard errors to the cloned request object that was not passed to that function)

Note: It's only used when the user needs to clone the request. If the request does not need to be cloned for any reason, invoking this function is unnecessary. Because the original request passed by an argument of the serve function is already assigned the reference for the connection watcher.

Note: This is only relevant in the main worker context, not the user worker context.

Object.defineProperty(globalThis, 'EdgeRuntime', {
	get() {
		return {
			userWorkers: SUPABASE_USER_WORKERS,
                         // 👇
			applyConnectionWatcher: (src, dest) => {
				applyWatcherRid(src, dest); 
			}
		};
	},
	configurable: true,
});

// Copies the connection watcher reference of `srcRequest` to `destRequest`.
// Throws `TypeError` if both arguments are not `Request`.
EdgeRuntime.applyConnectionWatcher(srcRequest: Request, destRequest: Request): void;

Usage

// NOTE: It covers the case if the user NEEDS to clone the request.
// If you don't need to clone the request, then you don't have to call.
Deno.serve(async (req: Request) => {
	// clone the request
	const req2 = new Request(req); // 👈 or `req.clone()`

	// do something...

	const worker = await createWorker();
	const controller = new AbortController();

	// ...

	EdgeRuntime.applyConnectionWatcher(req, req2); // 👈
	return await worker.fetch(req2, { signal });

	// ...
});

Benchmark2

Running environment

Hardware: Macbook Air M1 2020, 16GB
Container: Docker/Colima(0.6.7, vz, no rosetta 2)
Guest OS: Debian/Bookworm
ARCH: aarch64/arm64
VCPU: 8
RAM: 8GiB (16GiB Swap)

example.json

{
    "name": "meow"
}

examples/hello-world/index.ts

Deno.serve((req: Request) => {
	const result = mySlowFunction(2);
	const data = {
		message: `Hello33 ${result}`,
		test: 'foo',
	};

	return Response.json(data);
});

function mySlowFunction(baseNumber) {
	// console.time('mySlowFunction');
	let result = 0;	
	for (var i = Math.pow(baseNumber, 7); i >= 0; i--) {		
		result += Math.atan(i) * Math.tan(i);
	};
	// console.timeEnd('mySlowFunction');
	return result;
}

v1.30.0 vs PR-241 / per_worker / Normal

Limits

const workerTimeoutMs = 20 * 1000;

// XXX(Nyannyacha): What CPU time limit is used in production on Supabase?
const cpuTimeSoftLimitMs = 1000 * 100;
const cpuTimeHardLimitMs = 1000 * 200;

command

hey -c 20 -z 10s -m POST -D ./example.json http://127.0.0.1:8888/hello-world

v1.30.0 (VCPU: 8, Max parallelism: unspecified)

Summary:
  Total:	10.0026 secs
  Slowest:	0.2269 secs
  Fastest:	0.0011 secs
  Average:	0.0044 secs
  Requests/sec:	4579.3274


Response time histogram:
  0.001 [1]	|
  0.024 [45723]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.046 [41]	|
  0.069 [12]	|
  0.091 [8]	|
  0.114 [0]	|
  0.137 [0]	|
  0.159 [0]	|
  0.182 [9]	|
  0.204 [1]	|
  0.227 [10]	|


Latency distribution:
  10% in 0.0026 secs
  25% in 0.0031 secs
  50% in 0.0038 secs
  75% in 0.0052 secs
  90% in 0.0064 secs
  95% in 0.0071 secs
  99% in 0.0096 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0011 secs, 0.2269 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0052 secs
  resp wait:	0.0043 secs, 0.0011 secs, 0.2236 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0135 secs

Status code distribution:
  [200]	45777 responses
  [500]	28 responses

PR-241 (VCPU: 8, Max parallelism: 8 by default)

Summary:
  Total:	10.0047 secs
  Slowest:	0.1230 secs
  Fastest:	0.0014 secs
  Average:	0.0046 secs
  Requests/sec:	4361.3648


Response time histogram:
  0.001 [1]	|
  0.014 [43483]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.026 [102]	|
  0.038 [8]	|
  0.050 [0]	|
  0.062 [1]	|
  0.074 [2]	|
  0.087 [17]	|
  0.099 [0]	|
  0.111 [2]	|
  0.123 [18]	|


Latency distribution:
  10% in 0.0029 secs
  25% in 0.0033 secs
  50% in 0.0040 secs
  75% in 0.0055 secs
  90% in 0.0068 secs
  95% in 0.0075 secs
  99% in 0.0097 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0014 secs, 0.1230 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0088 secs
  resp wait:	0.0045 secs, 0.0012 secs, 0.1201 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0134 secs

Status code distribution:
  [200]	43634 responses

v1.30.0 vs PR-241 / per_worker / Chunked Request (Transfer-Encoding: chunked)

Limits

const workerTimeoutMs = 20 * 1000;
const cpuTimeSoftLimitMs = 2000;
const cpuTimeHardLimitMs = 3000;

payload.bin

POST http://127.0.0.1:8888/hello-world
Content-Type: application/json
@./example.json

command

 cat payload.bin | vegeta attack -rate=1000 -duration=13s -chunked

v1.30.0

Requests      [total, rate, throughput]         13000, 1000.08, 991.46
Duration      [total, attack, wait]             13s, 12.999s, 1.136ms
Latencies     [min, mean, 50, 90, 95, 99, max]  911.917µs, 3.521ms, 1.182ms, 1.733ms, 4.452ms, 80.947ms, 253.677ms
Bytes In      [total, mean]                     690527, 53.12
Bytes Out     [total, mean]                     299000, 23.00
Success       [ratio]                           99.15%
Status Codes  [code:count]                      200:12889  500:111
Error Set:
500 Internal Server Error

PR-241

Requests      [total, rate, throughput]         13000, 1000.08, 999.99
Duration      [total, attack, wait]             13s, 12.999s, 1.2ms
Latencies     [min, mean, 50, 90, 95, 99, max]  949.625µs, 2.6ms, 1.258ms, 1.663ms, 3.17ms, 43.234ms, 158.649ms
Bytes In      [total, mean]                     689000, 53.00
Bytes Out     [total, mean]                     299000, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:13000
Error Set:

Resolves supabase/supabase#19815
Resolves supabase/cli#247
Resolves #197
Resolves #198
Resolves #192
Resolves #103

Footnotes

  1. Figure is a briefly expressed flow chart of how the per_request policy works.

  2. hey was run in the same host, so the request per sec may not be accurate

@nyannyacha
Copy link
Collaborator Author

error: use of a fallible conversion when an infallible one could be used
   --> crates/node/ops/crypto/mod.rs:413:65
    |
413 |                     .verify_prehash(digest, &signature.to_vec().try_into()?)
    |                                                                 ^^^^^^^^ help: use: `into`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_fallible_conversions
    = note: `-D clippy::unnecessary-fallible-conversions` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::unnecessary_fallible_conversions)]`

🧐

@nyannyacha
Copy link
Collaborator Author

running 2 tests
test test_file_upload ... ok
error: test failed, to rerun pass `-p base --test oak_user_worker_tests`

Caused by:
  process didn't exit successfully: `/home/runner/work/edge-runtime/edge-runtime/target/debug/deps/oak_user_worker_tests-f0e7ee40c14d05fb` (signal: 14, SIGALRM: alarm clock)
Error: Process completed with exit code 101.

🙄

@nyannyacha
Copy link
Collaborator Author

Is cargo clippy check running above 1.75.0?
Looking at that lint description, it introduced from 1.75.0
🤔

@nyannyacha
Copy link
Collaborator Author

...
test deno_runtime::test::test_read_file_user_rt ... ok

failures:

error: test failed, to rerun pass `-p base --lib`
---- deno_runtime::test::test_create_eszip_from_graph stdout ----
thread 'deno_runtime::test::test_create_eszip_from_graph' panicked at crates/base/src/deno_runtime.rs:497:14:
called `Result::unwrap()` on an `Err` value: Reading /home/runner/.cache/deno/npm/registry.npmjs.org/is-even/1.0.0/.deno_sync_lock

Caused by:
    No such file or directory (os error 2)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    deno_runtime::test::test_create_eszip_from_graph

test result: FAILED. 10 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.53s

🧐

@nyannyacha
Copy link
Collaborator Author

🫠

@nyannyacha nyannyacha marked this pull request as ready for review January 8, 2024 06:24
@nyannyacha
Copy link
Collaborator Author

nyannyacha commented Jan 10, 2024

Note: This PR does not include the memory leak fixes. (That change has been separated with PR-240)

…termination of worker

(cherry picked from commit 84fafd1)
When the user sets the CPU time to zero for hard or soft the supervisor
will not make the signal pair for the CPU alarm.

(cherry picked from commit c4a87c2)
When massive requests arrive simultaneously, the pool makes the extra
workers eventually because the workers must have initializing time. That
means the pool must manage the multiple workers per service path.

Until now, extra workers created in this scenario have just been
abandoned. This commit introduces the simple data structure with round
robin to fix this problem.

(cherry picked from commit 7167813)
(cherry picked from commit 7e643d3)
(cherry picked from commit ae5401e)
Actually... The timeout is no longer necessary since we have introduced a line
that invokes `waker.wake()` directly from the supervisor routine ;)

(cherry picked from commit c6fbf7c)
…ifetime with an outbound request

Currently, The unix stream pair is used to communicate with the main thread,
main worker, and user workers.

But each stream pair is independent of the other pair, so there are no
guarantees that it will be synchronized between stream pairs.

If streams are not synchronized, each stream will have a different lifetime, so
depending on the case, it will cause unexpected connection breakage.

This problem was already exposed as an internal server error or timeout on the
surface, which was too hard to notice.

It might be a substantial solution for the Pull Requests and Issues described
below.

supabase#50
supabase#56
supabase#69
supabase#75
supabase#198

🫠

(cherry picked from commit 2387c9e)
…hen entering the job

(cherry picked from commit 6e00e48)
(cherry picked from commit ae430ca)
(cherry picked from commit 88f2d74)
This policy forces the supervisor to terminate the isolation immediately if the
request is complete.

Using this policy with development will make sense because it terminates the
isolation immediately if the request is complete, so developers will not have
to restart runtime.

This commit solves cases such as supabase#192 and supabase#212

(cherry picked from commit 0b1ddd0)
Copy link
Collaborator

@andreespirela andreespirela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR

@nyannyacha
Copy link
Collaborator Author

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.28s

     Running tests/oak_user_worker_tests.rs (target/debug/deps/oak_user_worker_tests-f0e7ee40c14d05fb)

running 2 tests
test test_oak_server ... ok
error: test failed, to rerun pass `-p base --test oak_user_worker_tests`

Caused by:
  process didn't exit successfully: `/home/runner/work/edge-runtime/edge-runtime/target/debug/deps/oak_user_worker_tests-f0e7ee40c14d05fb` (signal: 14, SIGALRM: alarm clock)
Error: Process completed with exit code 101.

🙄

@nyannyacha
Copy link
Collaborator Author

nyannyacha commented Jan 10, 2024

🤔
I have changed to turn off the CPUTimer-related bits on testing, I have no idea why SIGALRM triggered

@nyannyacha
Copy link
Collaborator Author

nyannyacha commented Jan 11, 2024

Characteristics (per_worker vs per_request) / 3 min

Running environment

Hardware: Macbook Air M1 2020, 16GB
Container: Docker/Colima(0.6.7, vz, no rosetta 2)
Guest OS: Debian/Bookworm
ARCH: aarch64/arm64
VCPU: 8
RAM: 8GiB (16GiB Swap)

Limits

const workerTimeoutMs = 20 * 1000;
const cpuTimeSoftLimitMs = 500;
const cpuTimeHardLimitMs = 600;

Command

vegeta attack -rate=0 -duration=3m -max-workers=12 [-chunked]

graph
107-122

per_worker

Requests      [total, rate, throughput]         708556, 3936.39, 3936.36
Duration      [total, attack, wait]             3m0s, 3m0s, 1.341ms
Latencies     [min, mean, 50, 90, 95, 99, max]  911.042µs, 2.956ms, 2.566ms, 4.816ms, 5.462ms, 7.062ms, 144.479ms
Bytes In      [total, mean]                     37553468, 53.00
Bytes Out     [total, mean]                     16296788, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:708556
Error Set:

per_request

Requests      [total, rate, throughput]         715516, 3975.08, 3975.04
Duration      [total, attack, wait]             3m0s, 3m0s, 1.792ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.003ms, 2.958ms, 2.498ms, 5.002ms, 5.627ms, 7.152ms, 131.394ms
Bytes In      [total, mean]                     37922348, 53.00
Bytes Out     [total, mean]                     16456868, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:715516

per_worker / chunked

Requests      [total, rate, throughput]         661733, 3676.28, 3676.24
Duration      [total, attack, wait]             3m0s, 3m0s, 1.953ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.088ms, 3.22ms, 2.8ms, 5.071ms, 5.7ms, 7.533ms, 120.342ms
Bytes In      [total, mean]                     35071849, 53.00
Bytes Out     [total, mean]                     15219859, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:661733
Error Set:

per_request / chunked

Requests      [total, rate, throughput]         688821, 3826.77, 3826.74
Duration      [total, attack, wait]             3m0s, 3m0s, 1.694ms
Latencies     [min, mean, 50, 90, 95, 99, max]  999.458µs, 3.109ms, 2.65ms, 4.974ms, 5.672ms, 7.618ms, 110.429ms
Bytes In      [total, mean]                     36507513, 53.00
Bytes Out     [total, mean]                     15842883, 23.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:688821
Error Set:

@andreespirela
Copy link
Collaborator

So I tested everything, it seems like all it's good to go. But before, I approve, it would be good if we can get some tests that touches on the different supervisors, and the limits and all. Like an integration test.

cc @nyannyacha

Thanks for the work!

@nyannyacha
Copy link
Collaborator Author

@andreespirela
Ah, that's funny 🤣 I was just exploring that topic at the right time.

As a continuation of that exploration, I'm refactoring in the other branch the native thread spawning parts that are used to run the runtime event loop as that can be used in the green thread pool. 😁

@nyannyacha
Copy link
Collaborator Author

When you say integration testing, are you referring to integration testing in the context of rust, and if so, can you be more specific about what you need?

I can't think of anything right now 😅

@andreespirela
Copy link
Collaborator

@nyannyacha Yes so we have a couple of integration tests you can take a look at for example, oak_user_worker_tests.rs, they are testing some of the behaviors of the calls for example. Here we want to test the behavior of the supervisors, just making sure the isolates get killed and everything that goes on when the supervisor signals take place.

@nyannyacha
Copy link
Collaborator Author

@andreespirela
Hi! 😋

Currently, I'm just refactoring the integration tests to be compatible with the new supervisor policies in the green thread PR.

I've added some flow control routines to the integration tests to reproduce the exact behaviors of these policies.

A little later, I'll submit a few commits for the integration test changes in the green thread PR.
Can you review and leave some comments for integration test changes?

@andreespirela
Copy link
Collaborator

Yep will review ! @nyannyacha Thanks

@nyannyacha
Copy link
Collaborator Author

Good! I'll submit changes after testing on the local machine!

I'll be happy if you would leave the comments for the integration test changes to better determine the direction of any integration tests that may written in the future.

Copy link
Contributor

@laktek laktek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this excellent contribution.

@laktek laktek merged commit a6bd18a into supabase:main Jan 16, 2024
3 checks passed
Copy link

🎉 This PR is included in version 1.31.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@nyannyacha nyannyacha deleted the feat-policy branch January 16, 2024 23:13
@spence
Copy link

spence commented Jan 18, 2024

awesome @nyannyacha! can you help migrate to the latest deno_core 😅

@nyannyacha
Copy link
Collaborator Author

Hi! @spence 😋

Well... I'm not a team member of Supabase, so I think any issues with core dependencies are probably out of my hands.

@nyannyacha
Copy link
Collaborator Author

nyannyacha commented Jan 18, 2024

I guess I could create a PR if they decide to, I don't know right now.

@andreespirela
Copy link
Collaborator

@nyannyacha it’s okay! I’m working on migrating to newest Deno. Will keep you posted as far as this PR, there shouldn’t be any big issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment