-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: perf: add end-to-end benchmark for alternator #13121
Conversation
CI state |
Ideally an external process could have also gotten the same counters using metrics? What worries me a bit about having the client in the same process as the server is that it mixes the performance of the client and server, and also might make it harder to include some things (like client authentication). But I guess it's a good start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good (I just left a few comments/questions), I am guessing that you'll probably want to improve this code together with doing Alternator optimizations, and understanding better what you really want to benchmark / profile.
main.cc
Outdated
@@ -1719,6 +1721,9 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl | |||
}); | |||
|
|||
startlog.info("Scylla version {} initialization completed.", scylla_version()); | |||
if(after_init_func) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
main.cc
Outdated
@@ -1768,6 +1773,7 @@ int main(int ac, char** av) { | |||
{"perf-row-cache-update", perf::scylla_row_cache_update_main, "run performance tests by updating row cache on this server"}, | |||
{"perf-simple-query", perf::scylla_simple_query_main, "run performance tests by sending simple queries to this server"}, | |||
{"perf-sstable", perf::scylla_sstable_main, "run performance tests by exercising sstable related operations on this server"}, | |||
{"perf-alternator-workloads", perf::alternator_workloads(scylla_main, &after_init_func), "run performance tests on full alternator stack"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought (feel free to discard): this after_init_func made the alternator_workloads function very weird - it needs to return a lambda, set another lambda, etc. Wouldn't it be easier for main to offer a promise to wait for initialization to complete (maybe we already have such a thing somehow?) the the workload function will be just like the rest, just start by running main and await in the beginning for the initialization to complete?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's doable to split it to 2 parts, but I wanted changes to be minimal in scylla_main as it's already very complex (or at least long).
It has something like supervisor::notify
which could be extended, although this is "global complexity" vs "local complexity" to me. perf::alternator_workloads
is more complex but it doesn't affect anything not related to it while extending this supervisor::notify
for instance (or something similar) would affect every usage.
the difference versus other perf::
functions from above stem from the fact that it's the only one which needs scylla_main to be called, other replace it.
req._headers["X-Amz-Target:"] = "DynamoDB_20120810." + operation; | ||
req.write_body("application/x-amz-json-1.0", std::move(body)); | ||
co_await cli.make_request(std::move(req), [] (const http::reply& rep, input_stream<char>&& in_) -> future<> { | ||
auto in = std::move(in_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this move achieve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably to avoid the lambda coroutine fiasco. Please see if coroutine::lambda() fixes it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but wasn't this fiasco about captures, not parameters? (but maybe I'm misremembering).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it was added to keep it alive. Indeed coroutine fiasco document mentions only captures. I am not sure why this is happening. It looks like in_
is freed after the first suspension point in lambda.
Adding coroutine::lambda or the second solution with std::ref doesn't help here.
Perhaps there is something wrong with caller code in seastar?
return do_with(std::move(rep), [&con, handle = std::move(handle)] (auto& rep) mutable {
return handle(rep, con.in(rep));
});
handle
is my lambda from above. I've tried also put it in do_with to keep it alive, without success:
return do_with(std::move(rep), std::move(handle), [&con] (auto& rep, auto& handle) mutable {
return handle(rep, con.in(rep));
});
so this trick with
auto in = std::move(in_);
is the only one which works for me (I saw it also here: https://github.com/scylladb/scylladb/blob/master/alternator/executor.cc#L91 and in one of Pavel's wip patches)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire coroutine frame is lost. It applies equally to captures, parameters, and locals that are promoted to live in the coroutine frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's better to use coroutine::lambda(), it solves the problem rather than working around it and failing if someone adds a local variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like just wrap the lambda?
co_await cli.make_request(std::move(req), coroutine::lambda([] (const http::reply& rep, input_stream<char>&& in_) -> future<> {
...
})
tested this before and it doesn't solve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be solved but in seastar instead:
diff --git a/src/http/client.cc b/src/http/client.cc
index 7a66823c..46df99df 100644
--- a/src/http/client.cc
+++ b/src/http/client.cc
@@ -236,9 +236,9 @@ future<> client::make_request(request req, reply_handler handle, reply::status_t
return make_exception_future<>(std::runtime_error(format("request finished with {}", rep._status)));
}
- return do_with(std::move(rep), [&con, handle = std::move(handle)] (auto& rep) mutable {
- return handle(rep, con.in(rep));
- });
+ return do_with(std::move(rep), coroutine::lambda([&con, handle = std::move(handle)] (auto& rep) mutable -> future<> {
+ co_await handle(rep, con.in(rep));
+ }));
});
});
}
while this is documented
/// lambda coroutine must complete (co_await) in the same statement.
I am not sure why it requires immediate co_await to work
fun_t fun = it->second; | ||
|
||
auto results = time_parallel([&] { | ||
static thread_local auto sharded_cli = get_client(c.port); // for simplicity never closed as it lives for the whole process runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what I'm seeing here. You have "concurrency", and yet just one "cli" object per shard. How does it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was working because underneath connection::make_request
doesn't yield, at least in this case.
Although I think it's more realistic for alternator to work on a higher number of connections, so I will add a pool, making one number of connections equal to concurrency.
6180681
to
ac1823d
Compare
v2:
|
CI state |
ac1823d
to
aeebf55
Compare
CI state |
v3:
|
CI state |
9cadd30
to
7727f01
Compare
CI state |
CI state |
7727f01
to
1af1e5f
Compare
CI state |
1af1e5f
to
f3dcc20
Compare
v4:
|
…vice group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) uncommited framework for easy alternator performance testing: scylladb#13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads.
…vice group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) uncommited framework for easy alternator performance testing: scylladb#13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \ --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \ --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads. Fixes scylladb#15844
…vice group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) not committed framework for easy alternator performance testing: scylladb#13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \ --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \ --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads. Fixes scylladb#15844
🔴 CI State: FAILURE✅ - Build Failed Tests (1/21011):Build Details:
|
Test failure could be #15334 - looks unrelated to this PR? |
…vice group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) not committed framework for easy alternator performance testing: #13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \ --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \ --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads. Fixes #15844 Closes #15845
🔴 CI State: FAILURE✅ - Build Failed Tests (2/30566):Build Details:
|
It will be reused later by a new tool.
The code is based on similar idea as perf_simple_query. The main differences are: - it starts full scylla process - communicates with alternator via http (localhost) - uses richer table schema with all dynamoDB types instead of only strings Testing code runs in the same process as scylla so we can easily get various perf counters (tps, instr, allocation, etc). Results on my machine (with 1 vCPU): > ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload read --duration 10 2> /dev/null ... median 23402.59616090321 median absolute deviation: 598.77 maximum: 24014.41 minimum: 19990.34 > ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write --duration 10 2> /dev/null ... median 16089.34211320635 median absolute deviation: 552.65 maximum: 16915.95 minimum: 14781.97 The above seem more realistic than results from perf_simple_query which are 96k and 49k tps (per core).
@nyh I've synchronized this PR with enterprise one. I think you can merge both. |
How do you isolate client code from server code? |
I don't. Anyway interesting is delta (i.e. before and after some patch). Absolute number has very little meaning. |
Thanks! I'm waiting for the CI to finish. |
🔴 CI State: FAILURE✅ - Build Build Details:
|
Seemingly unrelated test failed, filled https://github.com/scylladb/scylla-dtest/issues/4252, restarting |
@yarongilor clang-tidy is failing with
don't see ehow this is related to the PR. Does it block the merge now? |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
Thanks @nuivall. I see CI passed, so I merged now. |
The code is based on similar idea as perf_simple_query. The main differences are:
Testing code runs in the same process as scylla so we can easily get various perf counters (tps, instr, allocation, etc).
Results on my machine (with 1 vCPU):
The above seem more realistic than results from perf_simple_query which are 96k and 49k tps (per core).
Related: #12518