Coroutinize distributed loader #10609

bhalevy · 2022-05-19T12:03:07Z

Before touching any of this code for #9559,
that requires a change when loading sstables from the staging subdirectory,
simplify it using coroutines.

scylladb-promoter · 2022-05-19T18:12:39Z

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/441/

bhalevy · 2022-05-20T10:49:39Z

In v2 (b36d004):

kept seastar threads where deferred_stop(directory) was required.
- to keep 'em simple
fixed premature freeing of func in sstable_directory: parallel_for_each_restricted that causes use-after-free exposed by coroutinizing make_sstables_available
rebased + retested

replica/distributed_loader.cc

scylladb-promoter · 2022-05-20T13:45:07Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/457/

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…ss calls Without that there's use-after-free when called from distributed_loader::make_sstables_available where func is turned into a coroutine and the shared_sstable parameter is not explicitly copied and captured for the continuation of sst->move_to_new_dir. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…st_dirs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…dirs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy · 2022-05-20T14:45:52Z

In v3 (b3e2204):

simplified process_sstable_dir as per Coroutinize distributed loader #10609 (comment)
kep parallel_for_each in populate_keyspace as per Coroutinize distributed loader #10609 (comment)

scylladb-promoter · 2022-05-20T19:50:59Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/467/

bhalevy · 2022-05-24T08:11:10Z

@xemul ping. please re-review.

xemul · 2022-05-24T09:23:09Z

replica/distributed_loader.cc

            }

-            return table.add_sstables_and_update_cache(new_sstables).handle_exception([&table] (std::exception_ptr ep) {
+            co_await table.add_sstables_and_update_cache(new_sstables).handle_exception([&table] (std::exception_ptr ep) {


Erm... I don't insist, but shouldn't it rather look like

try { co_await table.add_sstables_and_update_cache(); } catch (...) { dblog.error(...); abort(); }

?

It could, but why throw exception when they can be handled elegantly as above?
I'd use try/catch if the exception handling would have covered a number of statements where we don't care which of them failed and they have common exception handling.

xemul · 2022-05-24T09:27:28Z

replica/distributed_loader.cc

-            });
-        }).then([start, total_size, ks_name, table_name] {
+            co_await d.reshard(std::move(info_vec), cm, table, max_threshold, creator, iop);
+                co_await d.move_foreign_sstables(dir);


Won't dir go out of scope because of this? There had been several bugs like that already, e.g. see my patch Save coroutine's captured variable on stack (to be fair -- I don't understand this problem completely)

dir is passed by reference and is instantiated in a seastar thread in distributed_loader::process_upload_dir
so it doesn't go out of scope.

(to be fair -- I don't understand this problem completely)

I think the following explains it well.

https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#cp51-do-not-use-capturing-lambdas-that-are-coroutines

A lambda results in a closure object with storage, often on the stack, that will go out of scope at some point. When the closure object goes out of scope the captures will also go out of scope. Normal lambdas will have finished executing by this time so it is not a problem. Coroutine lambdas may resume from suspension after the closure object has destructed and at that point all captures will be use-after-free memory access.

The point is that the lambda object is temporary and goes out of scope when to lambda coroutine suspends.
One way to extend its lifetime is, if the calling function is a coroutine as well, keep the coroutine lambda in an automatic variable and that variable is retained on the calling coroutine stack frame.

I.e. instead of e.g.

co_await parallel_for_each([captures] () -> future<> { co_await async_call; });

do:

auto f = [captures] () -> future<> { co_await async_call; }; co_await parallel_for_each(f);

Does coroutine::parallel_for_each help?

Does coroutine::parallel_for_each help?

Yes, since it moves the Func&& to the coroutine::parallel_for_each object. and so it extends its lifetime until it's fully resolved. but I'm not sure all continuations do that.

Anyhow, it's not required for this particular use case.

bhalevy requested review from xemul and cmm May 19, 2022 12:04

bhalevy force-pushed the coroutinize-distributed_loader branch from 0bfd569 to b36d004 Compare May 20, 2022 10:44

bhalevy requested review from tgrabiec and nyh as code owners May 20, 2022 10:44

xemul reviewed May 20, 2022

View reviewed changes

replica/distributed_loader.cc Outdated Show resolved Hide resolved

xemul reviewed May 20, 2022

View reviewed changes

replica/distributed_loader.cc Outdated Show resolved Hide resolved

xemul reviewed May 20, 2022

View reviewed changes

replica/distributed_loader.cc Outdated Show resolved Hide resolved

bhalevy added 19 commits May 20, 2022 17:07

replica: distributed_loader: coroutinize process_sstable_dir

33179c8

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent process_sstable_dir

8080e98

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize collect_all_shared_sstables

84d528c

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent collect_all_shared_sstables

ba1eb7a

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize distribute_reshard_jobs

3baa4d4

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent distribute_reshard_jobs

b65d55c

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize run_resharding_jobs

29e51ed

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent run_resharding_jobs

e1ba285

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize reshard

cf0d0a1

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize reshape

a1e663f

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent reshape

b13f44c

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize make_sstables_available

b3ebbf3

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent make_sstables_available

5f4d202

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize cleanup_column_family_temp_s…

8ba10db

…st_dirs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent cleanup_column_family_temp_sst_…

48122d3

…dirs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize handle_sstables_pending_delete

b8260c9

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent handle_sstables_pending_delete

5b038af

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: coroutinize populate_keyspace

a3c1dc8

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

replica: distributed_loader: reindent populate_keyspace

b3e2204

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy force-pushed the coroutinize-distributed_loader branch from b36d004 to b3e2204 Compare May 20, 2022 14:43

bhalevy requested a review from xemul May 20, 2022 14:46

xemul reviewed May 24, 2022

View reviewed changes

scylladb-promoter closed this in ed23e83 May 25, 2022

scylladb-promoter merged commit ed23e83 into scylladb:master May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coroutinize distributed loader #10609

Coroutinize distributed loader #10609

bhalevy commented May 19, 2022

scylladb-promoter commented May 19, 2022

bhalevy commented May 20, 2022

scylladb-promoter commented May 20, 2022

bhalevy commented May 20, 2022

scylladb-promoter commented May 20, 2022

bhalevy commented May 24, 2022

xemul May 24, 2022

bhalevy May 24, 2022

xemul May 24, 2022

bhalevy May 24, 2022

bhalevy May 24, 2022 •

edited

avikivity May 24, 2022

bhalevy May 24, 2022

Coroutinize distributed loader #10609

Coroutinize distributed loader #10609

Conversation

bhalevy commented May 19, 2022

scylladb-promoter commented May 19, 2022

bhalevy commented May 20, 2022

scylladb-promoter commented May 20, 2022

bhalevy commented May 20, 2022

scylladb-promoter commented May 20, 2022

bhalevy commented May 24, 2022

xemul May 24, 2022

Choose a reason for hiding this comment

bhalevy May 24, 2022

Choose a reason for hiding this comment

xemul May 24, 2022

Choose a reason for hiding this comment

bhalevy May 24, 2022

Choose a reason for hiding this comment

bhalevy May 24, 2022 • edited

Choose a reason for hiding this comment

avikivity May 24, 2022

Choose a reason for hiding this comment

bhalevy May 24, 2022

Choose a reason for hiding this comment

bhalevy May 24, 2022 •

edited