mvcc: make schema upgrades gentle, old version of the PR #13104

michoecho · 2023-03-07T18:25:33Z

After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.

This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery.
Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version.
After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and upgraded permanently by the MVCC version merge ongoing in the background.

The series:

Does some code cleanup in the mutation_partition area.
Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).
Adds upgrading variants of constructors and apply() for row and its wrappers.
Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains.
Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version.

I'm particularly worried about the correctness of the modifications to the version merging algorithm. It is already probably the trickiest and most complex piece of Scylla code, and these modifications make reasoning even harder.
Before the series, we always merge the newer version into the older version. After the series, we also merge in the other direction, because after the merge we want to end up with the newer schema, not the older schema. This changes some assumptions of apply_monotonically, especially these concerning eviction order and the "older versions are evicted first" rule.
Between preemption safety, exception safety, rows and dummies and last dummies, evictable and non-evictable entries, continuity, row tombstones, row compaction, and now also forward and backward merging, the complexity of this function is brain-melting. I have no idea how to cover it with trustworthy tests. The puny test I've added is barely more than a sanity check.

A possible performance problem with this method is that the same row entry might have to be upgraded many times by the cursor until its finally upgraded permanently by a version merge, whereas before the series they only need to be upgraded once. Note that problem is only relevant for big partition entries (the version merge starts eagerly and is only backgrounded on first preempt, so for small entries it will usually complete synchronously). I didn't do any performance tests, but schema upgrades currently (both before and after this series) seem to be implemented inefficiently: for every cell in the upgraded entry, a schema upgrade looks up its column name in a hashmap. If this is a performance problem, we can follow up with an optimization which avoids this by preparing an integer-based column mapping for the schema pair instead.

Fixes #2577

In some mixed-schema apply helpers for tests, the source mutation is accidentally copied with the target schema. Fix that. Nothing seems to be currently affected by this bug; I found it when it was triggered by a new test I was adding.

mutation_assertion gives no information whatsoever about *which* assert in the test failed. This patch adds a std::source_location argument to assert_that, which can be used to supply the assert with its call location, to make the error more informative.

Reads with multiple schema verions have a different code path now, so add schema changes to the test, to test these paths too.

…ally() The comment suggests that the order of sentinel insertion is meaningful because of the resulting eviction order. But the sentinels are added to the tracker with the two-argument version of insert(), which inserts the second argument into the LRU right before the (more recent) first argument. Thus the eviction order of sentinels is decided explicitly, and it doesn't rely on insertion order.

apply_weak is just an alias for apply(), and most of its variants are dead code. Get rid of it.

Most variants of apply() and apply_monotonically() in mutation_partition_v2 are leftovers from mutation_partition, and are unused. Thus they only add confusion and maintenance burden. Since we will be modifying apply_monotonically() in upcoming patches, let's clean them up, lest the variants become stale. This patch removes all unused variants of apply() and apply_monotonically() and "manually inlines" the variants which aren't used often enough to carry their own weight. In the end, we are left with a single apply_monotonically() and two convenience apply() helpers. The single apply_monotonically() accepts two schema arguments. This facility is unimplemented and unused as of this patch - the two arguments are always the same - but it will be implemented and used in later parts of the series.

The comment refers to "other", but it means "pe". Fix that. The patch also adds a bit of context to the mutation_partition jargon ("evictability" and "continuity"), by reminding how it relates to the concrete abstractions: memtable and cache.

deletable_row::apply() and deletable_row::apply_monotonically() are exchangeable. There is no reason for this aliasing. This patch gets rid of the redundancy. Arbitrarily, it leaves an apply() with a `const deletable_row&` argument and a apply_monotonically() with a `deletable_row&&` argument. Ditto for row::apply().

It will be used in upcoming patches. Only a copying variant is provided. As an optimization, a moving variant might be added in the future.

…tion_v2 constructor We don't have a convention for when to pass `schema_ptr` and and when to pass `const schema&` around. In general, IMHO the natural convention for such a situation is to pass the shared pointer if the callee might extend the lifetime of shared_ptr, and pass a reference otherwise. But we convert between them willy-nilly through shared_from_this(). While passing a reference to a function which actually expects a shared_ptr can make sense (e.g. due to the fact that smart pointers can't be passed in registers), the other way around is rather pointless. This patch takes one occurence of that and modifies the parameter to a reference. Since enable_shared_from_this makes shared pointer parameters and reference parameters interchangeable, this is a purely cosmetic change.

…n constructor Cosmetic change. See the preceding commit for details.

…n::difference Cosmetic change. See the preceding commit for details.

michoecho · 2023-03-14T00:59:31Z

Ready for review. Requesting review from @tgrabiec

scylladb-promoter · 2023-03-14T04:03:22Z

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4927/

Currently, partition_version does not reference its schema. All partition_version reachable from a entry/snapshot have the same schema, which is referenced in memtable_entry/cache_entry/partition_snapshot. To enable gentle schema upgrades, we want to use the existing background version merging mechanism. To achieve that, we will move the schema reference into partition_version, and we will allow neighbouring MVCC versions to have different schemas, and we will merge them on-the-fly during reads and persistently during background version merges. This way, an upgrade will boil down to adding a new empty version with the new schema. This patch adds the _schema field to partition_version and propagates the schema pointer to it from the version's containers (entry/snapshot). Subsequent patches will remove the schema references from the containers, because they are now redundant.

After adding a _schema field to each partition version, the field in partition_snapshot is redundant. It can be always recovered from the latest version. Remove it.

After adding a _schema field to each partition version, the field in cache_entry is redundant. It can be always recovered from the latest version. Remove it.

After adding a _schema field to each partition version, the field in memtable_entry is redundant. It can be always recovered from the latest version. Remove it.

…ead() partition_entry now contains a reference to its schema, so it no longer needs to be supplied by the caller.

operator<< accepts a schema& and a partition_entry&. But since the latter now contains a reference to its schema inside, the former is redundant. Remove it.

It will be used in upcoming commits. A factory function is used, rather than an actual constructor, because we want to delegate the (easy) case of equal schemas to the existing single-schema constructor. And that's impossible (at least without invoking a copy/move constructor) to do with only constructors.

It will be used during upcoming changes in partition_snapshot_row_cursor to prepare it for multi-schema snapshots.

…otonically() A helper which will be used during upcoming changes to mutation_partition_v2::apply_monotonically(), which will extend it to merging versions with different schemas.

A helper which will be used in upcoming commits.

A helper which will be used during upcoming changes to mutation_partition_v2::apply_monotonically(), which will extend it to merging versions with different schemas.

Adds a logalloc::region argument to upgrade_schema(). It's currently unused, but will be further propagated to partition_entry::upgrade() in an upcoming patch.

avikivity · 2023-03-15T14:56:58Z

test/boost/row_cache_test.cc

+                m2.upgrade(new_schema);
+                cache.set_schema(new_schema);
+            }
+


Regarding tests, I think we need a stresser that builds a model of the data outside the row cache and compares it with the real row_cache under a mix of writes, reads, schema changes, and evictions. The snapshot_source that backs the cache can read the model, no need for real sstables.

test_concurrent_reads_and_eviction() is such a test. The role of external model is served by the "mutation" object. We use memtable_snapshot_source instead of sstables as underlying source.

avikivity · 2023-03-15T15:15:59Z

After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.

This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery. Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version. After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and upgraded permanently by the MVCC version merge ongoing in the background.

The series:

Does some code cleanup in the mutation_partition area.

Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).

This is neutral in terms of non-transient memory footprint, yes?

Adds upgrading variants of constructors and apply() for row and its wrappers.

Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains.

Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version.

I'm particularly worried about the correctness of the modifications to the version merging algorithm. It is already probably the trickiest and most complex piece of Scylla code, and these modifications make reasoning even harder. Before the series, we always merge the newer version into the older version. After the series, we also merge in the other direction, because after the merge we want to end up with the newer schema, not the older schema.

Could we not make things more orthogonal? For example, instead of an upgrading merge, create a clone of the version with the new schema, replace the old version with the upgraded schema, and then proceed with the version merge?

A different way to make things orthogonal is to create a row_cache object for every schema version and chain those row_cache objects, but that comes with its own problems. I admit the motivation for that was to avoid duplicating the schema everywhere, not a desire to simplify things.

This changes some assumptions of apply_monotonically, especially these concerning eviction order and the "older versions are evicted first" rule. Between preemption safety, exception safety, rows and dummies and last dummies, evictable and non-evictable entries, continuity, row tombstones, row compaction, and now also forward and backward merging, the complexity of this function is brain-melting. I have no idea how to cover it with trustworthy tests. The puny test I've added is barely more than a sanity check.

A possible performance problem with this method is that the same row entry might have to be upgraded many times by the cursor until its finally upgraded permanently by a version merge, whereas before the series they only need to be upgraded once.

I think it's reasonable, as long as we schedule the version merge to happen "soon".

Note that problem is only relevant for big partition entries (the version merge starts eagerly and is only backgrounded on first preempt, so for small entries it will usually complete synchronously).

That's fine. For small partitions we get good behavior, and for large partitions anything is better than the stall we have today.

I didn't do any performance tests, but schema upgrades currently (both before and after this series) seem to be implemented inefficiently: for every cell in the upgraded entry, a schema upgrade looks up its column name in a hashmap. If this is a performance problem, we can follow up with an optimization which avoids this by preparing an integer-based column mapping for the schema pair instead.

Fixes #2577

Please annotate the cover letter with perf-simple-query results.

avikivity · 2023-03-15T15:19:57Z

Yes, cache_entry can destroy the schema.
schema is not supposed to be a managed object, so we don't care about
allocator context. But it contains a managed object (raw_value), and that's
a problem. raw_value should be switched to be non-managed. If we have no
alternative for managed_bytes, we can create a wrapper which destroys it in
std allocator context.

@avikivity Using managed_bytes as the common type for the buffer fragmentation project was a mistake.

Yes. I started a project to make bytes_ostream compatible with managed_bytes in its internal layout so it can use the same iterators, but not have to share the same class name. This avoids a bunch of templates but can also solve this problem.

We always copy across cache/memtable <-> rest of system boundaries, so using the same type only saves us effort in allowing to share the mutation representation between memtable/cache and non-cache, but perhaps it's not worth it.

tgrabiec · 2023-03-15T15:39:43Z

On Wed, Mar 15, 2023 at 2:52 PM Michał Chojnowski ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In mutation/mutation_partition_v2.cc <#13104 (comment)>: > @@ -114,17 +113,27 @@ std::ostream& operator<<(std::ostream& out, const apply_resume& res) { return out << "{" << int(res._stage) << ", " << res._pos << "}"; } -stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, mutation_partition_v2&& p, cache_tracker* tracker, - mutation_application_stats& app_stats, is_preemptible preemptible, apply_resume& res, is_evictable evictable) { - return apply_monotonically(s, std::move(p), tracker, app_stats, - preemptible ? default_preemption_check() : never_preempt(), res, evictable); +void mutation_partition_v2::apply_v1(const schema& s, mutation_partition&& p) { + apply_v2(s, mutation_partition_v2(s, std::move(p))); +} +void mutation_partition_v2::apply_v2(const schema& s, mutation_partition_v2&& p, cache_tracker* tracker, is_evictable evictable) { In this case, personal preference. Overloads break grepping for the name, so I don't like giving the same name to two functions without a reason.

This code will be read more often than grepped. In this case it looks odd because mutation_partition_v2 is the default (same object). v1 can be justified.

…

Message ID: ***@***.***>

tgrabiec · 2023-03-15T15:44:41Z

On Wed, Mar 15, 2023 at 2:54 PM Michał Chojnowski ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In test/boost/mutation_test.cc <#13104 (comment)>: > @@ -2101,7 +2101,7 @@ SEASTAR_TEST_CASE(test_v2_merging_in_non_evictable_snapshot) { auto to_apply = mutation_partition_v2(s, m2_v2); apply_resume res; - while (result_v2.apply_monotonically(s, std::move(to_apply), no_cache_tracker, app_stats, + while (result_v2.apply_monotonically(s, s, std::move(to_apply), no_cache_tracker, app_stats, You are trading 10 chars of clutter in the test for >4 lines of clutter in mutation_partition_v2.{hh,cc}. IMO the helper wasn't carrying its own weight.

The clutter in tests weighs more because it's what you look at when trying to understand the failed test. Pushing clutter down is good.

…

Message ID: ***@***.***>

michoecho · 2023-03-15T16:08:55Z

The series:
Does some code cleanup in the mutation_partition area.
Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).

This is neutral in terms of non-transient memory footprint, yes?

Yes.

Could we not make things more orthogonal? For example, instead of an upgrading merge, create a clone of the version with the new schema, replace the old version with the upgraded schema, and then proceed with the version merge?

We could. It's on the table. In this scenario, we could have a simpler "upgrading merge" function, upgrade_monotonically(), which would be used only for heterogeneous neighbors, and only if the two versions cover disjoint ranges — it would be simpler, because the function would only need to move entries and insert sentinels on preempt, but wouldn't have to care about intersections, which are the hard part. (And if there were two heterogeneous versions which don't cover disjoint ranges, then an empty version with the newer schema would be inserted between them to enforce that).

I can try writing that variant too. Lowering the complexity of apply_monotonically can be worth it.

I think it's reasonable, as long as we schedule the version merge to happen "soon".

Version merge is scheduled as soon as the snapshot which owns the version is destroyed, but version merges happen in order (except for the "first task quota" of the merge, which happens synchronously with the destruction, until it's preempted). The merges happen in a special scheduling group, mem_compaction. (Not making a point, just reminding how it works).

Please annotate the cover letter with perf-simple-query results.

Will do.

michoecho · 2023-03-15T16:46:01Z

@tgrabiec

This code will be read more often than grepped.

In my case grepping is an essential part of reading. If I'm looking at some code, usually I also want to know where it's used, so I grep for its calls. If the name repeats too much, you need to fire up an IDE to find the call site, and I don't like that.

But since you insist, I will rename them back to just apply().

The clutter in tests weighs more because it's what you look at when trying
to understand the failed test. Pushing clutter down is good.

Are you saying that pushing clutter from test code to core code is good? I disagree with the sentiment, IMHO readability of core code takes priority over readability of tests. But I can add the helper back.

avikivity · 2023-03-15T16:54:58Z

There is a third option yet, flattening the versions. In this option we allow rows with duplicate clustering keys (and duplicate static rows), the duplicates are disambiguated by the maximum timestamp within the row's mutation. A version is an instruction not to merge rows across a particular timestamp. Reading a version means reading the partition and ignoring higher-timestamp rows. To remove a version, merge adjacent rows that cross the version's timestamp.

It only works for LWT and raft (since monotonic timestamps are required), but MVCC is useless anyway outside LWT/Raft.

scylladb-promoter · 2023-03-15T18:18:53Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/25/

michoecho · 2023-03-15T18:29:00Z

CI failed on the ol' faithful ttl_test.TestDistributedTTL.test_ttl_is_respected_on_repair

Restarting. This time it will pass, I can feel it.

avikivity · 2023-03-15T19:26:08Z

CI failed on the ol' faithful ttl_test.TestDistributedTTL.test_ttl_is_respected_on_repair

Restarting. This time it will pass, I can feel it.

Was an issue filed/updated?

michoecho · 2023-03-15T19:28:18Z

Was an issue filed/updated?

It's ~~#11684~~. It's failing some CI every day.

Edit: I confused it with #13137, but they are in one area.

avikivity · 2023-03-15T19:31:38Z

Asked to prioritize.

scylladb-promoter · 2023-03-15T22:50:42Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/36/

tgrabiec · 2023-03-15T15:01:01Z

mutation/mutation_partition.hh

@@ -97,6 +97,7 @@ public:
    row();
    ~row();
    row(const schema&, column_kind, const row&);
+    static row construct_row(const schema& our_schema, const schema& their_schema, column_kind, const row&);


Why not just row::construct()?

construct_row is greppable, construct is not.
In the absence of other arguments, I prefer the more greppable name.

It's a static method, so users will call row::construct(), which is as greppable.

But from within the implementation of class row you can call it without the row:: scope — just construct(). So if you really want to grep all uses of a function in C++, you have to grep just its name, not the qualified name. (And there are also function pointers, and typedefs, and more). A static method with a generic method is more greppable than a non-static with a generic name, but still not truly greppable.

True. But should we go there? And rename row::size() to row::row_size() too, and apply() to mpv2_apply()? It seems like a slippery slope to me, which hurts readability. This problem is solved by IDE's with "Find Usages". @avikivity What's your take?

This problem is solved by IDE's with "Find Usages".

It is solved by IDEs until it isn't.

For example:

scylladb/serializer.hh

Lines 254 to 257 in 5705df7

template<typename T, typename Output>

inline void serialize(Output& out, const std::reference_wrapper<T> v) {

serializer<T>::write(out, v.get());

}

clangd seems to have no idea where are the implementations of write(), even though this is statically resolvable information.
If it had a unique name, I could at least grep for it — but I can't.

The problem goes the other way too: the IDE won't show you a usage of a method if the object has templated type.

It's not like I like it — I don't think greppability is "the right thing". If "find references" actually worked, I wouldn't advocate it. But C++ tooling was retrofitted onto the language and is less reliable than plain text.

tgrabiec · 2023-03-15T15:19:24Z

test/boost/row_cache_test.cc

+                m2.upgrade(new_schema);
+                cache.set_schema(new_schema);
+            }
+


test_concurrent_reads_and_eviction() is such a test. The role of external model is served by the "mutation" object. We use memtable_snapshot_source instead of sstables as underlying source.

tgrabiec · 2023-03-15T21:23:42Z

mutation/mutation_partition_v2.cc

@@ -355,7 +353,7 @@ stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, const
                        position_in_partition::after_key(s, src_e.position()), is_dummy::yes, is_continuous::no));
                if (src_e.position().is_clustering_row()) {
                    s2 = alloc_strategy_unique_ptr<rows_entry>(
-                            current_allocator().construct<rows_entry>(s,
+                            current_allocator().construct<rows_entry>(p_s,


s2 is this_sentinel, so should use "s" (our schema). s1 should use p_s.

It currently doesn't matter because sentinels carry no cells, only positions, which, as a clustering key, can be interpreted using any schema version. key schema is immutable. But better be precise in case something in the rows_entry becomes version-dependent in the future.

Right. I managed to confuse s1 and s2 multiple times when working on this code. This is a leftover of one such occasion. I must have thought that it was you who made this mistake and that I was fixing it. Must be some deep-seated instinct that 1 == "us" and 2 == "them".

Must be some deep-seated instinct that 1 == "us" and 2 == "them".

Now that I think about it, this might be engrained by the grammatical notion of first person.

tgrabiec · 2023-03-15T21:44:25Z

mutation/mutation_partition_v2.cc

@@ -440,18 +447,26 @@ stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, const
                }


s1 above should use p_s, for consistency.

Ah, so this is what I had in mind when I changed s2 to p_s.

tgrabiec · 2023-03-15T22:01:08Z

mutation/mutation_partition_v2.cc

+                    // the last dummy entry which holds the partition entry linked in LRU.
+                    i->swap(src_e);
+                } else {
+                    i->apply_monotonically(s, p_s, std::move(src_e));


You can just drop src_e in this case due to "information monotonicity" ("i" is in the newer version).

Note, it wouldn't be safe to call apply_monotonically() if "i" was in the older version because apply_monotonically can fail mid-way and we would violate "information monotonicity" by leaving src_e with less information (some cells moved).

tgrabiec · 2023-03-15T22:11:14Z

mutation/mutation_partition_v2.cc

+                    i->swap(src_e);
+                } else {
+                    i->apply_monotonically(s, p_s, std::move(src_e));
+                }
                tracker->remove(src_e);
            } else {
                // Avoid row compaction if no newer range tombstone.
                do_compact = (src_e.range_tombstone() + src_e.row().deleted_at().regular()) >
                            (i->range_tombstone() + i->row().deleted_at().regular());
                memory::on_alloc_point();


For reverse order merging the comparison in do_compact should be inverted:

do_compact = (src_e.range_tombstone() + src_e.row().deleted_at().regular()) < (i->range_tombstone() + i->row().deleted_at().regular());

The rationale is this. We want to compact if one version has a range tombstone which covers cells in the other version. To avoid expensive exact checks, we just assume that if the later version contains a range tombstone, then it probably covers cells in the older version (which is usually the case because timestamp values usually follow the order of apply). For !same_schema, it's "i" which is in the newer version, so we want to compact if i->range_tombstone() is later than src_e.range_tombstone().

tgrabiec · 2023-03-16T11:07:36Z

mutation/partition_version.cc

-            current_allocator().destroy(prev);
+            if (merged_into->is_referenced()) {
+                merged_into->back_reference().release();
+            }


This if block seems redundant to me and a bit confusing. If merged_into is _version, the assignment below takes care of it. If it's not _version, it's incorrect to just release it, but this case is taken care of by previous if (prev->is_referenced()), so not the case here.

The assignment below creates a reference to merged_into, and it's forbidden to create a reference to an already-referenced version. (This is enforced by this assert:)

scylladb/mutation/partition_version.hh

Line 158 in aad2afd

assert(!_version->_backref);

So we have to explicitly release the existing reference to merged_into before we create the new one.
But I concur that this looks wrong. I'll add a comment which explains it.

tgrabiec · 2023-03-16T11:18:55Z

test/boost/row_cache_test.cc

@@ -949,6 +949,7 @@ SEASTAR_TEST_CASE(test_eviction_after_schema_change) {
            rd.fill_buffer().get();
        }

+        tracker.cleaner().drain().get0();


What was the failure without this?

Background cleaning shouldn't prevent eviction.

The failure is in BOOST_REQUIRE_EQUAL(tracker.get_stats().rows, 0);.
It's about last dummies — they are only removed when their version is finally destroyed by the mutation cleaner.
Since there is no guarantee that the code above doesn't leave any unreferenced versions, the cleaning has to be explicitly finished before the check happens.

That said, you made me realize that the code works in a different way than I intended. I wanted to start the version merge synchronously after the upgrade, so that small partitions are upgraded without leaving garbage. But now I see that it doesn't happen, because this check is blocking that, and the upgrading merge never starts synchronously:

scylladb/mutation/mutation_partition.cc

Lines 2554 to 2562 in 5705df7

// Allocating sections require the region to be reclaimable

// which means that they cannot be nested.

// It is, however, possible, that if the snapshot is taken

// inside an allocating section and then an exception is thrown

// this function will be called to clean up even though we

// still will be in the context of the allocating section.

if (!region.reclaiming_enabled()) {

return stop_iteration::no;

}

I wonder what to do about this. Maybe change it so that the worker allocating section is called only if reclaiming isn't already disabled by an outer section?

scylladb/mutation/mutation_partition.cc

Line 2565 in 5705df7

return _worker_state->alloc_section(region, [&] {

tgrabiec · 2023-03-16T13:00:56Z

On Wed, Mar 15, 2023 at 5:09 PM Michał Chojnowski ***@***.***> wrote: The series: Does some code cleanup in the mutation_partition area. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry). This is neutral in terms of non-transient memory footprint, yes? Yes. Could we not make things more orthogonal? For example, instead of an upgrading merge, create a clone of the version with the new schema, replace the old version with the upgraded schema, and then proceed with the version merge? We could. It's on the table. In this scenario, we could have a simpler "upgrading merge" function, upgrade_monotonically(), which would be used only for heterogeneous neighbors, and only if the two versions cover disjoint ranges — it would be simpler, because the function would only need to move entries and insert sentinels on preempt, but wouldn't have to care about intersections, which are the hard part. (And if there were two heterogeneous versions which don't cover disjoint ranges, then an empty version with the newer schema would be inserted between them to enforce that).

The main increase in complexity for schema upgrade is from the fact that the order of merging is reversed. We can avoid reversing the order and avoid introducing a new upgrade_monotonically() algorithm, by doing this on version merge. When merging v2 (newer, s2) and v1 (older, s1), and !same_schema, insert v1' of schema s2 after v1 and merge v1 into v' using the standard direction (upgrading rows on the fly). We only need to remember the fact that v' is of the later schema, so when resuming we don't allocate yet another version. Then remove v1 and have v2 and v1' both having the same schema version. Message ID: ***@***.***>

…

tgrabiec · 2023-03-16T13:04:33Z

On Wed, Mar 15, 2023 at 5:46 PM Michał Chojnowski ***@***.***> wrote: This code will be read more often than grepped. In my case grepping is an essential part of reading. If I'm looking at some code, usually I also want to know where it's used, so I grep for its calls. If the name repeats too much, you need to fire up an IDE to find the call site, and I don't like that. But since you insist, I will rename them back to just apply(). The clutter in tests weighs more because it's what you look at when trying to understand the failed test. Pushing clutter down is good. Are you saying that pushing clutter from test code to core code is good? I disagree with the sentiment, IMHO readability of core code takes priority over readability of tests. But I can add the helper back.

Depends on the kind of clutter. If it's test-specific, it can be a helper which is local to the test code.

…

Message ID: ***@***.***>

tgrabiec · 2023-03-16T17:49:22Z

@avikivity Using managed_bytes as the common type for the buffer fragmentation project was a mistake.

Please open an issue so that we don't forget about it.

michoecho · 2023-03-16T18:29:11Z

@avikivity Using managed_bytes as the common type for the buffer fragmentation project was a mistake.

Please open an issue so that we don't forget about it.

Opened #13216

avikivity · 2023-04-25T16:51:30Z

Let's move this forward. What's remaining?

michoecho · 2023-04-25T20:13:51Z

Let's move this forward. What's remaining?

Status reminder:

The PR aimed to implement gentle upgrades by making the schema a per-version property (instead of a per-entry property), and adding support for heterogeneous entries to reading and merging algorithms. This way an upgrade boils down to adding an empty version.
But this requires merging versions forward (moving rows from an older version into a newer version), which is the opposite of the regular merging direction. We ultimately decided that requiring the merge algorithm to work in both directions makes it too hard to comprehend (and believe in its correctness).
Tomek gave an alternative idea in #13104 (comment), but it's not that simple either. It avoids forward merges, but it adds much complexity to how the version chain is managed.

So, in short, progress was stalled mostly because no design seems good, and we (I) are afraid of pushing yet another bunch of complexity into an already very complicated section of code.

I was just planning to rewrite this into a variant of Tomek's idea (where during upgrade we add a newer version after the older version, and the two are treated from the outside as a unit, and they aren't merged from/to until their merge/upgrade finishes), but the tracing project (and other issues) diverted my time.

A secondary unsolved problem (#13104 (comment)) is that an assumption made by the PR — that a merge can be triggered on upgrade to make upgrades of small partition entries effectively synchronous — was wrong, because of allocating_section nesting rules (there can be no nesting). Since both the cache update and the merge have their own allocating section, the merge refuses to happen during the update, so it becomes asynchronous, rather than synchronous. We have to avoid this somehow, maybe by bypassing the merge section if we are inside the update section.

avikivity · 2023-04-27T17:55:19Z

@tgrabiec and @michoecho, please agree on a direction and move forward, it's a pity to let this rot.

mykaul · 2023-05-14T13:50:47Z

@tgrabiec and @michoecho, please agree on a direction and move forward, it's a pity to let this rot.

@tgrabiec , @michoecho - please move forward, I'd be happy to see it in before 5.3 branch.

michoecho · 2023-05-14T13:52:33Z

@tgrabiec and @michoecho, please agree on a direction and move forward, it's a pity to let this rot.

@tgrabiec , @michoecho - please move forward, I'd be happy to see it in before 5.3 branch.

Note: a new version of this PR is at #13761. It's awaiting @tgrabiec's review.

avikivity · 2023-05-14T15:33:49Z

Closing in favor of #13761 to avoid confusion.

michoecho added 12 commits March 13, 2023 21:56

mutation_partition: fix mixed-schema apply()

81e6294

In some mixed-schema apply helpers for tests, the source mutation is accidentally copied with the target schema. Fix that. Nothing seems to be currently affected by this bug; I found it when it was triggered by a new test I was adding.

row_cache_test: add schema changes to test_concurrent_reads_and_eviction

d500c7f

Reads with multiple schema verions have a different code path now, so add schema changes to the test, to test these paths too.

mutation_partition: remove apply_weak()

032a6cf

apply_weak is just an alias for apply(), and most of its variants are dead code. Get rid of it.

mutation_partition: add an upgrading variant of row::apply()

5cf2a0f

It will be used in upcoming patches. Only a copying variant is provided. As an optimization, a moving variant might be added in the future.

mutation_partition: change schema_ptr to schema& in mutation_partitio…

4510ae5

…n constructor Cosmetic change. See the preceding commit for details.

mutation_partition: change schema_ptr to schema& in mutation_partitio…

c9a5529

…n::difference Cosmetic change. See the preceding commit for details.

michoecho force-pushed the gentle_schema_upgrades branch from ec0e406 to ecc3217 Compare March 14, 2023 00:55

michoecho marked this pull request as ready for review March 14, 2023 00:55

michoecho requested review from nyh and tgrabiec as code owners March 14, 2023 00:55

michoecho added 12 commits March 14, 2023 14:01

partition_version: remove the _schema field from partition_snapshot

c4acf4c

After adding a _schema field to each partition version, the field in partition_snapshot is redundant. It can be always recovered from the latest version. Remove it.

row_cache: remove _schema from cache_entry

ea30e40

After adding a _schema field to each partition version, the field in cache_entry is redundant. It can be always recovered from the latest version. Remove it.

memtable: remove _schema from memtable_entry

a6fcd4c

After adding a _schema field to each partition version, the field in memtable_entry is redundant. It can be always recovered from the latest version. Remove it.

partition_version: remove the schema argument from partition_entry::r…

c4c4a08

…ead() partition_entry now contains a reference to its schema, so it no longer needs to be supplied by the caller.

partition_version: remove _schema from partition_entry::operator<<

5ce4ea8

operator<< accepts a schema& and a partition_entry&. But since the latter now contains a reference to its schema inside, the former is redundant. Remove it.

mutation_fragment: add an upgrading variant of clustering_row::apply()

49e9809

It will be used during upcoming changes in partition_snapshot_row_cursor to prepare it for multi-schema snapshots.

mutation_partition: add an upgrading variant of rows_entry::apply_mon…

9029ba6

…otonically() A helper which will be used during upcoming changes to mutation_partition_v2::apply_monotonically(), which will extend it to merging versions with different schemas.

mutation_partition: add an upgrading variant of rows_entry::rows_entry

0037622

A helper which will be used in upcoming commits.

mutation_partition: add an upgrading variant of lazy_row::apply()

4d9c58f

A helper which will be used during upcoming changes to mutation_partition_v2::apply_monotonically(), which will extend it to merging versions with different schemas.

memtable: propagate the region to memtable_entry::upgrade_schema()

7240b78

Adds a logalloc::region argument to upgrade_schema(). It's currently unused, but will be further propagated to partition_entry::upgrade() in an upcoming patch.

avikivity reviewed Mar 15, 2023

View reviewed changes

tgrabiec reviewed Mar 16, 2023

View reviewed changes

michoecho mentioned this pull request Mar 16, 2023

managed_bytes used outside of managed objects can be accidentally allocated or freed in the wrong allocator #13216

Open

mykaul added the Eng-1 label May 1, 2023

michoecho changed the title ~~mvcc: make schema upgrades gentle~~ mvcc: make schema upgrades gentle, old version of the PR May 4, 2023

michoecho mentioned this pull request May 4, 2023

mvcc: make schema upgrades gentle #13761

Merged

avikivity closed this May 14, 2023

	template<typename T, typename Output>
	inline void serialize(Output& out, const std::reference_wrapper<T> v) {
	serializer<T>::write(out, v.get());
	}

		@@ -440,18 +447,26 @@ stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, const
		}

	// Allocating sections require the region to be reclaimable
	// which means that they cannot be nested.
	// It is, however, possible, that if the snapshot is taken
	// inside an allocating section and then an exception is thrown
	// this function will be called to clean up even though we
	// still will be in the context of the allocating section.
	if (!region.reclaiming_enabled()) {
	return stop_iteration::no;
	}

mvcc: make schema upgrades gentle, old version of the PR #13104

mvcc: make schema upgrades gentle, old version of the PR #13104

Conversation

michoecho commented Mar 7, 2023 • edited

michoecho commented Mar 14, 2023

scylladb-promoter commented Mar 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avikivity commented Mar 15, 2023

avikivity commented Mar 15, 2023

tgrabiec commented Mar 15, 2023 via email

tgrabiec commented Mar 15, 2023 via email

michoecho commented Mar 15, 2023

michoecho commented Mar 15, 2023 • edited

avikivity commented Mar 15, 2023

scylladb-promoter commented Mar 15, 2023

michoecho commented Mar 15, 2023 • edited

avikivity commented Mar 15, 2023

michoecho commented Mar 15, 2023 • edited

avikivity commented Mar 15, 2023

scylladb-promoter commented Mar 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michoecho Mar 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michoecho Mar 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michoecho Mar 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgrabiec commented Mar 16, 2023 via email

tgrabiec commented Mar 16, 2023 via email

tgrabiec commented Mar 16, 2023

michoecho commented Mar 16, 2023

avikivity commented Apr 25, 2023

michoecho commented Apr 25, 2023

avikivity commented Apr 27, 2023

mykaul commented May 14, 2023

michoecho commented May 14, 2023

avikivity commented May 14, 2023

michoecho commented Mar 7, 2023 •

edited

michoecho commented Mar 15, 2023 •

edited

michoecho commented Mar 15, 2023 •

edited

michoecho commented Mar 15, 2023 •

edited

michoecho Mar 16, 2023 •

edited

michoecho Mar 16, 2023 •

edited

michoecho Mar 17, 2023 •

edited