-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvcc: make schema upgrades gentle, old version of the PR #13104
Conversation
In some mixed-schema apply helpers for tests, the source mutation is accidentally copied with the target schema. Fix that. Nothing seems to be currently affected by this bug; I found it when it was triggered by a new test I was adding.
mutation_assertion gives no information whatsoever about *which* assert in the test failed. This patch adds a std::source_location argument to assert_that, which can be used to supply the assert with its call location, to make the error more informative.
Reads with multiple schema verions have a different code path now, so add schema changes to the test, to test these paths too.
…ally() The comment suggests that the order of sentinel insertion is meaningful because of the resulting eviction order. But the sentinels are added to the tracker with the two-argument version of insert(), which inserts the second argument into the LRU right before the (more recent) first argument. Thus the eviction order of sentinels is decided explicitly, and it doesn't rely on insertion order.
apply_weak is just an alias for apply(), and most of its variants are dead code. Get rid of it.
Most variants of apply() and apply_monotonically() in mutation_partition_v2 are leftovers from mutation_partition, and are unused. Thus they only add confusion and maintenance burden. Since we will be modifying apply_monotonically() in upcoming patches, let's clean them up, lest the variants become stale. This patch removes all unused variants of apply() and apply_monotonically() and "manually inlines" the variants which aren't used often enough to carry their own weight. In the end, we are left with a single apply_monotonically() and two convenience apply() helpers. The single apply_monotonically() accepts two schema arguments. This facility is unimplemented and unused as of this patch - the two arguments are always the same - but it will be implemented and used in later parts of the series.
The comment refers to "other", but it means "pe". Fix that. The patch also adds a bit of context to the mutation_partition jargon ("evictability" and "continuity"), by reminding how it relates to the concrete abstractions: memtable and cache.
deletable_row::apply() and deletable_row::apply_monotonically() are exchangeable. There is no reason for this aliasing. This patch gets rid of the redundancy. Arbitrarily, it leaves an apply() with a `const deletable_row&` argument and a apply_monotonically() with a `deletable_row&&` argument. Ditto for row::apply().
It will be used in upcoming patches. Only a copying variant is provided. As an optimization, a moving variant might be added in the future.
…tion_v2 constructor We don't have a convention for when to pass `schema_ptr` and and when to pass `const schema&` around. In general, IMHO the natural convention for such a situation is to pass the shared pointer if the callee might extend the lifetime of shared_ptr, and pass a reference otherwise. But we convert between them willy-nilly through shared_from_this(). While passing a reference to a function which actually expects a shared_ptr can make sense (e.g. due to the fact that smart pointers can't be passed in registers), the other way around is rather pointless. This patch takes one occurence of that and modifies the parameter to a reference. Since enable_shared_from_this makes shared pointer parameters and reference parameters interchangeable, this is a purely cosmetic change.
…n constructor Cosmetic change. See the preceding commit for details.
…n::difference Cosmetic change. See the preceding commit for details.
ec0e406
to
ecc3217
Compare
Ready for review. Requesting review from @tgrabiec |
CI state |
Currently, partition_version does not reference its schema. All partition_version reachable from a entry/snapshot have the same schema, which is referenced in memtable_entry/cache_entry/partition_snapshot. To enable gentle schema upgrades, we want to use the existing background version merging mechanism. To achieve that, we will move the schema reference into partition_version, and we will allow neighbouring MVCC versions to have different schemas, and we will merge them on-the-fly during reads and persistently during background version merges. This way, an upgrade will boil down to adding a new empty version with the new schema. This patch adds the _schema field to partition_version and propagates the schema pointer to it from the version's containers (entry/snapshot). Subsequent patches will remove the schema references from the containers, because they are now redundant.
After adding a _schema field to each partition version, the field in partition_snapshot is redundant. It can be always recovered from the latest version. Remove it.
After adding a _schema field to each partition version, the field in cache_entry is redundant. It can be always recovered from the latest version. Remove it.
After adding a _schema field to each partition version, the field in memtable_entry is redundant. It can be always recovered from the latest version. Remove it.
…ead() partition_entry now contains a reference to its schema, so it no longer needs to be supplied by the caller.
operator<< accepts a schema& and a partition_entry&. But since the latter now contains a reference to its schema inside, the former is redundant. Remove it.
It will be used in upcoming commits. A factory function is used, rather than an actual constructor, because we want to delegate the (easy) case of equal schemas to the existing single-schema constructor. And that's impossible (at least without invoking a copy/move constructor) to do with only constructors.
It will be used during upcoming changes in partition_snapshot_row_cursor to prepare it for multi-schema snapshots.
…otonically() A helper which will be used during upcoming changes to mutation_partition_v2::apply_monotonically(), which will extend it to merging versions with different schemas.
A helper which will be used in upcoming commits.
A helper which will be used during upcoming changes to mutation_partition_v2::apply_monotonically(), which will extend it to merging versions with different schemas.
Adds a logalloc::region argument to upgrade_schema(). It's currently unused, but will be further propagated to partition_entry::upgrade() in an upcoming patch.
m2.upgrade(new_schema); | ||
cache.set_schema(new_schema); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding tests, I think we need a stresser that builds a model of the data outside the row cache and compares it with the real row_cache under a mix of writes, reads, schema changes, and evictions. The snapshot_source that backs the cache can read the model, no need for real sstables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_concurrent_reads_and_eviction() is such a test. The role of external model is served by the "mutation" object. We use memtable_snapshot_source instead of sstables as underlying source.
This is neutral in terms of non-transient memory footprint, yes?
Could we not make things more orthogonal? For example, instead of an upgrading merge, create a clone of the version with the new schema, replace the old version with the upgraded schema, and then proceed with the version merge? A different way to make things orthogonal is to create a row_cache object for every schema version and chain those row_cache objects, but that comes with its own problems. I admit the motivation for that was to avoid duplicating the schema everywhere, not a desire to simplify things.
I think it's reasonable, as long as we schedule the version merge to happen "soon".
That's fine. For small partitions we get good behavior, and for large partitions anything is better than the stall we have today.
Please annotate the cover letter with perf-simple-query results. |
Yes. I started a project to make bytes_ostream compatible with managed_bytes in its internal layout so it can use the same iterators, but not have to share the same class name. This avoids a bunch of templates but can also solve this problem. We always copy across cache/memtable <-> rest of system boundaries, so using the same type only saves us effort in allowing to share the mutation representation between memtable/cache and non-cache, but perhaps it's not worth it. |
On Wed, Mar 15, 2023 at 2:52 PM Michał Chojnowski ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In mutation/mutation_partition_v2.cc
<#13104 (comment)>:
> @@ -114,17 +113,27 @@ std::ostream& operator<<(std::ostream& out, const apply_resume& res) {
return out << "{" << int(res._stage) << ", " << res._pos << "}";
}
-stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, mutation_partition_v2&& p, cache_tracker* tracker,
- mutation_application_stats& app_stats, is_preemptible preemptible, apply_resume& res, is_evictable evictable) {
- return apply_monotonically(s, std::move(p), tracker, app_stats,
- preemptible ? default_preemption_check() : never_preempt(), res, evictable);
+void mutation_partition_v2::apply_v1(const schema& s, mutation_partition&& p) {
+ apply_v2(s, mutation_partition_v2(s, std::move(p)));
+}
+void mutation_partition_v2::apply_v2(const schema& s, mutation_partition_v2&& p, cache_tracker* tracker, is_evictable evictable) {
In this case, personal preference. Overloads break grepping for the name,
so I don't like giving the same name to two functions without a reason.
This code will be read more often than grepped. In this case it looks odd
because mutation_partition_v2 is the default (same object). v1 can be
justified.
… Message ID: ***@***.***>
|
On Wed, Mar 15, 2023 at 2:54 PM Michał Chojnowski ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In test/boost/mutation_test.cc
<#13104 (comment)>:
> @@ -2101,7 +2101,7 @@ SEASTAR_TEST_CASE(test_v2_merging_in_non_evictable_snapshot) {
auto to_apply = mutation_partition_v2(s, m2_v2);
apply_resume res;
- while (result_v2.apply_monotonically(s, std::move(to_apply), no_cache_tracker, app_stats,
+ while (result_v2.apply_monotonically(s, s, std::move(to_apply), no_cache_tracker, app_stats,
You are trading 10 chars of clutter in the test for >4 lines of clutter in
mutation_partition_v2.{hh,cc}. IMO the helper wasn't carrying its own
weight.
The clutter in tests weighs more because it's what you look at when trying
to understand the failed test. Pushing clutter down is good.
… Message ID: ***@***.***>
|
Yes.
We could. It's on the table. In this scenario, we could have a simpler "upgrading merge" function, upgrade_monotonically(), which would be used only for heterogeneous neighbors, and only if the two versions cover disjoint ranges — it would be simpler, because the function would only need to move entries and insert sentinels on preempt, but wouldn't have to care about intersections, which are the hard part. (And if there were two heterogeneous versions which don't cover disjoint ranges, then an empty version with the newer schema would be inserted between them to enforce that). I can try writing that variant too. Lowering the complexity of
Version merge is scheduled as soon as the snapshot which owns the version is destroyed, but version merges happen in order (except for the "first task quota" of the merge, which happens synchronously with the destruction, until it's preempted). The merges happen in a special scheduling group,
Will do. |
In my case grepping is an essential part of reading. If I'm looking at some code, usually I also want to know where it's used, so I grep for its calls. If the name repeats too much, you need to fire up an IDE to find the call site, and I don't like that. But since you insist, I will rename them back to just
Are you saying that pushing clutter from test code to core code is good? I disagree with the sentiment, IMHO readability of core code takes priority over readability of tests. But I can add the helper back. |
There is a third option yet, flattening the versions. In this option we allow rows with duplicate clustering keys (and duplicate static rows), the duplicates are disambiguated by the maximum timestamp within the row's mutation. A version is an instruction not to merge rows across a particular timestamp. Reading a version means reading the partition and ignoring higher-timestamp rows. To remove a version, merge adjacent rows that cross the version's timestamp. It only works for LWT and raft (since monotonic timestamps are required), but MVCC is useless anyway outside LWT/Raft. |
CI state |
CI failed on the ol' faithful ttl_test.TestDistributedTTL.test_ttl_is_respected_on_repair Restarting. This time it will pass, I can feel it. |
Was an issue filed/updated? |
Asked to prioritize. |
CI state |
@@ -97,6 +97,7 @@ public: | |||
row(); | |||
~row(); | |||
row(const schema&, column_kind, const row&); | |||
static row construct_row(const schema& our_schema, const schema& their_schema, column_kind, const row&); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just row::construct()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
construct_row
is greppable, construct
is not.
In the absence of other arguments, I prefer the more greppable name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a static method, so users will call row::construct(), which is as greppable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But from within the implementation of class row
you can call it without the row::
scope — just construct()
. So if you really want to grep all uses of a function in C++, you have to grep just its name, not the qualified name. (And there are also function pointers, and typedefs, and more). A static method with a generic method is more greppable than a non-static with a generic name, but still not truly greppable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But should we go there? And rename row::size() to row::row_size() too, and apply() to mpv2_apply()? It seems like a slippery slope to me, which hurts readability. This problem is solved by IDE's with "Find Usages". @avikivity What's your take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem is solved by IDE's with "Find Usages".
It is solved by IDEs until it isn't.
For example:
Lines 254 to 257 in 5705df7
template<typename T, typename Output> | |
inline void serialize(Output& out, const std::reference_wrapper<T> v) { | |
serializer<T>::write(out, v.get()); | |
} |
clangd seems to have no idea where are the implementations of write(), even though this is statically resolvable information.
If it had a unique name, I could at least grep for it — but I can't.
The problem goes the other way too: the IDE won't show you a usage of a method if the object has templated type.
It's not like I like it — I don't think greppability is "the right thing". If "find references" actually worked, I wouldn't advocate it. But C++ tooling was retrofitted onto the language and is less reliable than plain text.
m2.upgrade(new_schema); | ||
cache.set_schema(new_schema); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_concurrent_reads_and_eviction() is such a test. The role of external model is served by the "mutation" object. We use memtable_snapshot_source instead of sstables as underlying source.
@@ -355,7 +353,7 @@ stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, const | |||
position_in_partition::after_key(s, src_e.position()), is_dummy::yes, is_continuous::no)); | |||
if (src_e.position().is_clustering_row()) { | |||
s2 = alloc_strategy_unique_ptr<rows_entry>( | |||
current_allocator().construct<rows_entry>(s, | |||
current_allocator().construct<rows_entry>(p_s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s2 is this_sentinel, so should use "s" (our schema). s1 should use p_s.
It currently doesn't matter because sentinels carry no cells, only positions, which, as a clustering key, can be interpreted using any schema version. key schema is immutable. But better be precise in case something in the rows_entry becomes version-dependent in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I managed to confuse s1 and s2 multiple times when working on this code. This is a leftover of one such occasion. I must have thought that it was you who made this mistake and that I was fixing it. Must be some deep-seated instinct that 1 == "us" and 2 == "them".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must be some deep-seated instinct that 1 == "us" and 2 == "them".
Now that I think about it, this might be engrained by the grammatical notion of first person.
@@ -440,18 +447,26 @@ stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, const | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s1 above should use p_s, for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so this is what I had in mind when I changed s2 to p_s
.
// the last dummy entry which holds the partition entry linked in LRU. | ||
i->swap(src_e); | ||
} else { | ||
i->apply_monotonically(s, p_s, std::move(src_e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just drop src_e in this case due to "information monotonicity" ("i" is in the newer version).
Note, it wouldn't be safe to call apply_monotonically() if "i" was in the older version because apply_monotonically can fail mid-way and we would violate "information monotonicity" by leaving src_e with less information (some cells moved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
i->swap(src_e); | ||
} else { | ||
i->apply_monotonically(s, p_s, std::move(src_e)); | ||
} | ||
tracker->remove(src_e); | ||
} else { | ||
// Avoid row compaction if no newer range tombstone. | ||
do_compact = (src_e.range_tombstone() + src_e.row().deleted_at().regular()) > | ||
(i->range_tombstone() + i->row().deleted_at().regular()); | ||
memory::on_alloc_point(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reverse order merging the comparison in do_compact should be inverted:
do_compact = (src_e.range_tombstone() + src_e.row().deleted_at().regular()) <
(i->range_tombstone() + i->row().deleted_at().regular());
The rationale is this. We want to compact if one version has a range tombstone which covers cells in the other version. To avoid expensive exact checks, we just assume that if the later version contains a range tombstone, then it probably covers cells in the older version (which is usually the case because timestamp values usually follow the order of apply). For !same_schema, it's "i" which is in the newer version, so we want to compact if i->range_tombstone() is later than src_e.range_tombstone().
current_allocator().destroy(prev); | ||
if (merged_into->is_referenced()) { | ||
merged_into->back_reference().release(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if block seems redundant to me and a bit confusing. If merged_into is _version, the assignment below takes care of it. If it's not _version, it's incorrect to just release it, but this case is taken care of by previous if (prev->is_referenced()), so not the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assignment below creates a reference to merged_into, and it's forbidden to create a reference to an already-referenced version. (This is enforced by this assert:)
scylladb/mutation/partition_version.hh
Line 158 in aad2afd
assert(!_version->_backref); |
So we have to explicitly release the existing reference to merged_into before we create the new one.
But I concur that this looks wrong. I'll add a comment which explains it.
@@ -949,6 +949,7 @@ SEASTAR_TEST_CASE(test_eviction_after_schema_change) { | |||
rd.fill_buffer().get(); | |||
} | |||
|
|||
tracker.cleaner().drain().get0(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the failure without this?
Background cleaning shouldn't prevent eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure is in BOOST_REQUIRE_EQUAL(tracker.get_stats().rows, 0);
.
It's about last dummies — they are only removed when their version is finally destroyed by the mutation cleaner.
Since there is no guarantee that the code above doesn't leave any unreferenced versions, the cleaning has to be explicitly finished before the check happens.
That said, you made me realize that the code works in a different way than I intended. I wanted to start the version merge synchronously after the upgrade, so that small partitions are upgraded without leaving garbage. But now I see that it doesn't happen, because this check is blocking that, and the upgrading merge never starts synchronously:
scylladb/mutation/mutation_partition.cc
Lines 2554 to 2562 in 5705df7
// Allocating sections require the region to be reclaimable | |
// which means that they cannot be nested. | |
// It is, however, possible, that if the snapshot is taken | |
// inside an allocating section and then an exception is thrown | |
// this function will be called to clean up even though we | |
// still will be in the context of the allocating section. | |
if (!region.reclaiming_enabled()) { | |
return stop_iteration::no; | |
} |
I wonder what to do about this. Maybe change it so that the worker allocating section is called only if reclaiming isn't already disabled by an outer section?
scylladb/mutation/mutation_partition.cc
Line 2565 in 5705df7
return _worker_state->alloc_section(region, [&] { |
On Wed, Mar 15, 2023 at 5:09 PM Michał Chojnowski ***@***.***> wrote:
The series:
Does some code cleanup in the mutation_partition area.
Adds a schema field to partition_version and removes it from its
containers (partition_snapshot, cache_entry, memtable_entry).
This is neutral in terms of non-transient memory footprint, yes?
Yes.
Could we not make things more orthogonal? For example, instead of an
upgrading merge, create a clone of the version with the new schema, replace
the old version with the upgraded schema, and then proceed with the version
merge?
We could. It's on the table. In this scenario, we could have a simpler
"upgrading merge" function, upgrade_monotonically(), which would be used
only for heterogeneous neighbors, and only if the two versions cover
disjoint ranges — it would be simpler, because the function would only need
to move entries and insert sentinels on preempt, but wouldn't have to care
about intersections, which are the hard part. (And if there were two
heterogeneous versions which don't cover disjoint ranges, then an empty
version with the newer schema would be inserted between them to enforce
that).
The main increase in complexity for schema upgrade is from the fact that
the order of merging is reversed. We can avoid reversing the order and
avoid introducing a new upgrade_monotonically() algorithm, by doing this on
version merge. When merging v2 (newer, s2) and v1 (older, s1), and
!same_schema, insert v1' of schema s2 after v1 and merge v1 into v' using
the standard direction (upgrading rows on the fly). We only need to
remember the fact that v' is of the later schema, so when resuming we don't
allocate yet another version. Then remove v1 and have v2 and v1' both
having the same schema version.
Message ID: ***@***.***>
… |
On Wed, Mar 15, 2023 at 5:46 PM Michał Chojnowski ***@***.***> wrote:
This code will be read more often than grepped.
In my case grepping is an essential part of reading. If I'm looking at
some code, usually I also want to know where it's used, so I grep for its
calls. If the name repeats too much, you need to fire up an IDE to find the
call site, and I don't like that.
But since you insist, I will rename them back to just apply().
The clutter in tests weighs more because it's what you look at when trying
to understand the failed test. Pushing clutter down is good.
Are you saying that pushing clutter from test code to core code is good? I
disagree with the sentiment, IMHO readability of core code takes priority
over readability of tests. But I can add the helper back.
Depends on the kind of clutter. If it's test-specific, it can be a helper
which is local to the test code.
… Message ID: ***@***.***>
|
Please open an issue so that we don't forget about it. |
Opened #13216 |
Let's move this forward. What's remaining? |
Status reminder: The PR aimed to implement gentle upgrades by making the schema a per-version property (instead of a per-entry property), and adding support for heterogeneous entries to reading and merging algorithms. This way an upgrade boils down to adding an empty version. So, in short, progress was stalled mostly because no design seems good, and we (I) are afraid of pushing yet another bunch of complexity into an already very complicated section of code. I was just planning to rewrite this into a variant of Tomek's idea (where during upgrade we add a newer version after the older version, and the two are treated from the outside as a unit, and they aren't merged from/to until their merge/upgrade finishes), but the tracing project (and other issues) diverted my time. A secondary unsolved problem (#13104 (comment)) is that an assumption made by the PR — that a merge can be triggered on upgrade to make upgrades of small partition entries effectively synchronous — was wrong, because of allocating_section nesting rules (there can be no nesting). Since both the cache update and the merge have their own allocating section, the merge refuses to happen during the update, so it becomes asynchronous, rather than synchronous. We have to avoid this somehow, maybe by bypassing the merge section if we are inside the update section. |
@tgrabiec and @michoecho, please agree on a direction and move forward, it's a pity to let this rot. |
@tgrabiec , @michoecho - please move forward, I'd be happy to see it in before 5.3 branch. |
Note: a new version of this PR is at #13761. It's awaiting @tgrabiec's review. |
Closing in favor of #13761 to avoid confusion. |
After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.
This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery.
Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version.
After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and upgraded permanently by the MVCC version merge ongoing in the background.
The series:
row
and its wrappers.I'm particularly worried about the correctness of the modifications to the version merging algorithm. It is already probably the trickiest and most complex piece of Scylla code, and these modifications make reasoning even harder.
Before the series, we always merge the newer version into the older version. After the series, we also merge in the other direction, because after the merge we want to end up with the newer schema, not the older schema. This changes some assumptions of apply_monotonically, especially these concerning eviction order and the "older versions are evicted first" rule.
Between preemption safety, exception safety, rows and dummies and last dummies, evictable and non-evictable entries, continuity, row tombstones, row compaction, and now also forward and backward merging, the complexity of this function is brain-melting. I have no idea how to cover it with trustworthy tests. The puny test I've added is barely more than a sanity check.
A possible performance problem with this method is that the same row entry might have to be upgraded many times by the cursor until its finally upgraded permanently by a version merge, whereas before the series they only need to be upgraded once. Note that problem is only relevant for big partition entries (the version merge starts eagerly and is only backgrounded on first preempt, so for small entries it will usually complete synchronously). I didn't do any performance tests, but schema upgrades currently (both before and after this series) seem to be implemented inefficiently: for every cell in the upgraded entry, a schema upgrade looks up its column name in a hashmap. If this is a performance problem, we can follow up with an optimization which avoids this by preparing an integer-based column mapping for the schema pair instead.
Fixes #2577