Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: make schema upgrades gentle, old version of the PR #13104

Closed
wants to merge 35 commits into from

Conversation

michoecho
Copy link
Contributor

@michoecho michoecho commented Mar 7, 2023

After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.

This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery.
Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version.
After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and upgraded permanently by the MVCC version merge ongoing in the background.

The series:

  1. Does some code cleanup in the mutation_partition area.
  2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).
  3. Adds upgrading variants of constructors and apply() for row and its wrappers.
  4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains.
  5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version.

I'm particularly worried about the correctness of the modifications to the version merging algorithm. It is already probably the trickiest and most complex piece of Scylla code, and these modifications make reasoning even harder.
Before the series, we always merge the newer version into the older version. After the series, we also merge in the other direction, because after the merge we want to end up with the newer schema, not the older schema. This changes some assumptions of apply_monotonically, especially these concerning eviction order and the "older versions are evicted first" rule.
Between preemption safety, exception safety, rows and dummies and last dummies, evictable and non-evictable entries, continuity, row tombstones, row compaction, and now also forward and backward merging, the complexity of this function is brain-melting. I have no idea how to cover it with trustworthy tests. The puny test I've added is barely more than a sanity check.

A possible performance problem with this method is that the same row entry might have to be upgraded many times by the cursor until its finally upgraded permanently by a version merge, whereas before the series they only need to be upgraded once. Note that problem is only relevant for big partition entries (the version merge starts eagerly and is only backgrounded on first preempt, so for small entries it will usually complete synchronously). I didn't do any performance tests, but schema upgrades currently (both before and after this series) seem to be implemented inefficiently: for every cell in the upgraded entry, a schema upgrade looks up its column name in a hashmap. If this is a performance problem, we can follow up with an optimization which avoids this by preparing an integer-based column mapping for the schema pair instead.

Fixes #2577

In some mixed-schema apply helpers for tests, the source mutation
is accidentally copied with the target schema. Fix that.

Nothing seems to be currently affected by this bug; I found it when it
was triggered by a new test I was adding.
mutation_assertion gives no information whatsoever about *which*
assert in the test failed.

This patch adds a std::source_location argument to assert_that,
which can be used to supply the assert with its call location,
to make the error more informative.
Reads with multiple schema verions have a different code path now,
so add schema changes to the test, to test these paths too.
…ally()

The comment suggests that the order of sentinel insertion is meaningful because
of the resulting eviction order. But the sentinels are added to the tracker
with the two-argument version of insert(), which inserts the second argument
into the LRU right before the (more recent) first argument.
Thus the eviction order of sentinels is decided explicitly, and it doesn't
rely on insertion order.
apply_weak is just an alias for apply(), and most of its variants
are dead code. Get rid of it.
Most variants of apply() and apply_monotonically() in mutation_partition_v2
are leftovers from mutation_partition, and are unused. Thus they only
add confusion and maintenance burden. Since we will be modifying
apply_monotonically() in upcoming patches, let's clean them up, lest
the variants become stale.

This patch removes all unused variants of apply() and apply_monotonically()
and "manually inlines" the variants which aren't used often enough to carry
their own weight.

In the end, we are left with a single apply_monotonically() and two convenience
apply() helpers.

The single apply_monotonically() accepts two schema arguments. This facility
is unimplemented and unused as of this patch - the two arguments are always
the same - but it will be implemented and used in later parts of the series.
The comment refers to "other", but it means "pe". Fix that.

The patch also adds a bit of context to the mutation_partition jargon
("evictability" and "continuity"), by reminding how it relates
to the concrete abstractions: memtable and cache.
deletable_row::apply() and deletable_row::apply_monotonically() are
exchangeable. There is no reason for this aliasing.
This patch gets rid of the redundancy.
Arbitrarily, it leaves an apply() with a `const deletable_row&` argument
and a apply_monotonically() with a `deletable_row&&` argument.

Ditto for row::apply().
It will be used in upcoming patches.
Only a copying variant is provided. As an optimization, a moving
variant might be added in the future.
…tion_v2 constructor

We don't have a convention for when to pass `schema_ptr` and and when to pass
`const schema&` around.
In general, IMHO the natural convention for such a situation is to pass the
shared pointer if the callee might extend the lifetime of shared_ptr,
and pass a reference otherwise. But we convert between them willy-nilly
through shared_from_this().

While passing a reference to a function which actually expects a shared_ptr
can make sense (e.g. due to the fact that smart pointers can't be passed in
registers), the other way around is rather pointless.

This patch takes one occurence of that and modifies the parameter to a reference.

Since enable_shared_from_this makes shared pointer parameters and reference
parameters interchangeable, this is a purely cosmetic change.
…n constructor

Cosmetic change. See the preceding commit for details.
…n::difference

Cosmetic change. See the preceding commit for details.
@michoecho michoecho marked this pull request as ready for review March 14, 2023 00:55
@michoecho
Copy link
Contributor Author

Ready for review. Requesting review from @tgrabiec

@scylladb-promoter
Copy link
Contributor

Currently, partition_version does not reference its schema.
All partition_version reachable from a entry/snapshot have the same schema,
which is referenced in memtable_entry/cache_entry/partition_snapshot.

To enable gentle schema upgrades, we want to use the existing background
version merging mechanism. To achieve that, we will move the schema reference
into partition_version, and we will allow neighbouring MVCC versions to have
different schemas, and we will merge them on-the-fly during reads and
persistently during background version merges.
This way, an upgrade will boil down to adding a new empty version with
the new schema.

This patch adds the _schema field to partition_version and propagates
the schema pointer to it from the version's containers (entry/snapshot).
Subsequent patches will remove the schema references from the containers,
because they are now redundant.
After adding a _schema field to each partition version,
the field in partition_snapshot is redundant. It can be always recovered
from the latest version. Remove it.
After adding a _schema field to each partition version,
the field in cache_entry is redundant. It can be always recovered
from the latest version. Remove it.
After adding a _schema field to each partition version,
the field in memtable_entry is redundant. It can be always recovered
from the latest version. Remove it.
…ead()

partition_entry now contains a reference to its schema, so it no longer
needs to be supplied by the caller.
operator<< accepts a schema& and a partition_entry&. But since the latter
now contains a reference to its schema inside, the former is redundant.
Remove it.
It will be used in upcoming commits.

A factory function is used, rather than an actual constructor,
because we want to delegate the (easy) case of equal schemas
to the existing single-schema constructor.
And that's impossible (at least without invoking a copy/move
constructor) to do with only constructors.
It will be used during upcoming changes in partition_snapshot_row_cursor
to prepare it for multi-schema snapshots.
…otonically()

A helper which will be used during upcoming changes to
mutation_partition_v2::apply_monotonically(), which will extend it to merging
versions with different schemas.
A helper which will be used in upcoming commits.
A helper which will be used during upcoming changes to
mutation_partition_v2::apply_monotonically(), which will extend it to merging
versions with different schemas.
Adds a logalloc::region argument to upgrade_schema().
It's currently unused, but will be further propagated to
partition_entry::upgrade() in an upcoming patch.
m2.upgrade(new_schema);
cache.set_schema(new_schema);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding tests, I think we need a stresser that builds a model of the data outside the row cache and compares it with the real row_cache under a mix of writes, reads, schema changes, and evictions. The snapshot_source that backs the cache can read the model, no need for real sstables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_concurrent_reads_and_eviction() is such a test. The role of external model is served by the "mutation" object. We use memtable_snapshot_source instead of sstables as underlying source.

@avikivity
Copy link
Member

After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.

This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery. Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version. After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and upgraded permanently by the MVCC version merge ongoing in the background.

The series:

  1. Does some code cleanup in the mutation_partition area.
  2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).

This is neutral in terms of non-transient memory footprint, yes?

  1. Adds upgrading variants of constructors and apply() for row and its wrappers.
  2. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains.
  3. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version.

I'm particularly worried about the correctness of the modifications to the version merging algorithm. It is already probably the trickiest and most complex piece of Scylla code, and these modifications make reasoning even harder. Before the series, we always merge the newer version into the older version. After the series, we also merge in the other direction, because after the merge we want to end up with the newer schema, not the older schema.

Could we not make things more orthogonal? For example, instead of an upgrading merge, create a clone of the version with the new schema, replace the old version with the upgraded schema, and then proceed with the version merge?

A different way to make things orthogonal is to create a row_cache object for every schema version and chain those row_cache objects, but that comes with its own problems. I admit the motivation for that was to avoid duplicating the schema everywhere, not a desire to simplify things.

This changes some assumptions of apply_monotonically, especially these concerning eviction order and the "older versions are evicted first" rule. Between preemption safety, exception safety, rows and dummies and last dummies, evictable and non-evictable entries, continuity, row tombstones, row compaction, and now also forward and backward merging, the complexity of this function is brain-melting. I have no idea how to cover it with trustworthy tests. The puny test I've added is barely more than a sanity check.

A possible performance problem with this method is that the same row entry might have to be upgraded many times by the cursor until its finally upgraded permanently by a version merge, whereas before the series they only need to be upgraded once.

I think it's reasonable, as long as we schedule the version merge to happen "soon".

Note that problem is only relevant for big partition entries (the version merge starts eagerly and is only backgrounded on first preempt, so for small entries it will usually complete synchronously).

That's fine. For small partitions we get good behavior, and for large partitions anything is better than the stall we have today.

I didn't do any performance tests, but schema upgrades currently (both before and after this series) seem to be implemented inefficiently: for every cell in the upgraded entry, a schema upgrade looks up its column name in a hashmap. If this is a performance problem, we can follow up with an optimization which avoids this by preparing an integer-based column mapping for the schema pair instead.

Fixes #2577

Please annotate the cover letter with perf-simple-query results.

@avikivity
Copy link
Member

Yes, cache_entry can destroy the schema.
schema is not supposed to be a managed object, so we don't care about
allocator context. But it contains a managed object (raw_value), and that's
a problem. raw_value should be switched to be non-managed. If we have no
alternative for managed_bytes, we can create a wrapper which destroys it in
std allocator context.

@avikivity Using managed_bytes as the common type for the buffer fragmentation project was a mistake.

Yes. I started a project to make bytes_ostream compatible with managed_bytes in its internal layout so it can use the same iterators, but not have to share the same class name. This avoids a bunch of templates but can also solve this problem.

We always copy across cache/memtable <-> rest of system boundaries, so using the same type only saves us effort in allowing to share the mutation representation between memtable/cache and non-cache, but perhaps it's not worth it.

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 15, 2023 via email

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 15, 2023 via email

@michoecho
Copy link
Contributor Author

The series:
Does some code cleanup in the mutation_partition area.
Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).

This is neutral in terms of non-transient memory footprint, yes?

Yes.

Could we not make things more orthogonal? For example, instead of an upgrading merge, create a clone of the version with the new schema, replace the old version with the upgraded schema, and then proceed with the version merge?

We could. It's on the table. In this scenario, we could have a simpler "upgrading merge" function, upgrade_monotonically(), which would be used only for heterogeneous neighbors, and only if the two versions cover disjoint ranges — it would be simpler, because the function would only need to move entries and insert sentinels on preempt, but wouldn't have to care about intersections, which are the hard part. (And if there were two heterogeneous versions which don't cover disjoint ranges, then an empty version with the newer schema would be inserted between them to enforce that).

I can try writing that variant too. Lowering the complexity of apply_monotonically can be worth it.

I think it's reasonable, as long as we schedule the version merge to happen "soon".

Version merge is scheduled as soon as the snapshot which owns the version is destroyed, but version merges happen in order (except for the "first task quota" of the merge, which happens synchronously with the destruction, until it's preempted). The merges happen in a special scheduling group, mem_compaction. (Not making a point, just reminding how it works).

Please annotate the cover letter with perf-simple-query results.

Will do.

@michoecho
Copy link
Contributor Author

michoecho commented Mar 15, 2023

@tgrabiec

This code will be read more often than grepped.

In my case grepping is an essential part of reading. If I'm looking at some code, usually I also want to know where it's used, so I grep for its calls. If the name repeats too much, you need to fire up an IDE to find the call site, and I don't like that.

But since you insist, I will rename them back to just apply().

The clutter in tests weighs more because it's what you look at when trying
to understand the failed test. Pushing clutter down is good.

Are you saying that pushing clutter from test code to core code is good? I disagree with the sentiment, IMHO readability of core code takes priority over readability of tests. But I can add the helper back.

@avikivity
Copy link
Member

There is a third option yet, flattening the versions. In this option we allow rows with duplicate clustering keys (and duplicate static rows), the duplicates are disambiguated by the maximum timestamp within the row's mutation. A version is an instruction not to merge rows across a particular timestamp. Reading a version means reading the partition and ignoring higher-timestamp rows. To remove a version, merge adjacent rows that cross the version's timestamp.

It only works for LWT and raft (since monotonic timestamps are required), but MVCC is useless anyway outside LWT/Raft.

@scylladb-promoter
Copy link
Contributor

@michoecho
Copy link
Contributor Author

michoecho commented Mar 15, 2023

CI failed on the ol' faithful ttl_test.TestDistributedTTL.test_ttl_is_respected_on_repair

Restarting. This time it will pass, I can feel it.

@avikivity
Copy link
Member

CI failed on the ol' faithful ttl_test.TestDistributedTTL.test_ttl_is_respected_on_repair

Restarting. This time it will pass, I can feel it.

Was an issue filed/updated?

@michoecho
Copy link
Contributor Author

michoecho commented Mar 15, 2023

Was an issue filed/updated?

It's #11684. It's failing some CI every day.

Edit: I confused it with #13137, but they are in one area.

@avikivity
Copy link
Member

Asked to prioritize.

@scylladb-promoter
Copy link
Contributor

@@ -97,6 +97,7 @@ public:
row();
~row();
row(const schema&, column_kind, const row&);
static row construct_row(const schema& our_schema, const schema& their_schema, column_kind, const row&);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just row::construct()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct_row is greppable, construct is not.
In the absence of other arguments, I prefer the more greppable name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a static method, so users will call row::construct(), which is as greppable.

Copy link
Contributor Author

@michoecho michoecho Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But from within the implementation of class row you can call it without the row:: scope — just construct(). So if you really want to grep all uses of a function in C++, you have to grep just its name, not the qualified name. (And there are also function pointers, and typedefs, and more). A static method with a generic method is more greppable than a non-static with a generic name, but still not truly greppable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But should we go there? And rename row::size() to row::row_size() too, and apply() to mpv2_apply()? It seems like a slippery slope to me, which hurts readability. This problem is solved by IDE's with "Find Usages". @avikivity What's your take?

Copy link
Contributor Author

@michoecho michoecho Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem is solved by IDE's with "Find Usages".

It is solved by IDEs until it isn't.

For example:

scylladb/serializer.hh

Lines 254 to 257 in 5705df7

template<typename T, typename Output>
inline void serialize(Output& out, const std::reference_wrapper<T> v) {
serializer<T>::write(out, v.get());
}

clangd seems to have no idea where are the implementations of write(), even though this is statically resolvable information.
If it had a unique name, I could at least grep for it — but I can't.

The problem goes the other way too: the IDE won't show you a usage of a method if the object has templated type.

It's not like I like it — I don't think greppability is "the right thing". If "find references" actually worked, I wouldn't advocate it. But C++ tooling was retrofitted onto the language and is less reliable than plain text.

m2.upgrade(new_schema);
cache.set_schema(new_schema);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_concurrent_reads_and_eviction() is such a test. The role of external model is served by the "mutation" object. We use memtable_snapshot_source instead of sstables as underlying source.

@@ -355,7 +353,7 @@ stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, const
position_in_partition::after_key(s, src_e.position()), is_dummy::yes, is_continuous::no));
if (src_e.position().is_clustering_row()) {
s2 = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(s,
current_allocator().construct<rows_entry>(p_s,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s2 is this_sentinel, so should use "s" (our schema). s1 should use p_s.

It currently doesn't matter because sentinels carry no cells, only positions, which, as a clustering key, can be interpreted using any schema version. key schema is immutable. But better be precise in case something in the rows_entry becomes version-dependent in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I managed to confuse s1 and s2 multiple times when working on this code. This is a leftover of one such occasion. I must have thought that it was you who made this mistake and that I was fixing it. Must be some deep-seated instinct that 1 == "us" and 2 == "them".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be some deep-seated instinct that 1 == "us" and 2 == "them".

Now that I think about it, this might be engrained by the grammatical notion of first person.

@@ -440,18 +447,26 @@ stop_iteration mutation_partition_v2::apply_monotonically(const schema& s, const
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s1 above should use p_s, for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this is what I had in mind when I changed s2 to p_s.

// the last dummy entry which holds the partition entry linked in LRU.
i->swap(src_e);
} else {
i->apply_monotonically(s, p_s, std::move(src_e));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just drop src_e in this case due to "information monotonicity" ("i" is in the newer version).

Note, it wouldn't be safe to call apply_monotonically() if "i" was in the older version because apply_monotonically can fail mid-way and we would violate "information monotonicity" by leaving src_e with less information (some cells moved).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

i->swap(src_e);
} else {
i->apply_monotonically(s, p_s, std::move(src_e));
}
tracker->remove(src_e);
} else {
// Avoid row compaction if no newer range tombstone.
do_compact = (src_e.range_tombstone() + src_e.row().deleted_at().regular()) >
(i->range_tombstone() + i->row().deleted_at().regular());
memory::on_alloc_point();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reverse order merging the comparison in do_compact should be inverted:

do_compact = (src_e.range_tombstone() + src_e.row().deleted_at().regular()) <
                            (i->range_tombstone() + i->row().deleted_at().regular());

The rationale is this. We want to compact if one version has a range tombstone which covers cells in the other version. To avoid expensive exact checks, we just assume that if the later version contains a range tombstone, then it probably covers cells in the older version (which is usually the case because timestamp values usually follow the order of apply). For !same_schema, it's "i" which is in the newer version, so we want to compact if i->range_tombstone() is later than src_e.range_tombstone().

current_allocator().destroy(prev);
if (merged_into->is_referenced()) {
merged_into->back_reference().release();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if block seems redundant to me and a bit confusing. If merged_into is _version, the assignment below takes care of it. If it's not _version, it's incorrect to just release it, but this case is taken care of by previous if (prev->is_referenced()), so not the case here.

Copy link
Contributor Author

@michoecho michoecho Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment below creates a reference to merged_into, and it's forbidden to create a reference to an already-referenced version. (This is enforced by this assert:)

assert(!_version->_backref);

So we have to explicitly release the existing reference to merged_into before we create the new one.
But I concur that this looks wrong. I'll add a comment which explains it.

@@ -949,6 +949,7 @@ SEASTAR_TEST_CASE(test_eviction_after_schema_change) {
rd.fill_buffer().get();
}

tracker.cleaner().drain().get0();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the failure without this?

Background cleaning shouldn't prevent eviction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure is in BOOST_REQUIRE_EQUAL(tracker.get_stats().rows, 0);.
It's about last dummies — they are only removed when their version is finally destroyed by the mutation cleaner.
Since there is no guarantee that the code above doesn't leave any unreferenced versions, the cleaning has to be explicitly finished before the check happens.

That said, you made me realize that the code works in a different way than I intended. I wanted to start the version merge synchronously after the upgrade, so that small partitions are upgraded without leaving garbage. But now I see that it doesn't happen, because this check is blocking that, and the upgrading merge never starts synchronously:

// Allocating sections require the region to be reclaimable
// which means that they cannot be nested.
// It is, however, possible, that if the snapshot is taken
// inside an allocating section and then an exception is thrown
// this function will be called to clean up even though we
// still will be in the context of the allocating section.
if (!region.reclaiming_enabled()) {
return stop_iteration::no;
}

I wonder what to do about this. Maybe change it so that the worker allocating section is called only if reclaiming isn't already disabled by an outer section?

return _worker_state->alloc_section(region, [&] {

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 16, 2023 via email

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 16, 2023 via email

@tgrabiec
Copy link
Contributor

@avikivity Using managed_bytes as the common type for the buffer fragmentation project was a mistake.

Please open an issue so that we don't forget about it.

@michoecho
Copy link
Contributor Author

@avikivity Using managed_bytes as the common type for the buffer fragmentation project was a mistake.

Please open an issue so that we don't forget about it.

Opened #13216

@avikivity
Copy link
Member

Let's move this forward. What's remaining?

@michoecho
Copy link
Contributor Author

Let's move this forward. What's remaining?

Status reminder:

The PR aimed to implement gentle upgrades by making the schema a per-version property (instead of a per-entry property), and adding support for heterogeneous entries to reading and merging algorithms. This way an upgrade boils down to adding an empty version.
But this requires merging versions forward (moving rows from an older version into a newer version), which is the opposite of the regular merging direction. We ultimately decided that requiring the merge algorithm to work in both directions makes it too hard to comprehend (and believe in its correctness).
Tomek gave an alternative idea in #13104 (comment), but it's not that simple either. It avoids forward merges, but it adds much complexity to how the version chain is managed.

So, in short, progress was stalled mostly because no design seems good, and we (I) are afraid of pushing yet another bunch of complexity into an already very complicated section of code.

I was just planning to rewrite this into a variant of Tomek's idea (where during upgrade we add a newer version after the older version, and the two are treated from the outside as a unit, and they aren't merged from/to until their merge/upgrade finishes), but the tracing project (and other issues) diverted my time.

A secondary unsolved problem (#13104 (comment)) is that an assumption made by the PR — that a merge can be triggered on upgrade to make upgrades of small partition entries effectively synchronous — was wrong, because of allocating_section nesting rules (there can be no nesting). Since both the cache update and the merge have their own allocating section, the merge refuses to happen during the update, so it becomes asynchronous, rather than synchronous. We have to avoid this somehow, maybe by bypassing the merge section if we are inside the update section.

@avikivity
Copy link
Member

@tgrabiec and @michoecho, please agree on a direction and move forward, it's a pity to let this rot.

@mykaul mykaul added the Eng-1 label May 1, 2023
@michoecho michoecho changed the title mvcc: make schema upgrades gentle mvcc: make schema upgrades gentle, old version of the PR May 4, 2023
@mykaul
Copy link
Contributor

mykaul commented May 14, 2023

@tgrabiec and @michoecho, please agree on a direction and move forward, it's a pity to let this rot.

@tgrabiec , @michoecho - please move forward, I'd be happy to see it in before 5.3 branch.

@michoecho
Copy link
Contributor Author

@tgrabiec and @michoecho, please agree on a direction and move forward, it's a pity to let this rot.

@tgrabiec , @michoecho - please move forward, I'd be happy to see it in before 5.3 branch.

Note: a new version of this PR is at #13761. It's awaiting @tgrabiec's review.

@avikivity
Copy link
Member

Closing in favor of #13761 to avoid confusion.

@avikivity avikivity closed this May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants