mv: handle different ERMs for base and view table #18816

wmitros · 2024-05-22T08:09:43Z

When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to f2ff701, the update may have just been performed partially when we are taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash.

This is fixed in the following patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1.

Fixes: #17786
Fixes: #18709

wmitros · 2024-05-22T08:12:51Z

We should probably backport this after all. Unless f2ff701 is backported, #17786 should not happen, but theoretically there's noting preventing #18709 from happening

piodul · 2024-05-22T13:57:49Z

db/view/view.cc

@@ -1625,6 +1625,7 @@ get_view_natural_endpoint(
        }
    }

+    auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology();


This is fixed in the following patch by using

There is only one commit here, so I suppose you meant "in this patch"?

Yes, I guess that's what he meant - the patch that follows this description ;-) But I agree it's ambiguous and it should be called "this patch".

I fixed the message

piodul · 2024-05-22T14:02:40Z

test/topology_custom/test_mv_add_nodes.py

+# In the test, we create a keyspace with a table and a materialized view.
+# We then start writing to the table, causing the materialized view to be updated.
+# While the writes are in progress, we add nodes to the cluster.
+# The test verifies that the node doesn't crash as a result of the node addition,


The important thing here to reproduce the bug is that the node is from a new DC, right? If so, then it should be noted here.

No. The important thing is for replication strategy to be networking. DC can be the same.

piodul · 2024-05-22T14:05:37Z

test/topology_custom/test_mv_add_nodes.py

+# and that the materialized view is eventually consistent.
+@pytest.mark.asyncio
+@skip_mode('release', 'error injections are not supported in release mode')
+async def test_add_node_during_mv_writes(manager: ManagerClient):


Does this test alone reliably reproduce the issue with unchanged Scylla? I think we need to trigger a situation where the view ERM is updated but the base table ERM is not - either by introducing a sleep like in the issue description or by sleeping in the code that updates the ERMs. Can we do it with error injection?

Right. Please explain in the commit message or even in this discssion, how often does this test fail before the fix. If it fails 100% of the time, that's fantastic. But to be realistic, even if it fails 10% of the time it's good enough - if the bug ever regresses, we'll see this test fail 10% of the time and it will be fine.

piodul · 2024-05-22T14:06:35Z

test/topology_custom/test_mv_add_nodes.py

+@pytest.mark.asyncio
+@skip_mode('release', 'error injections are not supported in release mode')
+async def test_add_node_during_mv_writes(manager: ManagerClient):
+    cfg = {'force_gossip_topology_changes': True}


I know that you took probably took the test from the issue description verbatim, but is legacy mode necessary to reproduce it? If not, can we use non-legacy mode?

I can't seem to make the test fail consistently without it, maybe @gleb-cloudius knows more precisely why it was added in the initial reproducer?

That is not my test either. It is from here #18676. The timing with and without raft will be very different though. May be barrier can even prevent the problem from happening.

piodul · 2024-05-22T14:07:43Z

test/topology_custom/test_mv_add_nodes.py

+    await asyncio.sleep(2)
+
+    logger.info(nums)
+    logger.info(sum(nums))
+
+    res = await cql.run_async("select pk, v from ks.t")
+    for r in res:
+        logger.info(f"pk: {r.pk}, v: {r.v}")
+
+    res = await cql.run_async("select v, pk from ks.t_view")
+    for r in res:
+        logger.info(f"v: {r.v}, pk: {r.pk}")


Are these logs necessary for the reproducer? Or the sleep?

I removed these from the test after all

kbr-scylla · 2024-05-22T14:10:10Z

test/topology_custom/test_mv_add_nodes.py

+# and that the materialized view is eventually consistent.
+@pytest.mark.asyncio
+@skip_mode('release', 'error injections are not supported in release mode')
+async def test_add_node_during_mv_writes(manager: ManagerClient):


I think you should add sleep injections to increase the probability of encountering different e_r_ms:

one sleep in MV update code between obtaining the different e_r_ms

one sleep in the code which updates e_r_ms in storage_service

(i.e., the same injections that @gleb-cloudius did in his reproducers)

Otherwise this test is very unlikely to reproduce either problem.

You should verify that this test is really a regression test. I.e., run it before and after the fix, and verify that it reproduces both failures before the fix, but reproduces neither after the fix.

In fact I believe we may need two tests actually. How can it reproduce #17786? There is no decommission done in this test... only bootstraps.

One test would be enough, if it does both bootstrap and decommission (concurrent to doing MV updates), but you should verify that it reproduces both before the fix (e.g. by commenting out bootstrap, checking that it reproduces the decommission issue, then bringing bootstrap back but commenting out decommission, checking that it reproduces the bootstrap issue)

Yes, please make it do both. As of sleeps I agree with @nyh that if it reproduces even 10% without it is enough.

I remember @gleb-cloudius that you did not manage to reproduce the decommission case with hundreds of runs until you put the sleep in e_r_m update code.

So we do need at least that sleep to hit the decommission case. Not sure about the bootstrap case -- what was your reproducibility without sleep?

I remember @gleb-cloudius that you did not manage to reproduce the decommission case with hundreds of runs until you put the sleep in e_r_m update code.

I do not remember all attempts, but it was definitely not easily reproducible. Lets run the test 10 time and if it does not fail then we need sleeps like in my examples.

kbr-scylla · 2024-05-22T14:10:48Z

test/topology_custom/test_mv_add_nodes.py

+
+    await manager.server_add()
+    await manager.server_add()
+    await manager.server_add()


With the right sleep injections only a single bootstrap should be required, with no sleep injections I think there's significant risk that we won't hit it.

Right, I left only one additional server in the updated test

kbr-scylla · 2024-05-22T14:13:05Z

Added backport 6.0 label

nyh

Thanks for fixing this. In commit 4505a86 I fixed most of the places where we used the same ERM for base and view, but I guess I fixed the topology one wrong.

nyh · 2024-05-22T14:43:05Z

db/view/view.cc

@@ -1625,6 +1625,7 @@ get_view_natural_endpoint(
        }
    }

+    auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology();


Yes, I guess that's what he meant - the patch that follows this description ;-) But I agree it's ambiguous and it should be called "this patch".

nyh · 2024-05-22T14:50:44Z

test/topology_custom/test_mv_add_nodes.py

+# and that the materialized view is eventually consistent.
+@pytest.mark.asyncio
+@skip_mode('release', 'error injections are not supported in release mode')
+async def test_add_node_during_mv_writes(manager: ManagerClient):


Right. Please explain in the commit message or even in this discssion, how often does this test fail before the fix. If it fails 100% of the time, that's fantastic. But to be realistic, even if it fails 10% of the time it's good enough - if the bug ever regresses, we'll see this test fail 10% of the time and it will be fine.

wmitros · 2024-05-27T00:03:14Z

Regarding #18816 (comment), #18816 (comment) and #18816 (comment):
The test did in fact require injections to be even remotely consistent. When testing them, I also found a small inconsistency in the solution: when the base ERMs changes so that the node generating the view update is no longer a base replica, it shouldn't
pair with itself if there's a view replica on that node as well. This probably wouldn't cause any issues, but we should be aware of this situation, so I added the check for that in the code.
In the newest version, the test fails practically every time on the node addition in dev mode, but even with the large, 8s delay, it's only failing ~10% of the time in debug mode.
On decommission, in both dev and debug, the test is failing almost 50% of the time - we can't really get it any higher without more intrusive injections or just more test iterations because for the fail to happen we need the base erm be updated before the view erm, and the order seems random

scylladb-promoter · 2024-05-27T00:51:36Z

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_mv_topology_change

Failed Tests (4/300):

Build Details:

Duration: 54 min
Builder: spider1.cloudius-systems.com

wmitros · 2024-05-27T13:49:30Z

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 topology_custom/test_mv_topology_change

Failed Tests (4/300):

test_mv_topology_change 🔍

test_mv_topology_change 🔍

test_mv_topology_change 🔍

test_mv_topology_change 🔍

Build Details:

Duration: 54 min

Builder: spider1.cloudius-systems.com

Looks like I hit https://github.com/scylladb/seastar/blob/master/doc/lambda-coroutine-fiasco.md. I couldn't use coroutine::lambda here, so instead I named the _abort_source capture at the start of the lambda coroutine so that it isn't lost after the co_await

avikivity · 2024-05-27T13:52:02Z

db/view/view.cc

@@ -1625,25 +1625,26 @@ get_view_natural_endpoint(
        }
    }


ERMs

That's one of our worst acronyms

We should pay Scylla employees for fast-typing lessons, so they won't have to use acronyms in their code :-)

Scylla could also sponsor the AAAAAAA - the American Association Against Abuse of Abbreviations And Acronyms.

We should pay Scylla employees for fast-typing lessons, so they won't have to use acronyms in their code :-)

And for wider monitors!

avikivity · 2024-05-27T13:55:06Z

db/view/view.cc

@@ -1658,7 +1659,7 @@ get_view_natural_endpoint(
        return {};
    }
    auto replica = view_endpoints[base_it - base_endpoints.begin()];
-    return topology.get_node(replica).endpoint();
+    return view_topology.get_node(replica).endpoint();


I don't understand why we have two different topologies. The whole idea behind effective_replication_map_ptr is that while you hold it, topology can only change in a compatible way. So why not have view_erm be base_erm? Why have two variables in the first place?

Topology is an object inside an erm. Erms are per table (I guess for tablets, but they have to be at least per keyspace even for vnodes since they are result of application of a replication to a topology).

I don't think they're per table, I think they're per keyspace. @bhalevy

They are per-keyspace with vnodes, and per-table with tablets.

@tgrabiec is there a way to get to a "super effective replication map" that will cover both a base table and its views?

@bhalevy maybe the function of "ERM" changed since I wrote this code (although, it does still work). As I noted above, what I needed to query was base_erm->get_replicas(base_token) - the list of three replicas that currently hold this token. The result of this is different between different tables. So if this is not an "erm" and is rather a "tablet map that is only a part of the erm", then I guess this code needs two "tablet maps", not two "erm"s. I'd appreciate it if someone who understand how this ERM stuff evolved will change this code to use the right objects or just write terminology. But again, the current code does work (and if it breaks, there's a test that exercises this case of different tablets for the base and view, and would have caught such breakage).

@nyh, as I mentioned in #18816 (comment), with tablets, the base and view have different tablet maps, but they can both live in the same erm (which is kept in the keyspace.

I do not see it from the code you posted. The code gets per_table_replication_strategy and creates individual erm for each table base on the schema id. May be I read the code wrong. Yes, the topology will be the same for all erms if they were created from the same token metadata, but replication for each tablet table is different.

I have a test that reproduces it

Is it committed to the repo?

@kbr-scylla yes, test/topology_experimental_raft/test_mv_tablets.py:test_tablet_mv_simple_6node, from 4505a86. Its comment:

A simple reproducer for a bug of forgetting that the view table has a different tablet mapping from the base: Using the wrong tablet mapping for the base table or view table can cause us to send a view update to the wrong view replica - or not send a view update at all. A row that we write on the base table will not be readable in the view. We start a large-enough cluster (6 nodes) to increase the probability that if the mapping is different for the one row we write, and the test will fail if the bug exists. Reproduces #16227.

@nyh, as I mentioned in #18816 (comment), with tablets, the base and view have different tablet maps, but they can both live in the same erm (which is kept in the keyspace.

I do not see it from the code you posted. The code gets per_table_replication_strategy and creates individual erm for each table base on the schema id. May be I read the code wrong. Yes, the topology will be the same for all erms if they were created from the same token metadata, but replication for each tablet table is different.

Yes, sorry. I mixed the terms.
They will indeed have different effective_replication_map:s but they do share the same token_metadata_ptr and locator::topology.

scylladb-promoter · 2024-05-27T16:49:50Z

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_mv_topology_change

Failed Tests (2/300):

Build Details:

Duration: 2 hr 59 min
Builder: i-0898df8fcf0cfc7ee (m5ad.12xlarge)

When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to f2ff701, the update may have just been performed partially when we are taking the ERMs. Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash. This is fixed in this patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1. Fixes: scylladb#17786 Fixes: scylladb#18709

wmitros · 2024-05-27T17:55:49Z

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 topology_custom/test_mv_topology_change

Failed Tests (2/300):

test_mv_topology_change 🔍

test_mv_topology_change 🔍

Build Details:

Duration: 2 hr 59 min

Builder: i-0898df8fcf0cfc7ee (m5ad.12xlarge)

Looks like the lambda coroutine wasn't the (only) problem - indeed, the code in future<> inject(const std::string_view& name, waiting_handler_fun func, bool share_messages = true) seems like should keep the lambda alive until completion.
I found the more likely problem - I was using the _abort_source from the storage_service where the replicate_to_all_cores was called, even though the code was being executed on all cores. I fixed it in the rebase by taking the _abort_source from corresponding storage_services for the sleep_abortable. I left the fiasco workaround anyway, because I'm still not completely sure it wasn't a contributor to the problem, and it's a small difference anyway

scylladb-promoter · 2024-05-28T00:36:09Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_mv_topology_change
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 6 hr 45 min
Builder: i-07ccb190095e38dc3 (m5ad.12xlarge)

piodul · 2024-05-28T14:02:42Z

Queued

gleb-cloudius · 2024-05-28T14:05:48Z

Should be backported to 6.0, no?

kbr-scylla · 2024-05-28T14:08:09Z

Should be backported to 6.0, no?

Yes, in fact it has all the possible OSS backport labels

gleb-cloudius · 2024-05-28T14:11:10Z

Should be backported to 6.0, no?

Yes, in fact it has all the possible OSS backport labels

Ah, I missed that 6.0 was added as well. The branch was created while this PR was in review.

nyh · 2024-05-28T16:36:15Z

Queued

This may have caused a promotion failure - #18941 :-(

wmitros requested review from nyh and piodul as code owners May 22, 2024 08:09

wmitros added the backport/none Backport is not required label May 22, 2024

github-actions bot added area/materialized views area/topology changes status/release blocker Preventing from a release to be promoted status/regression labels May 22, 2024

wmitros added backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed and removed backport/none Backport is not required labels May 22, 2024

piodul reviewed May 22, 2024

View reviewed changes

kbr-scylla requested changes May 22, 2024

View reviewed changes

kbr-scylla requested a review from gleb-cloudius May 22, 2024 14:12

kbr-scylla added the backport/6.0 label May 22, 2024

nyh approved these changes May 22, 2024

View reviewed changes

gleb-cloudius approved these changes May 22, 2024

View reviewed changes

dani-tweig assigned wmitros May 26, 2024

wmitros force-pushed the mv-bootstrap-erm branch from 5a46371 to 83b51b7 Compare May 26, 2024 23:57

wmitros requested a review from tgrabiec as a code owner May 26, 2024 23:57

wmitros force-pushed the mv-bootstrap-erm branch from 83b51b7 to a5ff5bc Compare May 27, 2024 13:49

avikivity reviewed May 27, 2024

View reviewed changes

wmitros force-pushed the mv-bootstrap-erm branch from a5ff5bc to ed95782 Compare May 27, 2024 17:50

piodul requested a review from kbr-scylla May 28, 2024 06:14

kbr-scylla approved these changes May 28, 2024

View reviewed changes

piodul approved these changes May 28, 2024

View reviewed changes

kbr-scylla mentioned this pull request May 28, 2024

Coredump on an existing node during addition of a new node 2m before it's readiness #18792

Closed

nyh mentioned this pull request May 28, 2024

[x86_64, debug] topology_custom/test_mv_topology_change failed with <Task #18941

Closed

scylladb-promoter closed this in 519317d May 28, 2024

github-actions bot added the promoted-to-master label May 28, 2024

This was referenced May 28, 2024

[Backport 5.2] mv: handle different ERMs for base and view table #18942

Closed

[Backport 5.4] mv: handle different ERMs for base and view table #18943

Closed

[Backport 6.0] mv: handle different ERMs for base and view table #18945

Closed

denesb mentioned this pull request Jun 6, 2024

Topology change failure with concurrent index drop when repairing the dropped index #18452

Open

2 tasks

mv: handle different ERMs for base and view table #18816

mv: handle different ERMs for base and view table #18816

Conversation

wmitros commented May 22, 2024

wmitros commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbr-scylla commented May 22, 2024

nyh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wmitros commented May 27, 2024 • edited Loading

scylladb-promoter commented May 27, 2024

🔴 CI State: FAILURE

Failed Tests (4/300):

Build Details:

wmitros commented May 27, 2024

🔴 CI State: FAILURE

Failed Tests (4/300):

Build Details:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scylladb-promoter commented May 27, 2024

🔴 CI State: FAILURE

Failed Tests (2/300):

Build Details:

wmitros commented May 27, 2024

🔴 CI State: FAILURE

Failed Tests (2/300):

Build Details:

scylladb-promoter commented May 28, 2024

🟢 CI State: SUCCESS

Build Details:

piodul commented May 28, 2024

gleb-cloudius commented May 28, 2024

kbr-scylla commented May 28, 2024

gleb-cloudius commented May 28, 2024

nyh commented May 28, 2024

wmitros commented May 27, 2024 •

edited

Loading