Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mv: handle different ERMs for base and view table #18816

Closed
wants to merge 1 commit into from

Conversation

wmitros
Copy link
Contributor

@wmitros wmitros commented May 22, 2024

When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to f2ff701, the update may have just been performed partially when we are taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash.

This is fixed in the following patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1.

Fixes: #17786
Fixes: #18709

@wmitros wmitros requested review from nyh and piodul as code owners May 22, 2024 08:09
@wmitros wmitros added the backport/none Backport is not required label May 22, 2024
@wmitros wmitros added backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed and removed backport/none Backport is not required labels May 22, 2024
@wmitros
Copy link
Contributor Author

wmitros commented May 22, 2024

We should probably backport this after all. Unless f2ff701 is backported, #17786 should not happen, but theoretically there's noting preventing #18709 from happening

@@ -1625,6 +1625,7 @@ get_view_natural_endpoint(
}
}

auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed in the following patch by using

There is only one commit here, so I suppose you meant "in this patch"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess that's what he meant - the patch that follows this description ;-) But I agree it's ambiguous and it should be called "this patch".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the message

# In the test, we create a keyspace with a table and a materialized view.
# We then start writing to the table, causing the materialized view to be updated.
# While the writes are in progress, we add nodes to the cluster.
# The test verifies that the node doesn't crash as a result of the node addition,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important thing here to reproduce the bug is that the node is from a new DC, right? If so, then it should be noted here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The important thing is for replication strategy to be networking. DC can be the same.

# and that the materialized view is eventually consistent.
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_add_node_during_mv_writes(manager: ManagerClient):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test alone reliably reproduce the issue with unchanged Scylla? I think we need to trigger a situation where the view ERM is updated but the base table ERM is not - either by introducing a sleep like in the issue description or by sleeping in the code that updates the ERMs. Can we do it with error injection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Please explain in the commit message or even in this discssion, how often does this test fail before the fix. If it fails 100% of the time, that's fantastic. But to be realistic, even if it fails 10% of the time it's good enough - if the bug ever regresses, we'll see this test fail 10% of the time and it will be fine.

@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_add_node_during_mv_writes(manager: ManagerClient):
cfg = {'force_gossip_topology_changes': True}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that you took probably took the test from the issue description verbatim, but is legacy mode necessary to reproduce it? If not, can we use non-legacy mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to make the test fail consistently without it, maybe @gleb-cloudius knows more precisely why it was added in the initial reproducer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not my test either. It is from here #18676. The timing with and without raft will be very different though. May be barrier can even prevent the problem from happening.

Comment on lines 73 to 84
await asyncio.sleep(2)

logger.info(nums)
logger.info(sum(nums))

res = await cql.run_async("select pk, v from ks.t")
for r in res:
logger.info(f"pk: {r.pk}, v: {r.v}")

res = await cql.run_async("select v, pk from ks.t_view")
for r in res:
logger.info(f"v: {r.v}, pk: {r.pk}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these logs necessary for the reproducer? Or the sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these from the test after all

# and that the materialized view is eventually consistent.
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_add_node_during_mv_writes(manager: ManagerClient):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add sleep injections to increase the probability of encountering different e_r_ms:

  • one sleep in MV update code between obtaining the different e_r_ms
  • one sleep in the code which updates e_r_ms in storage_service

(i.e., the same injections that @gleb-cloudius did in his reproducers)

Otherwise this test is very unlikely to reproduce either problem.

You should verify that this test is really a regression test. I.e., run it before and after the fix, and verify that it reproduces both failures before the fix, but reproduces neither after the fix.

In fact I believe we may need two tests actually. How can it reproduce #17786? There is no decommission done in this test... only bootstraps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test would be enough, if it does both bootstrap and decommission (concurrent to doing MV updates), but you should verify that it reproduces both before the fix (e.g. by commenting out bootstrap, checking that it reproduces the decommission issue, then bringing bootstrap back but commenting out decommission, checking that it reproduces the bootstrap issue)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please make it do both. As of sleeps I agree with @nyh that if it reproduces even 10% without it is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember @gleb-cloudius that you did not manage to reproduce the decommission case with hundreds of runs until you put the sleep in e_r_m update code.

So we do need at least that sleep to hit the decommission case. Not sure about the bootstrap case -- what was your reproducibility without sleep?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember @gleb-cloudius that you did not manage to reproduce the decommission case with hundreds of runs until you put the sleep in e_r_m update code.

I do not remember all attempts, but it was definitely not easily reproducible. Lets run the test 10 time and if it does not fail then we need sleeps like in my examples.


await manager.server_add()
await manager.server_add()
await manager.server_add()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the right sleep injections only a single bootstrap should be required, with no sleep injections I think there's significant risk that we won't hit it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I left only one additional server in the updated test

@kbr-scylla
Copy link
Contributor

Added backport 6.0 label

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. In commit 4505a86 I fixed most of the places where we used the same ERM for base and view, but I guess I fixed the topology one wrong.

@@ -1625,6 +1625,7 @@ get_view_natural_endpoint(
}
}

auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess that's what he meant - the patch that follows this description ;-) But I agree it's ambiguous and it should be called "this patch".

# and that the materialized view is eventually consistent.
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_add_node_during_mv_writes(manager: ManagerClient):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Please explain in the commit message or even in this discssion, how often does this test fail before the fix. If it fails 100% of the time, that's fantastic. But to be realistic, even if it fails 10% of the time it's good enough - if the bug ever regresses, we'll see this test fail 10% of the time and it will be fine.

@wmitros
Copy link
Contributor Author

wmitros commented May 27, 2024

Regarding #18816 (comment), #18816 (comment) and #18816 (comment):
The test did in fact require injections to be even remotely consistent. When testing them, I also found a small inconsistency in the solution: when the base ERMs changes so that the node generating the view update is no longer a base replica, it shouldn't
pair with itself if there's a view replica on that node as well. This probably wouldn't cause any issues, but we should be aware of this situation, so I added the check for that in the code.
In the newest version, the test fails practically every time on the node addition in dev mode, but even with the large, 8s delay, it's only failing ~10% of the time in debug mode.
On decommission, in both dev and debug, the test is failing almost 50% of the time - we can't really get it any higher without more intrusive injections or just more test iterations because for the fail to happen we need the base erm be updated before the view erm, and the order seems random

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_mv_topology_change

Failed Tests (4/300):

Build Details:

  • Duration: 54 min
  • Builder: spider1.cloudius-systems.com

@wmitros
Copy link
Contributor Author

wmitros commented May 27, 2024

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 topology_custom/test_mv_topology_change

Failed Tests (4/300):

Build Details:

  • Duration: 54 min
  • Builder: spider1.cloudius-systems.com

Looks like I hit https://github.com/scylladb/seastar/blob/master/doc/lambda-coroutine-fiasco.md. I couldn't use coroutine::lambda here, so instead I named the _abort_source capture at the start of the lambda coroutine so that it isn't lost after the co_await

@@ -1625,25 +1625,26 @@ get_view_natural_endpoint(
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERMs

That's one of our worst acronyms

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pay Scylla employees for fast-typing lessons, so they won't have to use acronyms in their code :-)

Scylla could also sponsor the AAAAAAA - the American Association Against Abuse of Abbreviations And Acronyms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pay Scylla employees for fast-typing lessons, so they won't have to use acronyms in their code :-)

And for wider monitors!

@@ -1658,7 +1659,7 @@ get_view_natural_endpoint(
return {};
}
auto replica = view_endpoints[base_it - base_endpoints.begin()];
return topology.get_node(replica).endpoint();
return view_topology.get_node(replica).endpoint();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we have two different topologies. The whole idea behind effective_replication_map_ptr is that while you hold it, topology can only change in a compatible way. So why not have view_erm be base_erm? Why have two variables in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topology is an object inside an erm. Erms are per table (I guess for tablets, but they have to be at least per keyspace even for vnodes since they are result of application of a replication to a topology).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they're per table, I think they're per keyspace. @bhalevy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are per-keyspace with vnodes, and per-table with tablets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgrabiec is there a way to get to a "super effective replication map" that will cover both a base table and its views?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhalevy maybe the function of "ERM" changed since I wrote this code (although, it does still work). As I noted above, what I needed to query was base_erm->get_replicas(base_token) - the list of three replicas that currently hold this token. The result of this is different between different tables. So if this is not an "erm" and is rather a "tablet map that is only a part of the erm", then I guess this code needs two "tablet maps", not two "erm"s. I'd appreciate it if someone who understand how this ERM stuff evolved will change this code to use the right objects or just write terminology. But again, the current code does work (and if it breaks, there's a test that exercises this case of different tablets for the base and view, and would have caught such breakage).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nyh, as I mentioned in #18816 (comment), with tablets, the base and view have different tablet maps, but they can both live in the same erm (which is kept in the keyspace.

I do not see it from the code you posted. The code gets per_table_replication_strategy and creates individual erm for each table base on the schema id. May be I read the code wrong. Yes, the topology will be the same for all erms if they were created from the same token metadata, but replication for each tablet table is different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a test that reproduces it

Is it committed to the repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbr-scylla yes, test/topology_experimental_raft/test_mv_tablets.py:test_tablet_mv_simple_6node, from 4505a86. Its comment:

A simple reproducer for a bug of forgetting that the view table has a different tablet mapping from the base: Using the wrong tablet mapping for the base table or view table can cause us to send a view update to the wrong view replica - or not send a view update at all. A row that we write on the base table will not be readable in the view. We start a large-enough cluster (6 nodes) to increase the probability that if the mapping is different for the one row we write, and the test will fail if the bug exists. Reproduces #16227.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nyh, as I mentioned in #18816 (comment), with tablets, the base and view have different tablet maps, but they can both live in the same erm (which is kept in the keyspace.

I do not see it from the code you posted. The code gets per_table_replication_strategy and creates individual erm for each table base on the schema id. May be I read the code wrong. Yes, the topology will be the same for all erms if they were created from the same token metadata, but replication for each tablet table is different.

Yes, sorry. I mixed the terms.
They will indeed have different effective_replication_map:s but they do share the same token_metadata_ptr and locator::topology.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_mv_topology_change

Failed Tests (2/300):

Build Details:

  • Duration: 2 hr 59 min
  • Builder: i-0898df8fcf0cfc7ee (m5ad.12xlarge)

When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.

This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.

Fixes: scylladb#17786
Fixes: scylladb#18709
@wmitros
Copy link
Contributor Author

wmitros commented May 27, 2024

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 topology_custom/test_mv_topology_change

Failed Tests (2/300):

Build Details:

  • Duration: 2 hr 59 min
  • Builder: i-0898df8fcf0cfc7ee (m5ad.12xlarge)

Looks like the lambda coroutine wasn't the (only) problem - indeed, the code in future<> inject(const std::string_view& name, waiting_handler_fun func, bool share_messages = true) seems like should keep the lambda alive until completion.
I found the more likely problem - I was using the _abort_source from the storage_service where the replicate_to_all_cores was called, even though the code was being executed on all cores. I fixed it in the rebase by taking the _abort_source from corresponding storage_services for the sleep_abortable. I left the fiasco workaround anyway, because I'm still not completely sure it wasn't a contributor to the problem, and it's a small difference anyway

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_mv_topology_change
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 6 hr 45 min
  • Builder: i-07ccb190095e38dc3 (m5ad.12xlarge)

@piodul piodul requested a review from kbr-scylla May 28, 2024 06:14
@piodul
Copy link
Contributor

piodul commented May 28, 2024

Queued

@gleb-cloudius
Copy link
Contributor

Should be backported to 6.0, no?

@kbr-scylla
Copy link
Contributor

Should be backported to 6.0, no?

Yes, in fact it has all the possible OSS backport labels

@gleb-cloudius
Copy link
Contributor

Should be backported to 6.0, no?

Yes, in fact it has all the possible OSS backport labels

Ah, I missed that 6.0 was added as well. The branch was created while this PR was in review.

@nyh
Copy link
Contributor

nyh commented May 28, 2024

Queued

This may have caused a promotion failure - #18941 :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/materialized views area/topology changes backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 promoted-to-master status/regression status/release blocker Preventing from a release to be promoted
Projects
None yet
9 participants