-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mv: handle different ERMs for base and view table #18816
Conversation
@@ -1625,6 +1625,7 @@ get_view_natural_endpoint( | |||
} | |||
} | |||
|
|||
auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed in the following patch by using
There is only one commit here, so I suppose you meant "in this patch"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess that's what he meant - the patch that follows this description ;-) But I agree it's ambiguous and it should be called "this patch".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the message
# In the test, we create a keyspace with a table and a materialized view. | ||
# We then start writing to the table, causing the materialized view to be updated. | ||
# While the writes are in progress, we add nodes to the cluster. | ||
# The test verifies that the node doesn't crash as a result of the node addition, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important thing here to reproduce the bug is that the node is from a new DC, right? If so, then it should be noted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The important thing is for replication strategy to be networking. DC can be the same.
# and that the materialized view is eventually consistent. | ||
@pytest.mark.asyncio | ||
@skip_mode('release', 'error injections are not supported in release mode') | ||
async def test_add_node_during_mv_writes(manager: ManagerClient): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test alone reliably reproduce the issue with unchanged Scylla? I think we need to trigger a situation where the view ERM is updated but the base table ERM is not - either by introducing a sleep like in the issue description or by sleeping in the code that updates the ERMs. Can we do it with error injection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Please explain in the commit message or even in this discssion, how often does this test fail before the fix. If it fails 100% of the time, that's fantastic. But to be realistic, even if it fails 10% of the time it's good enough - if the bug ever regresses, we'll see this test fail 10% of the time and it will be fine.
@pytest.mark.asyncio | ||
@skip_mode('release', 'error injections are not supported in release mode') | ||
async def test_add_node_during_mv_writes(manager: ManagerClient): | ||
cfg = {'force_gossip_topology_changes': True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that you took probably took the test from the issue description verbatim, but is legacy mode necessary to reproduce it? If not, can we use non-legacy mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to make the test fail consistently without it, maybe @gleb-cloudius knows more precisely why it was added in the initial reproducer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not my test either. It is from here #18676. The timing with and without raft will be very different though. May be barrier can even prevent the problem from happening.
await asyncio.sleep(2) | ||
|
||
logger.info(nums) | ||
logger.info(sum(nums)) | ||
|
||
res = await cql.run_async("select pk, v from ks.t") | ||
for r in res: | ||
logger.info(f"pk: {r.pk}, v: {r.v}") | ||
|
||
res = await cql.run_async("select v, pk from ks.t_view") | ||
for r in res: | ||
logger.info(f"v: {r.v}, pk: {r.pk}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these logs necessary for the reproducer? Or the sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed these from the test after all
# and that the materialized view is eventually consistent. | ||
@pytest.mark.asyncio | ||
@skip_mode('release', 'error injections are not supported in release mode') | ||
async def test_add_node_during_mv_writes(manager: ManagerClient): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add sleep injections to increase the probability of encountering different e_r_ms:
- one sleep in MV update code between obtaining the different e_r_ms
- one sleep in the code which updates e_r_ms in storage_service
(i.e., the same injections that @gleb-cloudius did in his reproducers)
Otherwise this test is very unlikely to reproduce either problem.
You should verify that this test is really a regression test. I.e., run it before and after the fix, and verify that it reproduces both failures before the fix, but reproduces neither after the fix.
In fact I believe we may need two tests actually. How can it reproduce #17786? There is no decommission done in this test... only bootstraps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One test would be enough, if it does both bootstrap and decommission (concurrent to doing MV updates), but you should verify that it reproduces both before the fix (e.g. by commenting out bootstrap, checking that it reproduces the decommission issue, then bringing bootstrap back but commenting out decommission, checking that it reproduces the bootstrap issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please make it do both. As of sleeps I agree with @nyh that if it reproduces even 10% without it is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember @gleb-cloudius that you did not manage to reproduce the decommission case with hundreds of runs until you put the sleep in e_r_m update code.
So we do need at least that sleep to hit the decommission case. Not sure about the bootstrap case -- what was your reproducibility without sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember @gleb-cloudius that you did not manage to reproduce the decommission case with hundreds of runs until you put the sleep in e_r_m update code.
I do not remember all attempts, but it was definitely not easily reproducible. Lets run the test 10 time and if it does not fail then we need sleeps like in my examples.
|
||
await manager.server_add() | ||
await manager.server_add() | ||
await manager.server_add() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the right sleep injections only a single bootstrap should be required, with no sleep injections I think there's significant risk that we won't hit it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I left only one additional server in the updated test
Added backport 6.0 label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this. In commit 4505a86 I fixed most of the places where we used the same ERM for base and view, but I guess I fixed the topology one wrong.
@@ -1625,6 +1625,7 @@ get_view_natural_endpoint( | |||
} | |||
} | |||
|
|||
auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess that's what he meant - the patch that follows this description ;-) But I agree it's ambiguous and it should be called "this patch".
# and that the materialized view is eventually consistent. | ||
@pytest.mark.asyncio | ||
@skip_mode('release', 'error injections are not supported in release mode') | ||
async def test_add_node_during_mv_writes(manager: ManagerClient): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Please explain in the commit message or even in this discssion, how often does this test fail before the fix. If it fails 100% of the time, that's fantastic. But to be realistic, even if it fails 10% of the time it's good enough - if the bug ever regresses, we'll see this test fail 10% of the time and it will be fine.
Regarding #18816 (comment), #18816 (comment) and #18816 (comment): |
🔴 CI State: FAILURE✅ - Build Failed Tests (4/300):
Build Details:
|
Looks like I hit https://github.com/scylladb/seastar/blob/master/doc/lambda-coroutine-fiasco.md. I couldn't use |
@@ -1625,25 +1625,26 @@ get_view_natural_endpoint( | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERMs
That's one of our worst acronyms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should pay Scylla employees for fast-typing lessons, so they won't have to use acronyms in their code :-)
Scylla could also sponsor the AAAAAAA - the American Association Against Abuse of Abbreviations And Acronyms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should pay Scylla employees for fast-typing lessons, so they won't have to use acronyms in their code :-)
And for wider monitors!
@@ -1658,7 +1659,7 @@ get_view_natural_endpoint( | |||
return {}; | |||
} | |||
auto replica = view_endpoints[base_it - base_endpoints.begin()]; | |||
return topology.get_node(replica).endpoint(); | |||
return view_topology.get_node(replica).endpoint(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we have two different topologies. The whole idea behind effective_replication_map_ptr is that while you hold it, topology can only change in a compatible way. So why not have view_erm be base_erm? Why have two variables in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topology is an object inside an erm. Erms are per table (I guess for tablets, but they have to be at least per keyspace even for vnodes since they are result of application of a replication to a topology).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think they're per table, I think they're per keyspace. @bhalevy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are per-keyspace with vnodes, and per-table with tablets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgrabiec is there a way to get to a "super effective replication map" that will cover both a base table and its views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhalevy maybe the function of "ERM" changed since I wrote this code (although, it does still work). As I noted above, what I needed to query was base_erm->get_replicas(base_token)
- the list of three replicas that currently hold this token. The result of this is different between different tables. So if this is not an "erm" and is rather a "tablet map that is only a part of the erm", then I guess this code needs two "tablet maps", not two "erm"s. I'd appreciate it if someone who understand how this ERM stuff evolved will change this code to use the right objects or just write terminology. But again, the current code does work (and if it breaks, there's a test that exercises this case of different tablets for the base and view, and would have caught such breakage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nyh, as I mentioned in #18816 (comment), with tablets, the base and view have different tablet maps, but they can both live in the same erm (which is kept in the keyspace.
I do not see it from the code you posted. The code gets per_table_replication_strategy and creates individual erm for each table base on the schema id. May be I read the code wrong. Yes, the topology will be the same for all erms if they were created from the same token metadata, but replication for each tablet table is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a test that reproduces it
Is it committed to the repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbr-scylla yes, test/topology_experimental_raft/test_mv_tablets.py
:test_tablet_mv_simple_6node
, from 4505a86. Its comment:
A simple reproducer for a bug of forgetting that the view table has a different tablet mapping from the base: Using the wrong tablet mapping for the base table or view table can cause us to send a view update to the wrong view replica - or not send a view update at all. A row that we write on the base table will not be readable in the view. We start a large-enough cluster (6 nodes) to increase the probability that if the mapping is different for the one row we write, and the test will fail if the bug exists. Reproduces #16227.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nyh, as I mentioned in #18816 (comment), with tablets, the base and view have different tablet maps, but they can both live in the same erm (which is kept in the keyspace.
I do not see it from the code you posted. The code gets per_table_replication_strategy and creates individual erm for each table base on the schema id. May be I read the code wrong. Yes, the topology will be the same for all erms if they were created from the same token metadata, but replication for each tablet table is different.
Yes, sorry. I mixed the terms.
They will indeed have different effective_replication_map:s but they do share the same token_metadata_ptr and locator::topology.
🔴 CI State: FAILURE✅ - Build Failed Tests (2/300):Build Details:
|
When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to f2ff701, the update may have just been performed partially when we are taking the ERMs. Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash. This is fixed in this patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1. Fixes: scylladb#17786 Fixes: scylladb#18709
Looks like the lambda coroutine wasn't the (only) problem - indeed, the code in |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
Queued |
Should be backported to 6.0, no? |
Yes, in fact it has all the possible OSS backport labels |
Ah, I missed that 6.0 was added as well. The branch was created while this PR was in review. |
This may have caused a promotion failure - #18941 :-( |
When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to f2ff701, the update may have just been performed partially when we are taking the ERMs.
Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash.
This is fixed in the following patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1.
Fixes: #17786
Fixes: #18709