-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asserts and segmentation faults in topic creation under load #5558
Comments
Potential duplicate? #3335 |
@dotnwat - oh interesting. It's hard to say. It's the same assert but there's no decoded backtrace so it's difficult to assess whether In addition to my fix I'm adding |
Note that this code was refactored in the last month or so to co-routinize it, though the bug existed in the prior version too, but perhaps this change affected the timing or some other factor making it more likely now? Earlier load tests of a similar nature didn't trigger that as far as I noticed. |
Return the allocatoin_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558.
Return the allocatoin_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558.
Return the allocatoin_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558.
Looks like I reopened it based on the last comment by John which seemed to cast some doubt on whether it was fixed. |
Return the allocatoin_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558.
Const member prevents copy and move assignment, but copy and move assignment seem to have reasonable semantics (the destination shard id is replaced by the source) and we need it for allocation_units oncore tracking. Issue redpanda-data#5558.
Return the allocation_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558.
Const member prevents copy and move assignment, but copy and move assignment seem to have reasonable semantics (the destination shard id is replaced by the source) and we need it for allocation_units oncore tracking. Issue redpanda-data#5558.
Return the allocation_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558.
Return the allocation_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558.
Const member prevents copy and move assignment, but copy and move assignment seem to have reasonable semantics (the destination shard id is replaced by the source) and we need it for allocation_units oncore tracking. Issue redpanda-data#5558. (cherry picked from commit c64fdbb)
Return the allocation_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558. (cherry picked from commit 413cbe3)
Const member prevents copy and move assignment, but copy and move assignment seem to have reasonable semantics (the destination shard id is replaced by the source) and we need it for allocation_units oncore tracking. Issue redpanda-data#5558. (cherry picked from commit c64fdbb)
Return the allocation_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558. (cherry picked from commit 413cbe3)
Version & Environment
Redpanda version: 3c02b03
What went wrong?
During a load test, the following assert was observed:
The relevant backtrace is:
Similarly, a segmentation fault on shard 24 occurred in another run of the same load test with the following backtrace:
What should have happened instead?
No asserts or segmentation faults.
How to reproduce the issue?
This issue is difficult to reproduce deterministically, but the problem can be seen by inspection at
redpanda/src/v/cluster/topics_frontend.cc
Lines 460 to 469 in a80aca9
This uses some cluster::allocation_units on "this" shard which were originally created on shard 0 (the invoke_on part). Internally units have a _state pointer which points back to an internal structure of the partition_allocator on shard 0 (indeed, the design of partition_allocator is that it only exists on shard 0).
These units are manipulated and destroyed (e.g., if there is an error in the highlighted section above, or in topics_frontend::replicate_create_topic otherwise) on this shard, which will call back into the allocator on shard 0 via the _state pointer, e.g. here.
This is a cross-shard race condition: multiple shards may manipulate this state on shard 0 in parallel.
The text was updated successfully, but these errors were encountered: