-
Notifications
You must be signed in to change notification settings - Fork 199
mitosis: bugfixes #2904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mitosis: bugfixes #2904
Conversation
418df1f to
5553ed7
Compare
|
Reasonably confident this also fixes: which i encountered while testing. |
Let's drop this commit from the PR - I don't think we've hit this issue except in synthetic cases. It also won't work because we reassign the cpus to other cells so there's no guarantee that those cells will get drained. |
5553ed7 to
2d80d63
Compare
dschatzberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments inline - some of these changes don't make sense (I indicated where) but a bunch of good stuff in here too. I love the test script
2d80d63 to
fd9dec7
Compare
In mitosis_stopping, the scheduler calculates vtime advancement by dividing by the task's weight. If a task somehow has a zero weight, this would cause a division by zero error. Add a defensive check to catch this case and report an error rather than crashing.
When allocating a new cell, the vtime_now field is not initialized, which means it contains garbage data. This can cause scheduling anomalies as tasks inherit this uninitialized vtime. Initialize vtime_now to 0 when allocating a cell to ensure consistent vtime tracking from the start.
Replace std::ptr::read_volatile with proper atomic load operations with Acquire ordering when reading fields shared between BPF and Rust code. This ensures proper memory ordering guarantees required for multi-threaded communication: - applied_configuration_seq: Used to synchronize configuration updates - in_use: Used to determine cell allocation status The Acquire ordering ensures that all writes made by the BPF side before releasing the data are visible to the Rust side after the load.
Replace expect() with unwrap_or_else() when looking up cell CPU assignments. If a cell is marked as in_use but has no CPUs assigned (which can happen during race conditions or transient states), log a warning and use an empty cpumask rather than panicking. This makes the scheduler more resilient to transient inconsistencies.
Add a check to ensure global_queue_decisions is non-zero before taking log10. If it's zero (which can happen during initialization or in periods of no activity), use MIN_DECISIONS_WIDTH directly instead of attempting to calculate log10(0), which returns negative infinity and causes formatting issues.
Add validation that the percpu map lookup returns at least NR_CPUS_POSSIBLE entries before indexing into it. If the map returns fewer entries than expected, fail with a clear error message rather than panicking with an out-of-bounds access.
Address several race conditions and lifecycle issues that were causing errors under load: BPF side (mitosis.bpf.c): - Fix cgroup_exit incorrectly using BPF_LOCAL_STORAGE_GET_F_CREATE flag. The exit handler should not create storage; if storage doesn't exist, the cgroup was never a cell owner and there's nothing to free. This was causing "cgrp_ctx creation failed" errors. Rust side (main.rs): - Fix TOCTOU race in cell discovery by inferring cell existence from CPU assignments rather than reading in_use flag separately. The in_use flag is set early in allocate_cell() but CPUs aren't assigned until later in update_timer_cb. Reading in_use with Acquire doesn't synchronize with CPU assignments (which are synchronized via applied_configuration_seq). This was causing "missing cell in cpu map" warnings. Known remaining issue: - Cells are not freed when a cgroup's cpuset is removed (only when cgroup exits). See TODO at line 870 in mitosis.bpf.c. With MAX_CELLS=16, systems with many dynamic cpuset changes can exhaust the cell pool, leading to "No available cells to allocate" errors. This may be an issue only present in synthetic tests.
Add a script to stress mitosis by creating and destroying cells. This has proven useful for identifying bugs.
fd9dec7 to
67be89c
Compare
Fix bugs in mitosis.
The test script added passes on the last commit in this PR.
It fails due to various issues on commits in this PR.
W/o this PR at all, mitosis will get kicked w/
missing cell in cpu map.W/o drain commit, mitosis will get kicked w/
No available cells to allocate.