-
Notifications
You must be signed in to change notification settings - Fork 3
Implement a fail save init function #804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@jan-janssen has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 17 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
WalkthroughAdds try/except error handling and explicit success acknowledgments for the "init" branch in interactive backends (serial and parallel). The scheduler's init dispatch now uses send-and-receive and captures init-time exceptions to propagate them to pending Futures. Tests updated to assert the new initial acknowledgment message. No public API signatures changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant S as Scheduler/Client
participant I as Interface
participant B as Backend (Serial/Parallel)
rect #E8F3FF
S->>I: send_and_receive_dict({type: "init", ...})
I->>B: deliver init payload
alt Init succeeds
B->>I: {"result": True}
I-->>S: {"result": True}
else Init raises/exception
B->>I: {"error": {...}}
Note right of B: backend_write_error_file(apply_dict=input_dict)
I-->>S: {"error": {...}}
S->>S: scheduler records initialization exception
S->>S: pending Futures set with exception
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
executorlib/task_scheduler/interactive/blockallocation.py (1)
231-235: Avoid indefinite wait on backend crashsend_and_receive_dict can block forever if the backend dies before replying. Consider adding a receive timeout or a watchdog/heartbeat at the interface layer and surfacing a TimeoutError here.
executorlib/backend/interactive_serial.py (1)
75-90: Catching broad Exception is intentional; silence the linterCatching Exception here is appropriate to surface user init errors. Add a noqa to satisfy BLE001 without narrowing the scope.
- except Exception as error: + except Exception as error: # noqa: BLE001 interface_send( socket=socket, result_dict={"error": error}, )executorlib/backend/interactive_parallel.py (1)
100-116: Optional: barrier after ack for stricter orderingAfter sending the success/error response, consider a Barrier to ensure all ranks completed init before next command, mirroring the shutdown path’s synchronization.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
executorlib/backend/interactive_parallel.py(1 hunks)executorlib/backend/interactive_serial.py(1 hunks)executorlib/task_scheduler/interactive/blockallocation.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
executorlib/task_scheduler/interactive/blockallocation.py (3)
executorlib/task_scheduler/file/shared.py (1)
result(24-38)executorlib/standalone/interactive/communication.py (1)
send_and_receive_dict(82-94)tests/test_singlenodeexecutor_noblock.py (1)
init_function(22-23)
executorlib/backend/interactive_serial.py (3)
executorlib/standalone/interactive/backend.py (2)
call_funct(5-33)funct(21-22)executorlib/standalone/interactive/communication.py (1)
interface_send(221-230)executorlib/standalone/error.py (1)
backend_write_error_file(4-21)
executorlib/backend/interactive_parallel.py (3)
executorlib/standalone/interactive/backend.py (2)
call_funct(5-33)funct(21-22)executorlib/standalone/interactive/communication.py (1)
interface_send(221-230)executorlib/standalone/error.py (1)
backend_write_error_file(4-21)
🪛 Ruff (0.12.2)
executorlib/task_scheduler/interactive/blockallocation.py
234-234: Unnecessary key check before dictionary access
Replace with dict.get
(RUF019)
executorlib/backend/interactive_serial.py
79-79: Do not catch blind exception: Exception
(BLE001)
executorlib/backend/interactive_parallel.py
104-104: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
- GitHub Check: unittest_mpich (ubuntu-22.04-arm, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-24.04-arm, 3.13)
- GitHub Check: unittest_openmpi (macos-latest, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
- GitHub Check: unittest_openmpi (ubuntu-22.04-arm, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
- GitHub Check: unittest_mpich (macos-latest, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
- GitHub Check: unittest_flux_openmpi
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
- GitHub Check: unittest_flux_mpich
- GitHub Check: minimal
- GitHub Check: unittest_slurm_mpich
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
- GitHub Check: unittest_old
- GitHub Check: notebooks_integration
- GitHub Check: notebooks
| try: | ||
| memory.update( | ||
| call_funct(input_dict=input_dict, funct=None, memory=memory) | ||
| ) | ||
| except Exception as error: | ||
| if mpi_rank_zero: | ||
| interface_send( | ||
| socket=socket, | ||
| result_dict={"error": error}, | ||
| ) | ||
| backend_write_error_file( | ||
| error=error, | ||
| apply_dict=input_dict, | ||
| ) | ||
| else: | ||
| if mpi_rank_zero: | ||
| interface_send(socket=socket, result_dict={"result": True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: init failures on non-root ranks are ignored; client may receive success
If any non-root rank raises during init, rank 0 still sends {"result": True}. This leaves ranks with divergent memory and hides failures.
Aggregate errors across ranks and only acknowledge success if all ranks succeeded. Example patch:
- try:
- memory.update(
- call_funct(input_dict=input_dict, funct=None, memory=memory)
- )
- except Exception as error:
- if mpi_rank_zero:
- interface_send(
- socket=socket,
- result_dict={"error": error},
- )
- backend_write_error_file(
- error=error,
- apply_dict=input_dict,
- )
- else:
- if mpi_rank_zero:
- interface_send(socket=socket, result_dict={"result": True})
+ # Run init on all ranks and consolidate errors on rank 0
+ local_error = None
+ local_update = None
+ try:
+ local_update = call_funct(input_dict=input_dict, funct=None, memory=memory)
+ except Exception as error: # noqa: BLE001
+ local_error = error
+
+ from mpi4py import MPI # already available; keeps scope explicit
+ # Did any rank fail?
+ any_failed = MPI.COMM_WORLD.allreduce(bool(local_error), op=MPI.LOR)
+
+ if any_failed:
+ # Gather first non-None error to rank 0 and report
+ errors = MPI.COMM_WORLD.gather(local_error, root=0)
+ if mpi_rank_zero:
+ first_error = next((e for e in errors if e is not None), None)
+ interface_send(socket=socket, result_dict={"error": first_error})
+ backend_write_error_file(error=first_error, apply_dict=input_dict)
+ else:
+ # All ranks succeeded: update memory consistently and acknowledge
+ memory.update(local_update) # same update on all ranks
+ if mpi_rank_zero:
+ interface_send(socket=socket, result_dict={"result": True})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| memory.update( | |
| call_funct(input_dict=input_dict, funct=None, memory=memory) | |
| ) | |
| except Exception as error: | |
| if mpi_rank_zero: | |
| interface_send( | |
| socket=socket, | |
| result_dict={"error": error}, | |
| ) | |
| backend_write_error_file( | |
| error=error, | |
| apply_dict=input_dict, | |
| ) | |
| else: | |
| if mpi_rank_zero: | |
| interface_send(socket=socket, result_dict={"result": True}) | |
| # Run init on all ranks and consolidate errors on rank 0 | |
| local_error = None | |
| local_update = None | |
| try: | |
| local_update = call_funct(input_dict=input_dict, funct=None, memory=memory) | |
| except Exception as error: # noqa: BLE001 | |
| local_error = error | |
| from mpi4py import MPI # already available; keeps scope explicit | |
| # Did any rank fail? | |
| any_failed = MPI.COMM_WORLD.allreduce(bool(local_error), op=MPI.LOR) | |
| if any_failed: | |
| # Gather first non-None error to rank 0 and report | |
| errors = MPI.COMM_WORLD.gather(local_error, root=0) | |
| if mpi_rank_zero: | |
| first_error = next((e for e in errors if e is not None), None) | |
| interface_send(socket=socket, result_dict={"error": first_error}) | |
| backend_write_error_file(error=first_error, apply_dict=input_dict) | |
| else: | |
| # All ranks succeeded: update memory consistently and acknowledge | |
| memory.update(local_update) # same update on all ranks | |
| if mpi_rank_zero: | |
| interface_send(socket=socket, result_dict={"result": True}) |
🧰 Tools
🪛 Ruff (0.12.2)
104-104: Do not catch blind exception: Exception
(BLE001)
| result = interface.send_and_receive_dict( | ||
| input_dict={"init": True, "fn": init_function, "args": (), "kwargs": {}} | ||
| ) | ||
| if "error" in result and result["error"]: | ||
| raise result["error"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Init errors: also log to file and simplify error check
- You’re not passing error_log_file in the init payload, so backend_write_error_file can’t persist init failures. Add it.
- Replace the explicit key check with get(...) and raise a sensible RuntimeError if a non-exception is received (matches Ruff RUF019 hint).
- result = interface.send_and_receive_dict(
- input_dict={"init": True, "fn": init_function, "args": (), "kwargs": {}}
- )
- if "error" in result and result["error"]:
- raise result["error"]
+ result = interface.send_and_receive_dict(
+ input_dict={
+ "init": True,
+ "fn": init_function,
+ "args": (),
+ "kwargs": {},
+ "error_log_file": error_log_file,
+ }
+ )
+ error = result.get("error")
+ if error:
+ if isinstance(error, BaseException):
+ raise error
+ raise RuntimeError(f"Init failed with non-exception payload: {error!r}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = interface.send_and_receive_dict( | |
| input_dict={"init": True, "fn": init_function, "args": (), "kwargs": {}} | |
| ) | |
| if "error" in result and result["error"]: | |
| raise result["error"] | |
| result = interface.send_and_receive_dict( | |
| input_dict={ | |
| "init": True, | |
| "fn": init_function, | |
| "args": (), | |
| "kwargs": {}, | |
| "error_log_file": error_log_file, | |
| } | |
| ) | |
| error = result.get("error") | |
| if error: | |
| if isinstance(error, BaseException): | |
| raise error | |
| raise RuntimeError(f"Init failed with non-exception payload: {error!r}") |
🧰 Tools
🪛 Ruff (0.12.2)
234-234: Unnecessary key check before dictionary access
Replace with dict.get
(RUF019)
🤖 Prompt for AI Agents
In executorlib/task_scheduler/interactive/blockallocation.py around lines 231 to
235, the init payload must include the error_log_file so
backend_write_error_file can persist init failures; update the input_dict to
pass error_log_file and then replace the explicit key check with using
result.get("error") — if error is truthy and is an Exception raise it directly,
otherwise raise a RuntimeError (including the received error representation) to
surface non-exception failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_backend_interactive_serial.py (1)
42-42: Avoid hangs in CI by setting a receive timeout (optional).
If the handshake regresses, these blocking recv() calls could hang. Consider setting a small RCVTIMEO on the test-side socket after creation.Example (apply in each test right after creating the socket):
socket.setsockopt(zmq.RCVTIMEO, 5000) # 5s timeout to fail fast instead of hangingAlso applies to: 55-55, 70-70, 83-83
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
executorlib/task_scheduler/interactive/blockallocation.py(1 hunks)tests/test_backend_interactive_serial.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- executorlib/task_scheduler/interactive/blockallocation.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: unittest_openmpi (ubuntu-24.04-arm, 3.13)
- GitHub Check: unittest_mpich (ubuntu-24.04-arm, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
- GitHub Check: unittest_openmpi (macos-latest, 3.13)
- GitHub Check: unittest_slurm_mpich
- GitHub Check: unittest_flux_mpich
- GitHub Check: unittest_mpich (ubuntu-22.04-arm, 3.13)
- GitHub Check: unittest_win
- GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
- GitHub Check: unittest_mpich (macos-latest, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
- GitHub Check: notebooks
- GitHub Check: minimal
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
🔇 Additional comments (4)
tests/test_backend_interactive_serial.py (4)
42-42: Init ACK assertion matches the new handshake — LGTM.
Validates that successful init now replies with {"result": True} before task results.
55-55: Error-path still ACKs init first — LGTM.
Confirms init success ACK precedes the error payload.
70-70: Submit-thread path: init ACK assertion — LGTM.
Covers the inverse threading direction with the same contract.
83-83: Submit-thread error path: init ACK assertion — LGTM.
Ensures consistent ordering (ACK → error) here as well.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #804 +/- ##
==========================================
+ Coverage 97.75% 97.76% +0.01%
==========================================
Files 32 32
Lines 1468 1479 +11
==========================================
+ Hits 1435 1446 +11
Misses 33 33 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
executorlib/task_scheduler/interactive/blockallocation.py (1)
232-237: Init errors are silently ignored; inspect response and pass error_log_fileYou discard the init response and only catch transport-layer failures. Backend init failures are returned as {"error": ...}, so they won't be propagated. Also include error_log_file in the init payload so the backend can persist the failure.
Apply:
- try: - _ = interface.send_and_receive_dict( - input_dict={"init": True, "fn": init_function, "args": (), "kwargs": {}} - ) - except Exception as init_exception: - interface_initialization_exception = init_exception + try: + result = interface.send_and_receive_dict( + input_dict={ + "init": True, + "fn": init_function, + "args": (), + "kwargs": {}, + "error_log_file": error_log_file, + } + ) + error = result.get("error") + if error: + if isinstance(error, BaseException): + raise error + raise RuntimeError(f"Init failed with non-exception payload: {error!r}") + except Exception as init_exception: + interface_initialization_exception = init_exceptionRun to confirm backends return {"result": True} or {"error": ...} and accept error_log_file:
#!/bin/bash rg -nC3 --glob 'executorlib/backend/interactive_*.*' -e 'error_log_file' -e '"error"' -e '"result"\s*:\s*True'
🧹 Nitpick comments (1)
executorlib/task_scheduler/interactive/blockallocation.py (1)
236-237: Document rationale for broad Exception catch
Add a brief comment or# noqa: BLE001aboveexcept Exception as init_exception: interface_initialization_exception = init_exceptionto note that
send_and_receive_dictmay raise multiple exception types (e.g. ExecutorlibSocketError, ZMQError, cloudpickle errors, or arbitrary remote exceptions) and thus a broad catch is required in this boundary layer.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
executorlib/task_scheduler/interactive/blockallocation.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
executorlib/task_scheduler/interactive/blockallocation.py (2)
executorlib/standalone/interactive/communication.py (1)
send_and_receive_dict(82-94)executorlib/task_scheduler/interactive/shared.py (1)
execute_task_dict(12-48)
🪛 Ruff (0.12.2)
executorlib/task_scheduler/interactive/blockallocation.py
236-236: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
- GitHub Check: unittest_openmpi (ubuntu-22.04-arm, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
- GitHub Check: unittest_openmpi (macos-latest, 3.13)
- GitHub Check: notebooks_integration
- GitHub Check: unittest_slurm_mpich
- GitHub Check: unittest_flux_mpich
- GitHub Check: unittest_flux_openmpi
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
| if interface_initialization_exception is not None: | ||
| f.set_exception(exception=interface_initialization_exception) | ||
| else: | ||
| execute_task_dict( | ||
| task_dict=task_dict, | ||
| future_obj=f, | ||
| interface=interface, | ||
| cache_directory=cache_directory, | ||
| cache_key=cache_key, | ||
| error_log_file=error_log_file, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Respect cancellation before setting exception to avoid InvalidStateError
Setting an exception on a cancelled Future can raise InvalidStateError and break the worker loop. Mirror execute_task_dict’s pattern.
- if interface_initialization_exception is not None:
- f.set_exception(exception=interface_initialization_exception)
+ if interface_initialization_exception is not None:
+ if (not f.done()) and f.set_running_or_notify_cancel():
+ f.set_exception(interface_initialization_exception)
else:
execute_task_dict(
task_dict=task_dict,
future_obj=f,
interface=interface,
cache_directory=cache_directory,
cache_key=cache_key,
error_log_file=error_log_file,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if interface_initialization_exception is not None: | |
| f.set_exception(exception=interface_initialization_exception) | |
| else: | |
| execute_task_dict( | |
| task_dict=task_dict, | |
| future_obj=f, | |
| interface=interface, | |
| cache_directory=cache_directory, | |
| cache_key=cache_key, | |
| error_log_file=error_log_file, | |
| ) | |
| if interface_initialization_exception is not None: | |
| # Only set the exception if the future wasn't already done/cancelled | |
| if (not f.done()) and f.set_running_or_notify_cancel(): | |
| f.set_exception(interface_initialization_exception) | |
| else: | |
| execute_task_dict( | |
| task_dict=task_dict, | |
| future_obj=f, | |
| interface=interface, | |
| cache_directory=cache_directory, | |
| cache_key=cache_key, | |
| error_log_file=error_log_file, | |
| ) |
🤖 Prompt for AI Agents
In executorlib/task_scheduler/interactive/blockallocation.py around lines 248 to
258, avoid calling f.set_exception on a Future that may already be cancelled
(which can raise InvalidStateError); before setting the exception, check if
f.cancelled() (or otherwise mirror the guard used in execute_task_dict) and only
call set_exception when the future is not cancelled, otherwise skip setting the
exception (or handle via the same cancellation/ignore path used by
execute_task_dict).
Summary by CodeRabbit
New Features
Bug Fixes
Tests