Implement a fail save init function #804

jan-janssen · 2025-09-07T05:40:26Z

Summary by CodeRabbit

New Features
- Workers now send a clear success acknowledgment after initialization.
- Initialization failures are immediately reported back to the caller.
Bug Fixes
- Initialization errors are surfaced instead of failing silently.
- Scheduler propagates initialization failures to pending tasks to avoid hangs.
- Improved reliability across interactive serial and parallel modes.
Tests
- Added handshake assertions to validate the initialization acknowledgment.

for more information, see https://pre-commit.ci

coderabbitai · 2025-09-07T05:40:33Z

Warning

Rate limit exceeded

@jan-janssen has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 17 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between bc18718 and 6163cca.

📒 Files selected for processing (2)

executorlib/task_scheduler/interactive/blockallocation.py (2 hunks)
tests/test_backend_interactive_serial.py (6 hunks)

Walkthrough

Adds try/except error handling and explicit success acknowledgments for the "init" branch in interactive backends (serial and parallel). The scheduler's init dispatch now uses send-and-receive and captures init-time exceptions to propagate them to pending Futures. Tests updated to assert the new initial acknowledgment message. No public API signatures changed.

Changes

Cohort / File(s)	Summary
Interactive backends — init handling `executorlib/backend/interactive_parallel.py`, `executorlib/backend/interactive_serial.py`	Wrap `init` memory update in try/except/else; on exception send an error payload via the interface and call `backend_write_error_file` with `apply_dict=input_dict`; on success send `{"result": True}` acknowledgment back to root/client.
Scheduler — init dispatch & propagation `executorlib/task_scheduler/interactive/blockallocation.py`	Introduce `interface_initialization_exception`; replace one-way `interface.send_dict(...)` with `interface.send_and_receive_dict(...)` in init path and capture exceptions; if init failed, set the corresponding task Futures with that exception instead of executing tasks.
Tests — handshake expectation `tests/test_backend_interactive_serial.py`	Insert additional assertions to verify an initial `{"result": True}` message is received from the worker/socket after starting the worker or submitting a task; existing assertions retained.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant S as Scheduler/Client
  participant I as Interface
  participant B as Backend (Serial/Parallel)

  rect #E8F3FF
    S->>I: send_and_receive_dict({type: "init", ...})
    I->>B: deliver init payload
    alt Init succeeds
      B->>I: {"result": True}
      I-->>S: {"result": True}
    else Init raises/exception
      B->>I: {"error": {...}}
      Note right of B: backend_write_error_file(apply_dict=input_dict)
      I-->>S: {"error": {...}}
      S->>S: scheduler records initialization exception
      S->>S: pending Futures set with exception
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Refactor interactive task scheduler #798 — Overlapping changes to interactive init error handling and task-scheduler refactor that touch the same init/control-flow paths.
Add worker ID #748 — Related edits to the "init" handling in interactive backends and memory initialization patterns.
Execute single task #796 — Related changes to blockallocation worker/execute_multiple_tasks entrypoint and init propagation.

Poem

A small soft paw taps at the queue,
I sniff the init and watch it through.
If something snaps, I catch the cry,
If it’s all well, I blink, "All right."
A carrot nod—ack sent, hop by hop. 🥕

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fail_save_init_function

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

executorlib/task_scheduler/interactive/blockallocation.py (1)

231-235: Avoid indefinite wait on backend crash

send_and_receive_dict can block forever if the backend dies before replying. Consider adding a receive timeout or a watchdog/heartbeat at the interface layer and surfacing a TimeoutError here.
executorlib/backend/interactive_serial.py (1)
75-90: Catching broad Exception is intentional; silence the linter

Catching Exception here is appropriate to surface user init errors. Add a noqa to satisfy BLE001 without narrowing the scope.
-            except Exception as error:
+            except Exception as error:  # noqa: BLE001
                 interface_send(
                     socket=socket,
                     result_dict={"error": error},
                 )
executorlib/backend/interactive_parallel.py (1)

100-116: Optional: barrier after ack for stricter ordering

After sending the success/error response, consider a Barrier to ensure all ranks completed init before next command, mirroring the shutdown path’s synchronization.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2aaca49 and 3530cf5.

📒 Files selected for processing (3)

executorlib/backend/interactive_parallel.py (1 hunks)
executorlib/backend/interactive_serial.py (1 hunks)
executorlib/task_scheduler/interactive/blockallocation.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

executorlib/task_scheduler/interactive/blockallocation.py (3)

executorlib/task_scheduler/file/shared.py (1)

result (24-38)

executorlib/standalone/interactive/communication.py (1)

send_and_receive_dict (82-94)

tests/test_singlenodeexecutor_noblock.py (1)

init_function (22-23)

executorlib/backend/interactive_serial.py (3)

executorlib/standalone/interactive/backend.py (2)

call_funct (5-33)

funct (21-22)

executorlib/standalone/interactive/communication.py (1)

interface_send (221-230)

executorlib/standalone/error.py (1)

backend_write_error_file (4-21)

executorlib/backend/interactive_parallel.py (3)

executorlib/standalone/interactive/backend.py (2)

call_funct (5-33)

funct (21-22)

executorlib/standalone/interactive/communication.py (1)

interface_send (221-230)

executorlib/standalone/error.py (1)

backend_write_error_file (4-21)

🪛 Ruff (0.12.2)

executorlib/task_scheduler/interactive/blockallocation.py

234-234: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

executorlib/backend/interactive_serial.py

79-79: Do not catch blind exception: Exception

(BLE001)

executorlib/backend/interactive_parallel.py

104-104: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
GitHub Check: unittest_mpich (ubuntu-22.04-arm, 3.13)
GitHub Check: unittest_openmpi (ubuntu-24.04-arm, 3.13)
GitHub Check: unittest_openmpi (macos-latest, 3.13)
GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
GitHub Check: unittest_openmpi (ubuntu-22.04-arm, 3.13)
GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
GitHub Check: unittest_mpich (macos-latest, 3.13)
GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
GitHub Check: unittest_flux_openmpi
GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
GitHub Check: unittest_flux_mpich
GitHub Check: minimal
GitHub Check: unittest_slurm_mpich
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
GitHub Check: unittest_old
GitHub Check: notebooks_integration
GitHub Check: notebooks

coderabbitai · 2025-09-07T05:47:06Z

executorlib/backend/interactive_parallel.py

+            try:
+                memory.update(
+                    call_funct(input_dict=input_dict, funct=None, memory=memory)
+                )
+            except Exception as error:
+                if mpi_rank_zero:
+                    interface_send(
+                        socket=socket,
+                        result_dict={"error": error},
+                    )
+                    backend_write_error_file(
+                        error=error,
+                        apply_dict=input_dict,
+                    )
+            else:
+                if mpi_rank_zero:
+                    interface_send(socket=socket, result_dict={"result": True})


⚠️ Potential issue

Bug: init failures on non-root ranks are ignored; client may receive success

If any non-root rank raises during init, rank 0 still sends {"result": True}. This leaves ranks with divergent memory and hides failures.

Aggregate errors across ranks and only acknowledge success if all ranks succeeded. Example patch:

- try: - memory.update( - call_funct(input_dict=input_dict, funct=None, memory=memory) - ) - except Exception as error: - if mpi_rank_zero: - interface_send( - socket=socket, - result_dict={"error": error}, - ) - backend_write_error_file( - error=error, - apply_dict=input_dict, - ) - else: - if mpi_rank_zero: - interface_send(socket=socket, result_dict={"result": True}) + # Run init on all ranks and consolidate errors on rank 0 + local_error = None + local_update = None + try: + local_update = call_funct(input_dict=input_dict, funct=None, memory=memory) + except Exception as error: # noqa: BLE001 + local_error = error + + from mpi4py import MPI # already available; keeps scope explicit + # Did any rank fail? + any_failed = MPI.COMM_WORLD.allreduce(bool(local_error), op=MPI.LOR) + + if any_failed: + # Gather first non-None error to rank 0 and report + errors = MPI.COMM_WORLD.gather(local_error, root=0) + if mpi_rank_zero: + first_error = next((e for e in errors if e is not None), None) + interface_send(socket=socket, result_dict={"error": first_error}) + backend_write_error_file(error=first_error, apply_dict=input_dict) + else: + # All ranks succeeded: update memory consistently and acknowledge + memory.update(local_update) # same update on all ranks + if mpi_rank_zero: + interface_send(socket=socket, result_dict={"result": True})

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

memory.update(

call_funct(input_dict=input_dict, funct=None, memory=memory)

)

except Exception as error:

if mpi_rank_zero:

interface_send(

socket=socket,

result_dict={"error": error},

)

backend_write_error_file(

error=error,

apply_dict=input_dict,

)

else:

if mpi_rank_zero:

interface_send(socket=socket, result_dict={"result": True})

# Run init on all ranks and consolidate errors on rank 0

local_error = None

local_update = None

try:

local_update = call_funct(input_dict=input_dict, funct=None, memory=memory)

except Exception as error: # noqa: BLE001

local_error = error

from mpi4py import MPI # already available; keeps scope explicit

# Did any rank fail?

any_failed = MPI.COMM_WORLD.allreduce(bool(local_error), op=MPI.LOR)

if any_failed:

# Gather first non-None error to rank 0 and report

errors = MPI.COMM_WORLD.gather(local_error, root=0)

if mpi_rank_zero:

first_error = next((e for e in errors if e is not None), None)

interface_send(socket=socket, result_dict={"error": first_error})

backend_write_error_file(error=first_error, apply_dict=input_dict)

else:

# All ranks succeeded: update memory consistently and acknowledge

memory.update(local_update) # same update on all ranks

if mpi_rank_zero:

interface_send(socket=socket, result_dict={"result": True})

🧰 Tools

🪛 Ruff (0.12.2)

104-104: Do not catch blind exception: Exception

(BLE001)

coderabbitai · 2025-09-07T05:47:06Z

executorlib/task_scheduler/interactive/blockallocation.py

+        result = interface.send_and_receive_dict(
            input_dict={"init": True, "fn": init_function, "args": (), "kwargs": {}}
        )
+        if "error" in result and result["error"]:
+            raise result["error"]


🛠️ Refactor suggestion

Init errors: also log to file and simplify error check

You’re not passing error_log_file in the init payload, so backend_write_error_file can’t persist init failures. Add it.

Replace the explicit key check with get(...) and raise a sensible RuntimeError if a non-exception is received (matches Ruff RUF019 hint).

- result = interface.send_and_receive_dict( - input_dict={"init": True, "fn": init_function, "args": (), "kwargs": {}} - ) - if "error" in result and result["error"]: - raise result["error"] + result = interface.send_and_receive_dict( + input_dict={ + "init": True, + "fn": init_function, + "args": (), + "kwargs": {}, + "error_log_file": error_log_file, + } + ) + error = result.get("error") + if error: + if isinstance(error, BaseException): + raise error + raise RuntimeError(f"Init failed with non-exception payload: {error!r}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

result = interface.send_and_receive_dict(

input_dict={"init": True, "fn": init_function, "args": (), "kwargs": {}}

)

if "error" in result and result["error"]:

raise result["error"]

result = interface.send_and_receive_dict(

input_dict={

"init": True,

"fn": init_function,

"args": (),

"kwargs": {},

"error_log_file": error_log_file,

}

)

error = result.get("error")

if error:

if isinstance(error, BaseException):

raise error

raise RuntimeError(f"Init failed with non-exception payload: {error!r}")

🧰 Tools

🪛 Ruff (0.12.2)

234-234: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

🤖 Prompt for AI Agents

In executorlib/task_scheduler/interactive/blockallocation.py around lines 231 to 235, the init payload must include the error_log_file so backend_write_error_file can persist init failures; update the input_dict to pass error_log_file and then replace the explicit key check with using result.get("error") — if error is truthy and is an Exception raise it directly, otherwise raise a RuntimeError (including the received error representation) to surface non-exception failures.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/test_backend_interactive_serial.py (1)
42-42: Avoid hangs in CI by setting a receive timeout (optional).
If the handshake regresses, these blocking recv() calls could hang. Consider setting a small RCVTIMEO on the test-side socket after creation.

Example (apply in each test right after creating the socket):
socket.setsockopt(zmq.RCVTIMEO, 5000)  # 5s timeout to fail fast instead of hanging
Also applies to: 55-55, 70-70, 83-83

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3530cf5 and e3e00f8.

📒 Files selected for processing (2)

executorlib/task_scheduler/interactive/blockallocation.py (1 hunks)
tests/test_backend_interactive_serial.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

executorlib/task_scheduler/interactive/blockallocation.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: unittest_openmpi (ubuntu-24.04-arm, 3.13)
GitHub Check: unittest_mpich (ubuntu-24.04-arm, 3.13)
GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
GitHub Check: unittest_openmpi (macos-latest, 3.13)
GitHub Check: unittest_slurm_mpich
GitHub Check: unittest_flux_mpich
GitHub Check: unittest_mpich (ubuntu-22.04-arm, 3.13)
GitHub Check: unittest_win
GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
GitHub Check: unittest_mpich (macos-latest, 3.13)
GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
GitHub Check: notebooks
GitHub Check: minimal
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)

🔇 Additional comments (4)

tests/test_backend_interactive_serial.py (4)

42-42: Init ACK assertion matches the new handshake — LGTM.
Validates that successful init now replies with {"result": True} before task results.

55-55: Error-path still ACKs init first — LGTM.
Confirms init success ACK precedes the error payload.

70-70: Submit-thread path: init ACK assertion — LGTM.
Covers the inverse threading direction with the same contract.

83-83: Submit-thread error path: init ACK assertion — LGTM.
Ensures consistent ordering (ACK → error) here as well.

for more information, see https://pre-commit.ci

codecov · 2025-09-07T06:20:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.76%. Comparing base (774aa56) to head (6163cca).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #804      +/-   ##
==========================================
+ Coverage   97.75%   97.76%   +0.01%     
==========================================
  Files          32       32              
  Lines        1468     1479      +11     
==========================================
+ Hits         1435     1446      +11     
  Misses         33       33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

executorlib/task_scheduler/interactive/blockallocation.py (1)

232-237: Init errors are silently ignored; inspect response and pass error_log_file

You discard the init response and only catch transport-layer failures. Backend init failures are returned as {"error": ...}, so they won't be propagated. Also include error_log_file in the init payload so the backend can persist the failure.

Apply:

-        try:
-            _ = interface.send_and_receive_dict(
-                input_dict={"init": True, "fn": init_function, "args": (), "kwargs": {}}
-            )
-        except Exception as init_exception:
-            interface_initialization_exception = init_exception
+        try:
+            result = interface.send_and_receive_dict(
+                input_dict={
+                    "init": True,
+                    "fn": init_function,
+                    "args": (),
+                    "kwargs": {},
+                    "error_log_file": error_log_file,
+                }
+            )
+            error = result.get("error")
+            if error:
+                if isinstance(error, BaseException):
+                    raise error
+                raise RuntimeError(f"Init failed with non-exception payload: {error!r}")
+        except Exception as init_exception:
+            interface_initialization_exception = init_exception

Run to confirm backends return {"result": True} or {"error": ...} and accept error_log_file:

#!/bin/bash
rg -nC3 --glob 'executorlib/backend/interactive_*.*' -e 'error_log_file' -e '"error"' -e '"result"\s*:\s*True'

🧹 Nitpick comments (1)

executorlib/task_scheduler/interactive/blockallocation.py (1)
236-237: Document rationale for broad Exception catch
Add a brief comment or # noqa: BLE001 above
except Exception as init_exception:
    interface_initialization_exception = init_exception
to note that send_and_receive_dict may raise multiple exception types (e.g. ExecutorlibSocketError, ZMQError, cloudpickle errors, or arbitrary remote exceptions) and thus a broad catch is required in this boundary layer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e3e00f8 and bc18718.

📒 Files selected for processing (1)

executorlib/task_scheduler/interactive/blockallocation.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

executorlib/task_scheduler/interactive/blockallocation.py (2)

executorlib/standalone/interactive/communication.py (1)

send_and_receive_dict (82-94)

executorlib/task_scheduler/interactive/shared.py (1)

execute_task_dict (12-48)

🪛 Ruff (0.12.2)

executorlib/task_scheduler/interactive/blockallocation.py

236-236: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
GitHub Check: unittest_openmpi (ubuntu-22.04-arm, 3.13)
GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
GitHub Check: unittest_openmpi (macos-latest, 3.13)
GitHub Check: notebooks_integration
GitHub Check: unittest_slurm_mpich
GitHub Check: unittest_flux_mpich
GitHub Check: unittest_flux_openmpi
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)

coderabbitai · 2025-09-07T06:22:28Z

executorlib/task_scheduler/interactive/blockallocation.py

+            if interface_initialization_exception is not None:
+                f.set_exception(exception=interface_initialization_exception)
+            else:
+                execute_task_dict(
+                    task_dict=task_dict,
+                    future_obj=f,
+                    interface=interface,
+                    cache_directory=cache_directory,
+                    cache_key=cache_key,
+                    error_log_file=error_log_file,
+                )


🛠️ Refactor suggestion

⚠️ Potential issue

Respect cancellation before setting exception to avoid InvalidStateError

Setting an exception on a cancelled Future can raise InvalidStateError and break the worker loop. Mirror execute_task_dict’s pattern.

- if interface_initialization_exception is not None: - f.set_exception(exception=interface_initialization_exception) + if interface_initialization_exception is not None: + if (not f.done()) and f.set_running_or_notify_cancel(): + f.set_exception(interface_initialization_exception) else: execute_task_dict( task_dict=task_dict, future_obj=f, interface=interface, cache_directory=cache_directory, cache_key=cache_key, error_log_file=error_log_file, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if interface_initialization_exception is not None:

f.set_exception(exception=interface_initialization_exception)

else:

execute_task_dict(

task_dict=task_dict,

future_obj=f,

interface=interface,

cache_directory=cache_directory,

cache_key=cache_key,

error_log_file=error_log_file,

)

if interface_initialization_exception is not None:

# Only set the exception if the future wasn't already done/cancelled

if (not f.done()) and f.set_running_or_notify_cancel():

f.set_exception(interface_initialization_exception)

else:

execute_task_dict(

task_dict=task_dict,

future_obj=f,

interface=interface,

cache_directory=cache_directory,

cache_key=cache_key,

error_log_file=error_log_file,

)

🤖 Prompt for AI Agents

In executorlib/task_scheduler/interactive/blockallocation.py around lines 248 to 258, avoid calling f.set_exception on a Future that may already be cancelled (which can raise InvalidStateError); before setting the exception, check if f.cancelled() (or otherwise mirror the guard used in execute_task_dict) and only call set_exception when the future is not cancelled, otherwise skip setting the exception (or handle via the same cancellation/ignore path used by execute_task_dict).

jan-janssen and others added 2 commits September 7, 2025 07:40

Implement a fail save init function

3a8f7e2

[pre-commit.ci] auto fixes from pre-commit.com hooks

3530cf5

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Sep 7, 2025

View reviewed changes

fixes

e3e00f8

coderabbitai bot reviewed Sep 7, 2025

View reviewed changes

jan-janssen and others added 3 commits September 7, 2025 08:15

handle error during initialization

07daae3

[pre-commit.ci] auto fixes from pre-commit.com hooks

bd190d6

for more information, see https://pre-commit.ci

refactor

bc18718

coderabbitai bot reviewed Sep 7, 2025

View reviewed changes

jan-janssen and others added 2 commits September 7, 2025 08:27

extend tests

8754881

Merge branch 'main' into fail_save_init_function

6163cca

jan-janssen merged commit 92cd615 into main Sep 7, 2025
116 of 119 checks passed

jan-janssen deleted the fail_save_init_function branch September 7, 2025 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement a fail save init function #804

Implement a fail save init function #804

Uh oh!

jan-janssen commented Sep 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 7, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 7, 2025

Uh oh!

coderabbitai bot Sep 7, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Sep 7, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            try:
-                memory.update(
-                    call_funct(input_dict=input_dict, funct=None, memory=memory)
-                )
-            except Exception as error:
-                if mpi_rank_zero:
-                    interface_send(
-                        socket=socket,
-                        result_dict={"error": error},
-                    )
-                    backend_write_error_file(
-                        error=error,
-                        apply_dict=input_dict,
-                    )
-            else:
-                if mpi_rank_zero:
-                    interface_send(socket=socket, result_dict={"result": True})
+            # Run init on all ranks and consolidate errors on rank 0
+            local_error = None
+            local_update = None
+            try:
+                local_update = call_funct(input_dict=input_dict, funct=None, memory=memory)
+            except Exception as error:  # noqa: BLE001
+                local_error = error
+            from mpi4py import MPI  # already available; keeps scope explicit
+            # Did any rank fail?
+            any_failed = MPI.COMM_WORLD.allreduce(bool(local_error), op=MPI.LOR)
+            if any_failed:
+                # Gather first non-None error to rank 0 and report
+                errors = MPI.COMM_WORLD.gather(local_error, root=0)
+                if mpi_rank_zero:
+                    first_error = next((e for e in errors if e is not None), None)
+                    interface_send(socket=socket, result_dict={"error": first_error})
+                    backend_write_error_file(error=first_error, apply_dict=input_dict)
+            else:
+                # All ranks succeeded: update memory consistently and acknowledge
+                memory.update(local_update)  # same update on all ranks
+                if mpi_rank_zero:
+                    interface_send(socket=socket, result_dict={"result": True})

Implement a fail save init function #804

Implement a fail save init function #804

Uh oh!

Conversation

jan-janssen commented Sep 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jan-janssen commented Sep 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 7, 2025 •

edited

Loading

codecov bot commented Sep 7, 2025 •

edited

Loading