Cluster error handling #621

jan-janssen · 2025-03-28T12:25:09Z

Summary by CodeRabbit

Bug Fixes
- Enhanced error handling and output formatting across task execution flows, ensuring that success and failure states are clearly and consistently reported.
Tests
- Updated test cases to validate the enhanced error detection, ensuring reliable evaluation of operation outcomes.
- Added new tests to specifically assess error handling scenarios, including tests for raised exceptions and error propagation in various executor contexts.

for more information, see https://pre-commit.ci

coderabbitai · 2025-03-28T12:25:17Z

Warning

Rate limit exceeded

@jan-janssen has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between bc4c07d and bf10747.

📒 Files selected for processing (2)

executorlib/interactive/shared.py (1 hunks)
tests/test_singlenodeexecutor_cache.py (3 hunks)

Walkthrough

The changes enhance error handling and output formatting across multiple modules. Functions involved in result retrieval and file writing now utilize try-except blocks to manage exceptions. They also return an additional boolean flag to indicate whether execution was error-free. Consequently, the structure for logging output has been standardized to include either a "result" or an "error" key, and tests have been updated to reflect these modifications.

Changes

File(s)	Change Summary
`executorlib/.../backend/cache_parallel.py` `executorlib/.../cache/backend.py`	Introduced try-except blocks for MPI result gathering and function execution. Outputs are now logged with either a `"result"` key on success or an `"error"` key on failure.
`executorlib/.../cache/shared.py` `executorlib/.../interactive/shared.py`	Adjusted result retrieval to include an additional `no_error_flag` from `get_output`. Enhanced logic to raise exceptions or set future errors based on combined flag checks.
`executorlib/.../standalone/hdf.py` `tests/.../test_standalone_hdf.py`	Updated HDF5 file handling to return a tuple `(flag, no_error, output)` with an added `"error"` key. Tests were modified to assert the new `no_error` flag behavior.
`tests/.../test_cache_fileexecutor_serial.py`	Added a new function `get_error(a)` and a test method `test_executor_error(self)` to validate error handling in the executor.
`tests/.../test_singlenodeexecutor_cache.py`	Added a new function `get_error(a)` and a test method `test_cache_error(self)` to validate error handling in the SingleNodeExecutor.
`tests/.../test_cache_backend_execute.py`	Added a new function `get_error(a)` and a test method `test_execute_function_error(self)` to validate error handling in the backend execution.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Task Requester
    participant Executor as Task Executor
    participant Output as get_output

    Caller->>Executor: Call execute_task or get results
    Executor->>Output: Invoke get_output()
    alt Successful Execution<br/>(exec_flag & no_error_flag true)
        Output-->>Executor: (true, true, result)
        Executor-->>Caller: Return result
    else Execution Error<br/>(exec_flag true, no_error_flag false)
        Output-->>Executor: (true, false, error)
        Executor-->>Caller: Raise error / Set future exception
    end

Poem

I'm a rabbit hopping in code delight,
Errors now caught and handled just right.
Try-excepts leap like a joyful bound,
Logging results or errors safe and sound.
In the land of bits, our fixes set the pace,
Hopping forward with a bug-free grace.
🐰💻 Happy coding!

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2025-03-28T12:27:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.44%. Comparing base (8f1026f) to head (bf10747).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #621      +/-   ##
==========================================
+ Coverage   96.41%   96.44%   +0.03%     
==========================================
  Files          28       28              
  Lines        1254     1265      +11     
==========================================
+ Hits         1209     1220      +11     
  Misses         45       45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (9)

executorlib/backend/cache_parallel.py (1)
42-59: Improved error handling in MPI result gathering.

You've enhanced error handling by wrapping the MPI gather operation in a try-except block, which allows for capturing and reporting errors. This makes the system more robust by ensuring errors are properly logged instead of causing crashes.

While using a broad except Exception is typically discouraged, it's appropriate here since you're explicitly handling the error by writing it to the output file. However, for added safety, consider adding specific exception types if certain errors are expected:
    try:
        result = (
            MPI.COMM_WORLD.gather(output, root=0) if mpi_size_larger_one else output
        )
-    except Exception as error:
+    except (ValueError, RuntimeError, MPI.Exception) as error:  # Specific types if known
        if mpi_rank_zero:
            backend_write_file(
                file_name=file_name,
                output={"error": error},
                runtime=time.time() - time_start,
            )
executorlib/standalone/hdf.py (1)
64-80: Enhanced error handling with additional status flag.

The function signature has been updated to return a tuple of [bool, bool, Any] instead of [bool, Any], with the second boolean indicating whether an error occurred. This provides a more robust approach to error handling.

The docstring should be updated to clarify what each return value represents:
    """
    Check if output is available in the HDF5 file

    Args:
        file_name (str): file name of the HDF5 file as absolute path

    Returns:
-        Tuple[bool, bool, object]: boolean flag indicating if output is available and the output object itself
+        Tuple[bool, bool, object]: 
+            - First boolean flag indicating if any output (result or error) is available
+            - Second boolean flag indicating if the output is error-free (True) or contains an error (False)
+            - The output object itself or error information
    """
executorlib/interactive/shared.py (1)
174-180: Improved error handling with no_error_flag.

The function now properly handles errors by using the no_error_flag from get_output to determine whether to set the result or exception on the future object. This provides a clear distinction between successful and erroneous outcomes.

Add a comment explaining what no_error_flag represents for clarity:
-        _, no_error_flag, result = get_output(file_name=file_name)
+        # Unpack get_output result: exec_flag indicates if any output exists, 
+        # no_error_flag indicates if the output is error-free
+        _, no_error_flag, result = get_output(file_name=file_name)
        future = task_dict["future"]
        if no_error_flag:
            future.set_result(result)
        else:
            future.set_exception(result)
tests/test_standalone_hdf.py (1)
101-105: Updated test to handle new return value structure.

The test has been properly updated to handle the new three-value return from get_output, including the new no_error flag.

All current tests verify the behavior when neither "output" nor "error" keys are present. Consider adding tests that verify the behavior when these keys are present:
def test_hdf_with_output():
    cache_directory = os.path.abspath("cache")
    os.makedirs(cache_directory, exist_ok=True)
    file_name = os.path.join(cache_directory, "test_output.h5")
    expected_output = {"result": "success"}
    dump(file_name=file_name, data_dict={"output": expected_output})
    flag, no_error, output = get_output(file_name=file_name)
    self.assertTrue(flag)
    self.assertTrue(no_error)
    self.assertEqual(output, expected_output)

def test_hdf_with_error():
    cache_directory = os.path.abspath("cache")
    os.makedirs(cache_directory, exist_ok=True)
    file_name = os.path.join(cache_directory, "test_error.h5")
    expected_error = ValueError("Test error")
    dump(file_name=file_name, data_dict={"error": expected_error})
    flag, no_error, output = get_output(file_name=file_name)
    self.assertTrue(flag)
    self.assertFalse(no_error)
    self.assertEqual(type(output), ValueError)
    self.assertEqual(str(output), "Test error")
executorlib/cache/backend.py (2)
32-44: Consider updating function docstring

The function signature and behavior have changed significantly to handle structured error output, but the docstring doesn't reflect this change. Consider updating it to mention that output is now expected to be a dictionary with either a "result" or an "error" key.
    """
    Write the output to an HDF5 file.

    Args:
        file_name (str): The name of the HDF5 file.
-        output (Any): The output to be written.
+        output (dict): A dictionary containing either a "result" or an "error" key.
        runtime (float): Time for executing function.

    Returns:
        None

    """
60-69: Update function docstring for error handling

The function now includes error handling with try-except, but the docstring doesn't reflect this behavior. Consider updating it to clarify that errors are caught and structured in the output.
    """
    Execute the task stored in a given HDF5 file.

    Args:
        file_name (str): The file name of the HDF5 file as an absolute path.

    Returns:
        None
+
+    Note:
+        Exceptions during task execution are caught and structured in the output
+        as an "error" key rather than being propagated.
    """
executorlib/cache/shared.py (3)
25-39: Consider updating method docstring

The result method now has different behavior when an error is encountered, but the docstring doesn't reflect this. Consider updating it to mention that exceptions can be raised.
    """
    Get the result of the future item.

    Returns:
        str: The result of the future item.

+   Raises:
+       Exception: If the task execution resulted in an error.
    """
185-199: Update function docstring for exception handling

The _check_task_output function now sets exceptions on future objects, but the docstring doesn't reflect this behavior. Consider updating it.
    """
    Check the output of a task and set the result of the future object if available.

    Args:
        task_key (str): The key of the task.
        future_obj (Future): The future object associated with the task.
        cache_directory (str): The directory where the HDF5 files are stored.

    Returns:
        Future: The updated future object.

+   Note:
+       If the task execution resulted in an error, the exception will be set
+       on the future object instead of the result.
    """
39-39: Consider adding a recursion limit

The recursive call to self.result() could potentially lead to a stack overflow if get_output consistently returns exec_flag as False. Consider adding a timeout mechanism or a maximum number of retries.
    def result(self) -> Any:
        """
        Get the result of the future item.

        Returns:
            str: The result of the future item.

        """
+       max_retries = 100  # Adjust as needed
+       self._retries = getattr(self, '_retries', 0) + 1
        exec_flag, no_error_flag, result = get_output(file_name=self._file_name)
        if exec_flag and no_error_flag:
            return result
        elif exec_flag:
            raise result
+       elif self._retries > max_retries:
+           raise TimeoutError(f"Maximum retries ({max_retries}) exceeded waiting for result")
        else:
            return self.result()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f1026f and 6105865.

📒 Files selected for processing (6)

executorlib/backend/cache_parallel.py (1 hunks)
executorlib/cache/backend.py (2 hunks)
executorlib/cache/shared.py (2 hunks)
executorlib/interactive/shared.py (1 hunks)
executorlib/standalone/hdf.py (2 hunks)
tests/test_standalone_hdf.py (4 hunks)

🧰 Additional context used

🧬 Code Definitions (5)

executorlib/backend/cache_parallel.py (1)

executorlib/cache/backend.py (1)

backend_write_file (32-57)

executorlib/interactive/shared.py (2)

executorlib/cache/shared.py (1)

result (25-39)

executorlib/standalone/hdf.py (1)

get_output (64-80)

tests/test_standalone_hdf.py (1)

executorlib/standalone/hdf.py (2)

get_output (64-80)

get_runtime (83-97)

executorlib/cache/backend.py (2)

executorlib/standalone/hdf.py (1)

dump (19-34)

executorlib/cache/shared.py (1)

result (25-39)

executorlib/cache/shared.py (1)

executorlib/standalone/hdf.py (1)

get_output (64-80)

⏰ Context from checks skipped due to timeout of 90000ms (15)

GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
GitHub Check: unittest_mpich (macos-latest, 3.13)
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
GitHub Check: unittest_flux_mpich
GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
GitHub Check: unittest_openmpi (macos-latest, 3.13)
GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
GitHub Check: unittest_win
GitHub Check: unittest_old
GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
GitHub Check: unittest_flux_openmpi
GitHub Check: notebooks

🔇 Additional comments (8)

executorlib/standalone/hdf.py (1)

13-13: Added error key to group_dict.

Adding the "error" key to the group_dict provides a standardized way to represent errors in the HDF5 files, consistent with the improved error handling throughout the codebase.

tests/test_standalone_hdf.py (3)

42-46: Updated test to handle new return value structure.

The test has been properly updated to handle the new three-value return from get_output, including the new no_error flag.

59-63: Updated test to handle new return value structure.

The test has been properly updated to handle the new three-value return from get_output, including the new no_error flag.

85-89: Updated test to handle new return value structure.

The test has been properly updated to handle the new three-value return from get_output, including the new no_error flag.

executorlib/cache/backend.py (2)

47-56: Good improvement to error handling

The structured approach of distinguishing between result and error cases improves robustness. This change aligns with the modifications in get_output function (from the provided relevant code snippets).

72-80: Robust error handling implementation

The try-except block is a good addition that ensures errors are caught and structured consistently for the output. This works well with the changes in backend_write_file.

executorlib/cache/shared.py (2)

33-37: Improved error handling with no_error_flag

The updated code correctly handles the new three-tuple return format from get_output (as seen in the provided relevant code snippets). The conditional logic is well-structured to either return the result or raise it as an exception based on the error flag.

203-207: Consistent error handling implementation

This change maintains consistency with the result method by properly handling the no_error_flag to set either a result or an exception on the future object. Good implementation.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/test_singlenodeexecutor_cache.py (1)
35-41: Good addition of error handling test.

This test correctly verifies that errors from submitted functions are properly propagated through the SingleNodeExecutor. The test submits a function that raises a ValueError and then asserts that the same error type is raised when calling result() on the future object.

Consider two minor improvements:

The print statement is unnecessary since we're only testing error propagation

A more thorough test would also verify the error message content
def test_cache_error(self):
    cache_directory = "./cache_error"
    with SingleNodeExecutor(cache_directory=cache_directory) as exe:
        f = exe.submit(get_error, a=1)
        with self.assertRaises(ValueError):
-            print(f.result())
+            f.result()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd9817 and 73dca53.

📒 Files selected for processing (2)

tests/test_cache_fileexecutor_serial.py (2 hunks)
tests/test_singlenodeexecutor_cache.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_cache_fileexecutor_serial.py

🧰 Additional context used

🧬 Code Definitions (1)

tests/test_singlenodeexecutor_cache.py (1)

tests/test_cache_fileexecutor_serial.py (2)

get_error (30-31)

tearDown (205-207)

⏰ Context from checks skipped due to timeout of 90000ms (15)

GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
GitHub Check: unittest_openmpi (macos-latest, 3.13)
GitHub Check: notebooks
GitHub Check: unittest_old
GitHub Check: unittest_flux_mpich
GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
GitHub Check: unittest_mpich (macos-latest, 3.13)
GitHub Check: unittest_flux_openmpi
GitHub Check: unittest_win

🔇 Additional comments (2)

tests/test_singlenodeexecutor_cache.py (2)

15-17: LGTM - This looks like a good utility function for error testing.

The get_error function has been implemented correctly to raise a ValueError with the provided argument. This aligns with the PR objectives to improve cluster error handling by allowing tests to verify proper error propagation.

45-46: LGTM - Proper cleanup of the new test directory.

Adding cleanup for the "cache_error" directory in the tearDown method ensures that test resources are properly removed after each test run.

jan-janssen and others added 3 commits March 28, 2025 13:21

cluster error handling

65bf1da

fix error read

c36d437

[pre-commit.ci] auto fixes from pre-commit.com hooks

6105865

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Mar 28, 2025

View reviewed changes

jan-janssen added 2 commits March 28, 2025 13:36

Add additional tests

6dd9817

extend error handling

73dca53

coderabbitai bot reviewed Mar 28, 2025

View reviewed changes

jan-janssen added 6 commits March 28, 2025 13:53

folder error

f312b41

fix testing with flux

cea1a4e

another test

9d03edd

hit cache again

bc4c07d

in interactive mode only successful calculation are stored

733f157

clean up

bf10747

jan-janssen merged commit c4ce1e1 into main Mar 28, 2025
29 checks passed

jan-janssen deleted the cluster_error_handling branch March 28, 2025 13:51

coderabbitai bot mentioned this pull request Jul 16, 2025

Define error log file in resource_dict #736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster error handling #621

Cluster error handling #621

Uh oh!

jan-janssen commented Mar 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 28, 2025 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

codecov bot commented Mar 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cluster error handling #621

Cluster error handling #621

Uh oh!

Conversation

jan-janssen commented Mar 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

codecov bot commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jan-janssen commented Mar 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 28, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Mar 28, 2025 •

edited

Loading