-
Notifications
You must be signed in to change notification settings - Fork 3
Cluster error handling #621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@jan-janssen has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 28 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
WalkthroughThe changes enhance error handling and output formatting across multiple modules. Functions involved in result retrieval and file writing now utilize try-except blocks to manage exceptions. They also return an additional boolean flag to indicate whether execution was error-free. Consequently, the structure for logging output has been standardized to include either a Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Task Requester
participant Executor as Task Executor
participant Output as get_output
Caller->>Executor: Call execute_task or get results
Executor->>Output: Invoke get_output()
alt Successful Execution<br/>(exec_flag & no_error_flag true)
Output-->>Executor: (true, true, result)
Executor-->>Caller: Return result
else Execution Error<br/>(exec_flag true, no_error_flag false)
Output-->>Executor: (true, false, error)
Executor-->>Caller: Raise error / Set future exception
end
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #621 +/- ##
==========================================
+ Coverage 96.41% 96.44% +0.03%
==========================================
Files 28 28
Lines 1254 1265 +11
==========================================
+ Hits 1209 1220 +11
Misses 45 45 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
executorlib/backend/cache_parallel.py (1)
42-59: Improved error handling in MPI result gathering.You've enhanced error handling by wrapping the MPI gather operation in a try-except block, which allows for capturing and reporting errors. This makes the system more robust by ensuring errors are properly logged instead of causing crashes.
While using a broad
except Exceptionis typically discouraged, it's appropriate here since you're explicitly handling the error by writing it to the output file. However, for added safety, consider adding specific exception types if certain errors are expected:try: result = ( MPI.COMM_WORLD.gather(output, root=0) if mpi_size_larger_one else output ) - except Exception as error: + except (ValueError, RuntimeError, MPI.Exception) as error: # Specific types if known if mpi_rank_zero: backend_write_file( file_name=file_name, output={"error": error}, runtime=time.time() - time_start, )executorlib/standalone/hdf.py (1)
64-80: Enhanced error handling with additional status flag.The function signature has been updated to return a tuple of
[bool, bool, Any]instead of[bool, Any], with the second boolean indicating whether an error occurred. This provides a more robust approach to error handling.The docstring should be updated to clarify what each return value represents:
""" Check if output is available in the HDF5 file Args: file_name (str): file name of the HDF5 file as absolute path Returns: - Tuple[bool, bool, object]: boolean flag indicating if output is available and the output object itself + Tuple[bool, bool, object]: + - First boolean flag indicating if any output (result or error) is available + - Second boolean flag indicating if the output is error-free (True) or contains an error (False) + - The output object itself or error information """executorlib/interactive/shared.py (1)
174-180: Improved error handling with no_error_flag.The function now properly handles errors by using the
no_error_flagfromget_outputto determine whether to set the result or exception on the future object. This provides a clear distinction between successful and erroneous outcomes.Add a comment explaining what
no_error_flagrepresents for clarity:- _, no_error_flag, result = get_output(file_name=file_name) + # Unpack get_output result: exec_flag indicates if any output exists, + # no_error_flag indicates if the output is error-free + _, no_error_flag, result = get_output(file_name=file_name) future = task_dict["future"] if no_error_flag: future.set_result(result) else: future.set_exception(result)tests/test_standalone_hdf.py (1)
101-105: Updated test to handle new return value structure.The test has been properly updated to handle the new three-value return from
get_output, including the newno_errorflag.All current tests verify the behavior when neither "output" nor "error" keys are present. Consider adding tests that verify the behavior when these keys are present:
def test_hdf_with_output(): cache_directory = os.path.abspath("cache") os.makedirs(cache_directory, exist_ok=True) file_name = os.path.join(cache_directory, "test_output.h5") expected_output = {"result": "success"} dump(file_name=file_name, data_dict={"output": expected_output}) flag, no_error, output = get_output(file_name=file_name) self.assertTrue(flag) self.assertTrue(no_error) self.assertEqual(output, expected_output) def test_hdf_with_error(): cache_directory = os.path.abspath("cache") os.makedirs(cache_directory, exist_ok=True) file_name = os.path.join(cache_directory, "test_error.h5") expected_error = ValueError("Test error") dump(file_name=file_name, data_dict={"error": expected_error}) flag, no_error, output = get_output(file_name=file_name) self.assertTrue(flag) self.assertFalse(no_error) self.assertEqual(type(output), ValueError) self.assertEqual(str(output), "Test error")executorlib/cache/backend.py (2)
32-44: Consider updating function docstringThe function signature and behavior have changed significantly to handle structured error output, but the docstring doesn't reflect this change. Consider updating it to mention that
outputis now expected to be a dictionary with either a "result" or an "error" key.""" Write the output to an HDF5 file. Args: file_name (str): The name of the HDF5 file. - output (Any): The output to be written. + output (dict): A dictionary containing either a "result" or an "error" key. runtime (float): Time for executing function. Returns: None """
60-69: Update function docstring for error handlingThe function now includes error handling with try-except, but the docstring doesn't reflect this behavior. Consider updating it to clarify that errors are caught and structured in the output.
""" Execute the task stored in a given HDF5 file. Args: file_name (str): The file name of the HDF5 file as an absolute path. Returns: None + + Note: + Exceptions during task execution are caught and structured in the output + as an "error" key rather than being propagated. """executorlib/cache/shared.py (3)
25-39: Consider updating method docstringThe
resultmethod now has different behavior when an error is encountered, but the docstring doesn't reflect this. Consider updating it to mention that exceptions can be raised.""" Get the result of the future item. Returns: str: The result of the future item. + Raises: + Exception: If the task execution resulted in an error. """
185-199: Update function docstring for exception handlingThe
_check_task_outputfunction now sets exceptions on future objects, but the docstring doesn't reflect this behavior. Consider updating it.""" Check the output of a task and set the result of the future object if available. Args: task_key (str): The key of the task. future_obj (Future): The future object associated with the task. cache_directory (str): The directory where the HDF5 files are stored. Returns: Future: The updated future object. + Note: + If the task execution resulted in an error, the exception will be set + on the future object instead of the result. """
39-39: Consider adding a recursion limitThe recursive call to
self.result()could potentially lead to a stack overflow ifget_outputconsistently returnsexec_flagas False. Consider adding a timeout mechanism or a maximum number of retries.def result(self) -> Any: """ Get the result of the future item. Returns: str: The result of the future item. """ + max_retries = 100 # Adjust as needed + self._retries = getattr(self, '_retries', 0) + 1 exec_flag, no_error_flag, result = get_output(file_name=self._file_name) if exec_flag and no_error_flag: return result elif exec_flag: raise result + elif self._retries > max_retries: + raise TimeoutError(f"Maximum retries ({max_retries}) exceeded waiting for result") else: return self.result()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
executorlib/backend/cache_parallel.py(1 hunks)executorlib/cache/backend.py(2 hunks)executorlib/cache/shared.py(2 hunks)executorlib/interactive/shared.py(1 hunks)executorlib/standalone/hdf.py(2 hunks)tests/test_standalone_hdf.py(4 hunks)
🧰 Additional context used
🧬 Code Definitions (5)
executorlib/backend/cache_parallel.py (1)
executorlib/cache/backend.py (1)
backend_write_file(32-57)
executorlib/interactive/shared.py (2)
executorlib/cache/shared.py (1)
result(25-39)executorlib/standalone/hdf.py (1)
get_output(64-80)
tests/test_standalone_hdf.py (1)
executorlib/standalone/hdf.py (2)
get_output(64-80)get_runtime(83-97)
executorlib/cache/backend.py (2)
executorlib/standalone/hdf.py (1)
dump(19-34)executorlib/cache/shared.py (1)
result(25-39)
executorlib/cache/shared.py (1)
executorlib/standalone/hdf.py (1)
get_output(64-80)
⏰ Context from checks skipped due to timeout of 90000ms (15)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
- GitHub Check: unittest_mpich (macos-latest, 3.13)
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
- GitHub Check: unittest_flux_mpich
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
- GitHub Check: unittest_openmpi (macos-latest, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
- GitHub Check: unittest_win
- GitHub Check: unittest_old
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
- GitHub Check: unittest_flux_openmpi
- GitHub Check: notebooks
🔇 Additional comments (8)
executorlib/standalone/hdf.py (1)
13-13: Added error key to group_dict.Adding the "error" key to the
group_dictprovides a standardized way to represent errors in the HDF5 files, consistent with the improved error handling throughout the codebase.tests/test_standalone_hdf.py (3)
42-46: Updated test to handle new return value structure.The test has been properly updated to handle the new three-value return from
get_output, including the newno_errorflag.
59-63: Updated test to handle new return value structure.The test has been properly updated to handle the new three-value return from
get_output, including the newno_errorflag.
85-89: Updated test to handle new return value structure.The test has been properly updated to handle the new three-value return from
get_output, including the newno_errorflag.executorlib/cache/backend.py (2)
47-56: Good improvement to error handlingThe structured approach of distinguishing between result and error cases improves robustness. This change aligns with the modifications in
get_outputfunction (from the provided relevant code snippets).
72-80: Robust error handling implementationThe try-except block is a good addition that ensures errors are caught and structured consistently for the output. This works well with the changes in
backend_write_file.executorlib/cache/shared.py (2)
33-37: Improved error handling with no_error_flagThe updated code correctly handles the new three-tuple return format from
get_output(as seen in the provided relevant code snippets). The conditional logic is well-structured to either return the result or raise it as an exception based on the error flag.
203-207: Consistent error handling implementationThis change maintains consistency with the
resultmethod by properly handling theno_error_flagto set either a result or an exception on the future object. Good implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_singlenodeexecutor_cache.py (1)
35-41: Good addition of error handling test.This test correctly verifies that errors from submitted functions are properly propagated through the
SingleNodeExecutor. The test submits a function that raises aValueErrorand then asserts that the same error type is raised when callingresult()on the future object.Consider two minor improvements:
- The
- A more thorough test would also verify the error message content
def test_cache_error(self): cache_directory = "./cache_error" with SingleNodeExecutor(cache_directory=cache_directory) as exe: f = exe.submit(get_error, a=1) with self.assertRaises(ValueError): - print(f.result()) + f.result()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/test_cache_fileexecutor_serial.py(2 hunks)tests/test_singlenodeexecutor_cache.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_cache_fileexecutor_serial.py
🧰 Additional context used
🧬 Code Definitions (1)
tests/test_singlenodeexecutor_cache.py (1)
tests/test_cache_fileexecutor_serial.py (2)
get_error(30-31)tearDown(205-207)
⏰ Context from checks skipped due to timeout of 90000ms (15)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
- GitHub Check: unittest_openmpi (macos-latest, 3.13)
- GitHub Check: notebooks
- GitHub Check: unittest_old
- GitHub Check: unittest_flux_mpich
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
- GitHub Check: unittest_mpich (macos-latest, 3.13)
- GitHub Check: unittest_flux_openmpi
- GitHub Check: unittest_win
🔇 Additional comments (2)
tests/test_singlenodeexecutor_cache.py (2)
15-17: LGTM - This looks like a good utility function for error testing.The
get_errorfunction has been implemented correctly to raise aValueErrorwith the provided argument. This aligns with the PR objectives to improve cluster error handling by allowing tests to verify proper error propagation.
45-46: LGTM - Proper cleanup of the new test directory.Adding cleanup for the "cache_error" directory in the
tearDownmethod ensures that test resources are properly removed after each test run.
Summary by CodeRabbit
Bug Fixes
Tests