Skip to content

Add error path test #1469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Add error path test #1469

wants to merge 2 commits into from

Conversation

earhart
Copy link
Contributor

@earhart earhart commented Jun 14, 2025

Context

Adds a test for the graph error path logic

Contribution Checklist

  • If a Python package was changed, please run make fmt in the package directory.
  • If the server was changed, please run make fmt in server/.
  • Make sure all PR Checks are passing.

@earhart earhart requested review from diptanu and eabatalov June 14, 2025 06:27
@earhart earhart self-assigned this Jun 14, 2025
@earhart
Copy link
Contributor Author

earhart commented Jun 14, 2025

NB This test isn't expected to work until #1465 & tensorlakeai/tensorlake#175 land.

gm = graph.metadata()
self.assertEqual("Ready", gm.state.status)
im = graph.invocation_metadata(invocation_id)
self.assertEqual("Failed", im.state.status)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is no difference in Invocation State between an invocation failed by raising the InvocationError exception and a runtime Exception - how does the user know the difference when they are getting the invocation response? In the former case the graph work as expected, it handled bad input and raised an exception, and in the later case the invocation failed because of a runtime error (which is not expected).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception info coming back makes it clear that it was an InvocationError, and the behavior is different (we won't retry the task, and it doesn't count as either an invocation success or a failure for the graph breakage check).

Is there anything we can/should add to make things more clear?

self.assertEqual("Failed", im.state.status)

def test_permanent_graph_failure(self):
graph = Graph(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of this feature needs to change - we want the graph to be in a broken state if we couldn't load it. Invocations can be failing all the time, but we wouldn't still mark the graph as failed.

This feature is a contract between the platform and the Graph developers to not ship graphs that the platform can't load. Once the graph is loaded and the platform can call it, if invocation fails they are retried unless the exception type is InvocationException (as described by the test - test_permanent_invocation_failure)

Copy link
Collaborator

@diptanu diptanu Jun 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right way to write this test would be to -

  1. Create a tensorlake compute class which throws an exception in the constructor
  2. Create a Tensorlake compute class which sleeps for 5 minutes and so the health check by executor fails

Show that the graph gets to broken state after we retry for 3 times.

@earhart earhart force-pushed the earhart/errors-test branch from 99e1036 to 033d7c7 Compare June 14, 2025 20:24
@eabatalov
Copy link
Contributor

@earhart earhart closed this Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants