Cancel runs if no progress is made in the manifest #12

ayazhafiz · 2023-04-05T17:36:09Z

Presently, active runs can reach a state where all associated workers die, and no progress is made on the test suite, but the test suite sticks around in the queue memory. Since such runs may not be returned to at all, we'd like to diminish the amount of pressure they might place on a running queue.

This series of patches addresses the problem by running a job every hour that checks whether test runs have had any progress in their manifest. If either

there is no manifest associated with the run after an hour, or
no more items have been popped off the manifest since the last time progress was checked

then the run will be cancelled. If progress has been made, or the run was already done, the run is left untouched. If progress has been made and the run is not yet done, a job to check the progress again later is re-enqueued.

In the future, we'll likely want to adjust the behavior to not outright cancel a job, but to admit some way to re-launch the job from the last failure state.

Adds a `timeout` module to the queue for the purposes of configuring and acting on periodic timeouts. These timeouts will be used for deciding whether a test run should be cancelled because a manifest for a run made no progress.

github-actions · 2023-04-05T19:35:18Z

✅ Bigtest for 9537e32 (run)

Benchmarks:

RSpec: 11.24% overhead
- RSpec time: 17.79 seconds
- ABQ time: 19.79 seconds
RSpec parallel, 10 runs: max 15.29% overhead
- min 6.69% overhead
- standard deviation: 2.87%
Jest: 5.58% overhead
- Jest time: 21.16 seconds
- ABQ time: 22.341 seconds

Fuzz result sizes:

PASSED

github-actions · 2023-04-05T21:13:08Z

✅ Bigtest for 66dac76 (run)

Benchmarks:

RSpec: 15.57% overhead
- RSpec time: 17.73 seconds
- ABQ time: 20.49 seconds
RSpec parallel, 10 runs: max 12.63% overhead
- min 6.66% overhead
- standard deviation: 1.51%
Jest: 6.11% overhead
- Jest time: 21.362 seconds
- ABQ time: 22.667 seconds

Fuzz result sizes:

PASSED

ayazhafiz added 9 commits April 5, 2023 12:30

Add a module for enqueuing and processing timeouts

0b211ce

Adds a `timeout` module to the queue for the purposes of configuring and acting on periodic timeouts. These timeouts will be used for deciding whether a test run should be cancelled because a manifest for a run made no progress.

Wrap the RunTimeoutManager in a cheap-to-clone

44bd9b4

Thread timeout manager into the queue

3d25681

Stub out handling manifest-progress timeouts

2642539

Add a test to check for races in cancelling vs finishing a run

0578ac7

Avoid a box in timeouts

999aedb

Add an integration test for cancelling runs on timeout

1bf7fb3

Fix type errors

ff3cd60

Bad import

037cd13

ayazhafiz requested a review from doxavore April 5, 2023 17:36

ayazhafiz added 3 commits April 5, 2023 12:37

Check that log appears if test is cancelled

8ac6f40

Mark out-of-process retry workers when getting init context

9ad9ead

Typo

9537e32

ayazhafiz enabled auto-merge (squash) April 5, 2023 20:16

TAGraves approved these changes Apr 5, 2023

View reviewed changes

Merge branch 'main' into cancel-inactive-runs

66dac76

kylekthompson approved these changes Apr 5, 2023

View reviewed changes

ayazhafiz merged commit 3547f18 into main Apr 5, 2023
17 checks passed

ayazhafiz deleted the cancel-inactive-runs branch April 5, 2023 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel runs if no progress is made in the manifest #12

Cancel runs if no progress is made in the manifest #12

ayazhafiz commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

Cancel runs if no progress is made in the manifest #12

Cancel runs if no progress is made in the manifest #12

Conversation

ayazhafiz commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023