Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for deleting scheduled tasks from the daemon #1417

Open
2 tasks
brdji opened this issue Aug 3, 2022 · 5 comments
Open
2 tasks

Add support for deleting scheduled tasks from the daemon #1417

brdji opened this issue Aug 3, 2022 · 5 comments

Comments

@brdji
Copy link
Contributor

brdji commented Aug 3, 2022

Description

The command to remove (terminate) tasks (ie. testground terminate) does not remove scheduled jobs. If the daemon is overloaded, or we wish to remove scheduled tasks, we have to remove them from the queue (ie. Redis) manually.

What defines this endeavour to be complete?

Support terminating any task by state, other than the ones in progress (that would be too complicated, and we have a timeout for that).

  • Support terminating tasks by state (ie. testground terminate --state complete)
  • Support terminating scheduled tasks
@laurentsenta
Copy link
Contributor

Nice, thanks for creating this,
We might have the feature already in the dashboard (http://localhost:8042 by default): each task has "kill/delete" buttons.

If we decide to implement a CLI command, a few notes:

(1) What would happen if a user ran testground terminate --state complete?
I understand it as "terminate the tasks that are already complete" which looks like a no-op.

(2) Having a command specialized to terminate tasks by state sounds quite specialized,
could we provide testground kill --task ID instead?

  • It's more cohesive with the rest of the API (testground status --task ID)
  • It's more versatile since you may combine commands (the unix way)
    • For example, with docker, you may write docker stop $(docker container ls -q --filter name='testground*')
    • Similarly, we could combine the tasks and kill commands.

@dektech
Copy link
Contributor

dektech commented Aug 8, 2022

I completely agree that termination per state is quite specialized, and not really necessary when I think about it.
Termination per task ID would give us the possibility to clear stuck/scheduled tasks without disrupting anything else.

Regarding the testground terminate --state complete, yeah it looks like a no-op, but do you think it would be beneficial if we could clear all completed tasks from the list after 3/5/7 days or keep only the last 50/70/100 tasks?

@laurentsenta
Copy link
Contributor

Regarding clearing history, is there any reason to focus on testground terminate?
I might lack some context.

If we want to remove old tasks' data from the logs,
I would stick to something similar to my previous comment:

  • Improve testground tasks to list the tasks we need. Something like testground tasks --status=completed --completed-after=$DATE,
  • Expose the testground delete operation we've seen in the dashboard into the CLI.

It seems like the effort would be the same, with a clearer UX and more opportunities for reuse.

Regarding benefits, is that a new feature or is this related to 3. Clean up the state after running tests, so performance is not affected if a test crashes or takes too longin Devgrant?

@dektech
Copy link
Contributor

dektech commented Aug 8, 2022

The new feature is partially related to what you just listed - not to clutter the system and affect other tests and overall performance.

But a more solid example of why would we need is what we are encountering while testing ping-pong (as described in #1416) - one pod goes into completed state, the other stays running indefinitely until removed from the cluster.
Even if we delete the pods, the daemon will still show the task as running, until the daemon deployment is deleted and then redeployed.
So the new feature would allow us to cancel/stop/terminate tasks without the need to restart the daemon pods.

@laurentsenta
Copy link
Contributor

laurentsenta commented Aug 8, 2022

Got it, thanks for sharing, I don't think we need a new feature here, we need to fix bugs:

  • There is a task timeout for every test run in testground, when a task reaches this timeout we should kill it and update the state. Look for task_timeout_min in the codebase.
  • There is a kill operation in the dashboard, that should force kill the instances and update the state, if that is not working as intended we should fix it.
  • That might be a symptom of a concurrency error like in fix: testground instability due to race condition on outcomes #1407 or a software design flaw (should we hook into some sort of k8s status update stream?)

Could you open issues for the first two?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants