Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queueworker: prevent stop event on WorkerSleepException (PROJQUAY-1857) #737

Merged
merged 1 commit into from
Apr 12, 2021

Conversation

kleesc
Copy link
Member

@kleesc kleesc commented Apr 9, 2021

Prevents the queueworker from setting the event to stop the poll_queue
job when a WorkerSleepException is raised. On WorkerSleepException,
the worker should instead skip this iteration (go to sleep). e.g when
the NamespaceGCWorker can't acquire a lock because it is already taken
by some other worker.

Reverts the gcworkers job timeout from 24h to 3h. In case of a
deadlock between processes (for example, redeploying the app will not
clear the existing Redis keys), 24h is too long waiting for the locks to
expires so that the workers can resume work.

Add missing Counter increment for on row deletion on the Manifest table.

@kleesc kleesc requested a review from alecmerdler April 9, 2021 21:50
alecmerdler
alecmerdler previously approved these changes Apr 10, 2021
@kleesc kleesc force-pushed the prevent-stop-event-on-lock-exception branch 5 times, most recently from b7e134e to 6368ee0 Compare April 12, 2021 16:28
@kleesc
Copy link
Member Author

kleesc commented Apr 12, 2021

@alecmerdler Will need reapproval

Prevents the queueworker from setting the event to stop the poll_queue
job when a WorkerSleepException is raised. On WorkerSleepException,
the worker should instead skip this iteration (go to sleep). e.g when
the NamespaceGCWorker can't acquire a lock because it is already taken
by some other worker.

Reverts the gcworkers job timeout from 24h to 3h. In case of a
deadlock between processes (for example, redeploying the app will not
clear the existing Redis keys), 24h is too long waiting for the locks to
expires so that the workers can resume work.

Add missing Counter increment for on row deletion on the Manifest table.
@kleesc kleesc force-pushed the prevent-stop-event-on-lock-exception branch from 6368ee0 to 829cc56 Compare April 12, 2021 17:44
@kleesc kleesc merged commit 90f9ef9 into quay:master Apr 12, 2021
@kleesc kleesc deleted the prevent-stop-event-on-lock-exception branch April 12, 2021 18:43
kleesc added a commit to kleesc/quay that referenced this pull request Apr 12, 2021
…7) (quay#737)

Prevents the queueworker from setting the event to stop the poll_queue
job when a WorkerSleepException is raised. On WorkerSleepException,
the worker should instead skip this iteration (go to sleep). e.g when
the NamespaceGCWorker can't acquire a lock because it is already taken
by some other worker.

Reverts the gcworkers job timeout from 24h to 3h. In case of a
deadlock between processes (for example, redeploying the app will not
clear the existing Redis keys), 24h is too long waiting for the locks to
expires so that the workers can resume work.

Add missing Counter increment for on row deletion on the Manifest table.
kleesc added a commit that referenced this pull request Apr 12, 2021
* gc: fix GlobalLock ttl unit and increase gc workers lock timeout (#712)

Correctly converts the given ttl from seconds to milliseconds when
passed to Redis (redlock uses 'px', not 'ex'). Also increase the lock
timeout of gc workers to 1 day.

Some iteration, for repos with large numbers of tags (1000s), will
take more than 15 minutes to complete. This change will prevent multiple
workers GCing the same repo, and one possibly preempting
another. GlobalLock's ttl will make the lock available again when
expired, but will not actually stop execution of the current GC
iteration until the GlobalLock context is done. Having a 1 day timeout
should be enough.

NOTE: The correct solution would have GlobalLock should either renew
the lock until the caller is done, or signal that it is no longer
valid to the caller.

* gc: add metrics for deleted resources (#711)

Add counters for the number of resources deleted by the gc worker, the
repository gc worker and the namespace gc worker.

* queueworker: prevent stop event on WorkerSleepException (PROJQUAY-1857) (#737)

Prevents the queueworker from setting the event to stop the poll_queue
job when a WorkerSleepException is raised. On WorkerSleepException,
the worker should instead skip this iteration (go to sleep). e.g when
the NamespaceGCWorker can't acquire a lock because it is already taken
by some other worker.

Reverts the gcworkers job timeout from 24h to 3h. In case of a
deadlock between processes (for example, redeploying the app will not
clear the existing Redis keys), 24h is too long waiting for the locks to
expires so that the workers can resume work.

Add missing Counter increment for on row deletion on the Manifest table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants