Improve Housekeeper for distributed execution of tasks #21

lucasoares · 2023-06-26T15:44:08Z

We have successfully operated all our Deckard instances with a single housekeeper pod for years. To enhance scalability of the housekeeper tasks, I propose the following improvements for the housekeeper feature:

Implement a distributed locking mechanism for each task to support running multiple housekeeper pods simultaneously. While most tasks can run concurrently due to their atomic nature, running the same task in parallel can lead to resource waste.
Address potential issues, such as Prometheus metrics duplication. Currently, we expose numerous queue metrics in the /metrics endpoint of a Deckard instance with the housekeeper enabled. Since the housekeeper is responsible for measuring many of these metrics, duplication can occur if we deploy many housekeper pods with the /metrics enabled. We can consider deploying an individual metrics pod or explore alternative solutions to mitigate this issue.

By incorporating these enhancements, we aim to achieve better scalability, improved fault tolerance, and overall performance in our distributed Deckard setup.

The text was updated successfully, but these errors were encountered:

lucasoares added enhancement New feature or request good first issue Good for newcomers labels Jun 26, 2023

lucasoares mentioned this issue Jun 29, 2023

Review the Housekeeper recovery task #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Housekeeper for distributed execution of tasks #21

Improve Housekeeper for distributed execution of tasks #21

lucasoares commented Jun 26, 2023

Improve Housekeeper for distributed execution of tasks #21

Improve Housekeeper for distributed execution of tasks #21

Comments

lucasoares commented Jun 26, 2023