Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track a job's memory usage and abort if the memory usage gets too high #271

Open
tfoote opened this issue Mar 22, 2016 · 4 comments
Open

Comments

@tfoote
Copy link
Member

tfoote commented Mar 22, 2016

We discovered that one of the reasons for jenkins nodes going offline has been due to OOM killer: #265

We run on high performance machines and there's typically overhead available but we should be able to set a maximum limit to make sure that we don't interfere with other jobs running on the same executor.

We already have a background script which uses psutil to make sure that all subprocesses are cleaned up. I think that script could be extended or a parallel script created to also monitor the total memory usage of the job and abort it if the job uses too much memory. This would allow us to set a maximum memory usage policy and enforce it.

Methods which would work from a quick search:

@dirk-thomas
Copy link
Member

The pros and cons how I see them:

Pros:

  • a hard limit makes only specific jobs fail (when it hits 1/Nth of the memory of the slave)

Cons:

  • the specific jobs will always fail, even when the other executors on the same slave are idle
  • the system is not "self-healing" anymore, currently it fails but when it is retriggered over night the chances are the job passes (since it can likely use more then 1/Nth at that time
  • the effort to implement and test this feature

Therefore I don't think it makes sense to implement this. Instead I think it would make sense to:

  • identify the memory footprint of jobs
  • provision the machines with the required memory per executor
  • try to reduce the memory footprint of jobs which exceed the provisioned memory per executor

@tfoote
Copy link
Member Author

tfoote commented May 28, 2016

We can get access to memory usage of containers via cgroups, as well as put limits on them: https://docs.docker.com/v1.8/articles/runmetrics/ It's all exposed via the docker API: https://docs.docker.com/v1.8/reference/api/docker_remote_api_v1.21/

I would put that the specific jobs fail consistently as a pro not a con. Deterministic behavior is much better than relying on "self healing" of retrying later when hopefully there's more memory available. Especially as we've seen the issue trigger repeatedly when things like the ros_comm devel jobs trigger in parallel. And hitting OOM conditions causes side effects across the whole machine, such as taking the jenkins slave offline. If we set reasonable max allowed memory values and then give good errors when we go overm you can know that a failed build meant something. This is yet another area where unrepeatability will lead to people ignoreing the results and assuming that they will get better next time.

We can easily increase the memory available by either decreasing the number of executors per machine or paying for bigger machines. The challenge is to identify the memory footprint of the jobs.

@dirk-thomas
Copy link
Member

Closing due to inactivity. Please consider to contribute a PR if you are interested in this feature.

@tfoote
Copy link
Member Author

tfoote commented Mar 12, 2020

There's now the ability to limit memory for docker containers.

https://docs.docker.com/config/containers/resource_constraints/

As well as tracking usage: https://docs.docker.com/config/containers/runmetrics/

Reviving this as we're running into issues with running out of RAM here:
ros-infrastructure/buildfarm_deployment#232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants