Track a job's memory usage and abort if the memory usage gets too high #271

tfoote · 2016-03-22T01:31:08Z

We discovered that one of the reasons for jenkins nodes going offline has been due to OOM killer: #265

We run on high performance machines and there's typically overhead available but we should be able to set a maximum limit to make sure that we don't interfere with other jobs running on the same executor.

We already have a background script which uses psutil to make sure that all subprocesses are cleaned up. I think that script could be extended or a parallel script created to also monitor the total memory usage of the job and abort it if the job uses too much memory. This would allow us to set a maximum memory usage policy and enforce it.

Methods which would work from a quick search:

dirk-thomas · 2016-03-22T17:35:52Z

The pros and cons how I see them:

Pros:

a hard limit makes only specific jobs fail (when it hits 1/Nth of the memory of the slave)

Cons:

the specific jobs will always fail, even when the other executors on the same slave are idle
the system is not "self-healing" anymore, currently it fails but when it is retriggered over night the chances are the job passes (since it can likely use more then 1/Nth at that time
the effort to implement and test this feature

Therefore I don't think it makes sense to implement this. Instead I think it would make sense to:

identify the memory footprint of jobs
provision the machines with the required memory per executor
try to reduce the memory footprint of jobs which exceed the provisioned memory per executor

tfoote · 2016-05-28T00:01:59Z

We can get access to memory usage of containers via cgroups, as well as put limits on them: https://docs.docker.com/v1.8/articles/runmetrics/ It's all exposed via the docker API: https://docs.docker.com/v1.8/reference/api/docker_remote_api_v1.21/

I would put that the specific jobs fail consistently as a pro not a con. Deterministic behavior is much better than relying on "self healing" of retrying later when hopefully there's more memory available. Especially as we've seen the issue trigger repeatedly when things like the ros_comm devel jobs trigger in parallel. And hitting OOM conditions causes side effects across the whole machine, such as taking the jenkins slave offline. If we set reasonable max allowed memory values and then give good errors when we go overm you can know that a failed build meant something. This is yet another area where unrepeatability will lead to people ignoreing the results and assuming that they will get better next time.

We can easily increase the memory available by either decreasing the number of executors per machine or paying for bigger machines. The challenge is to identify the memory footprint of the jobs.

dirk-thomas · 2019-09-10T04:57:11Z

Closing due to inactivity. Please consider to contribute a PR if you are interested in this feature.

tfoote · 2020-03-12T21:45:54Z

There's now the ability to limit memory for docker containers.

https://docs.docker.com/config/containers/resource_constraints/

As well as tracking usage: https://docs.docker.com/config/containers/runmetrics/

Reviving this as we're running into issues with running out of RAM here:
ros-infrastructure/buildfarm_deployment#232

tfoote added the enhancement label Mar 22, 2016

tfoote mentioned this issue May 27, 2016

qemu tcg fatal error #214

Closed

dirk-thomas mentioned this issue Aug 10, 2016

Node went offline #265

Closed

dirk-thomas closed this as completed Sep 10, 2019

tfoote reopened this Mar 12, 2020

tfoote mentioned this issue Mar 12, 2020

enable cgroup memory management ros-infrastructure/buildfarm_deployment#233

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track a job's memory usage and abort if the memory usage gets too high #271

Track a job's memory usage and abort if the memory usage gets too high #271

tfoote commented Mar 22, 2016

dirk-thomas commented Mar 22, 2016

tfoote commented May 28, 2016

dirk-thomas commented Sep 10, 2019

tfoote commented Mar 12, 2020 •

edited

Loading

Track a job's memory usage and abort if the memory usage gets too high #271

Track a job's memory usage and abort if the memory usage gets too high #271

Comments

tfoote commented Mar 22, 2016

dirk-thomas commented Mar 22, 2016

tfoote commented May 28, 2016

dirk-thomas commented Sep 10, 2019

tfoote commented Mar 12, 2020 • edited Loading

tfoote commented Mar 12, 2020 •

edited

Loading