Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase memory limits for build projects (autoscale workers) #4403

Closed
humitos opened this issue Jul 18, 2018 · 3 comments
Closed

Increase memory limits for build projects (autoscale workers) #4403

humitos opened this issue Jul 18, 2018 · 3 comments
Labels
Improvement Minor improvement to code Needed: design decision A core team decision is required Needed: documentation Documentation is required

Comments

@humitos
Copy link
Member

humitos commented Jul 18, 2018

We have several projects lately that need more memory than the default value (1g). Although, since memory is a resource that could cause other different problems than CPU time we have been very careful when increasing this limit in some projects.

At this time we do have only 6 projects with different limits (1500m) than the default and only 2 of them with the maximum limit used (2g).

When a project needs more memory resources I usually suggest the owner to:

  • disable formats being built
  • reduce the amount of dependencies installed (via pip or conda)

These two points can be found in our docs https://docs.readthedocs.io/en/latest/guides/build-using-too-many-resources.html

To check the CPU and memory consumption of the processes executed by Read the Docs, there is an awesome comment in this issue aiidateam/aiida-core#1472 (comment) (it uses psrecord to make some nice plots)

This issue, in particular, is to collect projects that are running out of memory when building, increase their limits and track the results. Besides, to discuss a long term solution where increasing memory limits doesn't affect the builder servers.

Projects that are currently hitting the memory limit which I will start by increasing their limits:

Also to discuss what are the steps to follow from the core team to be able to increase these times in a safe way (without creating another issue in the builders) and propose a solution around it.

Ideas for a solution

Use Celery autoscale

We have talk about using Celery autoscaling option (http://docs.celeryproject.org/en/latest/userguide/workers.html#autoscaling) but instead of make celery decide when and how to increase/decrease the amount of workers, we may want to define our own Autoscaler and define it in the setting (http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-worker_autoscaler)

Example of Autoscaling based on CPU and Memory: https://gist.github.com/speedplane/224eb551c51a74068011f4d776237513

Scale workers manually

Another idea we had in mind was to scale the workers depending on the values that we already knew: container_time_limit and container_mem_limit. So, before the trigger_build function is called we can decrease the workers if it will be a task that consumes too much memory.

Increasing the workers at that point is not possible because we don't have information about the kind of tasks that the current builder is running. If we save the task_id into the Build object we could ask for all the tasks the builder is running, map it with the build object and know the project being built with the amount of resources needed.

Another possibility instead of saving the task_id in the Build object could be to create a Celery chain that first decrease the workers to 1, then execute the build, and then increase the workers to the default value.

Use a specific queue for heavy mem usage projects

To avoid all this logic, we could have a builder with just only one worker. Before trigger_build is called, the web server checks for custom time limits and force the task to be added in this particular queue.

@davidfischer
Copy link
Contributor

In my limited testing, pypy had much better build performance from a time perspective. See #3870.

@humitos
Copy link
Member Author

humitos commented Aug 30, 2018

News no this: we implementing this idea:

Use a specific queue for heavy mem usage projects

Although, the projects are manually added to that queue and manually increased the resources available to build.

Before trigger_build is called, the web server checks for custom time limits and force the task to be added in this particular queue

Making this process automatic, is part of #4573 and should be discussed there.

I'm closing this issue here since there is no action to take for now. In case we want to change our current implementation, we can consider to reopen it.

@humitos humitos closed this as completed Aug 30, 2018
@humitos
Copy link
Member Author

humitos commented Sep 11, 2018

I just want to add that the solution implemented of the build03 queue has been working properly and we haven't had more reports of projects with failing builds due to timeout or memory.

Use Celery autoscale

There is an issue that talks about this possible solution that worth to link it here: #3990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Minor improvement to code Needed: design decision A core team decision is required Needed: documentation Documentation is required
Projects
None yet
Development

No branches or pull requests

2 participants