Reduce number of Puma processes and threads #630

jferris · 2015-10-20T19:03:26Z

Using a simple Suspenders application, I profiled memory usage for our
default configuration, as well as a few others. I found the following:

A Puma cluster uses a master process and multiple worker processes.
The amount of memory used by a cluster is equal to the memory usage of
the master process plus the possible bloated size of a worker process
times the number of worker processes.
At boot, a simple Suspenders application uses about 117MB for the
master process and 109M for each worker.
After the first request is served, a process increases to around 117M,
like the master process.
The amount of potential bloat increases with each thread, because it's
possible for every thread to be handling a bloated request at once.
Using siege, I determined that the expected bloat in a simple
scenario is around 10M per thread. This will be much worse in some
applications.

This provides the following formula for maximum memory usage under load:

master_usage + worker_count * (worker_usage + bloat * thread_count)

For this simple Suspenders application, this formula provides the
following worst-case usage:

117 + 3 * (117 + 10 * 5) = 618

This is over the 512MB limit for a 1x Heroku dyno, and the application
is very simple.

I recommend changing to a default of two worker processes and two
threads per dyno, changing the usage to:

117 + 2 * (117 + 10 * 2) = 391

This provides reasonable performance with a high memory ceiling. When
applications begin to show troublesome performance characteristics under
load, developers can tune the application's process and thread count
according to its real-world memory usage, possibly upgrading the dyno
size as appropriate.

tute · 2015-10-20T19:34:23Z

This looks good to me. It seems like we can keep a higher number of threads while still comfortably fitting in a dyno, although if you'd rather have suspenders be lighter by default I'm not against that.

👍

jferris · 2015-10-20T19:44:21Z

@tute

This is explained in the code comment, removing.

I'm not sure that it is explained in the code comment, so maybe my message or comment isn't clear.

Increasing processes is very good for concurrency but very bad for memory usage. Increasing threads helps somewhat with concurrency but isn't as bad for memory usage. This is because processes don't efficiently share memory (even though Ruby 2.1+ purportedly has copy-on-write), and threads (in Ruby) don't efficiently share CPU, because of the global interpreter lock.

I think 2x2 is a good combination, because it means that:

If one of the workers on a dyno is fully occupied serving a slow, CPU-bound request (such as rendering complex ERB or JSON), another request routed to that dyno can be handled by the other worker.
If one of the threads on a worker is occupied serving a slow, IO-bound request (such as a slow database query), the other thread can still respond to requests.
More than two processes gets close to the memory ceiling, even with one thread, so two processes feels like a good max.

It seems like we can keep a higher number of threads while still comfortably fitting in a dyno, although if you'd rather have suspenders be lighter by default I'm not against that.

More than two threads on those processes increases the risk of simultaneously serving multiple bloated requests. If the worst request bloats a process by 30MB, the worst-case scenario for a dyno in a Suspenders app will bloat past the 512MB limit. Two threads feels like a safe maximum while still providing at least some safety against IO-bound requests.

Does that make more sense?

tute · 2015-10-21T04:19:25Z

It does, thank you! The code comment did explain the relationship between workers and threads, but I appreciate the longer answer. Is this PR and #627 all we have in mind for now to improve memory usage for suspenders? If so, when we merge them I'll do a release.

jferris · 2015-10-21T14:03:02Z

@tute I'll take a quick look today to see if there's anything else we can do to slim it down and get closer to Rails bootup size. Most of the dependencies we add are development/test dependencies, so it seems like Suspenders size should be pretty close.

jferris · 2015-10-21T15:06:42Z

@tute I don't think there are any other major quick wins to remove from our Gemfile.

The startup for a vanilla Rails app using Postgres is ~78M for the master process and ~72M for each cold worker process. The startup for Suspenders is ~90M for the master process and ~83M for each worker process.

If we removed all the gems Suspenders uses that vanilla Rails doesn't, we'd gain around 35M of extra ceiling for each dyno. This isn't significant, since Ralis apps under load will easily bloat more than that already.

If we were going to remove something, the biggest win would be to remove NewRelic, which would reduce our usage to 84M in the master and 77M for workers, saving 18M for each dyno.

For things that not every app uses, changing it so that ActionMailer and the mail gem are only loaded when needed would save a significant amount of memory (somewhere around 20M per process or 60M per dyno). However, that would involve disabling recipient_interceptor and delayed_job_active_record. That may be worth exploring, but I think it's worth cutting a release without worrying about it.

bernerdschaefer · 2015-10-21T16:28:49Z

I don't totally understand the relationship between the number of threads and the database pool, but should the default production pool size be adjusted to match these changes?

https://github.com/thoughtbot/suspenders/blob/master/templates/postgresql_database.yml.erb#L18

Otherwise, this looks great; thanks for diving deep on it! 👍

jferris · 2015-10-21T17:59:42Z

@bernerdschaefer The Heroku guide to concurrency and database connections recommends using a pool equal to the number of threads:

If you are using the Puma web server we recommend setting the pool value to equal ENV['MAX_THREADS'].

This is the pool setting Heroku recommends in database.yml:

ENV["DB_POOL"] || ENV['MAX_THREADS'] || 5

Ours is similar, but not quite the same:

[Integer(ENV.fetch("MAX_THREADS", 5)), Integer(ENV.fetch("DB_POOL", 5))].max

I'm not sure why it's a little different, but it seems like we'll have at least one connection for each thread on the Puma server, so I think we're okay; we might be using more database connections for each process than we technically need to. It looks like @calebthompson introduced this line in the commit that introduced Puma.

It probably makes sense to use the Heroku-recommended settings for database.yml, but I also think it makes sense to handle that in a different pull request.

Any thoughts?

bernerdschaefer · 2015-10-21T20:44:57Z

I think tackling the questions about database pool size indeed make sense in a separate PR.

Using a simple Suspenders application, I profiled memory usage for our default configuration, as well as a few others. I found the following: * A Puma cluster uses a master process and multiple worker processes. The amount of memory used by a cluster is equal to the memory usage of the master process plus the possible bloated size of a worker process times the number of worker processes. * At boot, a simple Suspenders application uses about 117MB for the master process and 109M for each worker. * After the first request is served, a process increases to around 117M, like the master process. * The amount of potential bloat increases with each thread, because it's possible for every thread to be handling a bloated request at once. * Using [siege], I determined that the expected bloat in a simple scenario is around 10M per thread. This will be much worse in some applications. This provides the following formula for maximum memory usage under load: master_usage + worker_count * (worker_usage + bloat * thread_count) For this simple Suspenders application, this formula provides the following worst-case usage: 117 + 3 * (117 + 10 * 5) = 618 This is over the 512MB limit for a 1x Heroku dyno, and the application is very simple. I recommend changing to a default of two worker processes and two threads per dyno, changing the usage to: 117 + 2 * (117 + 10 * 2) = 391 This provides reasonable performance with a high memory ceiling. When applications begin to show troublesome performance characteristics under load, developers can tune the application's process and thread count according to its real-world memory usage, possibly upgrading the dyno size as appropriate. [siege]: https://www.joedog.org/siege-home/

jferris force-pushed the jf-adjust-puma-defaults branch from 0c2c4f8 to bfd75f9 Compare October 21, 2015 20:52

jferris merged commit bfd75f9 into master Oct 21, 2015

jferris deleted the jf-adjust-puma-defaults branch October 21, 2015 20:52

jferris mentioned this pull request Oct 29, 2015

Use puma instead of unicorn thoughtbot/upcase#1475

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of Puma processes and threads #630

Reduce number of Puma processes and threads #630

jferris commented Oct 20, 2015

tute commented Oct 20, 2015

jferris commented Oct 20, 2015

tute commented Oct 21, 2015

jferris commented Oct 21, 2015

jferris commented Oct 21, 2015

bernerdschaefer commented Oct 21, 2015

jferris commented Oct 21, 2015

bernerdschaefer commented Oct 21, 2015

Reduce number of Puma processes and threads #630

Reduce number of Puma processes and threads #630

Conversation

jferris commented Oct 20, 2015

tute commented Oct 20, 2015

jferris commented Oct 20, 2015

tute commented Oct 21, 2015

jferris commented Oct 21, 2015

jferris commented Oct 21, 2015

bernerdschaefer commented Oct 21, 2015

jferris commented Oct 21, 2015

bernerdschaefer commented Oct 21, 2015