Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of Puma processes and threads #630

Merged
merged 1 commit into from
Oct 21, 2015
Merged

Conversation

jferris
Copy link
Contributor

@jferris jferris commented Oct 20, 2015

Using a simple Suspenders application, I profiled memory usage for our
default configuration, as well as a few others. I found the following:

  • A Puma cluster uses a master process and multiple worker processes.
    The amount of memory used by a cluster is equal to the memory usage of
    the master process plus the possible bloated size of a worker process
    times the number of worker processes.
  • At boot, a simple Suspenders application uses about 117MB for the
    master process and 109M for each worker.
  • After the first request is served, a process increases to around 117M,
    like the master process.
  • The amount of potential bloat increases with each thread, because it's
    possible for every thread to be handling a bloated request at once.
  • Using siege, I determined that the expected bloat in a simple
    scenario is around 10M per thread. This will be much worse in some
    applications.

This provides the following formula for maximum memory usage under load:

master_usage + worker_count * (worker_usage + bloat * thread_count)

For this simple Suspenders application, this formula provides the
following worst-case usage:

117 + 3 * (117 + 10 * 5) = 618

This is over the 512MB limit for a 1x Heroku dyno, and the application
is very simple.

I recommend changing to a default of two worker processes and two
threads per dyno, changing the usage to:

117 + 2 * (117 + 10 * 2) = 391

This provides reasonable performance with a high memory ceiling. When
applications begin to show troublesome performance characteristics under
load, developers can tune the application's process and thread count
according to its real-world memory usage, possibly upgrading the dyno
size as appropriate.

@tute
Copy link
Contributor

tute commented Oct 20, 2015

This looks good to me. It seems like we can keep a higher number of threads while still comfortably fitting in a dyno, although if you'd rather have suspenders be lighter by default I'm not against that.

👍

@jferris
Copy link
Contributor Author

jferris commented Oct 20, 2015

@tute

This is explained in the code comment, removing.

I'm not sure that it is explained in the code comment, so maybe my message or comment isn't clear.

Increasing processes is very good for concurrency but very bad for memory usage. Increasing threads helps somewhat with concurrency but isn't as bad for memory usage. This is because processes don't efficiently share memory (even though Ruby 2.1+ purportedly has copy-on-write), and threads (in Ruby) don't efficiently share CPU, because of the global interpreter lock.

I think 2x2 is a good combination, because it means that:

  • If one of the workers on a dyno is fully occupied serving a slow, CPU-bound request (such as rendering complex ERB or JSON), another request routed to that dyno can be handled by the other worker.
  • If one of the threads on a worker is occupied serving a slow, IO-bound request (such as a slow database query), the other thread can still respond to requests.
  • More than two processes gets close to the memory ceiling, even with one thread, so two processes feels like a good max.

It seems like we can keep a higher number of threads while still comfortably fitting in a dyno, although if you'd rather have suspenders be lighter by default I'm not against that.

More than two threads on those processes increases the risk of simultaneously serving multiple bloated requests. If the worst request bloats a process by 30MB, the worst-case scenario for a dyno in a Suspenders app will bloat past the 512MB limit. Two threads feels like a safe maximum while still providing at least some safety against IO-bound requests.

Does that make more sense?

@tute
Copy link
Contributor

tute commented Oct 21, 2015

It does, thank you! The code comment did explain the relationship between workers and threads, but I appreciate the longer answer. Is this PR and #627 all we have in mind for now to improve memory usage for suspenders? If so, when we merge them I'll do a release.

@jferris
Copy link
Contributor Author

jferris commented Oct 21, 2015

@tute I'll take a quick look today to see if there's anything else we can do to slim it down and get closer to Rails bootup size. Most of the dependencies we add are development/test dependencies, so it seems like Suspenders size should be pretty close.

@jferris
Copy link
Contributor Author

jferris commented Oct 21, 2015

@tute I don't think there are any other major quick wins to remove from our Gemfile.

The startup for a vanilla Rails app using Postgres is ~78M for the master process and ~72M for each cold worker process. The startup for Suspenders is ~90M for the master process and ~83M for each worker process.

If we removed all the gems Suspenders uses that vanilla Rails doesn't, we'd gain around 35M of extra ceiling for each dyno. This isn't significant, since Ralis apps under load will easily bloat more than that already.

If we were going to remove something, the biggest win would be to remove NewRelic, which would reduce our usage to 84M in the master and 77M for workers, saving 18M for each dyno.

For things that not every app uses, changing it so that ActionMailer and the mail gem are only loaded when needed would save a significant amount of memory (somewhere around 20M per process or 60M per dyno). However, that would involve disabling recipient_interceptor and delayed_job_active_record. That may be worth exploring, but I think it's worth cutting a release without worrying about it.

@bernerdschaefer
Copy link

I don't totally understand the relationship between the number of threads and the database pool, but should the default production pool size be adjusted to match these changes?

https://github.com/thoughtbot/suspenders/blob/master/templates/postgresql_database.yml.erb#L18

Otherwise, this looks great; thanks for diving deep on it! 👍

@jferris
Copy link
Contributor Author

jferris commented Oct 21, 2015

@bernerdschaefer The Heroku guide to concurrency and database connections recommends using a pool equal to the number of threads:

If you are using the Puma web server we recommend setting the pool value to equal ENV['MAX_THREADS'].

This is the pool setting Heroku recommends in database.yml:

ENV["DB_POOL"] || ENV['MAX_THREADS'] || 5

Ours is similar, but not quite the same:

[Integer(ENV.fetch("MAX_THREADS", 5)), Integer(ENV.fetch("DB_POOL", 5))].max

I'm not sure why it's a little different, but it seems like we'll have at least one connection for each thread on the Puma server, so I think we're okay; we might be using more database connections for each process than we technically need to. It looks like @calebthompson introduced this line in the commit that introduced Puma.

It probably makes sense to use the Heroku-recommended settings for database.yml, but I also think it makes sense to handle that in a different pull request.

Any thoughts?

@bernerdschaefer
Copy link

I think tackling the questions about database pool size indeed make sense in a separate PR.

Using a simple Suspenders application, I profiled memory usage for our
default configuration, as well as a few others. I found the following:

* A Puma cluster uses a master process and multiple worker processes.
  The amount of memory used by a cluster is equal to the memory usage of
  the master process plus the possible bloated size of a worker process
  times the number of worker processes.
* At boot, a simple Suspenders application uses about 117MB for the
  master process and 109M for each worker.
* After the first request is served, a process increases to around 117M,
  like the master process.
* The amount of potential bloat increases with each thread, because it's
  possible for every thread to be handling a bloated request at once.
* Using [siege], I determined that the expected bloat in a simple
  scenario is around 10M per thread. This will be much worse in some
  applications.

This provides the following formula for maximum memory usage under load:

    master_usage + worker_count * (worker_usage + bloat * thread_count)

For this simple Suspenders application, this formula provides the
following worst-case usage:

    117 + 3 * (117 + 10 * 5) = 618

This is over the 512MB limit for a 1x Heroku dyno, and the application
is very simple.

I recommend changing to a default of two worker processes and two
threads per dyno, changing the usage to:

    117 + 2 * (117 + 10 * 2) = 391

This provides reasonable performance with a high memory ceiling. When
applications begin to show troublesome performance characteristics under
load, developers can tune the application's process and thread count
according to its real-world memory usage, possibly upgrading the dyno
size as appropriate.

[siege]: https://www.joedog.org/siege-home/
@jferris jferris force-pushed the jf-adjust-puma-defaults branch from 0c2c4f8 to bfd75f9 Compare October 21, 2015 20:52
@jferris jferris merged commit bfd75f9 into master Oct 21, 2015
@jferris jferris deleted the jf-adjust-puma-defaults branch October 21, 2015 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants