New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action Cable memory leak #26119

Closed
chrismccord opened this Issue Aug 11, 2016 · 9 comments

Comments

Projects
None yet
8 participants
@chrismccord

chrismccord commented Aug 11, 2016

Steps to reproduce

$ redis-server &
$ git clone git@github.com:chrismccord/channelsac.git
$ cd channelsac/rails
$ bundle
$ bundle exec puma -e production -w 8

Next, visit http://localhost:3000 and refresh the app multiple times. On our hardware, we can watch the memory grow unbounded, as seen here:

This was after an apparent "warm up" period where Rails appeared to be lazy loading resources, but the first several dozen page loads shows oddly high memory growth. I wasn't sure if this was lazy loading or puma workers warming up, so it may be benign:

Either way, the memory growth appears to grow unbounded following this high growth period and is never reclaimed.

Expected behavior

The memory is reclaimed following connection cleanup.

Actual behavior

Memory grows with each new connection and is not reclaimed after connections die.

System configuration

Rails version:
Observed on both Rails 5.0.0, and the 5-0-stable branch containing this patch:
https://github.com/rails/rails/pull/25615/files

Ruby version:
2.3.1

OS:
Ubuntu 16.04.1 x64

Let me know If I can provide more information to help diagnose further.

@matthewd matthewd self-assigned this Aug 11, 2016

@fxn

This comment has been minimized.

Member

fxn commented Aug 11, 2016

❤️

@tenderlove

This comment has been minimized.

Member

tenderlove commented Aug 13, 2016

Hi,

I tried to reproduce this, but I'm not getting any traction. My processes max out at around 110MB and will even reclaim memory if I let it sit. I used the same Ruby version and same Gemfile.lock. AFAICT, the only thing I'm doing differently is that I'm running on OS X.

I have tried reloading many times in both Safari and Chrome. I ran ab against the page, and didn't see anything. Since ab didn't exercise the websocket, I wrote a quick script to connect to the websocket using curl and keep the socket open for a bit:

require 'thread'

queue = SizedQueue.new 100

reaper = Thread.new do
  while thread = queue.pop
    thread.kill
    sleep 1
  end
end

loop do
  t = Thread.new do
    system 'curl -s -i -N -H "Connection: Upgrade" -H "Upgrade: websocket" -H "Host: localhost" -H "Origin: http://localhost:3000" http://localhost:3000/cable > /dev/null'
  end
  queue << t
end

Still no memory leaks.

I applied this patch so that I could get memory statistics:

diff --git a/rails/app/controllers/pages_controller.rb b/rails/app/controllers/pages_controller.rb
index bb7924d..90f4104 100644
--- a/rails/app/controllers/pages_controller.rb
+++ b/rails/app/controllers/pages_controller.rb
@@ -1,4 +1,11 @@
 class PagesController < ApplicationController
+  def gc
+    GC.start
+    GC.start
+    GC.start
+    render json: JSON.dump(GC.stat)
+  end
+
   def show
   end
 end
diff --git a/rails/config/routes.rb b/rails/config/routes.rb
index e0fbf08..984263b 100644
--- a/rails/config/routes.rb
+++ b/rails/config/routes.rb
@@ -2,4 +2,5 @@ Rails.application.routes.draw do
   # For details on the DSL available within this file, see http://guides.rubyonrails.org/routing.html
   root to: "pages#index"
   mount ActionCable.server => "/cable"
+  get "/pages/gc", :to => 'pages#gc'
 end

I observed heap_live_slots remain steady.

My current hunch is that it is specific to the platform. The only difference I can see between our environments is that I'm running on OS X and the leak was observed on Ubuntu. @chrismccord does the script I pasted above reproduce the memory leak on your environment? I'd like to try to automate the repro steps as much as possible.

Thanks!

@chrismccord

This comment has been minimized.

chrismccord commented Aug 16, 2016

I applied the patches and put together a new branch cm-diagnose-memory (https://github.com/chrismccord/channelsac/tree/cm-memory-diagnose) that automates the refreshing at least. When you visiting the root page it will connect to action cable, post a message to the room and refresh. Rinse and repeat. The issue with the Ruby raw WS script is it won't exersize action cable. I also added the pages/gc endpoint.

My findings with this on OS X are similar to yours. My puma workers (4) spin up to 118mb quickly, but then they creep up slows to ~130mb, but settle there after 10 minutes, which seems normal. I reprovisioned a new ubuntu instance and I can replicate the seemingly unbounded memory growth there again. The puma workers continue to creep up after 20 min of running a couple browser tabs agains the server, consuming 164mb when I quit.

In both OS X and ubuntu, the GC statistics appeared to remain steady, so I don't have a good answer to where that memory is going within the puma processes.

however:

If I run the ./bench_rails.sh on my local OS X machine, wait for 5-10k connections to establish, then I do observe the puma workers not free any memory after the tsung connections are killed. They remain using ~300mb 20 minutes after the tsung benchmark is killed. The odd thing is the heap_live_slots steadily recovers back to pre-tsung levels, but the puma memory is never freed. I'm not sure if this is normal behavior for puma or not where it greedily hangs onto memory, but I'm assuming not.

@tenderlove if you can send me your public key, I can give you access to the server to take a look. If not, let me know so I can tear it down. Thanks!

@tenderlove

This comment has been minimized.

Member

tenderlove commented Aug 17, 2016

if you can send me your public key, I can give you access to the server to take a look. If not, let me know so I can tear it down. Thanks!

I'd send you my keys, but I'm not going to be able to poke at it until next week. 😞

I reprovisioned a new ubuntu instance and I can replicate the seemingly unbounded memory growth there again.
If I run the ./bench_rails.sh on my local OS X machine, wait for 5-10k connections to establish, then I do observe the puma workers not free any memory after the tsung connections are killed. They remain using ~300mb 20 minutes after the tsung benchmark is killed. The odd thing is the heap_live_slots steadily recovers back to pre-tsung levels, but the puma memory is never freed.

It's possible that we have a malloc leak on Ubuntu but not OS X, or that we need to adjust some GC tuning variables. Do :heap_allocated_pages or :heap_sorted_length grow (also how about :total_freed_pages). If you hit the GC endpoint after running the tsung benchmark, does it recover memory?

@chrismccord

This comment has been minimized.

chrismccord commented Aug 17, 2016

If you hit the GC endpoint after running the tsung benchmark, does it recover memory?

I hit the gc endpoint many times after running the tsung benchmark, but I only paid attention to the heap_live_slots, which did steadily reduce down, even though the puma processes maintained their memory usage.

I'd send you my keys, but I'm not going to be able to poke at it until next week. 😞

No worries! I'll kill off the instance, but I took a snapshot so I can reprovision quickly once you're ready :)

@jlduran

This comment has been minimized.

Contributor

jlduran commented Aug 23, 2016

@tenderlove

This comment has been minimized.

Member

tenderlove commented Oct 7, 2016

Hi. Sorry it's taken so long to get to this. I finally got the application up and running on an Ubuntu virtual machine. When I use tsung though, I get this error:

2016-10-07 20:13:27 +0100: HTTP parse error, malformed request (): #<Puma::HttpParserError: Invalid HTTP format, parsing fails.>
2016-10-07 20:13:27 +0100: ENV: {"rack.version"=>[1, 3], "rack.errors"=>#<IO:<STDERR>>, "rack.multithread"=>true, "rack.multiprocess"=>false, "rack.run_once"=>false, "SCRIPT_NAME"=>"", "QUERY_STRING"=>"", "SERVER_PROTOCOL"=>"HTTP/1.1", "SERVER_SOFTWARE"=>"puma 3.5.2 Amateur Raccoon Rocketry", "GATEWAY_INTERFACE"=>"CGI/1.2"}

I tried hitting the page and refreshing a bunch, but still can't reproduce it. All processes would get up to about 148M and just sit there.

@chrisarcand are you able to reproduce this without -w 8? Just single process mode? That would make debugging much easier.

@chrisarcand

This comment has been minimized.

Contributor

chrisarcand commented Oct 7, 2016

cc/ @chrismccord 😉

@tenderlove

This comment has been minimized.

Member

tenderlove commented Oct 11, 2016

@chrisarcand derp, thanks. 😉

@stale stale bot added the wontfix label Mar 27, 2017

@rafaelfranca rafaelfranca added stale and removed stale wontfix labels Mar 27, 2017

@stale stale bot closed this Apr 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment