Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riemann stops working properly following OOME #623

Closed
abailly opened this issue Nov 17, 2015 · 36 comments · Fixed by #640
Closed

riemann stops working properly following OOME #623

abailly opened this issue Nov 17, 2015 · 36 comments · Fixed by #640
Assignees
Labels

Comments

@abailly
Copy link
Contributor

abailly commented Nov 17, 2015

Hello,
I am running riemann 0.2.9 inside a docker container with java version

java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

Every week or so, my riemann server stops working properly following OOME exception. I got the following stderr output:

Exception in thread "riemann task 3" Exception in thread "riemann task 0" Exception in thread "Thread-7" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "riemann task 0"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "riemann task 3"
java.lang.OutOfMemoryError: Java heap space
Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space
Exception in thread "epollEventLoopGroup-4-1" java.lang.OutOfMemoryError: Java heap space
Exception in thread "epollEventLoopGroup-4-3" java.lang.OutOfMemoryError: Java heap space
Exception in thread "defaultEventExecutorGroup-2-2" java.lang.OutOfMemoryError: Java heap space
Exception in thread "defaultEventExecutorGroup-2-1" java.lang.OutOfMemoryError: Java heap space

and lot of the following lines in riemann's own logs:

WARN [2015-11-15 14:28:02,021] epollEventLoopGroup-4-2 - io.netty.channel.DefaultChannelPipeline - Failed to submit an exceptionCaught() event.
WARN [2015-11-15 14:28:02,022] epollEventLoopGroup-4-2 - io.netty.channel.DefaultChannelPipeline - The exceptionCaught() event that was failed to submit was:
WARN

I would expect riemann to simply stop working in case of OOME, but I may be misunderstanding something. What should I do to prevent this from happening? Host has 4GB of RAM and data input is really not important (collectd stats every minute or so plus logs from our app which amounts to maximum of 5 messages/sec).

Thanks for any advice.

@faxm0dem
Copy link
Contributor

how much memory do you assign to the heap? ps -P <pid> -o args should contain this information in the form of -Xmx and -Xms

@abailly
Copy link
Contributor Author

abailly commented Nov 17, 2015

I use the default configuration:

java -cp /usr/share/riemann/riemann.jar: riemann.bin start /etc/riemann/riemann.config

@aphyr
Copy link
Collaborator

aphyr commented Nov 17, 2015

Can you get the process close to running out of memory again, take a heapdump using jmap, and get me a profile? I may be able to tell you what the problem is and make an optimization to give you more breathing room.
There's not a whole lot Riemann can do if it can't allocate memory. You're right that we should not crash and recover gracefully, but I'm not exactly sure how to do that in a way that wouldn't require more allocations--e.g. expiring events from the index. If you have ideas I'd be delighted to see a patch.
On Nov 17, 2015 5:09 AM, Arnaud Bailly notifications@github.com wrote:I use the default configuration:

java -cp /usr/share/riemann/riemann.jar: riemann.bin start /etc/riemann/riemann.config

—Reply to this email directly or view it on GitHub.

@abailly
Copy link
Contributor Author

abailly commented Nov 17, 2015

Hi Kyle,
I will try to do that, maybe I can reproduce locallly, for example by constraining memory even more?

Regarding your second paragraph, I would be delighted to contribute as I am a great fan of riemann. I would like to be sure we are on the same page: My understanding is that we should "stop the world" in case of OOME so that the whole process crashes instead of staying around but not being able to process events. Am I correct?

If yes then I can try to have a look and propose something.

@aphyr
Copy link
Collaborator

aphyr commented Nov 17, 2015

Welllll... If the process can't serve any requests it might as well be crashed, so crashing is probably the better option! Then at least a watchdog can restart it. However, there are some causes of memory exhaustion, like pathological clients, where we might be able to free memory and keep working by kicking the client offline.
On Nov 17, 2015 07:42, Arnaud Bailly notifications@github.com wrote:Hi Kyle,
I will try to do that, maybe I can reproduce locallly, for example by constraining memory even more?

Regarding your second paragraph, I would be delighted to contribute as I am a great fan of riemann. I would like to be sure we are on the same page: My understanding is that we should "stop the world" in case of OOME so that the whole process crashes instead of staying around but not being able to process events. Am I correct?

If yes then I can try to have a look and propose something.

—Reply to this email directly or view it on GitHub.

@abailly
Copy link
Contributor Author

abailly commented Nov 17, 2015

Right. What would define a pathological client? Too many requests piling up?

@aphyr
Copy link
Collaborator

aphyr commented Nov 17, 2015

Some clients don't read any of Riemann's responses and things pile up in their netty send buffer, for instance.
On Nov 17, 2015 08:06, Arnaud Bailly notifications@github.com wrote:Right. What would define a pathological client? Too many requests piling up?

—Reply to this email directly or view it on GitHub.

@abailly
Copy link
Contributor Author

abailly commented Nov 17, 2015

Oh! Then that may be the actual cause of my problem: I wrote the client, it uses TCP for sending messages but never reads from it, would that explain it all?

@aphyr
Copy link
Collaborator

aphyr commented Nov 17, 2015

Maybe, haha, but either way we should be more robust to broken clients.
On Nov 17, 2015 08:37, Arnaud Bailly notifications@github.com wrote:Oh! Then that may be the actual cause of my problem: I wrote the client, it uses TCP for sending messages but never reads from it, would that explain it all?

—Reply to this email directly or view it on GitHub.

@abailly
Copy link
Contributor Author

abailly commented Nov 17, 2015

Yeah sure. Ok then, I will fix my client then have a look at how riemann handles clients connections and see if I can submit a PR for it. I will leave the issue open for the moment as a reminder.

@abailly
Copy link
Contributor Author

abailly commented Jan 3, 2016

So I modified my custom client, ensuring riemann's responses are read and it still crashes after a while. I also bumped -Xmx to 2G and I can definitely see memory slowly accumulating in the riemann's process heap... The only other client I am using is collectd which comes with a Riemann client. Wondering whether or not this one could also forget to read responses from the server.

Here is output of jmap -heap a few minutes after a restart:

Attaching to process ID 868, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.91-b01

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 0
   MaxHeapFreeRatio = 100
   MaxHeapSize      = 2147483648 (2048.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 174063616 (166.0MB)
   G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 267386880 (255.0MB)
   used     = 50210328 (47.884300231933594MB)
   free     = 217176552 (207.1156997680664MB)
   18.778156953699447% used
From Space:
   capacity = 18350080 (17.5MB)
   used     = 12103576 (11.542869567871094MB)
   free     = 6246504 (5.957130432128906MB)
   65.95925467354911% used
To Space:
   capacity = 18350080 (17.5MB)
   used     = 0 (0.0MB)
   free     = 18350080 (17.5MB)
   0.0% used
PS Old Generation
   capacity = 42991616 (41.0MB)
   used     = 29363824 (28.003524780273438MB)
   free     = 13627792 (12.996475219726562MB)
   68.30127995188643% used
PS Perm Generation
   capacity = 50331648 (48.0MB)
   used     = 49848848 (47.53956604003906MB)
   free     = 482800 (0.4604339599609375MB)
   99.04076258341472% used

17657 interned Strings occupying 2131040 bytes.

I have a full heap dump but this is of course quite large, taking about 15MB bzip2-compressed. Can I attach it to this issue or is it better to share it with you on Google Drive?

@abailly
Copy link
Contributor Author

abailly commented Jan 4, 2016

Here is the link to the compressed heap: https://drive.google.com/file/d/0BxxGw1_rva_tM1JiYTdxNm1XQmM/view?usp=sharing

@abailly
Copy link
Contributor Author

abailly commented Jan 6, 2016

I checked collectd's riemann plugin source code and it IS reading ack from riemann so this is not the problem. I can still see the memory slowly growing over time: it was 850MB this morning, now at 1GB+ and counting... Did not have time yet to revive my java tooling to the point I can analyze the heap dump.

@abailly
Copy link
Contributor Author

abailly commented Jan 6, 2016

So it is clear from the following snapshot that @aphyr is right: It is the accumulation of unsent messages in netty's queues that's growing memory:
screen shot 2016-01-06 at 22 58 18

I don't know why this is the case though...

@aphyr
Copy link
Collaborator

aphyr commented Jan 6, 2016

We should forcibly close connections that have too many outstanding acks, and log a message.
On Jan 6, 2016 17:00, Arnaud Bailly notifications@github.com wrote:So it is clear from the following snapshot that @aphyr is right: It is the accumulation of unsent messages in netty's queues that's growing memory:

I don't know why this is the case though...

—Reply to this email directly or view it on GitHub.

@mfournier
Copy link

On 6 January 2016 22:40:27 CET, Arnaud Bailly notifications@github.com wrote:

I checked collectd's riemann plugin source code and it IS reading ack
from riemann so this is not the problem.

NB: early implementations of the collectd plugin did not honour these ACKs, leading to this sort of problem. You might want to make sure you're using an up to date collectd version (this was fixed in version 5.4.2).

@abailly
Copy link
Contributor Author

abailly commented Jan 7, 2016

Oh, I was not aware of that, thank you very much! I will upgrade our servers to use this version, there does seem to be a PPA available for ubuntu:trusty : https://launchpad.net/~collectd/+archive/ubuntu/collectd-5.5

@abailly
Copy link
Contributor Author

abailly commented Jan 7, 2016

@aphyr: Looking at riemann's source code and Netty's API, I would like to suggest the following solution: There is a isWritable method in Channel that returns false when (among other things) the send buffer is filled. It would be simple to add a condition in tcp-handler that checks writability of channel before calling writeAndFlush, e.g.

        (let [response (handle core message)]
          (if (.. ctx getChannel isWritable)
            (writeAndFlush response)))

@aphyr
Copy link
Collaborator

aphyr commented Jan 7, 2016

That seeeeems helpful but I don't think we should accept operations if we can't ack them, nor should we just pause indefinitely--that'll just back up Riemann's recv queue. Maybe tighter buffer sizes are in order?
On Jan 7, 2016 07:05, Arnaud Bailly notifications@github.com wrote:@aphyr: Looking at riemann's source code and Netty's API, I would like to suggest the following solution: There is a isWritable method in Channel that returns false when (among other things) the send buffer is filled. It would be simple to add a condition in tcp-handler that checks writability of channel before calling writeAndFlush, e.g.

    (let [response (handle core message)]
      (if (.. ctx getChannel isWritable)
        (writeAndFlush response)))

—Reply to this email directly or view it on GitHub.

@abailly
Copy link
Contributor Author

abailly commented Jan 7, 2016

I understand your concern, however according to netty/netty#1903 checking writeability of the channel is the way to go to prevent buffer filling up and memory overflow. Not acking when buffer is full seems to me the smoothest route as it won't prevent "malformed" clients from operating while preventing memory to be clogged with unsendable ack messages.

As an alternative, we could intercept writeabilityChanged events in the adapter set up by gen-tcp-handler and do something there, but I don't think anything to do but close the channel which seems harsh.

  (proxy [ChannelInboundHandlerAdapter] []
    (channelActive [ctx]
      (.add channel-group (.channel ctx)))

    ;; to handle buffer filling in, 
    (channelWritabilityChanged [^ChannelHandlerContext ctx] 
        (let [channel (.getChannel ctx)]
           (when (not (.isWritable channel))
                (.close channel))))

    (channelRead [^ChannelHandlerContext ctx ^Object message]
      (try
        (handler @core stats ctx message)
        (catch java.nio.channels.ClosedChannelException e
          (warn "channel closed"))))

    (exceptionCaught [^ChannelHandlerContext ctx ^Throwable cause]
      (warn cause "TCP handler caught")
      (.close (.channel ctx)))

    (isSharable [] true))

@aphyr
Copy link
Collaborator

aphyr commented Jan 7, 2016

I'm OK with that. Could you run a test with a client that doesn't read acks and see if its memory still balloons?
On Jan 7, 2016 07:41, Arnaud Bailly notifications@github.com wrote:I understand your concern, however according to netty/netty#1903 checking writeability of the channel is the way to go to prevent buffer filling up and memory overflow. Not acking when buffer is full seems to me the smoothest route as it won't prevent "malformed" clients from operating while preventing memory to be clogged with unsendable ack messages.

As an alternative, we could intercept writeabilityChanged events in the adapter set up by gen-tcp-handler and do something there, but I don't think anything to do but close the channel which seems harsh.

(proxy [ChannelInboundHandlerAdapter] [](channelActive [ctx]
%28.add channel-group %28.channel ctx%29%29)

;; to handle buffer filling in, 
(channelWritabilityChanged [^ChannelHandlerContext ctx] 
    (let [channel (.getChannel ctx)]
       (when (not (.isWritable channel))
            (.close channel))))

(channelRead [^ChannelHandlerContext ctx ^Object message]
  (try
    (handler @core stats ctx message)
    (catch java.nio.channels.ClosedChannelException e
      (warn "channel closed"))))

(exceptionCaught [^ChannelHandlerContext ctx ^Throwable cause]
  (warn cause "TCP handler caught")
  (.close (.channel ctx)))

(isSharable [] true))

—Reply to this email directly or view it on GitHub.

@abailly
Copy link
Contributor Author

abailly commented Jan 7, 2016

Actually I was aiming at writing a test for that :-) Can you tell me which that you are OK with, though:

  1. not writing ack when buffer is full
  2. closing channel when it becomes not writable

Thanks

@aphyr
Copy link
Collaborator

aphyr commented Jan 7, 2016

Hehe I am fine with whatever prevents us from OOMing :-)
On Jan 7, 2016 10:15, Arnaud Bailly notifications@github.com wrote:Actually I was aiming at writing a test for that :-) Can you tell me which that you are OK with, though:

not writing ack when buffer is fullclosing channel when it becomes not writable

Thanks

—Reply to this email directly or view it on GitHub.

@abailly
Copy link
Contributor Author

abailly commented Jan 8, 2016

I have deployed a patched version of riemann from #640 and will watch how it behaves over the weekend, to be sure RAM is not eaten up by netty buffers anymore.

@abailly
Copy link
Contributor Author

abailly commented Jan 9, 2016

I can confirm that RAM consumption is now flat and collectd (usiing "faulty" version 5.4.0) keeps feeding data I can see in riemann's dashboard, so I guess patch is working :-)

@aphyr
Copy link
Collaborator

aphyr commented Jan 9, 2016

Cool! And I assume this won't kill fast clients that only allow n outstanding reqs on the wire as long as n*msgsize is smaller than the recv buffer
On Jan 9, 2016 03:00, Arnaud Bailly notifications@github.com wrote:I can confirm that RAM consumption is now flat and collectd (usiing "faulty" version 5.4.0) keeps feeding data I can see in riemann's dashboard, so I guess patch is working :-)

—Reply to this email directly or view it on GitHub.

@abailly
Copy link
Contributor Author

abailly commented Jan 9, 2016

I did not delve into the exact details of the implementation of isWritable or when netty does trigger the ChangeWriteable event. But as long the client consumes acks fast enough, it should be fine.

@abailly
Copy link
Contributor Author

abailly commented Jan 11, 2016

I probably cried victory too early... Riemann process RAM use still grows, just taking a heap dump and analyzing it.

@abailly
Copy link
Contributor Author

abailly commented Jan 11, 2016

Hmm, might be an artifact of GC. Maybe when channel is closed buffer is still retained in old generation in heap and does not get garbaged until next major GC, which occurs infrequently because there is not much contention RAM. I probably need to run that for a week or so to get more assurance my fix actually works as expected.

@abailly
Copy link
Contributor Author

abailly commented Jan 18, 2016

Well, unfortunately I still got an OOME even after my fix. I tried to take a heap dump from the process but it failed somewhere in the middle apparently. I will retry launching riemann process with an Oracle JDK and see if it goes better...

@pyr
Copy link
Contributor

pyr commented Feb 9, 2016

@abailly are you trying with a "noop" config to ensure that the clients are still the problem after your fix? e.g:

(streams prn) 

would be a good start. At -Xmx64m you will find out rather quickly if you are still growing mem usage.
Thanks for the work on this issue so far!

@pyr pyr added the bug label Feb 9, 2016
@pyr pyr self-assigned this Feb 9, 2016
@abailly
Copy link
Contributor Author

abailly commented Feb 9, 2016

Thanks for the suggestion. No, I did not tried that.
I definitely think client is an issue and more precisely the collectd client I am using (an oldish version that comes with ubutun 14.04): I noticed the memory consumed by riemann process was stable for a while, then started to grow up to more than 3.2GB, at which point I stopped all collectd clients and it stabilized to 2.9 GB.
A thing that might hinder proper flush of tcp buffers is that I added one level of indirection through stunnel: riemann process is accessed behing a couple of stunnel proxies. Not sure if this is relevant...

@pyr
Copy link
Contributor

pyr commented Feb 9, 2016

Definitely might be, sorry about this early collectd write_riemann client, it's causing headaches to a few people. the stunnel scenario might be at play here also. It's going to be hard for us to act on this issue, do you want to do additional testing or can we close the issue for now?

@abailly
Copy link
Contributor Author

abailly commented Feb 9, 2016

hmm, I guess you can close. What about the associated PR #640 ?

@pyr
Copy link
Contributor

pyr commented Feb 9, 2016

@abailly I intend on reviewing it shortly and getting it ready for a merge.

@pyr pyr closed this as completed Feb 9, 2016
@abailly
Copy link
Contributor Author

abailly commented Feb 9, 2016

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants