riemann stops working properly following OOME #623

abailly · 2015-11-17T11:42:34Z

Hello,
I am running riemann 0.2.9 inside a docker container with java version

java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

Every week or so, my riemann server stops working properly following OOME exception. I got the following stderr output:

Exception in thread "riemann task 3" Exception in thread "riemann task 0" Exception in thread "Thread-7" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "riemann task 0"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "riemann task 3"
java.lang.OutOfMemoryError: Java heap space
Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space
Exception in thread "epollEventLoopGroup-4-1" java.lang.OutOfMemoryError: Java heap space
Exception in thread "epollEventLoopGroup-4-3" java.lang.OutOfMemoryError: Java heap space
Exception in thread "defaultEventExecutorGroup-2-2" java.lang.OutOfMemoryError: Java heap space
Exception in thread "defaultEventExecutorGroup-2-1" java.lang.OutOfMemoryError: Java heap space

and lot of the following lines in riemann's own logs:

WARN [2015-11-15 14:28:02,021] epollEventLoopGroup-4-2 - io.netty.channel.DefaultChannelPipeline - Failed to submit an exceptionCaught() event.
WARN [2015-11-15 14:28:02,022] epollEventLoopGroup-4-2 - io.netty.channel.DefaultChannelPipeline - The exceptionCaught() event that was failed to submit was:
WARN

I would expect riemann to simply stop working in case of OOME, but I may be misunderstanding something. What should I do to prevent this from happening? Host has 4GB of RAM and data input is really not important (collectd stats every minute or so plus logs from our app which amounts to maximum of 5 messages/sec).

Thanks for any advice.

The text was updated successfully, but these errors were encountered:

faxm0dem · 2015-11-17T12:20:49Z

how much memory do you assign to the heap? ps -P <pid> -o args should contain this information in the form of -Xmx and -Xms

abailly · 2015-11-17T13:09:20Z

I use the default configuration:

java -cp /usr/share/riemann/riemann.jar: riemann.bin start /etc/riemann/riemann.config

aphyr · 2015-11-17T15:36:54Z

Can you get the process close to running out of memory again, take a heapdump using jmap, and get me a profile? I may be able to tell you what the problem is and make an optimization to give you more breathing room.
There's not a whole lot Riemann can do if it can't allocate memory. You're right that we should not crash and recover gracefully, but I'm not exactly sure how to do that in a way that wouldn't require more allocations--e.g. expiring events from the index. If you have ideas I'd be delighted to see a patch.
On Nov 17, 2015 5:09 AM, Arnaud Bailly notifications@github.com wrote:I use the default configuration:

java -cp /usr/share/riemann/riemann.jar: riemann.bin start /etc/riemann/riemann.config

—Reply to this email directly or view it on GitHub.

abailly · 2015-11-17T15:42:16Z

Hi Kyle,
I will try to do that, maybe I can reproduce locallly, for example by constraining memory even more?

Regarding your second paragraph, I would be delighted to contribute as I am a great fan of riemann. I would like to be sure we are on the same page: My understanding is that we should "stop the world" in case of OOME so that the whole process crashes instead of staying around but not being able to process events. Am I correct?

If yes then I can try to have a look and propose something.

aphyr · 2015-11-17T16:04:39Z

Welllll... If the process can't serve any requests it might as well be crashed, so crashing is probably the better option! Then at least a watchdog can restart it. However, there are some causes of memory exhaustion, like pathological clients, where we might be able to free memory and keep working by kicking the client offline.
On Nov 17, 2015 07:42, Arnaud Bailly notifications@github.com wrote:Hi Kyle,
I will try to do that, maybe I can reproduce locallly, for example by constraining memory even more?

Regarding your second paragraph, I would be delighted to contribute as I am a great fan of riemann. I would like to be sure we are on the same page: My understanding is that we should "stop the world" in case of OOME so that the whole process crashes instead of staying around but not being able to process events. Am I correct?

If yes then I can try to have a look and propose something.

—Reply to this email directly or view it on GitHub.

abailly · 2015-11-17T16:06:12Z

Right. What would define a pathological client? Too many requests piling up?

aphyr · 2015-11-17T16:32:52Z

Some clients don't read any of Riemann's responses and things pile up in their netty send buffer, for instance.
On Nov 17, 2015 08:06, Arnaud Bailly notifications@github.com wrote:Right. What would define a pathological client? Too many requests piling up?

—Reply to this email directly or view it on GitHub.

abailly · 2015-11-17T16:37:49Z

Oh! Then that may be the actual cause of my problem: I wrote the client, it uses TCP for sending messages but never reads from it, would that explain it all?

aphyr · 2015-11-17T16:50:10Z

Maybe, haha, but either way we should be more robust to broken clients.
On Nov 17, 2015 08:37, Arnaud Bailly notifications@github.com wrote:Oh! Then that may be the actual cause of my problem: I wrote the client, it uses TCP for sending messages but never reads from it, would that explain it all?

—Reply to this email directly or view it on GitHub.

abailly · 2015-11-17T16:51:36Z

Yeah sure. Ok then, I will fix my client then have a look at how riemann handles clients connections and see if I can submit a PR for it. I will leave the issue open for the moment as a reminder.

abailly · 2016-01-03T21:41:19Z

So I modified my custom client, ensuring riemann's responses are read and it still crashes after a while. I also bumped -Xmx to 2G and I can definitely see memory slowly accumulating in the riemann's process heap... The only other client I am using is collectd which comes with a Riemann client. Wondering whether or not this one could also forget to read responses from the server.

Here is output of jmap -heap a few minutes after a restart:

Attaching to process ID 868, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.91-b01

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 0
   MaxHeapFreeRatio = 100
   MaxHeapSize      = 2147483648 (2048.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 174063616 (166.0MB)
   G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 267386880 (255.0MB)
   used     = 50210328 (47.884300231933594MB)
   free     = 217176552 (207.1156997680664MB)
   18.778156953699447% used
From Space:
   capacity = 18350080 (17.5MB)
   used     = 12103576 (11.542869567871094MB)
   free     = 6246504 (5.957130432128906MB)
   65.95925467354911% used
To Space:
   capacity = 18350080 (17.5MB)
   used     = 0 (0.0MB)
   free     = 18350080 (17.5MB)
   0.0% used
PS Old Generation
   capacity = 42991616 (41.0MB)
   used     = 29363824 (28.003524780273438MB)
   free     = 13627792 (12.996475219726562MB)
   68.30127995188643% used
PS Perm Generation
   capacity = 50331648 (48.0MB)
   used     = 49848848 (47.53956604003906MB)
   free     = 482800 (0.4604339599609375MB)
   99.04076258341472% used

17657 interned Strings occupying 2131040 bytes.

I have a full heap dump but this is of course quite large, taking about 15MB bzip2-compressed. Can I attach it to this issue or is it better to share it with you on Google Drive?

abailly · 2016-01-04T08:19:52Z

Here is the link to the compressed heap: https://drive.google.com/file/d/0BxxGw1_rva_tM1JiYTdxNm1XQmM/view?usp=sharing

abailly · 2016-01-06T21:40:25Z

I checked collectd's riemann plugin source code and it IS reading ack from riemann so this is not the problem. I can still see the memory slowly growing over time: it was 850MB this morning, now at 1GB+ and counting... Did not have time yet to revive my java tooling to the point I can analyze the heap dump.

abailly · 2016-01-06T22:00:12Z

So it is clear from the following snapshot that @aphyr is right: It is the accumulation of unsent messages in netty's queues that's growing memory:

I don't know why this is the case though...

aphyr · 2016-01-06T23:33:32Z

We should forcibly close connections that have too many outstanding acks, and log a message.
On Jan 6, 2016 17:00, Arnaud Bailly notifications@github.com wrote:So it is clear from the following snapshot that @aphyr is right: It is the accumulation of unsent messages in netty's queues that's growing memory:

I don't know why this is the case though...

—Reply to this email directly or view it on GitHub.

mfournier · 2016-01-07T08:38:44Z

On 6 January 2016 22:40:27 CET, Arnaud Bailly notifications@github.com wrote:

I checked collectd's riemann plugin source code and it IS reading ack
from riemann so this is not the problem.

NB: early implementations of the collectd plugin did not honour these ACKs, leading to this sort of problem. You might want to make sure you're using an up to date collectd version (this was fixed in version 5.4.2).

abailly · 2016-01-07T12:01:10Z

Oh, I was not aware of that, thank you very much! I will upgrade our servers to use this version, there does seem to be a PPA available for ubuntu:trusty : https://launchpad.net/~collectd/+archive/ubuntu/collectd-5.5

abailly · 2016-01-07T12:05:02Z

@aphyr: Looking at riemann's source code and Netty's API, I would like to suggest the following solution: There is a isWritable method in Channel that returns false when (among other things) the send buffer is filled. It would be simple to add a condition in tcp-handler that checks writability of channel before calling writeAndFlush, e.g.

        (let [response (handle core message)]
          (if (.. ctx getChannel isWritable)
            (writeAndFlush response)))

aphyr · 2016-01-07T12:33:16Z

That seeeeems helpful but I don't think we should accept operations if we can't ack them, nor should we just pause indefinitely--that'll just back up Riemann's recv queue. Maybe tighter buffer sizes are in order?
On Jan 7, 2016 07:05, Arnaud Bailly notifications@github.com wrote:@aphyr: Looking at riemann's source code and Netty's API, I would like to suggest the following solution: There is a isWritable method in Channel that returns false when (among other things) the send buffer is filled. It would be simple to add a condition in tcp-handler that checks writability of channel before calling writeAndFlush, e.g.

    (let [response (handle core message)]
      (if (.. ctx getChannel isWritable)
        (writeAndFlush response)))

—Reply to this email directly or view it on GitHub.

abailly · 2016-01-07T12:41:14Z

I understand your concern, however according to netty/netty#1903 checking writeability of the channel is the way to go to prevent buffer filling up and memory overflow. Not acking when buffer is full seems to me the smoothest route as it won't prevent "malformed" clients from operating while preventing memory to be clogged with unsendable ack messages.

As an alternative, we could intercept writeabilityChanged events in the adapter set up by gen-tcp-handler and do something there, but I don't think anything to do but close the channel which seems harsh.

  (proxy [ChannelInboundHandlerAdapter] []
    (channelActive [ctx]
      (.add channel-group (.channel ctx)))

    ;; to handle buffer filling in, 
    (channelWritabilityChanged [^ChannelHandlerContext ctx] 
        (let [channel (.getChannel ctx)]
           (when (not (.isWritable channel))
                (.close channel))))

    (channelRead [^ChannelHandlerContext ctx ^Object message]
      (try
        (handler @core stats ctx message)
        (catch java.nio.channels.ClosedChannelException e
          (warn "channel closed"))))

    (exceptionCaught [^ChannelHandlerContext ctx ^Throwable cause]
      (warn cause "TCP handler caught")
      (.close (.channel ctx)))

    (isSharable [] true))

aphyr · 2016-01-07T13:46:29Z

I'm OK with that. Could you run a test with a client that doesn't read acks and see if its memory still balloons?
On Jan 7, 2016 07:41, Arnaud Bailly notifications@github.com wrote:I understand your concern, however according to netty/netty#1903 checking writeability of the channel is the way to go to prevent buffer filling up and memory overflow. Not acking when buffer is full seems to me the smoothest route as it won't prevent "malformed" clients from operating while preventing memory to be clogged with unsendable ack messages.

As an alternative, we could intercept writeabilityChanged events in the adapter set up by gen-tcp-handler and do something there, but I don't think anything to do but close the channel which seems harsh.

(proxy [ChannelInboundHandlerAdapter] [](channelActive [ctx]
%28.add channel-group %28.channel ctx%29%29)

;; to handle buffer filling in, 
(channelWritabilityChanged [^ChannelHandlerContext ctx] 
    (let [channel (.getChannel ctx)]
       (when (not (.isWritable channel))
            (.close channel))))

(channelRead [^ChannelHandlerContext ctx ^Object message]
  (try
    (handler @core stats ctx message)
    (catch java.nio.channels.ClosedChannelException e
      (warn "channel closed"))))

(exceptionCaught [^ChannelHandlerContext ctx ^Throwable cause]
  (warn cause "TCP handler caught")
  (.close (.channel ctx)))

(isSharable [] true))

—Reply to this email directly or view it on GitHub.

abailly · 2016-01-07T15:15:02Z

Actually I was aiming at writing a test for that :-) Can you tell me which that you are OK with, though:

not writing ack when buffer is full
closing channel when it becomes not writable

Thanks

aphyr · 2016-01-07T17:59:09Z

Hehe I am fine with whatever prevents us from OOMing :-)
On Jan 7, 2016 10:15, Arnaud Bailly notifications@github.com wrote:Actually I was aiming at writing a test for that :-) Can you tell me which that you are OK with, though:

not writing ack when buffer is fullclosing channel when it becomes not writable

Thanks

—Reply to this email directly or view it on GitHub.

abailly · 2016-01-08T14:56:07Z

I have deployed a patched version of riemann from #640 and will watch how it behaves over the weekend, to be sure RAM is not eaten up by netty buffers anymore.

abailly · 2016-01-09T08:00:11Z

I can confirm that RAM consumption is now flat and collectd (usiing "faulty" version 5.4.0) keeps feeding data I can see in riemann's dashboard, so I guess patch is working :-)

aphyr · 2016-01-09T13:39:38Z

Cool! And I assume this won't kill fast clients that only allow n outstanding reqs on the wire as long as n*msgsize is smaller than the recv buffer
On Jan 9, 2016 03:00, Arnaud Bailly notifications@github.com wrote:I can confirm that RAM consumption is now flat and collectd (usiing "faulty" version 5.4.0) keeps feeding data I can see in riemann's dashboard, so I guess patch is working :-)

—Reply to this email directly or view it on GitHub.

abailly · 2016-01-09T14:27:32Z

I did not delve into the exact details of the implementation of isWritable or when netty does trigger the ChangeWriteable event. But as long the client consumes acks fast enough, it should be fine.

abailly · 2016-01-11T07:01:04Z

I probably cried victory too early... Riemann process RAM use still grows, just taking a heap dump and analyzing it.

abailly · 2016-01-11T12:28:13Z

Hmm, might be an artifact of GC. Maybe when channel is closed buffer is still retained in old generation in heap and does not get garbaged until next major GC, which occurs infrequently because there is not much contention RAM. I probably need to run that for a week or so to get more assurance my fix actually works as expected.

abailly · 2016-01-18T16:50:07Z

Well, unfortunately I still got an OOME even after my fix. I tried to take a heap dump from the process but it failed somewhere in the middle apparently. I will retry launching riemann process with an Oracle JDK and see if it goes better...

pyr · 2016-02-09T07:56:09Z

@abailly are you trying with a "noop" config to ensure that the clients are still the problem after your fix? e.g:

(streams prn)

would be a good start. At -Xmx64m you will find out rather quickly if you are still growing mem usage.
Thanks for the work on this issue so far!

abailly · 2016-02-09T13:00:15Z

Thanks for the suggestion. No, I did not tried that.
I definitely think client is an issue and more precisely the collectd client I am using (an oldish version that comes with ubutun 14.04): I noticed the memory consumed by riemann process was stable for a while, then started to grow up to more than 3.2GB, at which point I stopped all collectd clients and it stabilized to 2.9 GB.
A thing that might hinder proper flush of tcp buffers is that I added one level of indirection through stunnel: riemann process is accessed behing a couple of stunnel proxies. Not sure if this is relevant...

pyr · 2016-02-09T13:07:58Z

Definitely might be, sorry about this early collectd write_riemann client, it's causing headaches to a few people. the stunnel scenario might be at play here also. It's going to be hard for us to act on this issue, do you want to do additional testing or can we close the issue for now?

abailly · 2016-02-09T13:09:22Z

hmm, I guess you can close. What about the associated PR #640 ?

pyr · 2016-02-09T13:10:33Z

@abailly I intend on reviewing it shortly and getting it ready for a merge.

abailly · 2016-02-09T13:11:16Z

Thanks.

abailly mentioned this issue Nov 17, 2015

TCP client does not read response from server tel/riemann-hs#9

Closed

abailly mentioned this issue Jan 8, 2016

close channel when it becomes unwritable #640

Merged

pyr added the bug label Feb 9, 2016

pyr self-assigned this Feb 9, 2016

pyr closed this as completed Feb 9, 2016

riemann stops working properly following OOME #623

riemann stops working properly following OOME #623

Comments

abailly commented Nov 17, 2015

faxm0dem commented Nov 17, 2015

abailly commented Nov 17, 2015

aphyr commented Nov 17, 2015

abailly commented Nov 17, 2015

aphyr commented Nov 17, 2015

abailly commented Nov 17, 2015

aphyr commented Nov 17, 2015

abailly commented Nov 17, 2015

aphyr commented Nov 17, 2015

abailly commented Nov 17, 2015

abailly commented Jan 3, 2016

abailly commented Jan 4, 2016

abailly commented Jan 6, 2016

abailly commented Jan 6, 2016

aphyr commented Jan 6, 2016

mfournier commented Jan 7, 2016

abailly commented Jan 7, 2016

abailly commented Jan 7, 2016

aphyr commented Jan 7, 2016

abailly commented Jan 7, 2016

aphyr commented Jan 7, 2016

abailly commented Jan 7, 2016

aphyr commented Jan 7, 2016

abailly commented Jan 8, 2016

abailly commented Jan 9, 2016

aphyr commented Jan 9, 2016

abailly commented Jan 9, 2016

abailly commented Jan 11, 2016

abailly commented Jan 11, 2016

abailly commented Jan 18, 2016

pyr commented Feb 9, 2016

abailly commented Feb 9, 2016

pyr commented Feb 9, 2016

abailly commented Feb 9, 2016

pyr commented Feb 9, 2016

abailly commented Feb 9, 2016