New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
riemann stops working properly following OOME #623
Comments
how much memory do you assign to the heap? |
I use the default configuration:
|
Can you get the process close to running out of memory again, take a heapdump using jmap, and get me a profile? I may be able to tell you what the problem is and make an optimization to give you more breathing room. java -cp /usr/share/riemann/riemann.jar: riemann.bin start /etc/riemann/riemann.config —Reply to this email directly or view it on GitHub. |
Hi Kyle, Regarding your second paragraph, I would be delighted to contribute as I am a great fan of riemann. I would like to be sure we are on the same page: My understanding is that we should "stop the world" in case of OOME so that the whole process crashes instead of staying around but not being able to process events. Am I correct? If yes then I can try to have a look and propose something. |
Welllll... If the process can't serve any requests it might as well be crashed, so crashing is probably the better option! Then at least a watchdog can restart it. However, there are some causes of memory exhaustion, like pathological clients, where we might be able to free memory and keep working by kicking the client offline. Regarding your second paragraph, I would be delighted to contribute as I am a great fan of riemann. I would like to be sure we are on the same page: My understanding is that we should "stop the world" in case of OOME so that the whole process crashes instead of staying around but not being able to process events. Am I correct? If yes then I can try to have a look and propose something. —Reply to this email directly or view it on GitHub. |
Right. What would define a pathological client? Too many requests piling up? |
Some clients don't read any of Riemann's responses and things pile up in their netty send buffer, for instance. —Reply to this email directly or view it on GitHub. |
Oh! Then that may be the actual cause of my problem: I wrote the client, it uses TCP for sending messages but never reads from it, would that explain it all? |
Maybe, haha, but either way we should be more robust to broken clients. —Reply to this email directly or view it on GitHub. |
Yeah sure. Ok then, I will fix my client then have a look at how riemann handles clients connections and see if I can submit a PR for it. I will leave the issue open for the moment as a reminder. |
So I modified my custom client, ensuring riemann's responses are read and it still crashes after a while. I also bumped -Xmx to 2G and I can definitely see memory slowly accumulating in the riemann's process heap... The only other client I am using is collectd which comes with a Riemann client. Wondering whether or not this one could also forget to read responses from the server. Here is output of
I have a full heap dump but this is of course quite large, taking about 15MB bzip2-compressed. Can I attach it to this issue or is it better to share it with you on Google Drive? |
Here is the link to the compressed heap: https://drive.google.com/file/d/0BxxGw1_rva_tM1JiYTdxNm1XQmM/view?usp=sharing |
I checked collectd's riemann plugin source code and it IS reading ack from riemann so this is not the problem. I can still see the memory slowly growing over time: it was 850MB this morning, now at 1GB+ and counting... Did not have time yet to revive my java tooling to the point I can analyze the heap dump. |
So it is clear from the following snapshot that @aphyr is right: It is the accumulation of unsent messages in netty's queues that's growing memory: I don't know why this is the case though... |
We should forcibly close connections that have too many outstanding acks, and log a message. I don't know why this is the case though... —Reply to this email directly or view it on GitHub. |
On 6 January 2016 22:40:27 CET, Arnaud Bailly notifications@github.com wrote:
NB: early implementations of the collectd plugin did not honour these ACKs, leading to this sort of problem. You might want to make sure you're using an up to date collectd version (this was fixed in version 5.4.2). |
Oh, I was not aware of that, thank you very much! I will upgrade our servers to use this version, there does seem to be a PPA available for ubuntu:trusty : https://launchpad.net/~collectd/+archive/ubuntu/collectd-5.5 |
@aphyr: Looking at riemann's source code and Netty's API, I would like to suggest the following solution: There is a
|
That seeeeems helpful but I don't think we should accept operations if we can't ack them, nor should we just pause indefinitely--that'll just back up Riemann's recv queue. Maybe tighter buffer sizes are in order?
—Reply to this email directly or view it on GitHub. |
I understand your concern, however according to netty/netty#1903 checking writeability of the channel is the way to go to prevent buffer filling up and memory overflow. Not acking when buffer is full seems to me the smoothest route as it won't prevent "malformed" clients from operating while preventing memory to be clogged with unsendable ack messages. As an alternative, we could intercept
|
I'm OK with that. Could you run a test with a client that doesn't read acks and see if its memory still balloons? As an alternative, we could intercept writeabilityChanged events in the adapter set up by gen-tcp-handler and do something there, but I don't think anything to do but close the channel which seems harsh. (proxy [ChannelInboundHandlerAdapter] [](channelActive [ctx]
—Reply to this email directly or view it on GitHub. |
Actually I was aiming at writing a test for that :-) Can you tell me which that you are OK with, though:
Thanks |
Hehe I am fine with whatever prevents us from OOMing :-) not writing ack when buffer is fullclosing channel when it becomes not writable Thanks —Reply to this email directly or view it on GitHub. |
I have deployed a patched version of riemann from #640 and will watch how it behaves over the weekend, to be sure RAM is not eaten up by netty buffers anymore. |
I can confirm that RAM consumption is now flat and collectd (usiing "faulty" version 5.4.0) keeps feeding data I can see in riemann's dashboard, so I guess patch is working :-) |
Cool! And I assume this won't kill fast clients that only allow n outstanding reqs on the wire as long as n*msgsize is smaller than the recv buffer —Reply to this email directly or view it on GitHub. |
I did not delve into the exact details of the implementation of |
I probably cried victory too early... Riemann process RAM use still grows, just taking a heap dump and analyzing it. |
Hmm, might be an artifact of GC. Maybe when channel is closed buffer is still retained in old generation in heap and does not get garbaged until next major GC, which occurs infrequently because there is not much contention RAM. I probably need to run that for a week or so to get more assurance my fix actually works as expected. |
Well, unfortunately I still got an OOME even after my fix. I tried to take a heap dump from the process but it failed somewhere in the middle apparently. I will retry launching riemann process with an Oracle JDK and see if it goes better... |
@abailly are you trying with a "noop" config to ensure that the clients are still the problem after your fix? e.g: (streams prn) would be a good start. At -Xmx64m you will find out rather quickly if you are still growing mem usage. |
Thanks for the suggestion. No, I did not tried that. |
Definitely might be, sorry about this early collectd write_riemann client, it's causing headaches to a few people. the stunnel scenario might be at play here also. It's going to be hard for us to act on this issue, do you want to do additional testing or can we close the issue for now? |
hmm, I guess you can close. What about the associated PR #640 ? |
@abailly I intend on reviewing it shortly and getting it ready for a merge. |
Thanks. |
Hello,
I am running riemann 0.2.9 inside a docker container with java version
Every week or so, my riemann server stops working properly following OOME exception. I got the following stderr output:
and lot of the following lines in riemann's own logs:
I would expect riemann to simply stop working in case of OOME, but I may be misunderstanding something. What should I do to prevent this from happening? Host has 4GB of RAM and data input is really not important (collectd stats every minute or so plus logs from our app which amounts to maximum of 5 messages/sec).
Thanks for any advice.
The text was updated successfully, but these errors were encountered: