Skip to content
This repository has been archived by the owner on Sep 2, 2020. It is now read-only.

Very divergent performance angled fork #40

Closed
tcoupland opened this issue Jul 1, 2014 · 29 comments
Closed

Very divergent performance angled fork #40

tcoupland opened this issue Jul 1, 2014 · 29 comments

Comments

@tcoupland
Copy link
Contributor

I've got a very divergent fork going that i've been working on for some time to improve performance. Some highlights from the changes are:

  • ES path lookup to query for complete paths before breaking them down for creation.
  • ES requests to batch versions and to use http-kit for async requests.
  • Casandra updates are now done in asynchronous prepared batches.
  • String processing has been improved.
  • Removed Lamina and moved it all over to core.async.

I'd like to know what the appetite is for taking this rather large change set wholesale. That would obviously be simplest for me, but i appreciate that the original author might not want that. If not then I'll see if i can separate out some bits to give back.

@tcoupland
Copy link
Contributor Author

Realised saying i've 'improved performance' with no evidence is not right, so here's a couple of screenshots.

This is Cyanite running on its own EC2 m3.large instance. The change over was performed by stopping the application, dropping in the HEAD uberjar and restarting.

casandraqueriespersecond

Here you can see the rate of inserts against Casandra change when i replaced an instance of my fork with the current HEAD.

oldversioncpuusage

YourKit showing the CPU usage of current HEAD.

newversioncpuusage

This is the same view, but with the fork running. No change to the applications submitting metrics where made.

@chjohnst
Copy link

chjohnst commented Jul 1, 2014

I like the idea of batched writes, is this tunable as every workload can be
different.

I would also be happy to test this code, I am working on trying to increase
performance and I am finding I can't get to 20k without falling behind.
Next approach is to use haproxy to try and leverage several instances.
On Jul 1, 2014 11:44 AM, "Tom Coupland" notifications@github.com wrote:

I've got a very divergent fork going that i've been working on for some
time to improve performance. Some highlights from the changes are:

  • ES path lookup to query for complete paths before breaking them down
    for creation.
  • ES requests to batch versions and to use http-kit for async requests.
  • Casandra updates are now down in asynchronous prepared batches.
  • String processing has been improved.
  • Removed Lamina and moved it all over to core.async.

I'd like to know what the appetite is for taking this rather large change
set wholesale. That would obviously be simplest for me, but i appreciate
that the original author might not want that. If not then I'll see if i can
separate out some bits to give back.


Reply to this email directly or view it on GitHub
#40.

@tcoupland
Copy link
Contributor Author

The casandra batches are set to 500 at the moment, but they use a custom channel that flushes a batch every 500 millis. ElasticSearch batches are 1000 and flush every second. Hopefully the flushing should mean it won't need much tuning. See here for the channel: https://github.com/mantree/cyanite/blob/master/src/org/spootnik/cyanite/util.clj#L35

Would be really interested in seeing if you get enough of an improvement. Of course this is a prototype, performance has been my main objective and so correctness may have suffered. I've really only tried to change the flow of data and not any of the actual work done, so hopefully it will just work. If not just let me know and i'm sure i can fix it up.

@tmonk42
Copy link

tmonk42 commented Jul 1, 2014

Actually, any way to merge this work with #36? Bulk ops to Cassandra/ES AND in memory path cache?

@tcoupland
Copy link
Contributor Author

I have got a cache in there, but instead of storing the full path it stores the first few elements of paths. It's set to depth 1 at the moment while i got the batches to work, but could be ramped up. It only gets interrogated after it's checked that the full path already exists though, then they get broken up in the usual way and then the cache filters out the first few elements before the missing elements are found and, finally, the bulk inserts are performed.

There's no reason that the cache couldn't be put in front of the initial full path check though, it makes good sense, repetition levels are extreme after all. Looking through that code though i can't see any eviction policy on the cache, just a continuously growing set. We've got a lot of different metrics and i'd worry about not having some control over it's size.

@pyr
Copy link
Owner

pyr commented Jul 1, 2014

I'm happy to review and merge a PR @mantree great work !

@chjohnst
Copy link

chjohnst commented Jul 1, 2014

@mantree I tested your latest set of commits, and I think the tcp listener is not working as expected. All of my apps will open the socket on port 2003, and expect to hold the socket open happily while our apps send thousands of metrics a second continuously all day. I am seeing a TCP reset get sent from my cyanite node after a very short amount of time. Can you review your TCP code to see if something is broken here?

@tcoupland
Copy link
Contributor Author

@chjohnst thanks for giving it a whirl. This is probably a hang over from the initial problem of hanging connections we had with our setup. So i made netty close connections as quickly as possible when it's received a complete message. There I good chance this was a red herring for us though, I'll have a play with this and let you know if i get a anywhere.

@pyr execllent! I've got a bit more time for this so i'll double check stuff a bit today and sand off some off the really rough edges and get the connection handling improved.

@chjohnst
Copy link

chjohnst commented Jul 2, 2014

That red herring would be a show stopper for most carbon protocol apps
since they are all long running connections. CollectD is an example of
that. I'll poke at the code as well to see what I can find. My
java/clojure is not so good though.
On Jul 2, 2014 4:29 AM, "Tom Coupland" notifications@github.com wrote:

@chjohnst https://github.com/chjohnst thanks for giving it a whirl.
This is probably a hang over from the initial problem of hanging
connections we had with our setup. So i made netty close connections as
quickly as possible when it's received a complete message. There I good
chance this was a red herring for us though, I'll have a play with this and
let you know if i get a anywhere.

@pyr https://github.com/pyr execllent! I've got a bit more time for
this so i'll double check stuff a bit today and sand off some off the
really rough edges and get the connection handling improved.


Reply to this email directly or view it on GitHub
#40 (comment).

@tcoupland
Copy link
Contributor Author

Well, i think it's just a hang over from our initial problem. I've got a fix i'm just playing with now.

You can find the problem pretty easily by looking in tcp.clj, spot the (.close ctx) after it thinks it's got a whole message :)

@chjohnst
Copy link

chjohnst commented Jul 2, 2014

yea, I see what happens if I remove the .close, pretty nasty

@tcoupland
Copy link
Contributor Author

Just pushed a new version of the tcp server that should work a bit better here. I was trying to do too much before, this time i've just used Netty components. There is now an idle timeout set to 1 second, this should clear out anything that doesn't work in the streaming fashion.

This seems to be working ok in our scenario at least, all there are now minor cpu spikes every ~15 seconds that i'm trying to look into. If you can safely give it a test that'd be grand.

@chjohnst
Copy link

chjohnst commented Jul 2, 2014

Just gave it a try and getting some errors. Want me post the output?

What are you using to test this?
On Jul 2, 2014 11:58 AM, "Tom Coupland" notifications@github.com wrote:

Just pushed a new version of the tcp server that should work a bit better
here. I was trying to do too much before, this time i've just used Netty
components. There is now an idle timeout set to 1 second, this should clear
out anything that doesn't work in the streaming fashion.

This seems to be working ok in our scenario at least, all there are now
minor cpu spikes every ~15 seconds that i'm trying to look into. If you can
safely give it a test that'd be grand.


Reply to this email directly or view it on GitHub
#40 (comment).

@tcoupland
Copy link
Contributor Author

Well i give it a little try locally with jmeter, just to flush out any silly errors. Then i'm dropping it into our staging environment where we have a subset of our metrics being posted.

Sure, put in the output!

@chjohnst
Copy link

chjohnst commented Jul 2, 2014

Try running this script

#!/usr/bin/env python

import sys, os, time
from socket import socket
from random import random

try:
host = sys.argv[1]
port = int(sys.argv[2])
mpm = int(sys.argv[3])
except:
print 'Usage: %s host port metrics-per-minute' % os.path.basename(sys.argv[0])
sys.exit(1)

s = socket()
s.connect( (host,port) )
now = int( time.time() )
now -= now % 60

while True:
start = time.time()
count = 0
for i in xrange(0, mpm):
metric = 'TEST.%d' % i
value = random()
s.sendall('%s %s %s\n' % (metric, value, now))
count += 1

print 'sent %d metrics in %.3f seconds' % (count, time.time() - start)
now += 60

@chjohnst
Copy link

chjohnst commented Jul 2, 2014

Here are the errors:

DEBUG [2014-07-02 11:36:45,612] pool-9-thread-1 - org.spootnik.cyanite.store - written batch
INFO [2014-07-02 11:36:45,618] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,619] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,623] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,627] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,637] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,641] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,642] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,643] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,644] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
INFO [2014-07-02 11:36:45,645] async-dispatch-56 - org.spootnik.cyanite.carbon - Exception for metric [ ] : #
WARN [2014-07-02 11:36:46,752] nioEventLoopGroup-3-16 - io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.AssertionError: Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
(< (.size puts) impl/MAX-QUEUE-SIZE)
at clojure.core.async.impl.channels.ManyToManyChannel.put_BANG_(channels.clj:95)
at clojure.core.async$put_BANG_.invoke(async.clj:136)
at org.spootnik.cyanite.tcp$build_handler_factory$fn__37$fn__38.invoke(tcp.clj:40)
at org.spootnik.cyanite.tcp.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source)
at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:341)
at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:327)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at org.spootnik.cyanite.tcp.proxy$io.netty.handler.codec.ByteToMessageDecoder$ff19274a.channelRead(Unknown Source)
at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:341)
at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:327)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
.
.
.
.
WARN [2014-07-02 14:13:11,155] nioEventLoopGroup-2-1 - io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
WARN [2014-07-02 14:13:12,159] nioEventLoopGroup-2-1 - io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
WARN [2014-07-02 14:13:13,161] nioEventLoopGroup-2-1 - io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
DEBUG [2014-07-02 14:13:13,446] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-02 14:13:13,446] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-02 14:13:13,446] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-02 14:13:13,447] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-02 14:13:13,447] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-02 14:13:13,447] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-02 14:13:13,447] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil

@tcoupland
Copy link
Contributor Author

Interesting.

Exception for metric [ ] : # Looks like empty strings are making it into the formatparser. That's easy to fix.

Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer. Means that the channel from tcp into the formatparser has become flooded and it is spilling metrics. Making it 'windowed' here would mean metrics being silently dropped. This is a sign that it's overloaded in it's current configuration. I noticed something this afternoon in the profiler and i'm thinking of changing it to hand over it's puts to into-chan (a new go block really). This would free it up.

java.io.IOException: Too many open files Well speaks for itself! Bashed into the file handle limit.

Essentially it looks like you pushed it too hard and that's it's noisy way of saying back off. I can't see any pause in that script, the indentation is gone in the comment, but it looks to me like that would just send msgs in a single minded loop.

How long did it take for this happen? What kind of server is Cyanite on? What's the file handle limit on that server?

I'll give this script a go tomorrow to, see if i can reproduce and fix.

@chjohnst
Copy link

chjohnst commented Jul 2, 2014

@mantree

The too man files open issue was fairly easy to solve, adjusted ulimit params and it went away for some time. The biggest issues im seeing is long lasting TCP connections just get disconnected very quickly with a TCP reset.

The script does a send.all so it will wait until all of the messages have been send then carries on with the loop. Usually if I run it against graphite carbon or the stable master branch of cyanite we just end up blocking for a period of time (sometimes 12s with a setting of 20k message rate). With your new code we never get blocked and I start seeing the errors I posted from java.

This can happen with as low as 1000 messages. Server is a 6 core westmere, not the latest but itswhat I have for testing. File handle limit I eventually bumped up 65k.

@tcoupland
Copy link
Contributor Author

Ok, i think i've got this licked.

I was using the wrong async func in the netty code to get into the channels. I was using put!, instead of >!!, thinking that put! was blocking. >!! means that when the server gets flooded clients are held back.

I've set the read time out handler to 30 seconds, this keeps it out of the way of the test for me. I think it existing at all needs to become a configuration element, defaulting to it not being there. Hopefully 30 seconds should keep it out of the way, but if you find connections getting closed just delete line 55 in tcp.clj.

So, using the python soak test with my code i get the following cpu usage:

cyanitenew_soak2000

and the output from the soaker is:

python soakTest.py localhost 2020 20000
sent 20000 metrics in 0.105 seconds
sent 20000 metrics in 0.134 seconds
sent 20000 metrics in 0.131 seconds
sent 20000 metrics in 0.098 seconds
sent 20000 metrics in 0.074 seconds
sent 20000 metrics in 0.102 seconds
sent 20000 metrics in 3.481 seconds
sent 20000 metrics in 1.922 seconds
sent 20000 metrics in 0.213 seconds
sent 20000 metrics in 0.366 seconds
sent 20000 metrics in 0.376 seconds
sent 20000 metrics in 0.475 seconds
sent 20000 metrics in 3.647 seconds
sent 20000 metrics in 5.198 seconds
sent 20000 metrics in 3.045 seconds
sent 20000 metrics in 2.433 seconds
sent 20000 metrics in 2.348 seconds
sent 20000 metrics in 5.864 seconds
sent 20000 metrics in 2.903 seconds
sent 20000 metrics in 2.749 seconds
sent 20000 metrics in 2.766 seconds
sent 20000 metrics in 2.454 seconds
sent 20000 metrics in 5.538 seconds
sent 20000 metrics in 2.123 seconds
sent 20000 metrics in 2.829 seconds
sent 20000 metrics in 4.987 seconds
sent 20000 metrics in 3.024 seconds
sent 20000 metrics in 2.606 seconds
sent 20000 metrics in 2.226 seconds
sent 20000 metrics in 4.447 seconds
sent 20000 metrics in 2.769 seconds
sent 20000 metrics in 2.563 seconds

It then settles down into that kind of pattern.

Swapping in current head i get:

cyanitehead_soak20000

And soaker output:

python soakTest.py localhost 2020 20000 !11772
sent 20000 metrics in 0.105 seconds
sent 20000 metrics in 0.135 seconds
sent 20000 metrics in 0.123 seconds
sent 20000 metrics in 0.085 seconds
sent 20000 metrics in 0.083 seconds
sent 20000 metrics in 265.909 seconds
sent 20000 metrics in 458.526 seconds
sent 20000 metrics in 428.706 seconds
sent 20000 metrics in 873.841 seconds
sent 20000 metrics in 449.697 seconds
sent 20000 metrics in 431.110 seconds
sent 20000 metrics in 862.906 seconds
sent 20000 metrics in 454.394 seconds
sent 20000 metrics in 867.787 seconds
sent 20000 metrics in 430.354 seconds
sent 20000 metrics in 466.546 seconds

Really interesting that.

I've push the code so give it a whirl. I'm going to leave mine over night under high load to see if there's any memory gremlins, then if it looks good i'll create the pull request

@chjohnst
Copy link

chjohnst commented Jul 3, 2014

Great! I'll try this in a bit!
On Jul 3, 2014 10:56 AM, "Tom Coupland" notifications@github.com wrote:

Ok, i think i've got this licked.

I was using the wrong async func in the netty code to get into the
channels. I was using put!, instead of >!!, thinking that put! was
blocking. >!! means that when the server gets flooded clients are held
back.

I've set the read time out handler to 30 seconds, this keeps it out of the
way of the test for me. I think it existing at all needs to become a
configuration element, defaulting to it not being there. Hopefully 30
seconds should keep it out of the way, but if you find connections getting
closed just delete line 55 in tcp.clj.

So, using the python soak test with my code i get the following cpu usage:

[image: cyanitenew_soak2000]
https://cloud.githubusercontent.com/assets/422106/3470784/9d3b0e1e-02bd-11e4-84c2-f1bf3237325d.png

and the output from the soaker is:

python soakTest.py localhost 2020 20000
sent 20000 metrics in 0.105 seconds
sent 20000 metrics in 0.134 seconds
sent 20000 metrics in 0.131 seconds
sent 20000 metrics in 0.098 seconds
sent 20000 metrics in 0.074 seconds
sent 20000 metrics in 0.102 seconds
sent 20000 metrics in 3.481 seconds
sent 20000 metrics in 1.922 seconds
sent 20000 metrics in 0.213 seconds
sent 20000 metrics in 0.366 seconds
sent 20000 metrics in 0.376 seconds
sent 20000 metrics in 0.475 seconds
sent 20000 metrics in 3.647 seconds
sent 20000 metrics in 5.198 seconds
sent 20000 metrics in 3.045 seconds
sent 20000 metrics in 2.433 seconds
sent 20000 metrics in 2.348 seconds
sent 20000 metrics in 5.864 seconds
sent 20000 metrics in 2.903 seconds
sent 20000 metrics in 2.749 seconds
sent 20000 metrics in 2.766 seconds
sent 20000 metrics in 2.454 seconds
sent 20000 metrics in 5.538 seconds
sent 20000 metrics in 2.123 seconds
sent 20000 metrics in 2.829 seconds
sent 20000 metrics in 4.987 seconds
sent 20000 metrics in 3.024 seconds
sent 20000 metrics in 2.606 seconds
sent 20000 metrics in 2.226 seconds
sent 20000 metrics in 4.447 seconds
sent 20000 metrics in 2.769 seconds
sent 20000 metrics in 2.563 seconds

It then settles down into that kind of pattern.

Swapping in current head i get:

[image: cyanitehead_soak20000]
https://cloud.githubusercontent.com/assets/422106/3470805/c331a6f0-02bd-11e4-9b74-b48b75092f1b.png

And soaker output:

python soakTest.py localhost 2020 20000 !11772
sent 20000 metrics in 0.105 seconds
sent 20000 metrics in 0.135 seconds
sent 20000 metrics in 0.123 seconds
sent 20000 metrics in 0.085 seconds
sent 20000 metrics in 0.083 seconds
sent 20000 metrics in 265.909 seconds
sent 20000 metrics in 458.526 seconds
sent 20000 metrics in 428.706 seconds
sent 20000 metrics in 873.841 seconds
sent 20000 metrics in 449.697 seconds
sent 20000 metrics in 431.110 seconds
sent 20000 metrics in 862.906 seconds
sent 20000 metrics in 454.394 seconds
sent 20000 metrics in 867.787 seconds
sent 20000 metrics in 430.354 seconds
sent 20000 metrics in 466.546 seconds

Really interesting that.

I've push the code so give it a whirl. I'm going to leave mine over night
under high load to see if there's any memory gremlins, then if it looks
good i'll create the pull request


Reply to this email directly or view it on GitHub
#40 (comment).

@tcoupland
Copy link
Contributor Author

Think i've bashed into a thread death problem. After 40 minutes of having 5 instances of the script hitting Cyanite, everything stopped getting processed. There were a lot of timeouts talking to elastic search and somehow these have killed the requesting threads of both http-kit and alia, so the whole things now dead in the water.

Going to start it back up with a few less soakers, but there's clearly a problem with very heavy sustained load. Must be something in the way http-kit is getting used, there's an issue on the project discussing something similar, but can't see how i've ended up there. Need to dig in to why it effected both the writers though, that seems very weird.

Be interested to know if you encounter the same problem.

@chjohnst
Copy link

chjohnst commented Jul 3, 2014

@mantree

We are definitely getting closer, but it didn't take long for me to get cyanite to start acting up. I am firing a single instance to cyanite with about 20000 metrics (same as usual) and things started to fall apart pretty bad after about 30 seconds. Looks like the problem is now in the es_path code where I start getting errors about "Failed bulk update with: nil". It seems to be saying it about the client-worker though. Is this an actual error to be concerned about?

DEBUG [2014-07-03 14:14:32,275] pool-9-thread-1 - org.spootnik.cyanite.store - written batch
DEBUG [2014-07-03 14:14:32,275] async-dispatch-46 - org.spootnik.cyanite.es_path - Full Paths Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,290] pool-9-thread-1 - org.spootnik.cyanite.store - written batch
DEBUG [2014-07-03 14:14:32,298] async-dispatch-31 - org.spootnik.cyanite.es_path - Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,298] async-dispatch-68 - org.spootnik.cyanite.es_path - Full Paths Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,310] async-dispatch-69 - org.spootnik.cyanite.es_path - Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,319] pool-9-thread-1 - org.spootnik.cyanite.store - written batch
DEBUG [2014-07-03 14:14:32,322] async-dispatch-43 - org.spootnik.cyanite.es_path - Full Paths Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,322] async-dispatch-71 - org.spootnik.cyanite.es_path - Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,322] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-03 14:14:32,340] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-03 14:14:32,342] async-dispatch-60 - org.spootnik.cyanite.es_path - Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,344] pool-9-thread-1 - org.spootnik.cyanite.store - written batch
DEBUG [2014-07-03 14:14:32,350] async-dispatch-32 - org.spootnik.cyanite.es_path - Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,354] async-dispatch-27 - org.spootnik.cyanite.es_path - Full Paths Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,358] async-dispatch-60 - org.spootnik.cyanite.es_path - Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,365] pool-9-thread-1 - org.spootnik.cyanite.store - written batch
DEBUG [2014-07-03 14:14:32,377] pool-9-thread-1 - org.spootnik.cyanite.store - written batch
DEBUG [2014-07-03 14:14:32,378] client-worker-1 - org.spootnik.cyanite.es_path - Failed bulk update with: nil
DEBUG [2014-07-03 14:14:32,382] async-dispatch-33 - org.spootnik.cyanite.es_path - Fnd 0 , creating 1000
DEBUG [2014-07-03 14:14:32,386] async-dispatch-60 - org.spootnik.cyanite.es_path - Full Paths Fnd 0 , creating

@chjohnst
Copy link

chjohnst commented Jul 3, 2014

Ok, I think my source of trouble with es_rest was an overloaded server. I moved elastic search off to another node and things are behaving better. Will continue testing, but the user in my company here is quite happy with the results in the streaming of data he is sending. Now to just get things to graph up in graphite-api. Nice work so far, I think this is definitely a lot more scalable.

@tcoupland
Copy link
Contributor Author

Excellent!

It does seem that something goes awry when ElasticSearch starts struggling. Those 'nils' are the status in the response from ES. Not sure why i thought only logging the status was enough... i've change that now to log the full response map.

I just did the bare minimum with http-kit to see if this approach would work, so not too surprising it's a source of trouble. I'll take a look tomorrow at it's configuration, see if i can't make the client behave a bit better under load.

@pyr
Copy link
Owner

pyr commented Jul 3, 2014

Maybe it would be wise to test with a dummy Pathstore to eliminate ES from the picture, for instance:

(defn dummy-pathstore
  []
  (reify Pathstore
    (register [this tenant path])
    (prefixes [this tenant path] [])
    (search [this tenant path] [])))

@tcoupland
Copy link
Contributor Author

I can't get it to fall into the stuck state again, even with my ES cluster playing up a bit. It ran happily over night and there's no sign of any memory problems.

I'll leave it over the weekend with many requesters running and see if it happens again, but i think it might be hard to reproduce. If your happy @pyr I'll create the pull request.

@pyr
Copy link
Owner

pyr commented Jul 4, 2014

yes, please do!

@chjohnst
Copy link

chjohnst commented Jul 4, 2014

@mantree @pyr I am testing both scenarios here, we have a steady stream of metrics that will flow at a rate of about 20-30k messages a second so I want to make sure I am stressing both code paths. Thankfully I have a small cluster of ES nodes to send the data too and that solved that issue. We had a stream running overnight that suddenly stopped at around midnight but my other streams from collectd continued running just fine but I think my user may have shutdown his metric stream. The recent code change is definitely a significant improvement though.

One more feature add that would be nice is making some configurable options for how much we should batch up for each async writer and/or how often it should attempt to push the batch into Cassandra.

@pyr
Copy link
Owner

pyr commented Sep 5, 2014

Thanks for your awesome work @mantree

@pyr pyr closed this as completed Sep 5, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants