-
Notifications
You must be signed in to change notification settings - Fork 79
Very divergent performance angled fork #40
Comments
Realised saying i've 'improved performance' with no evidence is not right, so here's a couple of screenshots. This is Cyanite running on its own EC2 m3.large instance. The change over was performed by stopping the application, dropping in the HEAD uberjar and restarting. Here you can see the rate of inserts against Casandra change when i replaced an instance of my fork with the current HEAD. YourKit showing the CPU usage of current HEAD. This is the same view, but with the fork running. No change to the applications submitting metrics where made. |
I like the idea of batched writes, is this tunable as every workload can be I would also be happy to test this code, I am working on trying to increase
|
The casandra batches are set to 500 at the moment, but they use a custom channel that flushes a batch every 500 millis. ElasticSearch batches are 1000 and flush every second. Hopefully the flushing should mean it won't need much tuning. See here for the channel: https://github.com/mantree/cyanite/blob/master/src/org/spootnik/cyanite/util.clj#L35 Would be really interested in seeing if you get enough of an improvement. Of course this is a prototype, performance has been my main objective and so correctness may have suffered. I've really only tried to change the flow of data and not any of the actual work done, so hopefully it will just work. If not just let me know and i'm sure i can fix it up. |
Actually, any way to merge this work with #36? Bulk ops to Cassandra/ES AND in memory path cache? |
I have got a cache in there, but instead of storing the full path it stores the first few elements of paths. It's set to depth 1 at the moment while i got the batches to work, but could be ramped up. It only gets interrogated after it's checked that the full path already exists though, then they get broken up in the usual way and then the cache filters out the first few elements before the missing elements are found and, finally, the bulk inserts are performed. There's no reason that the cache couldn't be put in front of the initial full path check though, it makes good sense, repetition levels are extreme after all. Looking through that code though i can't see any eviction policy on the cache, just a continuously growing set. We've got a lot of different metrics and i'd worry about not having some control over it's size. |
I'm happy to review and merge a PR @mantree great work ! |
@mantree I tested your latest set of commits, and I think the tcp listener is not working as expected. All of my apps will open the socket on port 2003, and expect to hold the socket open happily while our apps send thousands of metrics a second continuously all day. I am seeing a TCP reset get sent from my cyanite node after a very short amount of time. Can you review your TCP code to see if something is broken here? |
@chjohnst thanks for giving it a whirl. This is probably a hang over from the initial problem of hanging connections we had with our setup. So i made netty close connections as quickly as possible when it's received a complete message. There I good chance this was a red herring for us though, I'll have a play with this and let you know if i get a anywhere. @pyr execllent! I've got a bit more time for this so i'll double check stuff a bit today and sand off some off the really rough edges and get the connection handling improved. |
That red herring would be a show stopper for most carbon protocol apps
|
Well, i think it's just a hang over from our initial problem. I've got a fix i'm just playing with now. You can find the problem pretty easily by looking in tcp.clj, spot the (.close ctx) after it thinks it's got a whole message :) |
yea, I see what happens if I remove the .close, pretty nasty |
Just pushed a new version of the tcp server that should work a bit better here. I was trying to do too much before, this time i've just used Netty components. There is now an idle timeout set to 1 second, this should clear out anything that doesn't work in the streaming fashion. This seems to be working ok in our scenario at least, all there are now minor cpu spikes every ~15 seconds that i'm trying to look into. If you can safely give it a test that'd be grand. |
Just gave it a try and getting some errors. Want me post the output? What are you using to test this?
|
Well i give it a little try locally with jmeter, just to flush out any silly errors. Then i'm dropping it into our staging environment where we have a subset of our metrics being posted. Sure, put in the output! |
Try running this script #!/usr/bin/env python import sys, os, time try: s = socket() while True: print 'sent %d metrics in %.3f seconds' % (count, time.time() - start) |
Here are the errors: DEBUG [2014-07-02 11:36:45,612] pool-9-thread-1 - org.spootnik.cyanite.store - written batch |
Interesting. Exception for metric [ ] : # Looks like empty strings are making it into the formatparser. That's easy to fix. Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer. Means that the channel from tcp into the formatparser has become flooded and it is spilling metrics. Making it 'windowed' here would mean metrics being silently dropped. This is a sign that it's overloaded in it's current configuration. I noticed something this afternoon in the profiler and i'm thinking of changing it to hand over it's puts to into-chan (a new go block really). This would free it up. java.io.IOException: Too many open files Well speaks for itself! Bashed into the file handle limit. Essentially it looks like you pushed it too hard and that's it's noisy way of saying back off. I can't see any pause in that script, the indentation is gone in the comment, but it looks to me like that would just send msgs in a single minded loop. How long did it take for this happen? What kind of server is Cyanite on? What's the file handle limit on that server? I'll give this script a go tomorrow to, see if i can reproduce and fix. |
The too man files open issue was fairly easy to solve, adjusted ulimit params and it went away for some time. The biggest issues im seeing is long lasting TCP connections just get disconnected very quickly with a TCP reset. The script does a send.all so it will wait until all of the messages have been send then carries on with the loop. Usually if I run it against graphite carbon or the stable master branch of cyanite we just end up blocking for a period of time (sometimes 12s with a setting of 20k message rate). With your new code we never get blocked and I start seeing the errors I posted from java. This can happen with as low as 1000 messages. Server is a 6 core westmere, not the latest but itswhat I have for testing. File handle limit I eventually bumped up 65k. |
Ok, i think i've got this licked. I was using the wrong async func in the netty code to get into the channels. I was using put!, instead of >!!, thinking that put! was blocking. >!! means that when the server gets flooded clients are held back. I've set the read time out handler to 30 seconds, this keeps it out of the way of the test for me. I think it existing at all needs to become a configuration element, defaulting to it not being there. Hopefully 30 seconds should keep it out of the way, but if you find connections getting closed just delete line 55 in tcp.clj. So, using the python soak test with my code i get the following cpu usage: and the output from the soaker is: python soakTest.py localhost 2020 20000 It then settles down into that kind of pattern. Swapping in current head i get: And soaker output: python soakTest.py localhost 2020 20000 !11772 Really interesting that. I've push the code so give it a whirl. I'm going to leave mine over night under high load to see if there's any memory gremlins, then if it looks good i'll create the pull request |
Great! I'll try this in a bit!
|
Think i've bashed into a thread death problem. After 40 minutes of having 5 instances of the script hitting Cyanite, everything stopped getting processed. There were a lot of timeouts talking to elastic search and somehow these have killed the requesting threads of both http-kit and alia, so the whole things now dead in the water. Going to start it back up with a few less soakers, but there's clearly a problem with very heavy sustained load. Must be something in the way http-kit is getting used, there's an issue on the project discussing something similar, but can't see how i've ended up there. Need to dig in to why it effected both the writers though, that seems very weird. Be interested to know if you encounter the same problem. |
We are definitely getting closer, but it didn't take long for me to get cyanite to start acting up. I am firing a single instance to cyanite with about 20000 metrics (same as usual) and things started to fall apart pretty bad after about 30 seconds. Looks like the problem is now in the es_path code where I start getting errors about "Failed bulk update with: nil". It seems to be saying it about the client-worker though. Is this an actual error to be concerned about? DEBUG [2014-07-03 14:14:32,275] pool-9-thread-1 - org.spootnik.cyanite.store - written batch |
Ok, I think my source of trouble with es_rest was an overloaded server. I moved elastic search off to another node and things are behaving better. Will continue testing, but the user in my company here is quite happy with the results in the streaming of data he is sending. Now to just get things to graph up in graphite-api. Nice work so far, I think this is definitely a lot more scalable. |
Excellent! It does seem that something goes awry when ElasticSearch starts struggling. Those 'nils' are the status in the response from ES. Not sure why i thought only logging the status was enough... i've change that now to log the full response map. I just did the bare minimum with http-kit to see if this approach would work, so not too surprising it's a source of trouble. I'll take a look tomorrow at it's configuration, see if i can't make the client behave a bit better under load. |
Maybe it would be wise to test with a dummy Pathstore to eliminate ES from the picture, for instance: (defn dummy-pathstore
[]
(reify Pathstore
(register [this tenant path])
(prefixes [this tenant path] [])
(search [this tenant path] []))) |
I can't get it to fall into the stuck state again, even with my ES cluster playing up a bit. It ran happily over night and there's no sign of any memory problems. I'll leave it over the weekend with many requesters running and see if it happens again, but i think it might be hard to reproduce. If your happy @pyr I'll create the pull request. |
yes, please do! |
@mantree @pyr I am testing both scenarios here, we have a steady stream of metrics that will flow at a rate of about 20-30k messages a second so I want to make sure I am stressing both code paths. Thankfully I have a small cluster of ES nodes to send the data too and that solved that issue. We had a stream running overnight that suddenly stopped at around midnight but my other streams from collectd continued running just fine but I think my user may have shutdown his metric stream. The recent code change is definitely a significant improvement though. One more feature add that would be nice is making some configurable options for how much we should batch up for each async writer and/or how often it should attempt to push the batch into Cassandra. |
Thanks for your awesome work @mantree |
I've got a very divergent fork going that i've been working on for some time to improve performance. Some highlights from the changes are:
I'd like to know what the appetite is for taking this rather large change set wholesale. That would obviously be simplest for me, but i appreciate that the original author might not want that. If not then I'll see if i can separate out some bits to give back.
The text was updated successfully, but these errors were encountered: