Performance Related Changes #41

Merged
merged 23 commits into from Sep 5, 2014

Conversation

Projects
None yet
6 participants
@mantree
Contributor

mantree commented Jul 4, 2014

Pull request as discussed at length in this issue

@tmonk42

This comment has been minimized.

Show comment
Hide comment
@tmonk42

tmonk42 Jun 19, 2014

This series of commits looks great. Looking forward to seeing it rolled into the upstream :)

This series of commits looks great. Looking forward to seeing it rolled into the upstream :)

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 1, 2014

Owner

A bit of a delay, but back on this now :)

Wondering how to commit this stuff back as i've made so many changes. The new stuff around es path creation should go as would all the async bulk methods. Don't know if the original author would be so happy with the move to async though.

Owner

mantree replied Jul 1, 2014

A bit of a delay, but back on this now :)

Wondering how to commit this stuff back as i've made so many changes. The new stuff around es path creation should go as would all the async bulk methods. Don't know if the original author would be so happy with the move to async though.

Tom Coupland added some commits Jul 1, 2014

Tom Coupland
Switched casandra bulk inserts away from hayt to direct method, huge …
…reduction in time spent string processing. Added bulk insert method for ES for creating new paths. Over all huge performance improvements again.
Tom Coupland
Rewritten server to only close connection if idle for 1 second. Fixed…
… path-depth function as it was always 1 less than original version
@mpenet

This comment has been minimized.

Show comment
Hide comment
@mpenet

mpenet Jul 3, 2014

Is the extend really necessary? the protocol extends Statement already (BatchStatement is a subclass of it http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/BatchStatement.html) and makes it a pass through.

https://github.com/mpenet/alia/blob/master/src/qbits/alia.clj#L207-L209

Is the extend really necessary? the protocol extends Statement already (BatchStatement is a subclass of it http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/BatchStatement.html) and makes it a pass through.

https://github.com/mpenet/alia/blob/master/src/qbits/alia.clj#L207-L209

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 4, 2014

Owner

Good point. Must have just got carried a way, removing that.

Owner

mantree replied Jul 4, 2014

Good point. Must have just got carried a way, removing that.

@chjohnst

This comment has been minimized.

Show comment
Hide comment
@chjohnst

chjohnst Jul 4, 2014

@mantree I think your code changes the default store to devnull, we should leave this as Cassandra perhaps and let people make the changes in cyanite.yaml if they want to point to devnull. Thoughts?

chjohnst commented Jul 4, 2014

@mantree I think your code changes the default store to devnull, we should leave this as Cassandra perhaps and let people make the changes in cyanite.yaml if they want to point to devnull. Thoughts?

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 4, 2014

Contributor

I did that early on, just so i could get it start up and get something working. Made it easier in that first 10 minutes.

That said i don't mind either way and happy to change it back to keep things the same.

Contributor

mantree commented Jul 4, 2014

I did that early on, just so i could get it start up and get something working. Made it easier in that first 10 minutes.

That said i don't mind either way and happy to change it back to keep things the same.

@chjohnst

This comment has been minimized.

Show comment
Hide comment
@chjohnst

chjohnst Jul 4, 2014

@mantree last bit of trouble I am running into with your code changes. Writes are working fantastic along with batching inserts into ES with the async channels being added. But have you tested retrieving data yet? I did some tests against the current master branch and your current patch set and I am not seeing any data being returned. Any chance you can take a peek would be appreciated.

chjohnst commented Jul 4, 2014

@mantree last bit of trouble I am running into with your code changes. Writes are working fantastic along with batching inserts into ES with the async channels being added. But have you tested retrieving data yet? I did some tests against the current master branch and your current patch set and I am not seeing any data being returned. Any chance you can take a peek would be appreciated.

@tmonk42

This comment has been minimized.

Show comment
Hide comment
@tmonk42

tmonk42 Jul 7, 2014

I also did a build from fad75b8, writes to Cassandra dropped to zero, GETs to elasticsearch index "cyanite_paths" dropped to zero. When I browse the data (using grafana), my top level includes items named 0-1,000 and the names of the two second level items I used to have (i.e.: vmstats.foo and vmstats.bar, now just foo, bar, 1-1000 all at the top level). This gist output is generated when I hit graphite-api with /metrics/find?query=* https://gist.github.com/tmonk42/8e5a82a698319c799883

tmonk42 commented Jul 7, 2014

I also did a build from fad75b8, writes to Cassandra dropped to zero, GETs to elasticsearch index "cyanite_paths" dropped to zero. When I browse the data (using grafana), my top level includes items named 0-1,000 and the names of the two second level items I used to have (i.e.: vmstats.foo and vmstats.bar, now just foo, bar, 1-1000 all at the top level). This gist output is generated when I hit graphite-api with /metrics/find?query=* https://gist.github.com/tmonk42/8e5a82a698319c799883

@addisonj

This comment has been minimized.

Show comment
Hide comment
@addisonj

addisonj Jul 8, 2014

Contributor

Glad to see this work is ongoing.

I can also test this out in our production cluster in the next few days and give feedback. One thing that isn't clear to me (don't have tons of time atm to grok the code) is if the ES format changes?

Let me know if there is anything I can do to help with this and super glad to see people taking this work further (and fixing my less than awesome first attempt!)

Contributor

addisonj commented Jul 8, 2014

Glad to see this work is ongoing.

I can also test this out in our production cluster in the next few days and give feedback. One thing that isn't clear to me (don't have tons of time atm to grok the code) is if the ES format changes?

Let me know if there is anything I can do to help with this and super glad to see people taking this work further (and fixing my less than awesome first attempt!)

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 8, 2014

Contributor

Hmm, i must confess i've done little on the retrieval side of things. The format hasn't intentionally changed, the goal was to just tweak how stuff moved around, not to actually change anything.

I'll look into this stuff today!

Contributor

mantree commented Jul 8, 2014

Hmm, i must confess i've done little on the retrieval side of things. The format hasn't intentionally changed, the goal was to just tweak how stuff moved around, not to actually change anything.

I'll look into this stuff today!

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 8, 2014

Contributor

Ok, hopefully that data change is the problem. Havn't been able to totally confirm it as havn't been able to point a graphite over to it today.

I should be able to do that tomorrow, but if someone could give it a quick go that'd be great.

Contributor

mantree commented Jul 8, 2014

Ok, hopefully that data change is the problem. Havn't been able to totally confirm it as havn't been able to point a graphite over to it today.

I should be able to do that tomorrow, but if someone could give it a quick go that'd be great.

@tmonk42

This comment has been minimized.

Show comment
Hide comment
@tmonk42

tmonk42 Jul 8, 2014

Fired up 7b72c46, after kicking it around a bunch, then reading docs on http-kit, I had an aha moment and flipped my config from es-native to es-rest. Now it I'm getting writes to elasticsearch. Still no writes to Cassandra, and path lookup from ES appears off. Still digging at specifics, but it looks like the depth might be wrong. Ie: searching for "foo." returns "foo..*". Will see if I can track down exactly what's up.

tmonk42 commented Jul 8, 2014

Fired up 7b72c46, after kicking it around a bunch, then reading docs on http-kit, I had an aha moment and flipped my config from es-native to es-rest. Now it I'm getting writes to elasticsearch. Still no writes to Cassandra, and path lookup from ES appears off. Still digging at specifics, but it looks like the depth might be wrong. Ie: searching for "foo." returns "foo..*". Will see if I can track down exactly what's up.

@tmonk42

This comment has been minimized.

Show comment
Hide comment
@tmonk42

tmonk42 Jul 8, 2014

Search for "*" at top level is returning items at second level. Similar behavior at deeper level searches as well.

DEBUG [2014-07-08 20:13:26,095] New I/O worker #17 - org.spootnik.cyanite.http - got request: {:remote-addr X.X.X.X, :params {:query }, :headers {accept-encoding gzip, deflate, user-agent python-requests/2.3.0 CPython/3.4.1 Linux/2.6.32-358.6.2.el6.x86_64, accept */, host cyanite.example.com:8080}, :server-port 8080, :content-length nil, :keep-alive? true, :content-type nil, :character-encoding nil, :action :paths, :uri /paths, :server-name graphite-api.example.com, :query-string query=%2A, :body nil, :scheme :http, :request-method :get}
DEBUG [2014-07-08 20:13:26,096] New I/O worker #17 - org.spootnik.cyanite.http - query now: *
INFO [2014-07-08 20:13:26,111] New I/O worker #17 - org.spootnik.cyanite.es_path - HITS: ({:_index cyanite_paths, :_type path, :_id vmstats.test, :_score 1.0, :_source {:_id vmstats.test, :path vmstats.test, :depth 2, :tenant , :leaf false}})

tmonk42 commented Jul 8, 2014

Search for "*" at top level is returning items at second level. Similar behavior at deeper level searches as well.

DEBUG [2014-07-08 20:13:26,095] New I/O worker #17 - org.spootnik.cyanite.http - got request: {:remote-addr X.X.X.X, :params {:query }, :headers {accept-encoding gzip, deflate, user-agent python-requests/2.3.0 CPython/3.4.1 Linux/2.6.32-358.6.2.el6.x86_64, accept */, host cyanite.example.com:8080}, :server-port 8080, :content-length nil, :keep-alive? true, :content-type nil, :character-encoding nil, :action :paths, :uri /paths, :server-name graphite-api.example.com, :query-string query=%2A, :body nil, :scheme :http, :request-method :get}
DEBUG [2014-07-08 20:13:26,096] New I/O worker #17 - org.spootnik.cyanite.http - query now: *
INFO [2014-07-08 20:13:26,111] New I/O worker #17 - org.spootnik.cyanite.es_path - HITS: ({:_index cyanite_paths, :_type path, :_id vmstats.test, :_score 1.0, :_source {:_id vmstats.test, :path vmstats.test, :depth 2, :tenant , :leaf false}})

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 8, 2014

Contributor

thanks @tmonk42 i did change some of the depthing stuff to use less heavy string methods. Thought i'd got it right, but must have missed something.

Not sure why your not seeing anything going to cassandra, should be flushing the buffer every 500 millis. Are you seeing any debug messages like "written batch"? That's currently logged for every request.

Which leads nicely on to me getting to the bottom of the weird lock state i got the other day: ran out of disk space. Too much debug logging on a small disk it seems!

Contributor

mantree commented Jul 8, 2014

thanks @tmonk42 i did change some of the depthing stuff to use less heavy string methods. Thought i'd got it right, but must have missed something.

Not sure why your not seeing anything going to cassandra, should be flushing the buffer every 500 millis. Are you seeing any debug messages like "written batch"? That's currently logged for every request.

Which leads nicely on to me getting to the bottom of the weird lock state i got the other day: ran out of disk space. Too much debug logging on a small disk it seems!

@tmonk42

This comment has been minimized.

Show comment
Hide comment
@tmonk42

tmonk42 Jul 8, 2014

@mantree yup, seeing log lines like this below, but the writes aren't there (using Opscenter to view Cassandra perf stats). Tomorrow I'll see about firing up cassandra at debug logging level and see if anything jumps out.

DEBUG [2014-07-08 20:48:42,503] pool-9-thread-1 - org.spootnik.cyanite.store - written batch

tmonk42 commented Jul 8, 2014

@mantree yup, seeing log lines like this below, but the writes aren't there (using Opscenter to view Cassandra perf stats). Tomorrow I'll see about firing up cassandra at debug logging level and see if anything jumps out.

DEBUG [2014-07-08 20:48:42,503] pool-9-thread-1 - org.spootnik.cyanite.store - written batch

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 9, 2014

Contributor

Ok, i think that's it. At one point the depthing was low, so i'd inc'd the depth on retrieval, but then when i fixed the depth i didn't remove the inc. This should fix the lookup problem.

@tmonk42 that does sound a bit weird. The only time i saw this kind of thing was when i introduced the batching and the reduction in writes was so dramatic it made it look like nothing was happening. That's actually why i introduced the debug msg. Have you tried retrieving any of your data via cql?

Contributor

mantree commented Jul 9, 2014

Ok, i think that's it. At one point the depthing was low, so i'd inc'd the depth on retrieval, but then when i fixed the depth i didn't remove the inc. This should fix the lookup problem.

@tmonk42 that does sound a bit weird. The only time i saw this kind of thing was when i introduced the batching and the reduction in writes was so dramatic it made it look like nothing was happening. That's actually why i introduced the debug msg. Have you tried retrieving any of your data via cql?

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 9, 2014

Contributor

Okay, just got data coming through to our staging graphite from cyanite, so hopefully it should be working for you all now :)

Contributor

mantree commented Jul 9, 2014

Okay, just got data coming through to our staging graphite from cyanite, so hopefully it should be working for you all now :)

@tmonk42

This comment has been minimized.

Show comment
Hide comment
@tmonk42

tmonk42 Jul 9, 2014

@mantree 0e1c58b fixed up the depth problem I was seeing. I did more digging on the Cassandra end, looks like I was having a face palm issue. Batch writes were showing as one "write" in the graph, compared to the hundreds of individual writes. Data is flowing :)

tmonk42 commented Jul 9, 2014

@mantree 0e1c58b fixed up the depth problem I was seeing. I did more digging on the Cassandra end, looks like I was having a face palm issue. Batch writes were showing as one "write" in the graph, compared to the hundreds of individual writes. Data is flowing :)

@tmonk42

This comment has been minimized.

Show comment
Hide comment
@tmonk42

tmonk42 Jul 9, 2014

Oh, I should mention, this is handling 2.4 million metrics per minute sustained on a multinode cluster. Initial indexing of that many stats takes about 15minutes, so I've got some tuning to do on the elasticsearch end of things, but that's a topic for another thread. In short, this bulk write fork rocks my socks off. I might do a little testing and tuning on the size of writes to ES and Cassandra, and it might be nice to be able to tune that via config, but that's no reason to not merge this up as-is in my book :)

tmonk42 commented Jul 9, 2014

Oh, I should mention, this is handling 2.4 million metrics per minute sustained on a multinode cluster. Initial indexing of that many stats takes about 15minutes, so I've got some tuning to do on the elasticsearch end of things, but that's a topic for another thread. In short, this bulk write fork rocks my socks off. I might do a little testing and tuning on the size of writes to ES and Cassandra, and it might be nice to be able to tune that via config, but that's no reason to not merge this up as-is in my book :)

@chjohnst

This comment has been minimized.

Show comment
Hide comment
@chjohnst

chjohnst Jul 9, 2014

@pyr this is great news! Just did a ton of QA work on this set of patches over the last week for @mantree and I am now able to push around 24k metrics a second on a single collector for testing. I need to get some faster machines for cyanite and ES but I am dispatching the writes into a 6 node cassandra cluster and things are working out pretty good. @tmonk42 I would like to know what sort of tuning you are doing for ES here that would be useful for some of us as I can see ES can be a bit of bottleneck when there is a lot of data coming in. @mantree I think exposing some more options for fine tuning the timers as well as the amount of data we want to batch up into configuration options would be useful to let people try out different options to see what works. The values you have may be good for now but as more people use it we may find better options for tuning, what are your thoughts there? is that something we want to do before a new release of cyanite?

chjohnst commented Jul 9, 2014

@pyr this is great news! Just did a ton of QA work on this set of patches over the last week for @mantree and I am now able to push around 24k metrics a second on a single collector for testing. I need to get some faster machines for cyanite and ES but I am dispatching the writes into a 6 node cassandra cluster and things are working out pretty good. @tmonk42 I would like to know what sort of tuning you are doing for ES here that would be useful for some of us as I can see ES can be a bit of bottleneck when there is a lot of data coming in. @mantree I think exposing some more options for fine tuning the timers as well as the amount of data we want to batch up into configuration options would be useful to let people try out different options to see what works. The values you have may be good for now but as more people use it we may find better options for tuning, what are your thoughts there? is that something we want to do before a new release of cyanite?

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 11, 2014

Contributor

Excellent stuff! Chuffed its rocking the socks :)

Options for the batch sizes and timing sound good, but i'm wondering whether they'd really make that much difference. The flush times are only really there to stop things backing up and the batches seem large enough without being too large for when the system is under load. It's tempting to just add options for everything i know, but i'm not sure what real impact they'd have. I'm hoping to move more traffic over to Cyanite next week, so can try some different batch sizes with higher load to see if there's any actual difference.

I've seen ES be the bottle neck as well. I think the full path cache (as in other branch) before the first ES check is worth investing in. Thinking of just using a simple set idea, but have a size limit of some kind. Unbounded caches just make me nervous.

Something that definitely needs exposing as an option is the idle connection timeout. We need this quite low for instance, where as most people seem to want it to be quite high (which will be the default).

One further improvement for multi core boxes i was thinking an option or cpus-1 for the number of go blocks performing the rollups. At the moment this is where all the cpu time is spent and if you've got the dbs to handle the write load then getting multiple workers on that could make it even faster. Going to do an experiment to see if this idea works next week.

However, I'm on holiday till Tuesday now, so won't be making any more changes till then :)

Contributor

mantree commented Jul 11, 2014

Excellent stuff! Chuffed its rocking the socks :)

Options for the batch sizes and timing sound good, but i'm wondering whether they'd really make that much difference. The flush times are only really there to stop things backing up and the batches seem large enough without being too large for when the system is under load. It's tempting to just add options for everything i know, but i'm not sure what real impact they'd have. I'm hoping to move more traffic over to Cyanite next week, so can try some different batch sizes with higher load to see if there's any actual difference.

I've seen ES be the bottle neck as well. I think the full path cache (as in other branch) before the first ES check is worth investing in. Thinking of just using a simple set idea, but have a size limit of some kind. Unbounded caches just make me nervous.

Something that definitely needs exposing as an option is the idle connection timeout. We need this quite low for instance, where as most people seem to want it to be quite high (which will be the default).

One further improvement for multi core boxes i was thinking an option or cpus-1 for the number of go blocks performing the rollups. At the moment this is where all the cpu time is spent and if you've got the dbs to handle the write load then getting multiple workers on that could make it even faster. Going to do an experiment to see if this idea works next week.

However, I'm on holiday till Tuesday now, so won't be making any more changes till then :)

@tmonk42

This comment has been minimized.

Show comment
Hide comment
@tmonk42

tmonk42 Jul 11, 2014

@mantree Following up on comment from a few days ago, this works great when using es-rest, and doesn't look like it works at all when using es-native (i.e.: no writes to cassandra).

@chjohnst I was chewing on something things like changing the replication type to async on path writes to elasticsearch. I also haven't done a ton of tuning on my ES cluster, so I'm sure there's a lot more to be gained there.

tmonk42 commented Jul 11, 2014

@mantree Following up on comment from a few days ago, this works great when using es-rest, and doesn't look like it works at all when using es-native (i.e.: no writes to cassandra).

@chjohnst I was chewing on something things like changing the replication type to async on path writes to elasticsearch. I also haven't done a ton of tuning on my ES cluster, so I'm sure there's a lot more to be gained there.

@pyr

This comment has been minimized.

Show comment
Hide comment
@pyr

pyr Jul 14, 2014

Owner

Hi guys, this looks good on my side as well, will probably get it in soon.

Owner

pyr commented Jul 14, 2014

Hi guys, this looks good on my side as well, will probably get it in soon.

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 15, 2014

Contributor

I think i've fixed the es-native client now, at least for a small sample. However, it's worth noting that the implementation is very basic and doesn't do any of batching or the async'ing of requests that the http version does. This is because we run our stuff in EC2 and getting the native client to work there is apparently troublesome.

So there's plenty of room for improvement in the native client :)

Contributor

mantree commented Jul 15, 2014

I think i've fixed the es-native client now, at least for a small sample. However, it's worth noting that the implementation is very basic and doesn't do any of batching or the async'ing of requests that the http version does. This is because we run our stuff in EC2 and getting the native client to work there is apparently troublesome.

So there's plenty of room for improvement in the native client :)

@chjohnst

This comment has been minimized.

Show comment
Hide comment
@chjohnst

chjohnst Jul 16, 2014

@tmonk42 yea what I am seeing is a bit of a hold up from ES when doing writes but also reads. It could also be that I am using a single node thats running on old kit with cassandra on it. This is a POC cluster we have built so I will be looking at all new equipment for this project.

@tmonk42 yea what I am seeing is a bit of a hold up from ES when doing writes but also reads. It could also be that I am using a single node thats running on old kit with cassandra on it. This is a POC cluster we have built so I will be looking at all new equipment for this project.

Tom Coupland added some commits Jul 16, 2014

@pyr

This comment has been minimized.

Show comment
Hide comment
@pyr

pyr Jul 23, 2014

Owner

@mantree would you be willing to get the PR back in mergeable state ?

Owner

pyr commented Jul 23, 2014

@mantree would you be willing to get the PR back in mergeable state ?

@mantree

This comment has been minimized.

Show comment
Hide comment
@mantree

mantree Jul 23, 2014

Contributor

Yep, it's there. I'm at home at the mo and not in a position to give it a test, but i'll find the time tomorrow to check over i got it right.

As an aside i'm thinking about submitting a proposal to talk about these changes, in particular the core.async stuff, at FPDays, a functional conference in London. Just an idea as i saw the CFP finishes this friday. First and most importantly, would you @pyr be happy for me to do that? It's your project after all, message me directly if you have concerns. Others, as a sample of people likely to be at the conference, would you think it something of interested?

Contributor

mantree commented Jul 23, 2014

Yep, it's there. I'm at home at the mo and not in a position to give it a test, but i'll find the time tomorrow to check over i got it right.

As an aside i'm thinking about submitting a proposal to talk about these changes, in particular the core.async stuff, at FPDays, a functional conference in London. Just an idea as i saw the CFP finishes this friday. First and most importantly, would you @pyr be happy for me to do that? It's your project after all, message me directly if you have concerns. Others, as a sample of people likely to be at the conference, would you think it something of interested?

Tom Coupland added some commits Jul 24, 2014

@pyr

This comment has been minimized.

Show comment
Hide comment
@pyr

pyr Jul 24, 2014

Owner

@mantree I'll merge shortly and it's a great idea to speak at FPDays, I'll see if I can make the trip for the talk if it gets accepted.

Cheers,

  • pyr
Owner

pyr commented Jul 24, 2014

@mantree I'll merge shortly and it's a great idea to speak at FPDays, I'll see if I can make the trip for the talk if it gets accepted.

Cheers,

  • pyr

pyr added a commit that referenced this pull request Sep 5, 2014

Merge pull request #41 from mantree/master
Performance Related Changes

@pyr pyr merged commit 0f7958d into pyr:master Sep 5, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment