indexList() extremely slow under load #2071

danielmewes · 2014-03-07T01:22:19Z

New title: Everything becomes extremely slow under backfilling + sindex creation

While re-testing #1389, I noticed that executing r.table().indexList() took about 30s when the cluster was under load (here: backfilling + creating indexes + some insert queries). indexStatus() was similarly slow, taking up to 2 minutes!

The text was updated successfully, but these errors were encountered:

danielmewes · 2014-03-07T01:25:06Z

Ok wrong issue title.
Almost every query (including r.table().limit(1), but not r.expr(1)) became extremely slow, and I even got a heartbeat timeout between nodes again.

danielmewes · 2014-03-07T01:25:42Z

Very low CPU utilization, moderate i/o.

coffeemug · 2014-03-07T01:28:23Z

This is an issue that was gone but is reintroduced after the merge of the new cache, correct? In that case it should be in 1.12, not in polish. Such a regression is really important to fix.

danielmewes · 2014-03-07T01:31:26Z

Yes, I don't know if it was reintroduced or had been there for longer. I originally thought that this did just affect indexList(), but it affects other things as well.
Moving to 1.12.

danielmewes · 2014-03-07T01:35:33Z

Oh and two nodes just crashed:

info: Connected to server "electro1" fa8a36ac-324a-4f04-9128-6e8983e81187
info: Connected to server "magneto1" aba3f0df-a534-4e62-b5c5-873c96c634f6
info: Disconnected from server "electro1" fa8a36ac-324a-4f04-9128-6e8983e81187
info: Connected to server "electro1" fa8a36ac-324a-4f04-9128-6e8983e81187
Version: rethinkdb foo (GCC 4.6.3)
error: Error in src/rdb_protocol/lazy_json.cc at line 19:
error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::SOCK_EOF.
error: Backtrace:
error: Thu Mar  6 17:30:29 2014

       1: .persist/rethinkdb() [0x9fb6c3] at 0x9fb6c3 ()
       2: .persist/rethinkdb() [0x9f34c4] at 0x9f34c4 ()
       3: .persist/rethinkdb() [0x90c200] at 0x90c200 ()
       4: .persist/rethinkdb() [0x918fc7] at 0x918fc7 ()
       5: .persist/rethinkdb() [0x459c98] at 0x459c98 ()
       6: .persist/rethinkdb() [0x457e52] at 0x457e52 ()
       7: .persist/rethinkdb() [0x457d28] at 0x457d28 ()
       8: .persist/rethinkdb() [0x457c94] at 0x457c94 ()
       9: .persist/rethinkdb() [0x43168e] at 0x43168e ()
error: Exiting.

That might just be a case where we don't handle disconnects (due to heartbeat timeouts) properly, and could be unrelated to the performance issues.

danielmewes · 2014-03-07T01:44:56Z

Opened a new issue for the crash: #2073

danielmewes · 2014-03-08T01:07:43Z

The reason for the slow queries appears to be transaction throttling in the cache.

Not sure how that could result in a heartbeat timeout though (it might be that the node was actually crashing as in #2073).

danielmewes · 2014-03-12T05:30:29Z

Also secondary index postconstruction does some things that could interact very badly with the cache's throttling.

I will look into this issue once #2073 is closed, but I expect that it will be comparably easy to fix.
Well, except for the heartbeat timeouts, for which I still don't have any explanation.
I could only imagine that something is queueing up due to the throttling and the node stops processing network messages. That seems very unlikely though. Hmm...

danielmewes · 2014-03-15T01:38:53Z

Yeah ok. Some things just hang in the message queue for way too long. Not sure why yet.

danielmewes · 2014-03-16T02:59:55Z

Did a few more tests and part of the problem (and possibly the cause for the heartbeat timeouts in this very write intensive scenario) is #265 . I could measure ftruncate calls taking multiple seconds, in one case more than 14 seconds.

@srh: This means that we have to do CR 1214 for 1.12.

With that part fixed, simple queries still take an extraordinary amount of time during backfilling+sindex postconstruction. It doesn't seem to be due to overly high message queue latencies though. Could be a throttling / locking issue now...

danielmewes · 2014-03-18T01:58:39Z

With #265 fixed, I'm not seeing heartbeat timeouts anymore.

danielmewes · 2014-03-20T04:57:36Z

A partial fix is in code review 1335 by @AtnNn.

The impact on latency of simple operations under this workload is still highly unsatisfactory though (rather than more than 1 minute, it is ~4 seconds with those patches).

More work is needed here.

coffeemug · 2014-03-20T07:10:58Z

@danielmewes -- out of curiosity, why did you make @AtnNn instead of @srh the reviewer? (Not sure how @AtnNn can review the code he isn't too familiar with)

danielmewes · 2014-03-20T07:16:40Z

@coffeemug @AtnNn mentioned he was interested in reading about secondary index code. These changes aren't in the cache. Apart from that my decision was based on the fact that @srh has other open issues to work on for 1.12.

coffeemug · 2014-03-20T07:24:58Z

Ah thanks -- just wondering.

danielmewes · 2014-03-20T07:40:25Z

...seems that @srh wants to do the review. Reassigned.

danielmewes · 2014-03-20T20:29:23Z

The partial fix is in next 5359930

danielmewes · 2014-03-21T02:32:09Z

Just a quick status update: This is still pretty bad. It was already bad in 1.11 (simple queries there took in the range of 1-2 seconds), but it's worse in next (2-15 seconds).

srh · 2014-03-21T02:34:04Z

What if we increase the dirty block throttling limit, to, say, a larger value?

srh · 2014-03-21T02:35:02Z

It's not some thought-out value right now -- 200 is particularly low, considering how quickly they could be dumped to disk.

danielmewes · 2014-03-21T02:41:49Z

@srh: I had previously experimented with that a little. Removing the throttling solves the issue (until the process runs out of memory), but increasing it to for example 2000 doesn't change all that much. I think it just takes it a little longer until hitting the limit, and from there on it's basically the same.

I have a theory of why this is so slow. Let me test that first...

danielmewes · 2014-03-21T02:57:35Z

@srh: Basically my hypothesis is that we have to do exactly this (from btree_store.cc):

    // SRH: Should we be able to "get in line" for the mutex and release the sindex
    // block or something?
    // JD: That could be a good optimization but I don't think there's any reason
    // we need to do it as part of the new cache. There's also an argument to
    // be made that this mutex could just go away and having write access to
    // the sindex block could fill the same role.

Now replacing the mutex by a fifo_enforcer to test whether that improves things...

srh · 2014-03-21T03:09:31Z

Or replace it with a new_mutex_t?

danielmewes · 2014-03-21T03:10:50Z

Oh right! I forgot that we had that. That's actually easier. :-)

danielmewes · 2014-03-21T06:18:27Z

That alone wasn't enough. A couple of different optimizations and parameter adjustments were required to get rid of the latency issues. The short version is that backfills and sindex post constructions now get throttled if the cache can't flush their data fast enough.

Implemented in branch daniel_2071_2.

In code review 1350 by @srh.

danielmewes · 2014-03-21T06:21:04Z

indexList() is still occasionally slow, but other non-sindex related queries are ok now (and much faster in this scenario than in 1.11).

danielmewes · 2014-03-21T22:22:06Z

Closed by a4e7edc

danielmewes added this to the 1.12-polish milestone Mar 7, 2014

danielmewes added the tp:performance label Mar 7, 2014

danielmewes self-assigned this Mar 7, 2014

coffeemug modified the milestones: 1.12, 1.12-polish Mar 7, 2014

danielmewes modified the milestone: 1.12 Mar 7, 2014

danielmewes added the pr:high label Mar 7, 2014

danielmewes mentioned this issue Mar 16, 2014

linux_file_t::set_size makes blocking syscalls. #265

Closed

danielmewes mentioned this issue Mar 18, 2014

Assertion failed: [evicted_.has_page(page)] #2108

Closed

danielmewes mentioned this issue Mar 21, 2014

Test alt cache for performance regressions before releasing #1875

Closed

danielmewes closed this as completed Mar 21, 2014

danielmewes reopened this Mar 21, 2014

danielmewes mentioned this issue Mar 21, 2014

Heavy insert workload suffers under sindex construction #1955

Closed

danielmewes closed this as completed Mar 21, 2014

danielmewes referenced this issue Mar 24, 2014

First draft of the release notes for 1.12

79a765b

This was referenced Mar 26, 2014

indexWait just after indexCreate takes 10 seconds #2170

Closed

rethinkdb: Memory allocation failed #2175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexList() extremely slow under load #2071

indexList() extremely slow under load #2071

danielmewes commented Mar 7, 2014

danielmewes commented Mar 7, 2014

danielmewes commented Mar 7, 2014

coffeemug commented Mar 7, 2014

danielmewes commented Mar 7, 2014

danielmewes commented Mar 7, 2014

danielmewes commented Mar 7, 2014

danielmewes commented Mar 8, 2014

danielmewes commented Mar 12, 2014

danielmewes commented Mar 15, 2014

danielmewes commented Mar 16, 2014

danielmewes commented Mar 18, 2014

danielmewes commented Mar 20, 2014

coffeemug commented Mar 20, 2014

danielmewes commented Mar 20, 2014

coffeemug commented Mar 20, 2014

danielmewes commented Mar 20, 2014

danielmewes commented Mar 20, 2014

danielmewes commented Mar 21, 2014

srh commented Mar 21, 2014

srh commented Mar 21, 2014

danielmewes commented Mar 21, 2014

danielmewes commented Mar 21, 2014

srh commented Mar 21, 2014

danielmewes commented Mar 21, 2014

danielmewes commented Mar 21, 2014

danielmewes commented Mar 21, 2014

danielmewes commented Mar 21, 2014

indexList() extremely slow under load #2071

indexList() extremely slow under load #2071

Comments

danielmewes commented Mar 7, 2014

danielmewes commented Mar 7, 2014

danielmewes commented Mar 7, 2014

coffeemug commented Mar 7, 2014

danielmewes commented Mar 7, 2014

danielmewes commented Mar 7, 2014

danielmewes commented Mar 7, 2014

danielmewes commented Mar 8, 2014

danielmewes commented Mar 12, 2014

danielmewes commented Mar 15, 2014

danielmewes commented Mar 16, 2014

danielmewes commented Mar 18, 2014

danielmewes commented Mar 20, 2014

coffeemug commented Mar 20, 2014

danielmewes commented Mar 20, 2014

coffeemug commented Mar 20, 2014

danielmewes commented Mar 20, 2014

danielmewes commented Mar 20, 2014

danielmewes commented Mar 21, 2014

srh commented Mar 21, 2014

srh commented Mar 21, 2014

danielmewes commented Mar 21, 2014

danielmewes commented Mar 21, 2014

srh commented Mar 21, 2014

danielmewes commented Mar 21, 2014

danielmewes commented Mar 21, 2014

danielmewes commented Mar 21, 2014

danielmewes commented Mar 21, 2014