Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow gets during insert workload #1820

Closed
danielmewes opened this issue Jan 6, 2014 · 17 comments
Closed

Extremely slow gets during insert workload #1820

danielmewes opened this issue Jan 6, 2014 · 17 comments
Assignees
Milestone

Comments

@danielmewes
Copy link
Member

@underrun reports simple point gets taking up to 6 minutes. However this is not consistent, apparently the time for a get varies between 16ms and 6 minutes.
He has an insert workload running in the background.

I'll try to reproduce it. Only idea I have out of my head is that it is some kind of locking issue, or starvation issue.

Here is the IRC log for details:

12:41 <underrun> hey guys ... i just did a .get on a table with about 80M docs and it took almost 6 minutes ...
12:43 <underrun> the swing is 16ms to 6 minutes ... which seems pretty bad
13:38 <neumino> Did it go in swap?
13:38 <underrun> no
13:39 <underrun> i have swappiness set to 0
13:39 <underrun> and 100GB of free ram
13:39 <underrun> rethink is using ~160 GB
13:40 <underrun> i have 4 shards running on the same box assigned to my 4 different numa nodes
13:41 <danielmewes> underrun: Just to clarify: By .get you mean a point get of a single document?
13:41 <danielmewes> 6 minutes sounds like something is really really wrong
13:42 <underrun> yes
13:42 <danielmewes> Also: If you get the same document multiple times in a row, do you still get some slow gets or is only the first one ever slow?
13:42 <underrun> subsequent gets are also slo
13:43 <underrun> sometimes getting a doc more than once will be fast the first time and slow subsequently
13:43 <underrun> it also seemed transient as it is now more responsive
13:43 <danielmewes> I see. Doesn't seem to be a (out-of) cache issue then
13:43 <underrun> but for about an hour there it was very non-optimal
13:44 <underrun> i looked at the query profile in the data explorer
13:44 <danielmewes> non-optimal is clearly a euphemism ;-)
13:44 <underrun> heh
13:45 <danielmewes> Which client are you using btw? Is it running on the same machine as the server?
13:46 <underrun> i'm using the data explorer from my box and my server is in a datacenter
13:46 <underrun> so js client
13:46 <underrun> not on the same machine
13:47 <danielmewes> ok
13:47 <danielmewes> Hmm...
13:48 <underrun> it looks like the profile says "Perform read." took the time. but none of its sub tasks took very long at all. "perform read on shard." took 77ms which is still higher than i would like but certainly more reasonable than 6 minutes.
13:48 <danielmewes> I'm just looking for clues in the dark here, so please excuse me asking lots of odd questions: Does the web interface in general feel fast?
13:48 <danielmewes> Oh ok
13:49 <danielmewes> Never mind about that then
13:49 <underrun> yes
13:49 <danielmewes> So the profile does account for the time, which means that it is spent on the server/cluster side
13:49 <danielmewes> perform read on shard would be the actual read operation
13:49 <danielmewes> The rest I believe is transmitting the result to whichever node you are connected to
13:49 <underrun> interesting
13:50 <underrun> i'm inserting data through and reading from the same node
13:50 <danielmewes> Were any backfilling or other background tasks going on after you started up the cluster?
13:51 <underrun> no - i was on vacation last week and i left it up
13:51 <underrun> so its been idle for many days
13:51 <underrun> but i did start a job that is inserting data using soft durability
13:53 <danielmewes> Ok and that was ongoing while you tested the gets?
13:53 <underrun> yes
13:54 <danielmewes> Can you tell me a bit more about the insert task? How many concurrent clients were you using (if any)? Did you use batched inserts? If yes, how many documents per batch?
13:54 <underrun> 8 threads inserting batches of 100 docs where each doc is about 0.5k
13:55 <underrun> plus or minus ~200 bytes
13:55 <danielmewes> Ok thanks
13:55 <danielmewes> And are your tables stored on rotational disks or SSDs?
13:55 <underrun> ssd
13:55 <danielmewes> Hmm ok
13:56 <underrun> raid 0
13:56 <danielmewes> I thought I was onto something, but no ;-)
13:56 <underrun> 4 disks
13:57 <danielmewes> So what you saw definitely should not happen. Unless there is a problem with your hardware or network or something, this is probably a bug in our server. I have no real clue yet what might cause this.
13:57 <danielmewes> underrun, I'll try to re-create your scenario on a test server and see if I can reproduce the issue
13:58 <danielmewes> How many servers do you have in your cluster?
13:58 <underrun> cool
13:59 <underrun> 1 physical box with 16 cores (4 numa nodes) run 4 instances of rethinkdb with 256GB RAM
13:59 <underrun> amd hardware if that matters
14:00 <underrun> using nuamctl to limit each rethinkdb instance to a single node
14:00 <danielmewes> Ok thanks for the info. Any special file systems or special system configuration?
14:00 <danielmewes> Oh ok hmm
14:00 <underrun> ext4
14:00 <danielmewes> Wonder if that might confuse our internal threading logic somehow. I will test it
14:01 <underrun> table has 32GB table cac
14:01 <underrun> cache
14:01 <danielmewes> RethinkDB 1.11?
14:01 <underrun> yes
14:01 <underrun> 1.11.2
14:02 <danielmewes> Ok one (I hope) last thing, just to make sure: Could you check the memory utilization of the RethinkDB server instances?
14:02 <danielmewes> (top or htop output is sufficient)
14:02 <underrun> each server is using ~42GB
14:02 <danielmewes> ok
14:02 <underrun> plus or minus a couple hundred MB
14:04 <underrun> i am always inserting to and reading from a single instance
 14:05 <underrun> this is how i'm running the instances - for the node I'm inserting to i'm using default ports: numactl -N 0 -m 0 rethinkdb -n dbeval1 -c 4 --bind all --pid-file /space/rethinkdb/rethinkdb1.pid --log-file /space/rethinkdb/log/rethinkdb1.log --canonical-address 127.0.0.1 --daemon -d /space/rethinkdb/data1
14:05 <underrun> other nodes look like this: numactl -N 1 -m 1 rethinkdb -n dbeval2 -c 4 --bind all --no-http-admin -j localhost --pid-file /space/rethinkdb/rethinkdb2.pid --log-file /space/rethinkdb/log/rethinkdb2.log --canonical-address 127.0.0.1 --daemon -d /space/rethinkdb/data2 -o 10
14:06 <underrun> with different port offsets and numbers in filenames accordingly 
@ghost ghost assigned danielmewes Jan 6, 2014
@danielmewes
Copy link
Member Author

underrun added the profiler output of a slow get: https://gist.github.com/underrun/68edbab57bcd4cfb4bf2

@underrun
Copy link

underrun commented Jan 7, 2014

here are two more profiler outputs using the mas_profiling branch:

slow outdated read: https://gist.github.com/underrun/ee270127ee06a3c3891a
slow not outdated read: https://gist.github.com/underrun/3b932f85923cbe401401

@danielmewes
Copy link
Member Author

Quick update: There are probably different things at play here. One of them is related to #1385 . I've seen that problem (that is the flush lock blocking things) account for delays in the range of close to a second, sometimes 2 seconds in the scenario described here. That in itself is already too much, but it doesn't explain query times in the minutes.
Currently I'm looking into scheduler messages being delayed for too long. Internally, RethinkDB splits up query execution as well as all internal processes into a number of such messages. Each message can be thought of as an item on a todo-list, and they are processed one at a time (per CPU core). So it seems that the time between such a message being put on this "todo list" and the moment it is getting processed is pretty long. The delay I could measure is only in the range of few seconds, but these delays could very well accumulate and delay a given query for much longer. I'm now looking into what causes these scheduling delays.

@underrun
Copy link

this issue is happening even when i'm not inserting anything. the cluster is completely idle except for a r.db.table.get running on it and it's still taking minutes.

@underrun
Copy link

without inserting it is consistently fast reading now... i didn't look around but i suspect that it was slow during backfilling. it is definitely consistently fast now that it has been a while since the last insert or change to the layout of the cluster.

@danielmewes
Copy link
Member Author

@underrun: Thank you for the additional information. I also could observe slow gets only while the cluster was either backfilling or while running the inserts.

@danielmewes
Copy link
Member Author

I finally found out where the stalls on the message queue come from. There are actually three different places which take in the order of multiple seconds to process without yielding, all in the serializer. Two of them become increasingly severe with an increasing table size. They are:

  1. log_serializer_t::index_write(). This one might be difficult to break into smaller tasks. I think we can work around the issue with the new cache though. We would simply adjust the merger_serializer_t to launch a new index_write as soon as a certain number of index operations has accumulated, rather than never launching more than one at a time. With the current cache this wouldn't be effective, because the cache itself already accumulates multiple write transactions into a single huge index_write().
  2. LBA garbage collection: This one re-generates the whole on-disk LBA without yielding as far as I can see. If the database is large, this can take a while. I think we also garbage-collect the LBA unreasonably often, but that's a different issue.
  3. LBA reconstruction: This one is definitely not the cause of this issue, but still an annoyance. It only affects the phase when a node with existing large tables is restarted. I've seen reconstruction times of up to 15 seconds with a table of ~60M documents, during which the process doesn't yield. I already have a fix for this.

These things are something that definitely have to be fixed. Whether they are the sole cause of this issue remains to be seen.

@danielmewes
Copy link
Member Author

I have fixed 2 and 3 in my branch daniel_1820_fixes.

There are still two problems left:

  1. Some issue with the master access throttling system, as @jdoliner noticed. This is probably the main cause for queries taking excessively long. Now working on this.
  2. One more long-running message that I haven't yet identified. it typically takes 0.5-1 seconds, which is very bad for low-latency applications.

@danielmewes
Copy link
Member Author

Found the problem with the throttling system. Now looking into the remaining slow message.

@danielmewes
Copy link
Member Author

LBA changes are in code review 1134 by @srh.
Throttling changes are in code review 1135 by @Tryneus.

@danielmewes
Copy link
Member Author

It seems that the remaining slow message is part of the cache's writeback process. That will be gone when the new cache is merged in.

@danielmewes
Copy link
Member Author

The LBA and throttling changes have been merged into next 54bab7f and will be part of RethinkDB 1.12. With these changes, I didn't get any slow gets anymore.

@srh
Copy link
Contributor

srh commented Feb 12, 2014

@danielmewes, what about the index_write issue? Is there anything more to say about that? Do you still think it's true?

@danielmewes
Copy link
Member Author

@srh: I think it is still a problem in theory. However looking at the alt cache, I see that alt_memory_tracker_t puts a relatively hard limit on the number of outstanding changed blocks. So that might be enough to ensure that each individual log_serializer_t::index_write() is sufficiently small for this not to be a problem.
It doesn't protect us from the case where someone (or some internal code) does a r.table('hugeTable').delete() in a single transaction. That would still generate a huge index_write() transaction as far as I see. In that specific case we could probably just split up the transaction in the query code though.

@coffeemug
Copy link
Contributor

As far as I know r.table('hugeTable').delete() executes via lots of individual point deletes.

@srh
Copy link
Contributor

srh commented Feb 13, 2014

There exists a function erase_range in the code and it gets used for things.

@danielmewes
Copy link
Member Author

It's used at least when a node used to be a replica for a shard and then loses that role.The delete also was just an example. A r.table().update() does it too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants