New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage #11666
Conversation
LGTM, Can you make some tests to cover it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. but i'll ask @ShooterIT to have a look as well.
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are bunch of replication related tests under tests/integration/
, Shall we put this test along with those ?
Not sure if this test qualifies as an integration test. |
before 7.0, we also think the allocated memory is ClientOutputBufferMemoryUsage, but backlog doesn't share replication buffer with replica clients, so there is no this problem, right? current fix is ok, but i think we also should care about the calls of this function, such as |
Unless client-output-buffer limits are smaller than the backlog, there's no problem because the backlog and replicas do not share the same buffer. Worst case when a big key is written would trigger full sync.
We can have separate calculation for client eviction, but this is an overkill. IMO this fix is good enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless client-output-buffer limits are smaller than the backlog, there's no problem because the backlog and replicas do not share the same buffer. Worst case when a big key is written would trigger full sync.
this configuration doesn't make sense, so we have this in place (i.e. the effective replica buffer limit is the bigger of the two):
Lines 3705 to 3714 in eef29b6
/* Note that it doesn't make sense to set the replica clients output buffer | |
* limit lower than the repl-backlog-size config (partial sync will succeed | |
* and then replica will get disconnected). | |
* Such a configuration is ignored (the size of repl-backlog-size will be used). | |
* This doesn't have memory consumption implications since the replica client | |
* will share the backlog buffers memory. */ | |
size_t hard_limit_bytes = server.client_obuf_limits[class].hard_limit_bytes; | |
if (class == CLIENT_TYPE_SLAVE && hard_limit_bytes && | |
(long long)hard_limit_bytes < server.repl_backlog_size) | |
hard_limit_bytes = server.repl_backlog_size; |
i'm sorry, only now i had enough time to dive into this issue and think about it. |
If memory consumption for a single node is a concern, we probably shouldn't allocate big nodes. But limiting node size introduces a new problem. A backlog with partial key is useless because no replica is going to need it and the memory will be wasted. Lets say we limit the node size to 16KB, and we limit the backlog to 1MB, and then add a 2MB key, the backlog will not contain useful replication data (this problem does exist today, a backlog node can still hold partial key, but normally alongside other keys) Another option would be to include the actual allocated memory size in the replica memory usage calculation but exclude backlog actual size (which won't be freed anyway), that way worst case would be a full sync rather than psync-loop. |
even when the backlog was a plain pre-allocated cyclic buffer, it could be that it's beginning represents the end of the previous command and it was useless, i don't think we wanna worry about that. but instead, with the new mechanism. we could be aiming to a certain size of backlog, and miss it by a lot because the nodes are too big, affecting our trimming granularity. i could try to argue that any excess memory we can attribute to the backlog rather than the replicas, and use that excuse to avoid disconnecting them, but the fact is that the backlog config must be set to lower than the replicas obuf limit (i posted the code that's doing that in a previous comment). do you guys think that just limiting the node side to a reasonable scale (maybe backlog size / 10 or such) can be a reasonable solution to all these issues? |
Why to limit the node size to backlog/10 and not to keep it simple and limit the node size to the default node size |
The min value of repl_backlog_size is 16k which will be problematic at this time. |
why |
If we use |
@sundb you're referring to a case where the backlog size is set to lower than 20k? but anyway, i don't think we need that level of granularity. so the question remains, if that'll solve all problems, or do we have concerns that this will not solve? |
This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfuly executes partial sync but the primary will drop the connection again becaue the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog).
I've limited node size to backlog_size/16 |
@xbasel it took me some time to re-read the discussion and remember what this PR is about.. i went over the unresolved comments and marked them all as resolved since they're no longer relevant. |
@xbasel due to the indentation change of a mass of code, it was quite impossible to review the PR in GH, and i ended up cloning it for a local review. i was trying to find a solution to that indentation problem, and avoid massive blame-log destruction (difficulty for future reviewers, and possibly more merge conflicts than necessary), but other than using the notorious other than that i did some other cleanups:
since i already cloned the code and run the test locally, i did some other cleanup changes in the tests:
let me know if you see anything wrong. |
lgtm |
It seems that some CI failed due to this PR. |
enjoy-binbin@2cbbf63 the reason i guess is that: we update repl_backlog->histlen first, in some cases we may trigger a Trim in incrementalTrimReplicationBacklog. i am not sure about it, but the master_repl_offset and repl_backlog->histlen look like was needed be handle in while loop |
@enjoy-binbin Yes, we shouldn't have updated |
it doesn't really matters when we update |
odd, my fix did not fix the CI failures. it still fails in the same problem i will take another look later... |
the problem, or at least one of them, is about NO_MALLOC_USABLE_SIZE, easy to reproduce locally with:
|
Perhaps this is due to the overhead of the replBufBlock. Following is the last log I printed, and I can see that jemalloc allocates 216 bytes more, so its jemalloc
libc (make MALLOC=libc CFLAGS=-DNO_MALLOC_USABLE_SIZE REDIS_CFLAGS='-Werror')
|
size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), (size_t)server.repl_backlog_size / 16); in this test the backlog size is 10k, so backlog / 16 is smaller than PROTO_REPLY_CHUNK_BYTES. this seems to solve the problem, at least for this specific pair of tests. - size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), (size_t)server.repl_backlog_size / 16);
+ size_t limit = max((size_t)server.repl_backlog_size / 16, (size_t)PROTO_REPLY_CHUNK_BYTES);
+ size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), limit); |
i suppose it's just a combination of a very low backlog size in that test, and some thresholds of that test that get violated because of the relatively high overhead of replBufBlock. |
@oranagra @enjoy-binbin Add tests to cover enjoy-binbin@2cbbf63? |
i am testing it, seems to work
yean, i'll give it a try, but maybe not in time |
In redis#11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before redis#11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. At the same time, the calculation of size was adjusted to fix the failed test. We suppose it's just a combination of a very low backlog size in that test, and some thresholds of that test that get violated because of the relatively high overhead of replBufBlock. This size fix was provided by oran. ``` *** [err]: slave buffer are counted correctly in tests/unit/maxmemory.tcl Expected 814808 < 500000 && 814808 > -500000 (context: type eval line 80 cmd {assert {$delta < $delta_max && $delta > -$delta_max}} proc ::test) *** [err]: replica buffer don't induce eviction in tests/unit/maxmemory.tcl Expected [::redis::redisHandle147 dbsize] == 100 (context: type eval line 77 cmd {assert {[$master dbsize] == 100}} proc ::test) *** [err]: All replicas share one global replication buffer in tests/integration/replication-buffer.tcl Expected '1153280' to be equal to '1148928' (context: type eval line 18 cmd {assert_equal $repl_buf_mem [s mem_total_replication_buffers]} proc ::test) ``` Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: sundb <sundbcn@gmail.com>
In #11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before #11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com>
…s and unnecessary memory usage (redis#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 7be7834)
In redis#11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before redis#11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 7997874)
…s and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 7be7834)
In #11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before #11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 7997874)
…s and unnecessary memory usage (redis#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
In redis#11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before redis#11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com>
…s and unnecessary memory usage (redis#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.
Example:
the key written above, and it exceeds the backlog size.
At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfully executes partial sync but the
primary will drop the connection again because the buffer usage is still
3 MB. This happens over and over.
To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
It also means that if the backlog has some excessive data it can't trim,
it would be at most about 6% overuse.
other notes:
change due to indentation, the actual changes are just the
min(max
and the loop.