Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage #11666

xbasel · 2022-12-30T03:27:31Z

This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.

Example:

DB is empty
Backlog size is 1 MB
Client out put buffer limit is 2 MB
Client writes a 3 MB key
The shared replication buffer will have a single node which contains
the key written above, and it exceeds the backlog size.

At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfully executes partial sync but the
primary will drop the connection again because the buffer usage is still
3 MB. This happens over and over.

To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
It also means that if the backlog has some excessive data it can't trim,
it would be at most about 6% overuse.

other notes:

a loop was added in feedReplicationBuffer which caused a massive LOC
change due to indentation, the actual changes are just the min(max and the loop.
an unrelated change in an existing test to speed up a server termination which took 10 seconds.

sundb · 2022-12-30T03:57:49Z

LGTM, Can you make some tests to cover it?

oranagra

LGTM. but i'll ask @ShooterIT to have a look as well.

xbasel · 2023-01-04T15:20:11Z

LGTM, Can you make some tests to cover it?

Done

hpatro

There are bunch of replication related tests under tests/integration/, Shall we put this test along with those ?

xbasel · 2023-01-05T14:16:56Z

There are bunch of replication related tests under tests/integration/, Shall we put this test along with those ?

Not sure if this test qualifies as an integration test.

ShooterIT · 2023-01-06T11:05:48Z

before 7.0, we also think the allocated memory is ClientOutputBufferMemoryUsage, but backlog doesn't share replication buffer with replica clients, so there is no this problem, right?

current fix is ok, but i think we also should care about the calls of this function, such as getClientMemoryUsage, maybe it is not accurate after applying this patch.

xbasel · 2023-01-07T09:32:03Z

before 7.0, we also think the allocated memory is ClientOutputBufferMemoryUsage, but backlog doesn't share replication buffer with replica clients, so there is no this problem, right?

Unless client-output-buffer limits are smaller than the backlog, there's no problem because the backlog and replicas do not share the same buffer. Worst case when a big key is written would trigger full sync.

current fix is ok, but i think we also should care about the calls of this function, such as getClientMemoryUsage, maybe it is not accurate after applying this patch.

getClientMemoryUsage won't be accurate. But the inaccuracy will be only in the calculation of the head and tail of the relevant shared replication buffer nodes. This might affect client eviction (if the keys stored in the head are proportionally large compared to the memory).

We can have separate calculation for client eviction, but this is an overkill. IMO this fix is good enough.

oranagra

Unless client-output-buffer limits are smaller than the backlog, there's no problem because the backlog and replicas do not share the same buffer. Worst case when a big key is written would trigger full sync.

this configuration doesn't make sense, so we have this in place (i.e. the effective replica buffer limit is the bigger of the two):

redis/src/networking.c

Lines 3705 to 3714 in eef29b6

    
               /* Note that it doesn't make sense to set the replica clients output buffer 
        
                * limit lower than the repl-backlog-size config (partial sync will succeed 
        
                * and then replica will get disconnected). 
        
                * Such a configuration is ignored (the size of repl-backlog-size will be used). 
        
                * This doesn't have memory consumption implications since the replica client 
        
                * will share the backlog buffers memory. */ 
        
               size_t hard_limit_bytes = server.client_obuf_limits[class].hard_limit_bytes; 
        
               if (class == CLIENT_TYPE_SLAVE && hard_limit_bytes && 
        
                   (long long)hard_limit_bytes < server.repl_backlog_size) 
        
                   hard_limit_bytes = server.repl_backlog_size;

tests/unit/replica-cob-usage.tcl

tests/test_helper.tcl

src/networking.c

tests/unit/replica-cob-usage.tcl

oranagra · 2023-01-08T15:15:13Z

i'm sorry, only now i had enough time to dive into this issue and think about it.
i'm not comfortable with the solution, both because of the fact that it counts only used bytes, and not memory consumption, and also because in that case of huge nodes, we have other issues.
i'm not certain what's the right move, hoping to hear some ideas and views.

xbasel · 2023-01-08T18:46:27Z

i'm sorry, only now i had enough time to dive into this issue and think about it. i'm not comfortable with the solution, both because of the fact that it counts only used bytes, and not memory consumption, and also because in that case of huge nodes, we have other issues. i'm not certain what's the right move, hoping to hear some ideas and views.

If memory consumption for a single node is a concern, we probably shouldn't allocate big nodes. But limiting node size introduces a new problem. A backlog with partial key is useless because no replica is going to need it and the memory will be wasted. Lets say we limit the node size to 16KB, and we limit the backlog to 1MB, and then add a 2MB key, the backlog will not contain useful replication data (this problem does exist today, a backlog node can still hold partial key, but normally alongside other keys)

Another option would be to include the actual allocated memory size in the replica memory usage calculation but exclude backlog actual size (which won't be freed anyway), that way worst case would be a full sync rather than psync-loop.

oranagra · 2023-01-10T08:37:11Z

even when the backlog was a plain pre-allocated cyclic buffer, it could be that it's beginning represents the end of the previous command and it was useless, i don't think we wanna worry about that.

but instead, with the new mechanism. we could be aiming to a certain size of backlog, and miss it by a lot because the nodes are too big, affecting our trimming granularity.
and also as you originally pointed out, it can cause replicas to get disconnected due to obuf limit.

i could try to argue that any excess memory we can attribute to the backlog rather than the replicas, and use that excuse to avoid disconnecting them, but the fact is that the backlog config must be set to lower than the replicas obuf limit (i posted the code that's doing that in a previous comment).
and also as i noted i think the metrics should show consumed memory (with any implied overheads), and not just the used net.

do you guys think that just limiting the node side to a reasonable scale (maybe backlog size / 10 or such) can be a reasonable solution to all these issues?
@ShooterIT @sundb please share your thoughts too.

xbasel · 2023-01-10T08:48:11Z

even when the backlog was a plain pre-allocated cyclic buffer, it could be that it's beginning represents the end of the previous command and it was useless, i don't think we wanna worry about that.

but instead, with the new mechanism. we could be aiming to a certain size of backlog, and miss it by a lot because the nodes are too big, affecting our trimming granularity. and also as you originally pointed out, it can cause replicas to get disconnected due to obuf limit.

i could try to argue that any excess memory we can attribute to the backlog rather than the replicas, and use that excuse to avoid disconnecting them, but the fact is that the backlog config must be set to lower than the replicas obuf limit (i posted the code that's doing that in a previous comment). and also as i noted i think the metrics should show consumed memory (with any implied overheads), and not just the used net.

do you guys think that just limiting the node side to a reasonable scale (maybe backlog size / 10 or such) can be a reasonable solution to all these issues? @ShooterIT @sundb please share your thoughts too.

Why to limit the node size to backlog/10 and not to keep it simple and limit the node size to the default node size PROTO_REPLY_CHUNK_BYTES which is 16KB ? the overhead isn't going to be significant.

sundb · 2023-01-10T09:30:22Z

Why to limit the node size to backlog/10 and not to keep it simple and limit the node size to the default node size PROTO_REPLY_CHUNK_BYTES which is 16KB ? the overhead isn't going to be significant.

The min value of repl_backlog_size is 16k which will be problematic at this time.
backlog_size/10 would seem more sense.

xbasel · 2023-01-10T09:55:31Z

Why to limit the node size to backlog/10 and not to keep it simple and limit the node size to the default node size PROTO_REPLY_CHUNK_BYTES which is 16KB ? the overhead isn't going to be significant.

The min value of repl_backlog_size is 16k which will be problematic at this time. backlog_size/10 would seem more sense.

why PROTO_REPLY_CHUNK_BYTES is problematic ?

sundb · 2023-01-10T10:25:27Z

why PROTO_REPLY_CHUNK_BYTES is problematic ?

If we use PROTO_REPLY_CHUNK_BYTES, the allocation size of replBufBlock will be 20k(16K+sizeof(replBufBlock)), slave's memory usage may still exceed backlog size.

oranagra · 2023-01-10T10:41:22Z

@sundb you're referring to a case where the backlog size is set to lower than 20k?
note that either way we'll use the internal frag, so aiming for 16k we'll still consume (and use) 20.

but anyway, i don't think we need that level of granularity.
considering the replica obuf limit is bigger than the backlog, we can focus on the size of the backlog.
so if nodes are limited to 1/10 of the backlog size, at most the backlog will keep 110% of what it was meant to hold.
[on a second thought, 10 is such an ugly number... better use 8 or 16]

so the question remains, if that'll solve all problems, or do we have concerns that this will not solve?

This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfuly executes partial sync but the primary will drop the connection again becaue the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog).

xbasel · 2023-03-02T13:48:27Z

@sundb you're referring to a case where the backlog size is set to lower than 20k? note that either way we'll use the internal frag, so aiming for 16k we'll still consume (and use) 20.

but anyway, i don't think we need that level of granularity. considering the replica obuf limit is bigger than the backlog, we can focus on the size of the backlog. so if nodes are limited to 1/10 of the backlog size, at most the backlog will keep 110% of what it was meant to hold. [on a second thought, 10 is such an ugly number... better use 8 or 16]

so the question remains, if that'll solve all problems, or do we have concerns that this will not solve?

I've limited node size to backlog_size/16

oranagra · 2023-03-07T16:11:33Z

@xbasel it took me some time to re-read the discussion and remember what this PR is about..
i see you implemented my suggested solution, i'll review the new code shortly (sorry for the delay, i was busy elsewhere)

i went over the unresolved comments and marked them all as resolved since they're no longer relevant.
i see the top comment is outdated, and i notice you force-pushed a commit with an updated comment (i'll copy it to the top comment).
p.s. i see that force-push contains both a rebase, and a completely new approach (compared to the previous one).
i generally ask to avoid force-push (makes it harder to do incremental review), but since this is a completely new approach, and new tests (i think), that's ok.

oranagra · 2023-03-07T17:09:36Z

@xbasel due to the indentation change of a mass of code, it was quite impossible to review the PR in GH, and i ended up cloning it for a local review.
after i adjusted the indentation for my review, i noticed a few unintentional changes probably made by your editor, so i reverted them.
i also fixed a bug (i think) which would have caused us to use repl_backlog_size / 16 even if we only require smaller size.

i was trying to find a solution to that indentation problem, and avoid massive blame-log destruction (difficulty for future reviewers, and possibly more merge conflicts than necessary), but other than using the notorious goto, i can't find one, so i accepted the change.
i'll note that in the top comment.

other than that i did some other cleanups:

added comment in the code explaining it's logic.
remove unnecessary comment about internal implementation in the test

since i already cloned the code and run the test locally, i did some other cleanup changes in the tests:

avoiding the use of "slave" when possible in new code
generalize client_field proc
move that test out of the scope of the server of the previous test.
speed up a previous test which i notices takes forever to finish.
negated the comment saying "and server cron to run and disconnect" (correct me if i'm wrong)

let me know if you see anything wrong.

xbasel · 2023-03-12T15:47:54Z

@xbasel due to the indentation change of a mass of code, it was quite impossible to review the PR in GH, and i ended up cloning it for a local review. after i adjusted the indentation for my review, i noticed a few unintentional changes probably made by your editor, so i reverted them. i also fixed a bug (i think) which would have caused us to use repl_backlog_size / 16 even if we only require smaller size.

i was trying to find a solution to that indentation problem, and avoid massive blame-log destruction (difficulty for future reviewers, and possibly more merge conflicts than necessary), but other than using the notorious goto, i can't find one, so i accepted the change. i'll note that in the top comment.

other than that i did some other cleanups:
1. added comment in the code explaining it's logic.

2. remove unnecessary comment about internal implementation in the test
since i already cloned the code and run the test locally, i did some other cleanup changes in the tests:
* avoiding the use of "slave" when possible in new code

* generalize client_field proc

* move that test out of the scope of the server of the previous test.

* speed up a previous test which i notices takes forever to finish.

* negated the comment saying "and server cron to run and disconnect" (correct me if i'm wrong)
let me know if you see anything wrong.

lgtm

sundb · 2023-03-13T06:54:08Z

It seems that some CI failed due to this PR.
https://github.com/redis/redis/actions/runs/4400103760/jobs/7705087323

enjoy-binbin · 2023-03-13T07:00:24Z

enjoy-binbin@2cbbf63
This is what I suspect, I'm running Daily to verify it: https://github.com/enjoy-binbin/redis/actions/runs/4402222711

the reason i guess is that: we update repl_backlog->histlen first, in some cases we may trigger a Trim in incrementalTrimReplicationBacklog. i am not sure about it, but the master_repl_offset and repl_backlog->histlen look like was needed be handle in while loop

sundb · 2023-03-13T07:15:06Z

@enjoy-binbin Yes, we shouldn't have updated master_repl at the beginning, before this PR we would have created at most one new reply node, but now there are multiple.
The repl_offset indicates that the first byte of the reply node points to its position in the master reply, so when there are multiple nodes, it will always be wrong.

oranagra · 2023-03-13T07:46:48Z

it doesn't really matters when we update master_repl since we also update histlen.
the problem is that now we're creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect.
enjoy-binbin@2cbbf63 looks right to me.

enjoy-binbin · 2023-03-13T08:16:30Z

odd, my fix did not fix the CI failures. it still fails in the same problem
and look like we do not have the test that will cover my fix (tail->repl_offset one)

i will take another look later...

oranagra · 2023-03-13T08:36:22Z

the problem, or at least one of them, is about NO_MALLOC_USABLE_SIZE, easy to reproduce locally with:

 make distclean; make MALLOC=libc CFLAGS=-DNO_MALLOC_USABLE_SIZE REDIS_CFLAGS='-Werror'
 ./runtest --single unit/maxmemory

sundb · 2023-03-13T08:44:18Z

Perhaps this is due to the overhead of the replBufBlock.

Following is the last log I printed, and I can see that jemalloc allocates 216 bytes more, so its server.repl_buffer_blocks length will be smaller, and replBufBlock overhead will be smaller.

jemalloc

malloc_size: 1024, usable_size: 1280, server.repl_buffer_blocks length: 40879

libc (make MALLOC=libc CFLAGS=-DNO_MALLOC_USABLE_SIZE REDIS_CFLAGS='-Werror')

malloc_size: 1024, usable_size: 1064, server.repl_buffer_blocks length: 49659

oranagra · 2023-03-13T08:46:54Z

            size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), (size_t)server.repl_backlog_size / 16);

in this test the backlog size is 10k, so backlog / 16 is smaller than PROTO_REPLY_CHUNK_BYTES. this seems to solve the problem, at least for this specific pair of tests.

-            size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), (size_t)server.repl_backlog_size / 16);
+            size_t limit = max((size_t)server.repl_backlog_size / 16, (size_t)PROTO_REPLY_CHUNK_BYTES);
+            size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), limit);

oranagra · 2023-03-13T08:58:30Z

i suppose it's just a combination of a very low backlog size in that test, and some thresholds of that test that get violated because of the relatively high overhead of replBufBlock.
@enjoy-binbin can you please mix my fix with yours and try a full CI (maybe limit it to just the units we saw failing).

sundb · 2023-03-13T09:00:49Z

@oranagra @enjoy-binbin Add tests to cover enjoy-binbin@2cbbf63?

enjoy-binbin · 2023-03-13T09:04:03Z

can you please mix my fix with yours and try a full CI (maybe limit it to just the units we saw failing).

i am testing it, seems to work

Add tests to cover xxx

yean, i'll give it a try, but maybe not in time

In redis#11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before redis#11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. At the same time, the calculation of size was adjusted to fix the failed test. We suppose it's just a combination of a very low backlog size in that test, and some thresholds of that test that get violated because of the relatively high overhead of replBufBlock. This size fix was provided by oran. ``` *** [err]: slave buffer are counted correctly in tests/unit/maxmemory.tcl Expected 814808 < 500000 && 814808 > -500000 (context: type eval line 80 cmd {assert {$delta < $delta_max && $delta > -$delta_max}} proc ::test) *** [err]: replica buffer don't induce eviction in tests/unit/maxmemory.tcl Expected [::redis::redisHandle147 dbsize] == 100 (context: type eval line 77 cmd {assert {[$master dbsize] == 100}} proc ::test) *** [err]: All replicas share one global replication buffer in tests/integration/replication-buffer.tcl Expected '1153280' to be equal to '1148928' (context: type eval line 18 cmd {assert_equal $repl_buf_mem [s mem_total_replication_buffers]} proc ::test) ``` Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: sundb <sundbcn@gmail.com>

In #11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before #11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com>

…s and unnecessary memory usage (redis#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 7be7834)

In redis#11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before redis#11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 7997874)

…s and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 7be7834)

In #11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before #11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 7997874)

…s and unnecessary memory usage (redis#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>

In redis#11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before redis#11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com>

…s and unnecessary memory usage (redis#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>

xbasel mentioned this pull request Dec 30, 2022

[BUG] big keys can trigger replica buffer overrun/psync loops #11667

Closed

oranagra linked an issue Jan 1, 2023 that may be closed by this pull request

[BUG] big keys can trigger replica buffer overrun/psync loops #11667

Closed

oranagra approved these changes Jan 3, 2023

View reviewed changes

hpatro reviewed Jan 4, 2023

View reviewed changes

oranagra reviewed Jan 8, 2023

View reviewed changes

tests/unit/replica-cob-usage.tcl Outdated Show resolved Hide resolved

tests/test_helper.tcl Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

tests/unit/replica-cob-usage.tcl Outdated Show resolved Hide resolved

xbasel force-pushed the fixCobUsage branch from 36306d6 to 3469fee Compare March 2, 2023 13:34

oranagra added 2 commits March 7, 2023 18:55

cleanup

ad362d1

hickup

bbf2f62

oranagra changed the title ~~Replica client output buffer usage calculation is problematic and can lead to buffer overrun/psync loops~~ Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage Mar 12, 2023

oranagra merged commit 7be7834 into redis:unstable Mar 12, 2023
13 checks passed

oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 12, 2023

enjoy-binbin mentioned this pull request Mar 13, 2023

Fix tail->repl_offset update in feedReplicationBuffer #11905

Merged

oranagra mentioned this pull request Mar 20, 2023

Release 7.0.10 #11941

Merged

oranagra mentioned this pull request Mar 31, 2023

Use dummy allocator to make accesses defined as per standard #11982

Merged

oranagra mentioned this pull request May 29, 2023

flushSlavesOutputBuffers should not write to replicas scheduled to drop #12242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage #11666

Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage #11666

xbasel commented Dec 30, 2022 •

edited by oranagra

sundb commented Dec 30, 2022

oranagra left a comment

xbasel commented Jan 4, 2023

hpatro left a comment

xbasel commented Jan 5, 2023

ShooterIT commented Jan 6, 2023

xbasel commented Jan 7, 2023 •

edited

oranagra left a comment

oranagra commented Jan 8, 2023

xbasel commented Jan 8, 2023 •

edited

oranagra commented Jan 10, 2023

xbasel commented Jan 10, 2023

sundb commented Jan 10, 2023

xbasel commented Jan 10, 2023

sundb commented Jan 10, 2023

oranagra commented Jan 10, 2023

xbasel commented Mar 2, 2023

oranagra commented Mar 7, 2023

oranagra commented Mar 7, 2023

xbasel commented Mar 12, 2023

sundb commented Mar 13, 2023

enjoy-binbin commented Mar 13, 2023 •

edited

sundb commented Mar 13, 2023

oranagra commented Mar 13, 2023

enjoy-binbin commented Mar 13, 2023

oranagra commented Mar 13, 2023

sundb commented Mar 13, 2023 •

edited

oranagra commented Mar 13, 2023 •

edited

oranagra commented Mar 13, 2023

sundb commented Mar 13, 2023

enjoy-binbin commented Mar 13, 2023

	/* Note that it doesn't make sense to set the replica clients output buffer
	* limit lower than the repl-backlog-size config (partial sync will succeed
	* and then replica will get disconnected).
	* Such a configuration is ignored (the size of repl-backlog-size will be used).
	* This doesn't have memory consumption implications since the replica client
	* will share the backlog buffers memory. */
	size_t hard_limit_bytes = server.client_obuf_limits[class].hard_limit_bytes;
	if (class == CLIENT_TYPE_SLAVE && hard_limit_bytes &&
	(long long)hard_limit_bytes < server.repl_backlog_size)
	hard_limit_bytes = server.repl_backlog_size;

Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage #11666

Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage #11666

Conversation

xbasel commented Dec 30, 2022 • edited by oranagra

sundb commented Dec 30, 2022

oranagra left a comment

Choose a reason for hiding this comment

xbasel commented Jan 4, 2023

hpatro left a comment

Choose a reason for hiding this comment

xbasel commented Jan 5, 2023

ShooterIT commented Jan 6, 2023

xbasel commented Jan 7, 2023 • edited

oranagra left a comment

Choose a reason for hiding this comment

oranagra commented Jan 8, 2023

xbasel commented Jan 8, 2023 • edited

oranagra commented Jan 10, 2023

xbasel commented Jan 10, 2023

sundb commented Jan 10, 2023

xbasel commented Jan 10, 2023

sundb commented Jan 10, 2023

oranagra commented Jan 10, 2023

xbasel commented Mar 2, 2023

oranagra commented Mar 7, 2023

oranagra commented Mar 7, 2023

xbasel commented Mar 12, 2023

sundb commented Mar 13, 2023

enjoy-binbin commented Mar 13, 2023 • edited

sundb commented Mar 13, 2023

oranagra commented Mar 13, 2023

enjoy-binbin commented Mar 13, 2023

oranagra commented Mar 13, 2023

sundb commented Mar 13, 2023 • edited

oranagra commented Mar 13, 2023 • edited

oranagra commented Mar 13, 2023

sundb commented Mar 13, 2023

enjoy-binbin commented Mar 13, 2023

xbasel commented Dec 30, 2022 •

edited by oranagra

xbasel commented Jan 7, 2023 •

edited

xbasel commented Jan 8, 2023 •

edited

enjoy-binbin commented Mar 13, 2023 •

edited

sundb commented Mar 13, 2023 •

edited

oranagra commented Mar 13, 2023 •

edited