Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage #11666

Merged
merged 3 commits into from Mar 12, 2023

Conversation

xbasel
Copy link
Contributor

@xbasel xbasel commented Dec 30, 2022

This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.

Example:

  1. DB is empty
  2. Backlog size is 1 MB
  3. Client out put buffer limit is 2 MB
  4. Client writes a 3 MB key
  5. The shared replication buffer will have a single node which contains
    the key written above, and it exceeds the backlog size.

At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfully executes partial sync but the
primary will drop the connection again because the buffer usage is still
3 MB. This happens over and over.

To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
It also means that if the backlog has some excessive data it can't trim,
it would be at most about 6% overuse.

other notes:

  1. a loop was added in feedReplicationBuffer which caused a massive LOC
    change due to indentation, the actual changes are just the min(max and the loop.
  2. an unrelated change in an existing test to speed up a server termination which took 10 seconds.

@sundb
Copy link
Collaborator

sundb commented Dec 30, 2022

LGTM, Can you make some tests to cover it?

@oranagra oranagra linked an issue Jan 1, 2023 that may be closed by this pull request
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. but i'll ask @ShooterIT to have a look as well.

@xbasel
Copy link
Contributor Author

xbasel commented Jan 4, 2023

LGTM, Can you make some tests to cover it?

Done

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are bunch of replication related tests under tests/integration/, Shall we put this test along with those ?

@xbasel
Copy link
Contributor Author

xbasel commented Jan 5, 2023

There are bunch of replication related tests under tests/integration/, Shall we put this test along with those ?

Not sure if this test qualifies as an integration test.

@ShooterIT
Copy link
Collaborator

before 7.0, we also think the allocated memory is ClientOutputBufferMemoryUsage, but backlog doesn't share replication buffer with replica clients, so there is no this problem, right?

current fix is ok, but i think we also should care about the calls of this function, such as getClientMemoryUsage, maybe it is not accurate after applying this patch.

@xbasel
Copy link
Contributor Author

xbasel commented Jan 7, 2023

before 7.0, we also think the allocated memory is ClientOutputBufferMemoryUsage, but backlog doesn't share replication buffer with replica clients, so there is no this problem, right?

Unless client-output-buffer limits are smaller than the backlog, there's no problem because the backlog and replicas do not share the same buffer. Worst case when a big key is written would trigger full sync.

current fix is ok, but i think we also should care about the calls of this function, such as getClientMemoryUsage, maybe it is not accurate after applying this patch.

getClientMemoryUsage won't be accurate. But the inaccuracy will be only in the calculation of the head and tail of the relevant shared replication buffer nodes. This might affect client eviction (if the keys stored in the head are proportionally large compared to the memory).

We can have separate calculation for client eviction, but this is an overkill. IMO this fix is good enough.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless client-output-buffer limits are smaller than the backlog, there's no problem because the backlog and replicas do not share the same buffer. Worst case when a big key is written would trigger full sync.

this configuration doesn't make sense, so we have this in place (i.e. the effective replica buffer limit is the bigger of the two):

redis/src/networking.c

Lines 3705 to 3714 in eef29b6

/* Note that it doesn't make sense to set the replica clients output buffer
* limit lower than the repl-backlog-size config (partial sync will succeed
* and then replica will get disconnected).
* Such a configuration is ignored (the size of repl-backlog-size will be used).
* This doesn't have memory consumption implications since the replica client
* will share the backlog buffers memory. */
size_t hard_limit_bytes = server.client_obuf_limits[class].hard_limit_bytes;
if (class == CLIENT_TYPE_SLAVE && hard_limit_bytes &&
(long long)hard_limit_bytes < server.repl_backlog_size)
hard_limit_bytes = server.repl_backlog_size;

tests/unit/replica-cob-usage.tcl Outdated Show resolved Hide resolved
tests/test_helper.tcl Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
tests/unit/replica-cob-usage.tcl Outdated Show resolved Hide resolved
@oranagra
Copy link
Member

oranagra commented Jan 8, 2023

i'm sorry, only now i had enough time to dive into this issue and think about it.
i'm not comfortable with the solution, both because of the fact that it counts only used bytes, and not memory consumption, and also because in that case of huge nodes, we have other issues.
i'm not certain what's the right move, hoping to hear some ideas and views.

@xbasel
Copy link
Contributor Author

xbasel commented Jan 8, 2023

i'm sorry, only now i had enough time to dive into this issue and think about it. i'm not comfortable with the solution, both because of the fact that it counts only used bytes, and not memory consumption, and also because in that case of huge nodes, we have other issues. i'm not certain what's the right move, hoping to hear some ideas and views.

If memory consumption for a single node is a concern, we probably shouldn't allocate big nodes. But limiting node size introduces a new problem. A backlog with partial key is useless because no replica is going to need it and the memory will be wasted. Lets say we limit the node size to 16KB, and we limit the backlog to 1MB, and then add a 2MB key, the backlog will not contain useful replication data (this problem does exist today, a backlog node can still hold partial key, but normally alongside other keys)

Another option would be to include the actual allocated memory size in the replica memory usage calculation but exclude backlog actual size (which won't be freed anyway), that way worst case would be a full sync rather than psync-loop.

@oranagra
Copy link
Member

even when the backlog was a plain pre-allocated cyclic buffer, it could be that it's beginning represents the end of the previous command and it was useless, i don't think we wanna worry about that.

but instead, with the new mechanism. we could be aiming to a certain size of backlog, and miss it by a lot because the nodes are too big, affecting our trimming granularity.
and also as you originally pointed out, it can cause replicas to get disconnected due to obuf limit.

i could try to argue that any excess memory we can attribute to the backlog rather than the replicas, and use that excuse to avoid disconnecting them, but the fact is that the backlog config must be set to lower than the replicas obuf limit (i posted the code that's doing that in a previous comment).
and also as i noted i think the metrics should show consumed memory (with any implied overheads), and not just the used net.

do you guys think that just limiting the node side to a reasonable scale (maybe backlog size / 10 or such) can be a reasonable solution to all these issues?
@ShooterIT @sundb please share your thoughts too.

@xbasel
Copy link
Contributor Author

xbasel commented Jan 10, 2023

even when the backlog was a plain pre-allocated cyclic buffer, it could be that it's beginning represents the end of the previous command and it was useless, i don't think we wanna worry about that.

but instead, with the new mechanism. we could be aiming to a certain size of backlog, and miss it by a lot because the nodes are too big, affecting our trimming granularity. and also as you originally pointed out, it can cause replicas to get disconnected due to obuf limit.

i could try to argue that any excess memory we can attribute to the backlog rather than the replicas, and use that excuse to avoid disconnecting them, but the fact is that the backlog config must be set to lower than the replicas obuf limit (i posted the code that's doing that in a previous comment). and also as i noted i think the metrics should show consumed memory (with any implied overheads), and not just the used net.

do you guys think that just limiting the node side to a reasonable scale (maybe backlog size / 10 or such) can be a reasonable solution to all these issues? @ShooterIT @sundb please share your thoughts too.

Why to limit the node size to backlog/10 and not to keep it simple and limit the node size to the default node size PROTO_REPLY_CHUNK_BYTES which is 16KB ? the overhead isn't going to be significant.

@sundb
Copy link
Collaborator

sundb commented Jan 10, 2023

Why to limit the node size to backlog/10 and not to keep it simple and limit the node size to the default node size PROTO_REPLY_CHUNK_BYTES which is 16KB ? the overhead isn't going to be significant.

The min value of repl_backlog_size is 16k which will be problematic at this time.
backlog_size/10 would seem more sense.

@xbasel
Copy link
Contributor Author

xbasel commented Jan 10, 2023

Why to limit the node size to backlog/10 and not to keep it simple and limit the node size to the default node size PROTO_REPLY_CHUNK_BYTES which is 16KB ? the overhead isn't going to be significant.

The min value of repl_backlog_size is 16k which will be problematic at this time. backlog_size/10 would seem more sense.

why PROTO_REPLY_CHUNK_BYTES is problematic ?

@sundb
Copy link
Collaborator

sundb commented Jan 10, 2023

why PROTO_REPLY_CHUNK_BYTES is problematic ?

If we use PROTO_REPLY_CHUNK_BYTES, the allocation size of replBufBlock will be 20k(16K+sizeof(replBufBlock)), slave's memory usage may still exceed backlog size.

@oranagra
Copy link
Member

@sundb you're referring to a case where the backlog size is set to lower than 20k?
note that either way we'll use the internal frag, so aiming for 16k we'll still consume (and use) 20.

but anyway, i don't think we need that level of granularity.
considering the replica obuf limit is bigger than the backlog, we can focus on the size of the backlog.
so if nodes are limited to 1/10 of the backlog size, at most the backlog will keep 110% of what it was meant to hold.
[on a second thought, 10 is such an ugly number... better use 8 or 16]

so the question remains, if that'll solve all problems, or do we have concerns that this will not solve?

This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.

Example:
1. DB is empty
2. Backlog size is 1 MB
3. Client out put buffer limit is 2 MB
4. Client writes a 3 MB key
5. The shared replication buffer will have a single node which contains
the key written above, and it exceeds the backlog size.

At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfuly executes partial sync but the
primary will drop the connection again becaue the buffer usage is still
3 MB. This happens over and over.

To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
@xbasel
Copy link
Contributor Author

xbasel commented Mar 2, 2023

@sundb you're referring to a case where the backlog size is set to lower than 20k? note that either way we'll use the internal frag, so aiming for 16k we'll still consume (and use) 20.

but anyway, i don't think we need that level of granularity. considering the replica obuf limit is bigger than the backlog, we can focus on the size of the backlog. so if nodes are limited to 1/10 of the backlog size, at most the backlog will keep 110% of what it was meant to hold. [on a second thought, 10 is such an ugly number... better use 8 or 16]

so the question remains, if that'll solve all problems, or do we have concerns that this will not solve?

I've limited node size to backlog_size/16

@oranagra
Copy link
Member

oranagra commented Mar 7, 2023

@xbasel it took me some time to re-read the discussion and remember what this PR is about..
i see you implemented my suggested solution, i'll review the new code shortly (sorry for the delay, i was busy elsewhere)

i went over the unresolved comments and marked them all as resolved since they're no longer relevant.
i see the top comment is outdated, and i notice you force-pushed a commit with an updated comment (i'll copy it to the top comment).
p.s. i see that force-push contains both a rebase, and a completely new approach (compared to the previous one).
i generally ask to avoid force-push (makes it harder to do incremental review), but since this is a completely new approach, and new tests (i think), that's ok.

@oranagra
Copy link
Member

oranagra commented Mar 7, 2023

@xbasel due to the indentation change of a mass of code, it was quite impossible to review the PR in GH, and i ended up cloning it for a local review.
after i adjusted the indentation for my review, i noticed a few unintentional changes probably made by your editor, so i reverted them.
i also fixed a bug (i think) which would have caused us to use repl_backlog_size / 16 even if we only require smaller size.

i was trying to find a solution to that indentation problem, and avoid massive blame-log destruction (difficulty for future reviewers, and possibly more merge conflicts than necessary), but other than using the notorious goto, i can't find one, so i accepted the change.
i'll note that in the top comment.

other than that i did some other cleanups:

  1. added comment in the code explaining it's logic.
  2. remove unnecessary comment about internal implementation in the test

since i already cloned the code and run the test locally, i did some other cleanup changes in the tests:

  • avoiding the use of "slave" when possible in new code
  • generalize client_field proc
  • move that test out of the scope of the server of the previous test.
  • speed up a previous test which i notices takes forever to finish.
  • negated the comment saying "and server cron to run and disconnect" (correct me if i'm wrong)

let me know if you see anything wrong.

@xbasel
Copy link
Contributor Author

xbasel commented Mar 12, 2023

@xbasel due to the indentation change of a mass of code, it was quite impossible to review the PR in GH, and i ended up cloning it for a local review. after i adjusted the indentation for my review, i noticed a few unintentional changes probably made by your editor, so i reverted them. i also fixed a bug (i think) which would have caused us to use repl_backlog_size / 16 even if we only require smaller size.

i was trying to find a solution to that indentation problem, and avoid massive blame-log destruction (difficulty for future reviewers, and possibly more merge conflicts than necessary), but other than using the notorious goto, i can't find one, so i accepted the change. i'll note that in the top comment.

other than that i did some other cleanups:

1. added comment in the code explaining it's logic.

2. remove unnecessary comment about internal implementation in the test

since i already cloned the code and run the test locally, i did some other cleanup changes in the tests:

* avoiding the use of "slave" when possible in new code

* generalize client_field proc

* move that test out of the scope of the server of the previous test.

* speed up a previous test which i notices takes forever to finish.

* negated the comment saying "and server cron to run and disconnect" (correct me if i'm wrong)

let me know if you see anything wrong.

lgtm

@oranagra oranagra changed the title Replica client output buffer usage calculation is problematic and can lead to buffer overrun/psync loops Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage Mar 12, 2023
@oranagra oranagra merged commit 7be7834 into redis:unstable Mar 12, 2023
13 checks passed
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 12, 2023
@sundb
Copy link
Collaborator

sundb commented Mar 13, 2023

It seems that some CI failed due to this PR.
https://github.com/redis/redis/actions/runs/4400103760/jobs/7705087323

@enjoy-binbin
Copy link
Collaborator

enjoy-binbin commented Mar 13, 2023

enjoy-binbin@2cbbf63
This is what I suspect, I'm running Daily to verify it: https://github.com/enjoy-binbin/redis/actions/runs/4402222711

the reason i guess is that: we update repl_backlog->histlen first, in some cases we may trigger a Trim in incrementalTrimReplicationBacklog. i am not sure about it, but the master_repl_offset and repl_backlog->histlen look like was needed be handle in while loop

@sundb
Copy link
Collaborator

sundb commented Mar 13, 2023

@enjoy-binbin Yes, we shouldn't have updated master_repl at the beginning, before this PR we would have created at most one new reply node, but now there are multiple.
The repl_offset indicates that the first byte of the reply node points to its position in the master reply, so when there are multiple nodes, it will always be wrong.

@oranagra
Copy link
Member

it doesn't really matters when we update master_repl since we also update histlen.
the problem is that now we're creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect.
enjoy-binbin@2cbbf63 looks right to me.

@enjoy-binbin
Copy link
Collaborator

odd, my fix did not fix the CI failures. it still fails in the same problem
and look like we do not have the test that will cover my fix (tail->repl_offset one)

i will take another look later...

@oranagra
Copy link
Member

the problem, or at least one of them, is about NO_MALLOC_USABLE_SIZE, easy to reproduce locally with:

 make distclean; make MALLOC=libc CFLAGS=-DNO_MALLOC_USABLE_SIZE REDIS_CFLAGS='-Werror'
 ./runtest --single unit/maxmemory

@sundb
Copy link
Collaborator

sundb commented Mar 13, 2023

Perhaps this is due to the overhead of the replBufBlock.

Following is the last log I printed, and I can see that jemalloc allocates 216 bytes more, so its server.repl_buffer_blocks length will be smaller, and replBufBlock overhead will be smaller.

jemalloc

malloc_size: 1024, usable_size: 1280, server.repl_buffer_blocks length: 40879

libc (make MALLOC=libc CFLAGS=-DNO_MALLOC_USABLE_SIZE REDIS_CFLAGS='-Werror')

malloc_size: 1024, usable_size: 1064, server.repl_buffer_blocks length: 49659

@oranagra
Copy link
Member

oranagra commented Mar 13, 2023

            size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), (size_t)server.repl_backlog_size / 16);

in this test the backlog size is 10k, so backlog / 16 is smaller than PROTO_REPLY_CHUNK_BYTES. this seems to solve the problem, at least for this specific pair of tests.

-            size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), (size_t)server.repl_backlog_size / 16);
+            size_t limit = max((size_t)server.repl_backlog_size / 16, (size_t)PROTO_REPLY_CHUNK_BYTES);
+            size_t size = min(max(len, (size_t)PROTO_REPLY_CHUNK_BYTES), limit);

@oranagra
Copy link
Member

i suppose it's just a combination of a very low backlog size in that test, and some thresholds of that test that get violated because of the relatively high overhead of replBufBlock.
@enjoy-binbin can you please mix my fix with yours and try a full CI (maybe limit it to just the units we saw failing).

@sundb
Copy link
Collaborator

sundb commented Mar 13, 2023

@oranagra @enjoy-binbin Add tests to cover enjoy-binbin@2cbbf63?

@enjoy-binbin
Copy link
Collaborator

can you please mix my fix with yours and try a full CI (maybe limit it to just the units we saw failing).

i am testing it, seems to work

Add tests to cover xxx

yean, i'll give it a try, but maybe not in time

enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Mar 13, 2023
In redis#11666, we added a while loop and will split a big reply
node to multiple nodes. The update of tail->repl_offset may
be wrong. Like before redis#11666, we would have created at most
one new reply node, and now we will create multiple nodes if
it is a big reply node.

Now we are creating more than one node, and the tail->repl_offset
of all the nodes except the last one are incorrect. Because we
update master_repl_offset at the beginning, and then use it to
update the tail->repl_offset.

At the same time, the calculation of size was adjusted to fix
the failed test. We suppose it's just a combination of a very low
backlog size in that test, and some thresholds of that test that
get violated because of the relatively high overhead of replBufBlock.
This size fix was provided by oran.
```
*** [err]: slave buffer are counted correctly in tests/unit/maxmemory.tcl
Expected 814808 < 500000 && 814808 > -500000 (context: type eval line 80 cmd {assert {$delta < $delta_max && $delta > -$delta_max}} proc ::test)

*** [err]: replica buffer don't induce eviction in tests/unit/maxmemory.tcl
Expected [::redis::redisHandle147 dbsize] == 100 (context: type eval line 77 cmd {assert {[$master dbsize] == 100}} proc ::test)

*** [err]: All replicas share one global replication buffer in tests/integration/replication-buffer.tcl
Expected '1153280' to be equal to '1148928' (context: type eval line 18 cmd {assert_equal $repl_buf_mem [s mem_total_replication_buffers]} proc ::test)
```

Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: sundb <sundbcn@gmail.com>
oranagra added a commit that referenced this pull request Mar 13, 2023
In #11666, we added a while loop and will split a big reply
node to multiple nodes. The update of tail->repl_offset may
be wrong. Like before #11666, we would have created at most
one new reply node, and now we will create multiple nodes if
it is a big reply node.

Now we are creating more than one node, and the tail->repl_offset
of all the nodes except the last one are incorrect. Because we
update master_repl_offset at the beginning, and then use it to
update the tail->repl_offset. This would have lead to an assertion
during PSYNC, a test was added to validate that case.

Besides that, the calculation of size was adjusted to fix
tests that failed due to a combination of a very low backlog size,
and some thresholds of that get violated because of the relatively
high overhead of replBufBlock. So now if the backlog size / 16 is too
small, we'll take PROTO_REPLY_CHUNK_BYTES instead.

Co-authored-by: Oran Agra <oran@redislabs.com>
oranagra pushed a commit to oranagra/redis that referenced this pull request Mar 20, 2023
…s and unnecessary memory usage (redis#11666)

This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.

Example:
1. DB is empty
2. Backlog size is 1 MB
3. Client out put buffer limit is 2 MB
4. Client writes a 3 MB key
5. The shared replication buffer will have a single node which contains
the key written above, and it exceeds the backlog size.

At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfully executes partial sync but the
primary will drop the connection again because the buffer usage is still
3 MB. This happens over and over.

To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
It also means that if the backlog has some excessive data it can't trim,
it would be at most about 6% overuse.

other notes:
1. a loop was added in feedReplicationBuffer which caused a massive LOC
  change due to indentation, the actual changes are just the `min(max` and the loop.
3. an unrelated change in an existing test to speed up a server termination which took 10 seconds.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit 7be7834)
oranagra pushed a commit to oranagra/redis that referenced this pull request Mar 20, 2023
In redis#11666, we added a while loop and will split a big reply
node to multiple nodes. The update of tail->repl_offset may
be wrong. Like before redis#11666, we would have created at most
one new reply node, and now we will create multiple nodes if
it is a big reply node.

Now we are creating more than one node, and the tail->repl_offset
of all the nodes except the last one are incorrect. Because we
update master_repl_offset at the beginning, and then use it to
update the tail->repl_offset. This would have lead to an assertion
during PSYNC, a test was added to validate that case.

Besides that, the calculation of size was adjusted to fix
tests that failed due to a combination of a very low backlog size,
and some thresholds of that get violated because of the relatively
high overhead of replBufBlock. So now if the backlog size / 16 is too
small, we'll take PROTO_REPLY_CHUNK_BYTES instead.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit 7997874)
@oranagra oranagra mentioned this pull request Mar 20, 2023
oranagra pushed a commit that referenced this pull request Mar 20, 2023
…s and unnecessary memory usage (#11666)

This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.

Example:
1. DB is empty
2. Backlog size is 1 MB
3. Client out put buffer limit is 2 MB
4. Client writes a 3 MB key
5. The shared replication buffer will have a single node which contains
the key written above, and it exceeds the backlog size.

At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfully executes partial sync but the
primary will drop the connection again because the buffer usage is still
3 MB. This happens over and over.

To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
It also means that if the backlog has some excessive data it can't trim,
it would be at most about 6% overuse.

other notes:
1. a loop was added in feedReplicationBuffer which caused a massive LOC
  change due to indentation, the actual changes are just the `min(max` and the loop.
3. an unrelated change in an existing test to speed up a server termination which took 10 seconds.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit 7be7834)
oranagra pushed a commit that referenced this pull request Mar 20, 2023
In #11666, we added a while loop and will split a big reply
node to multiple nodes. The update of tail->repl_offset may
be wrong. Like before #11666, we would have created at most
one new reply node, and now we will create multiple nodes if
it is a big reply node.

Now we are creating more than one node, and the tail->repl_offset
of all the nodes except the last one are incorrect. Because we
update master_repl_offset at the beginning, and then use it to
update the tail->repl_offset. This would have lead to an assertion
during PSYNC, a test was added to validate that case.

Besides that, the calculation of size was adjusted to fix
tests that failed due to a combination of a very low backlog size,
and some thresholds of that get violated because of the relatively
high overhead of replBufBlock. So now if the backlog size / 16 is too
small, we'll take PROTO_REPLY_CHUNK_BYTES instead.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit 7997874)
Mixficsol pushed a commit to Mixficsol/redis that referenced this pull request Apr 12, 2023
…s and unnecessary memory usage (redis#11666)

This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.

Example:
1. DB is empty
2. Backlog size is 1 MB
3. Client out put buffer limit is 2 MB
4. Client writes a 3 MB key
5. The shared replication buffer will have a single node which contains
the key written above, and it exceeds the backlog size.

At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfully executes partial sync but the
primary will drop the connection again because the buffer usage is still
3 MB. This happens over and over.

To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
It also means that if the backlog has some excessive data it can't trim,
it would be at most about 6% overuse.

other notes:
1. a loop was added in feedReplicationBuffer which caused a massive LOC
  change due to indentation, the actual changes are just the `min(max` and the loop.
3. an unrelated change in an existing test to speed up a server termination which took 10 seconds.

Co-authored-by: Oran Agra <oran@redislabs.com>
Mixficsol pushed a commit to Mixficsol/redis that referenced this pull request Apr 12, 2023
In redis#11666, we added a while loop and will split a big reply
node to multiple nodes. The update of tail->repl_offset may
be wrong. Like before redis#11666, we would have created at most
one new reply node, and now we will create multiple nodes if
it is a big reply node.

Now we are creating more than one node, and the tail->repl_offset
of all the nodes except the last one are incorrect. Because we
update master_repl_offset at the beginning, and then use it to
update the tail->repl_offset. This would have lead to an assertion
during PSYNC, a test was added to validate that case.

Besides that, the calculation of size was adjusted to fix
tests that failed due to a combination of a very low backlog size,
and some thresholds of that get violated because of the relatively
high overhead of replBufBlock. So now if the backlog size / 16 is too
small, we'll take PROTO_REPLY_CHUNK_BYTES instead.

Co-authored-by: Oran Agra <oran@redislabs.com>
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…s and unnecessary memory usage (redis#11666)

This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.

Example:
1. DB is empty
2. Backlog size is 1 MB
3. Client out put buffer limit is 2 MB
4. Client writes a 3 MB key
5. The shared replication buffer will have a single node which contains
the key written above, and it exceeds the backlog size.

At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfully executes partial sync but the
primary will drop the connection again because the buffer usage is still
3 MB. This happens over and over.

To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
It also means that if the backlog has some excessive data it can't trim,
it would be at most about 6% overuse.

other notes:
1. a loop was added in feedReplicationBuffer which caused a massive LOC
  change due to indentation, the actual changes are just the `min(max` and the loop.
3. an unrelated change in an existing test to speed up a server termination which took 10 seconds.

Co-authored-by: Oran Agra <oran@redislabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes indication that this issue needs to be mentioned in the release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] big keys can trigger replica buffer overrun/psync loops
6 participants