Skip to content

Conversation

yoav-steinberg
Copy link
Contributor

@yoav-steinberg yoav-steinberg commented Mar 23, 2021

Description

A mechanism for disconnecting clients when the sum of all connected clients is above a configured limit. This prevents eviction or OOM caused by accumulated used memory between all clients. It's a complimentary mechanism to the client-output-buffer-limit mechanism which takes into account not only a single client and not only output buffers but rather all memory used by all clients.

Design

The general design is as following:

  • We track memory usage of each client, taking into account all memory used by the client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date after reading from the socket, after processing commands and after writing to the socket.
  • Based on the used memory we sort all clients into buckets. Each bucket contains all clients using up up to x2 memory of the clients in the bucket below it. For example up to 1m clients, up to 2m clients, up to 4m clients, ...
  • Before processing a command and before sleep we check if we're over the configured limit. If we are we start disconnecting clients from larger buckets downwards until we're under the limit.

Config

maxmemory-clients max memory all clients are allowed to consume, above this threshold we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB suffix),
or as a percentage of maxmemory by using the % suffix (e.g. setting it to 10% would mean 10% of maxmemory).

Important code changes

  • During the development I encountered yet more situations where our io-threads access global vars. And needed to fix them. I also had to handle keeps the clients sorted into the memory buckets (which are global) while their memory usage changes in the io-thread. To achieve this I decided to simplify how we check if we're in an io-thread and make it much more explicit. I removed the CLIENT_PENDING_READ flag used for checking if the client is in an io-thread (it wasn't used for anything else) and just used the global io_threads_op variable the same way to check during writes.
  • I optimized the cleanup of the client from the clients_pending_read list on client freeing. We now store a pointer in the client struct to this list so we don't need to search in it (pending_read_list_node).
  • Added evicted_clients stat to INFO command.
  • Added CLIENT NO-EVICT ON|OFF sub command to exclude a specific client from the client eviction mechanism. Added corrosponding 'e' flag in the client info string.
  • Added multi-mem field in the client info string to show how much memory is used up by buffered multi commands.
  • Client tot-mem now accounts for buffered multi-commands, pubsub patterns and channels (partially), tracking prefixes (partially).
  • CLIENT_CLOSE_ASAP flag is now handled in a new beforeNextClient() function so clients will be disconnected between processing different clients and not only before sleep. This new function can be used in the future for work we want to do outside the command processing loop but don't want to wait for all clients to be processed before we get to it. Specifically I wanted to handle output-buffer-limit related closing before we process client eviction in case the two race with each other.
  • Added a DEBUG CLIENT-EVICTION command to print out info about the client eviction buckets.
  • Each client now holds a pointer to the client eviction memory usage bucket it belongs to and listNode to itself in that bucket for quick removal.
  • Global io_threads_op variable now can contain a IO_THREADS_OP_IDLE value indicating no io-threading is currently being executed.
  • In order to track memory used by each clients in real-time we can't rely on updating these stats in clientsCron() alone anymore. So now I call updateClientMemUsage() (used to be clientsCronTrackClientsMemUsage()) after command processing, after writing data to pubsub clients, after writing the output buffer and after reading from the socket (and maybe other places too). The function is written to be fast.
  • Clients are evicted if needed (with appropriate log line) in beforeSleep() and before processing a command (before performing oom-checks and key-eviction).
  • All clients memory usage buckets are grouped as follows:
    • All clients using less than 64k.
    • 64K..128K
    • 128K..256K
    • ...
    • 2G..4G
    • All clients using 4g and up.
  • Added client-eviction.tcl with a bunch of tests for the new mechanism.
  • Extended maxmemory.tcl to test the interaction between maxmemory and maxmemory-clients settings.
  • Added an option to flag a numeric configuration variable as a "percent", this means that if we encounter a '%' after the number in the config file (or config set command) we consider it as valid. Such a number is store internally as a negative value. This way an integer value can be interpreted as either a percent (negative) or absolute value (positive). This is useful for example if some numeric configuration can optionally be set to a percentage of something else.

Origianl PR description:

See #7676

Description

We track how much memory each client uses. The value is updated after each processInputBuffer and after writing to the socket. When the sum of all clients exceeds configuration we disconnect the fat clients. This is checked after each command.

This is more or less done given the current state of my discussions with @oranagra.
We need a larger forum to review this solution and decide if it's good enough.

Important/Open issues:

  1. @oranagra and I thought it won't be right to evict clients based on some rolling avg of memory usage, rather to look at the current state. This is because we evict clients until we're back under some threshold and looking at past memory usage while aiming to go below some current memory usage might cause unexpected results. Originally we thought some past avg will be needed to avoid disconnecting a bursting client which uses up a lot of memory for a short time, but in reality such a client might eventually also be disconnected and it might not be what we want if not disconnecting it will cause all other clients to disconnect as well. There's no one right solution here, but I think adding a past based rolling avg just complicates things and makes them less predictable and tougher to tune for the user.
  2. How does this relate to the already existing single client COB threshold. Do we still need it? Or is it redundant and should be deprecated. From the user's point of view it's easier to simply define a threshold for maxmemory-clients which is also enforced on the individual client level (and also includes the query buffer size) and there might be no real reason to configure the client-output-buffer-limit. Also need to consider how this relates to replication client limits, currently maxmemory-clients ignores these clients.
  3. Soft vs hard thresholds: the old client-output-buffer-limit had a timer based soft threshold. What is it really good for? Isn't this just over complicating things? If not perhaps we want something like this for maxmemory-clients as well? I have a feeling this is an overkill and can probably be deprecated.
  4. One difference between the client-output-buffer-limit implementation and this is that here we check and evict clients only before processing a command and before sleep but not during the command processing. We might want to change this and add the code checking the limit and evicting clients inside _addReplyProtoToList where asyncCloseClientOnOutputBufferLimitReached is called. I think this won't affect performance that much. If we're thinking this is the future replacement for client-output-buffer-limit then we need to do this.
  5. Should client no-evict flag also guard against client-output-buffer-limit protection?

TODO:

  • remove client loop in info command: Client eviction #8687 (comment)
  • remove big todo comment in server.h: Client eviction #8687 (comment)
  • Account MULTI command buffer size as clients used memory.
  • There's the issue of accounting watched keys: they are kind of per-client but not really because in reality there's a list of clients per watched key. How do we handle this? Do we add another mechanism for limiting memory used by watched keys?

Tests to be added:

  • Decrease maxmemory-clients in runtime causes client eviction.
  • Only the required number of clients are evicted to achieve maxmemory-clients
  • First larger clients are evicted and then smaller ones.
  • Client eviction works on both large query, large args and large output buffers, and large multi buffers and watched keys list.

@yoav-steinberg yoav-steinberg added state:needs-design the solution is not obvious and some effort should be made to design it state:major-decision Requires core team consensus labels Mar 31, 2021
oranagra
oranagra previously approved these changes Apr 28, 2021
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conceptual approval (just asked for some minor cleanup)

@oranagra oranagra added approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes labels May 4, 2021
@oranagra
Copy link
Member

oranagra commented May 4, 2021

@redis/core-team please take a look at this new feature for redis 7.0 (details at the top)

@yoav-steinberg
Copy link
Contributor Author

@yossigo @oranagra There's the issue of tracking memory usage of watched keys:

  • A WATCH command adds the key name to a global dict of watched keys. Each entry in the dict contains a list of clients watching that key. This means that this isn't a per-client memory consumption. So we need to think of a mechanism of limiting how much memory watched keys can consume. Another config?
  • We also don't have any reporting of these global dicts. So mem overhead reporting should be updated accordingly.
  • In addition, each client contains a list of pointers to all the keys it's watching. This can be accounted for per-client, reported in CLIENT LIST and used for client eviction. This is already implemented in my last commits.
    Any thoughts?

@yoav-steinberg
Copy link
Contributor Author

After talk with @oranagra about how to handle io-threads-do-reads we came up with following concept (to be tested):
To handle eviction buckets being global and update them when filling data per client in the read threads we can simply make sure all updates are either atomic decrement or increment (we need decrements when moving a client from one bucket to another). We can also check (and update) the total memory usage sum. If we pass maxmemory-clients we can stop processing the client or even abort the thread. When we're back in the main thread we can safely assume all sums in the buckets are valid because of eventual consistency. And at this point handle any client evictions if needed.

@oranagra
Copy link
Member

regarding the watched keys: i don't think the client eviction mechanism needs to be perfect and count all per-client overheads, it's ok that we solve the output buffer problem and other painful problems, and some edge cases remain unsolved (it's not a security feature).
So the things that are truly per client, and are easy to count, we'll count (no reason no to), but things that are shared between clients, we can skip.

we can however improve the total overhead reported in INFO MEMORY, and the detailed report in MEMORY STATS to include these WATCH, and maybe CSC (client side caching / tracking) overheads (for manual troubleshooting).

@madolson
Copy link
Contributor

Regarding the memory usage, I agree with oran that right now a best effort will catch most of the issues for now. If we get t the point in the future we see issues, we can iterate on this solution.

@oranagra
Copy link
Member

@madolson i didn't understand your comment about the valid_fn (for some reason i can't respond to that comment)

@madolson
Copy link
Contributor

madolson commented Jun 14, 2021

@oranagra I'm not entirely sure how that comment ended up there, it was a response to a sundb comment but somehow got duplicated as its own comment. I can't respond to it either, so I deleted it.

@yoav-steinberg yoav-steinberg added the state:needs-doc-pr requires a PR to redis-doc repository label Aug 12, 2021
@yoav-steinberg yoav-steinberg marked this pull request as ready for review August 17, 2021 12:15
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the top comment needs an update. the current discussion in it can remain below some separator, but i'd like to add a better description at the top that explain what the PR eventually does.
i.e. purpose, design, and most importantly interface changes and any unrelated changes.

the ones i listed during my review are these:

  • new config
  • explain the refactor of CLIENT_PENDING_READ and io_threads_op (why and how)
  • c->pending_read_list_node
  • evicted_clients stat in INFO
  • new multi-mem in CLIENT LIST and CLIENT INFO, and also other existing client memory fields (previously untracked)
  • pubsub_patterns and pubsub_channels and parts of client_tracking_prefixes memory tracked in the above
  • CLIENT NO-EVICT sub-command
  • CLIENT_CLOSE_ASAP handled in beforeNextClient rather than beforeSleep
  • anything else i missed?

oranagra pushed a commit that referenced this pull request Sep 26, 2021
Fixing CI test issues introduced in #8687
- valgrind warnings in readQueryFromClient when client was freed by processInputBuffer
- adding DEBUG pause-cron for tests not to be time dependent.
- skipping a test that depends on socket buffers / events not compatible with TLS
- making sure client got subscribed by not using deferring client
enjoy-binbin added a commit to enjoy-binbin/redis-doc that referenced this pull request Jan 12, 2022
multi-mem: added in redis/redis#8687
resp: added in redis/redis#9508

Also adjust the order of fields to better match the output

The current output format is (redis unstable branch):
```
id=3 addr=127.0.0.1:50188 laddr=127.0.0.1:6379 fd=8 name= age=7 idle=0
flags=N db=0 sub=0 psub=0 multi=-1 qbuf=26 qbuf-free=20448 argv-mem=10
multi-mem=0 obl=0 oll=0 omem=0 tot-mem=40986 events=r cmd=client|list
user=default redir=-1 resp=2
```
itamarhaber pushed a commit to redis/redis-doc that referenced this pull request Jan 23, 2022
multi-mem: added in redis/redis#8687
resp: added in redis/redis#9508

Also adjust the order of fields to better match the output

The current output format is (redis unstable branch):
```
id=3 addr=127.0.0.1:50188 laddr=127.0.0.1:6379 fd=8 name= age=7 idle=0
flags=N db=0 sub=0 psub=0 multi=-1 qbuf=26 qbuf-free=20448 argv-mem=10
multi-mem=0 obl=0 oll=0 omem=0 tot-mem=40986 events=r cmd=client|list
user=default redir=-1 resp=2
```
oranagra pushed a commit that referenced this pull request Mar 15, 2022
In a benchmark we noticed we spend a relatively long time updating the client
memory usage leading to performance degradation.
Before #8687 this was performed in the client's cron and didn't affect performance.
But since introducing client eviction we need to perform this after filling the input
buffers and after processing commands. This also lead me to write this code to be
thread safe and perform it in the i/o threads.

It turns out that the main performance issue here is related to atomic operations
being performed while updating the total clients memory usage stats used for client
eviction (`server.stat_clients_type_memory[]`). This update needed to be atomic
because `updateClientMemUsage()` was called from the IO threads.

In this commit I make sure to call `updateClientMemUsage()` only from the main thread.
In case of threaded IO I call it for each client during the "fan-in" phase of the read/write
operation. This also means I could chuck the `updateClientMemUsageBucket()` function
which was called during this phase and embed it into `updateClientMemUsage()`.

Profiling shows this makes `updateClientMemUsage()` (on my x86_64 linux) roughly x4 faster.
@oranagra
Copy link
Member

@yoav-steinberg i got some failure with valgrind. maybe you have time to look into it

*** [err]: avoid client eviction when client is freed by output buffer limit in tests/unit/client-eviction.tcl
Expected 'obuf-client1' to match 'no client named obuf-client1 found*' (context: type eval line 38 cmd {assert_match {no client named obuf-client1 found*} $e} proc ::test) 

@yoav-steinberg
Copy link
Contributor Author

Not sure, the test seems fine. If it recreates you can check the server logs to see why the two client's aren't being disconnected for reaching their output buffer.

oranagra pushed a commit that referenced this pull request Dec 28, 2022
…ors (#11657)

This call is introduced in #8687, but became irrelevant in #11348, and is currently a no-op.
The fact is that #11348 an unintended side effect, which is that even if the client eviction config
is enabled, there are certain types of clients for which memory consumption is not accurately
tracked, and so unlike normal clients, their memory isn't reported correctly in INFO.
oranagra pushed a commit that referenced this pull request Jan 16, 2023
…ors (#11657)

This call is introduced in #8687, but became irrelevant in #11348, and is currently a no-op.
The fact is that #11348 an unintended side effect, which is that even if the client eviction config
is enabled, there are certain types of clients for which memory consumption is not accurately
tracked, and so unlike normal clients, their memory isn't reported correctly in INFO.

(cherry picked from commit af0a4fe)
oranagra pushed a commit that referenced this pull request Jul 25, 2023
A bug introduced in #11657 (7.2 RC1), causes client-eviction (#8687)
and INFO to have inaccurate memory usage metrics of MONITOR clients.

Because the type in `c->type` and the type in `getClientType()` are confusing
(in the later, `CLIENT_TYPE_NORMAL` not `CLIENT_TYPE_SLAVE`), the comment
we wrote in `updateClientMemUsageAndBucket` was wrong, and in fact that function
didn't skip monitor clients.
And since it doesn't skip monitor clients, it was wrong to delete the call for it from
`replicationFeedMonitors` (it wasn't a NOP).
That deletion could mean that the monitor client memory usage is not always up to
date (updated less frequently, but still a candidate for client eviction).
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…ors (redis#11657)

This call is introduced in redis#8687, but became irrelevant in redis#11348, and is currently a no-op.
The fact is that redis#11348 an unintended side effect, which is that even if the client eviction config
is enabled, there are certain types of clients for which memory consumption is not accurately
tracked, and so unlike normal clients, their memory isn't reported correctly in INFO.
oranagra pushed a commit to redis/redis-doc that referenced this pull request Oct 24, 2023
no-evict added in redis/redis#8687
no-touch added in redis/redis#11483

Co-authored-by: Binbin <binloveplau1314@qq.com>
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Nov 28, 2023
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in redis#7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in redis#11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in redis#12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because redis#7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in redis#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0 to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`,
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
oranagra pushed a commit that referenced this pull request Nov 29, 2023
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
oranagra pushed a commit that referenced this pull request Jan 9, 2024
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.

(cherry picked from commit 22cc9b5)
oranagra pushed a commit that referenced this pull request Jan 9, 2024
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.

(cherry picked from commit 22cc9b5)
warrick1016 pushed a commit to ctripcorp/Redis-On-Rocks that referenced this pull request Aug 29, 2025
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.

The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
  client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
  after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
  clients using up up to x2 memory of the clients in the bucket below it. For example up
  to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
  limit. If we are we start disconnecting clients from larger buckets downwards until we're
  under the limit.

`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).

* During the development I encountered yet more situations where our io-threads access
  global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
  memory buckets (which are global) while their memory usage changes in the io-thread.
  To achieve this I decided to simplify how we check if we're in an io-thread and make it
  much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
  if the client is in an io-thread (it wasn't used for anything else) and just used the global
  `io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
  We now store a pointer in the `client` struct to this list so we don't need to search in it
  (`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
  client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
  by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
  channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
  clients will be disconnected between processing different clients and not only before sleep.
  This new function can be used in the future for work we want to do outside the command
  processing loop but don't want to wait for all clients to be processed before we get to it.
  Specifically I wanted to handle output-buffer-limit related closing before we process client
  eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
  buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
  and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
  indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
  these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
  (used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
  writing data to pubsub clients, after writing the output buffer and after reading from the
  socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
  processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
  * All clients using less than 64k.
  * 64K..128K
  * 128K..256K
  * ...
  * 2G..4G
  * All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
  maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
  if we encounter a '%' after the number in the config file (or config set command) we
  consider it as valid. Such a number is store internally as a negative value. This way an
  integer value can be interpreted as either a percent (negative) or absolute value (positive).
  This is useful for example if some numeric configuration can optionally be set to a percentage
  of something else.

Co-authored-by: Oran Agra <oran@redislabs.com>
warrick1016 pushed a commit to ctripcorp/Redis-On-Rocks that referenced this pull request Aug 29, 2025
…10401)

In a benchmark we noticed we spend a relatively long time updating the client
memory usage leading to performance degradation.
Before redis#8687 this was performed in the client's cron and didn't affect performance.
But since introducing client eviction we need to perform this after filling the input
buffers and after processing commands. This also lead me to write this code to be
thread safe and perform it in the i/o threads.

It turns out that the main performance issue here is related to atomic operations
being performed while updating the total clients memory usage stats used for client
eviction (`server.stat_clients_type_memory[]`). This update needed to be atomic
because `updateClientMemUsage()` was called from the IO threads.

In this commit I make sure to call `updateClientMemUsage()` only from the main thread.
In case of threaded IO I call it for each client during the "fan-in" phase of the read/write
operation. This also means I could chuck the `updateClientMemUsageBucket()` function
which was called during this phase and embed it into `updateClientMemUsage()`.

Profiling shows this makes `updateClientMemUsage()` (on my x86_64 linux) roughly x4 faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:needs-design the solution is not obvious and some effort should be made to design it state:needs-doc-pr requires a PR to redis-doc repository state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[NEW] Client "eviction" - drop clients when their total buffer overhead is over a limit
6 participants