New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash Field Expiration - Draft #13172
Hash Field Expiration - Draft #13172
Conversation
we need to update db->hexpire when executing writable hash command or other commands that may affect hash.
|
I'm wondering why we introduced list ebucket which is only one per db. |
Hello @sundb,
You can review test tests at file: |
@sundb , I am not sure that i understand the comment. There is no relation between the fact that there is only one per db and the fact that it adds "complexity". In fact the code is rather simple, doing unregister-and-register, from source to destination, to BTW, currently there is no real reason why |
ohh, sorry, listpack-related code forgets to handle them. |
Abstract
This proposal advocates for enhancing Redis expiration functionality by introducing the option to set expiration on individual hash fields. In addition, it challenges the existing keyspace expiration implementation by presenting an alternative data structure known as
ebuckets
, initially designed for HFE usage.ebuckets
combine different recurring ideas that were raised up by the community, in the context of keyspace, potentially extending their relevance to hash fields as well. The new DS is optimized for active expiration, sorted by TTL, boasting a reduced memory cost of approximately ~20 bytes of metadata per item compared to the existing ~40 bytes in the dict for expiry. It uses an embedded expiry struct in the registered item which saves another pointer and gives along the way an O(1) TTL lookup.In that context, we also introduce a new structure, named
MSTR
, which stands for immutable STRing with Metadata. It comes as a replacement forSDS
-fields in a hash. WhereasSDS
only allows manipulation of strings, withMSTR
we can add and remove metadata attached to immutable strings generically.Overview
Redis is often used as a cache, and as such, one of its main features is being able to expire keys based on timeout. Over the years there was a recurring demand for more granular expiration control, such as to expire members in a set [1,#135], or members in sorted set [1,2], and particularly there are requests for expiration of fields of a hash [1,2,3,#167,#1042]. Several initiatives have been undertaken to address these needs, taking the form of third-party modules [1,2] and Redis forks [1].
The key aspects of this feature include:
EXPIRE
command for keyspace which only needs to decide whether a given object is expired. Here we need to fine-tune a few hash commands for corresponding action to be taken.EXPIRE
feature for keys, however, some commands adopt a slightly different approach.Active Expiration
When designing a data structure for hash field expiration, it might be a good idea to examine first the current Redis implementation of key expiration, considering its pros and cons. By evaluating the existing approach, we can determine whether adopting a similar strategy would be beneficial. This examination may not only aid in designing an effective data structure for hash field expiration but potentially, may also improve the mechanism of Redis for key expiration. Therefore, the terms “key” and “field” (or “hash field”) might be used interchangeably. In addition, even if eventually it might be implemented only for hash field expiration, it allows us to challenge the current implementation and provide a good reference.
Redis uses a hash table DS to implement key expiration, wherein the insertion of a new key with TTL into the table is determined by the key's hash value. The pros and cons are:
Pros:
Cons:
As can be seen, using a hash-table for expiration brings with it one major advantage of O(1) access time to a given item, but on the other hand it raises several severe issues regarding active expiration (as highlighted in previous issues [#8700,#10802,#9480]). An alternative DS obviously should be sorted by expiration time for efficient active expiration, and be able to scale from a few items to millions.
RAX
is a candidate that keeps reoccurring in that context [1,#9480], as it is already available and well tested. There are more advanced algorithms for caching and support for expiration [1,2] but they might require major change and they don't scale from a few items for the case of hash fields.One might claim that the downside of using
RAX
is log(n) access time. This claim can be mitigated by keeping no more than 6 bytes of the expiration time (which is sufficient until the date of 02 August, 10889) and as a result the depth of the RADIX tree will be limited to 6 levels. Another concern might be thatRAX
is wasting too much memory for metadata . This can be mitigated as well by holding a segment of multiple items, instead of holding a single item at the leaf of the tree. Those segments will be optimized in memory such that the items themselves will embed the structure of the segment as a linked list. Furthermore, it also reduces the frequency of updates of the rax tree and resolves conflicts of items with the same expiration time.Having said that, the idea to embed metadata attached to the hash fields (or keys) is problematic as long as the hash fields, or keys, are of type
SDS
. Hence, as an alternative, we introduce a new structure, namedMSTR
, which stands for immutable STRing with Metadata. WhereasSDS
allows to keep only strings, withMSTR
we can add and remove metadata attached to the hash field (or key) string in a generic way. It might seem overkill to have general purposeMSTR
just for TTL, but it is likely that other features down the road, require attaching their own metadata to hash fields (or keys) as well.New DS: ebuckets
ebuckets
, which stands for expiry-buckets, is used to store items that are set with expiration-time. It supports the basic API of add, remove, and active expiration. Its implementation is based on aRAX
tree, or a plain linked list when small.The expiration time of an item is used as the key to traverse the
RAX
tree.The
ebuckets
data structure is organized hierarchically as follows:RAX
. Each leaf in theRAX
tree is a bucket.RAX
tree for each bucket represents low bound expiration time for the items within this ebucket and the key of the next bucket represents the upper bound expiration time.EB_SEG_MAX_ITEMS
(currently it is 16) items as a linked list. If needed to add an item to a full segment, then the segment will try to split the bucket. To avoid wasting memory, it is a singly linked list (single pointer to next item). The list is cyclic to allow efficient removal of items from the middle of the segment without necessarily traversing theRAX
tree.ebuckets
should embed theExpireMeta
structure and supply getter function.ExpireMeta
holds the expiration time of the item and a few more fields that are used to maintain the segments data-structure (Described at the end of this doc).The following diagram summarizes those ideas of using
RAX
, segments, items and embeddedExpireMeta
. It also gives a hint toMSTR
layout and how hash of type dict is going to integrate with it:Splitting bucket
Each segment can hold up-to
EB_SEG_MAX_ITEMS
items. On insertion of a new item, it will try to split the segment. Here is an example For adding another item to segment number 1 that already reached its maximum capacity which will cause to split of the segment and in turn split of the bucket as well to a finer grained ranges:Extending bucket
In the example above, the reason it wasn't split evenly, is that the segment must have been holding items with the same TTL and they must reside together in the same bucket after the split. Which brings us to another important point. If there is a segment that reaches its maximum capacity and all the items have the same expiration-time key, then we cannot split the bucket but aggregate all the items, with the same expiration time key, by allocating an extended-segment and chain it to the first segment in the visited bucket. In that sense, extended segments will only hold items with the same expiration-time key.
The following diagram describes in more details the way to extend a full segment and chain a new segment:
Memory evaluation
Optimal memory utilization occurs when multiple items share the same expiration time, as they can be stored efficiently within a single bucket utilizing extended segments, akin to a cost-effective linked list structure. Initially, let's disregard this scenario and assume that each item possesses a unique expiration time, resulting in all buckets operating without extended segments.
Note that:
ExpireMeta
.Use case: ebuckets contains few items. No more than
EB_LIST_MAX_ITEMS
When ebuckets contains no more than
EB_LIST_MAX_ITEMS
(=16) items, it is optimized not to allocate any memory. It just uses embeddedExpireMeta
to maintain its own data-structure as a plain list. Therefore, memory overhead is 16 bytes per item.Use case: Most items are removed via Active-Expire
If most of the items are being removed by active expiration, then sizes of the segments (buckets) will be an outcome of split and will hold at least 8 items and no more than 16 (
=EB_SEG_MAX_ITEMS
). So, it is expected to have an average of 12 items per bucket.Average of 12 items in bucket: 16 + 16/12 + 40/12 = 20.6 bytes
Use case: Items removed NOT only by Active-Expire
If expected to have many "non-sequential" removal of items from buckets, then many buckets can shrink down to 4 items in a bucket. Below it,
ebuckets
will try merge the bucket with adjacent one. So, it is expected to have an average of 10 items per segment bucket.Average of 10 items in bucket: 16 + 16/10 + 40/10 = 21.6 bytes
Worst case of 4 items in bucket: 16 + 16/4 + 40/4 = 30 bytes
Use case: All items with same expiration-time
In that case
ebuckets
will have a single bucket of type extended-segments with an unbounded number of items in it.ExpireMeta
becomes the dominant value andNextSegHdr
, of size 24 bytes, will be allocated per segment. Assuming 16 items per segment, we get:16 + 24/16 = 17.5 bytes
Performance Evaluation
ebuckets
and then actively expiring all of them.ebuckets
DS).rax
memory along with segment headers.To run the benchmark (currently, it is configured to 1 month distribution):
MSTR (immutable string with metadata)
SDS
string is widely used across the system and serves as a general purpose container to hold data. The need to optimize memory and aggregate strings along with metadata and store it into Redis data-structures as single bulk keep reoccur. One thought might be to attach metadata toSDS
. The trouble is thatSDS
is a mutable string in its nature, with a wide API (split, join, etc.). Pushing metadata logic intoSDS
will make it very fragile, and complex to maintain.As an alternative, we introduce a new concept of immutable strings, simplified, with limited API, and with the option to attach metadata. This idea isn’t new and was suggested in different contexts. The new thing here is defining it as infrastructure to different kinds of use cases. One use case can be attaching TTL to hash fields. Another use case can be attaching TTL and encoding to keys. Etc.
The representation of the string, without any metadata, in its basic form, resembles
SDS
but without the API to manipulate the string. The following diagram shows the memory layout of mstring (mstrhdr8) when no metadata is attached:If metadata flag is set, depicted in diagram above as
m-bit
in the diagram, then the header will be preceded with additional 16 bits of metadata flags such that if i'th bit is set, then corresponding i'th metadata structure is attached to the mstring. The metadata layout and their sizes are defined bymstrKind
structure (More below). The following diagram shows the memory layout ofMSTR
(mstrhdr8
) when 3 bits inmFlags
are set to indicate that 3 fields of metadata are attached to the mstring at the beginning.Note: Initially
MSTR
was designed to support implict conversion fromSDS
toMSTR
. That is, the layout ofMSTR
without any metadata attached is aligned withSDS
layout. But later on it was decided to discard this approach and to strive for explicit conversion. This decision can be challenged in the future as the code evolves.Kinds of MSTR
The
MSTR
allows defining different kinds (classes) of mstrings, each with its own unique metadata layout. For example, in case of hash fields, all instances of it can optionally have TTL metadata attached to it. This is achieved by prototyping oncemstrKind
structure that defines the metadata layout and metadata sizes of this specific kind.Here is
mstrKind
prototype for hash fields:Note that each hash field instance still has the freedom to optionally attach the expiration metadata to it. Most of them will start their life without any metadata attached.
In the future, the keys of Redis keyspace can be another kind of
MSTR
that aggregate TTL, LRU, or even dictEntry metadata embedded into a single allocation. Here is a general idea of how it might look like:This idea is further elaborated in Appendix B.
Alignment issues
There are two types of alignments to take into consideration:
Alignment of the metadata - As the metadatas layout are reversed to their enumeration, it is recommended to put metadata with "better" alignment first in memory layout (enumerated last) and the worst, or those that simply don't require any alignment will be last in memory layout (enumerated first). This is similar to the applied consideration when defining a new struct in C. Note also that each metadata might either be attached to
MSTR
or not which complicates a little the design phase of a newmstrKind
. In the example above, HKEY_META_VAL_REF_COUNT, with the worst alignment of 4 bytes, is enumerated first, and therefore, will be last in memory layout.Alignment of returned MSTR pointer - Few optimizations in Redis rely on the fact that
SDS
address is always an odd pointer. We can achieve the same with a little effort. It was already taken care that all headers of typemstrhdrX
has odd size. With that in mind, if a new kind ofMSTR
is required to be limited to odd addresses, then we must make sure that sizes of all related metadatas that are defined inmstrKind
are even in size.Put the pieces together
Each hash instance will maintain its own set of HFEs in its private
ebuckets
. It can be attached to dict structure as metadata. In order to support active expiration across hash instances, hashes with associated HFE will be also registered in globalebuckets
, per db, with expiration time value that reflects their next minimum time to expire. The global active expiration of HFE will be triggered from theactiveExpireCycle()
function and will invoke "local" HFE Active expiration for each hash instance that has expired fields.In addition, current implementation of hashes by dict will be modified from using
SDS
as fields tohfield
, which is kind ofMSTR
, in order to be able to attachExpireMeta
structure to the hash field. A field of hash can initially be allocated withoutExpireMeta
and following setting of TTL, it will be reallocated withExpireMeta
and will be added to theebuckets
of the hash. If the new TTL of the hash field is also the new minimum TTL of the hash, then the hash will be updated in the globalebuckets
as well.Lazy Expiration
Conceptually, every time before a hash object is accessed, it is needed to delete all expired fields. This approach, even though will bring the object to its desired state, might freeze the main thread for a long time in case there considerable amount of items to expire in a short period of time. As an alternative, we can refine the action taken before each command, to better suit and reduce the risk of stucking with a long expiry operation.
Extend expireIfNeeded() to support HFE
Today function
expireIfNeeded()
handles the lazy-expiration logic of keys. The main way this function is called is vialookupKey*()
family of functions.It is required to extend it to identify expiration of hashs in case all their hash fields are with TTL and all got expired. Which means that the hash itself should be expired as well.
A new function
hashTypeIsEmpty()
will be introduced tot_hash.c
. It will apply the following steps (return 0 if false):dbSize = dictSize(d)
;trash
, thenreturn dbSize == 0;
return 0;
.ebuckets
) is less than dbSize, thenreturn 0;
ebuckets
is less than current timereturn 1
.Commands that require careful execution
Here is non-finite list of commands that requires careful execution:
HGETALL
Will not actually expire fields, but rather just filter them from the reply.
HLEN (OPEN)
As we would like to return the correct value that doesn't count expired fields, we can remove expired fields before running the command.
Alternatively,
ebuckets
can support dry-run for active-expire in which it can scan the buckets, starting from the first, until it reaches a bucket with an expiration time that is not below “now” and aggregate along the way the number of items that are expired. Note that it doesn’t need to iterate the items in each bucket except maybe the last one, in case it is not of type extended segment (i.e. not more than 16 items). This operation is highly efficient, typically completing no more than 50msec for 10 million items.HLEN
command will return hash size minusexExpireDryRun()
.HRANDFIELD (OPEN)
For now, we shall remove all expired fields and then run the command.
This function can be further optimized.
HSCAN
Will lazy-expire (if master, otherwise will just filter out) fields (may result in an empty array) - like SCAN.
This function can be further optimized.
API
HEXPIRE group
For each specified field: set the expiration time in sec/msec/unix-sec/unix-msec:
(Similar to EXPIRE, PEXPIRE, EXPIREAT, PEXPIREAT)
HPERSIST command
For each specified field: remove the expiration time.
HEXPIRETIME group
For each specified field: get the expiration time in sec/msec:
(Similar to EXPIRETIME, PEXPIRETIME)
HTTL group
For each specified field: get the remaining time to live in sec/msec:
(Similar to TTL, PTTL)
HGETF command (TODO)
For each specified field get its value and optionally set the field's expiration time in sec/msec /unix-sec/unix-msec:
(Similar to GETEX)
HSETF command (TODO)
For each specified field value pair: set field to value and optionally set the field's expiration time in sec/msec /unix-sec/unix-msec:
Inspired by both HSET and SET.
Appendix A: Structure ExpireMeta
ExpireMeta
struct should be embedded inside the item that needs to be stored inebuckets
:Appendix B: Enhance EXPIRE keyspace (Idea for future work)
Now that we have a functioning
MSTR
andebuckets
for hash fields, we can consider challenging the existing keyspace expiration implementation with the same approach and even take it further.As a first step, we can try
ebuckets
instead of hashtable DS forEXPIRE
. This also requires to modifySDS
-keys to be based on kind ofMSTR
and attaching itExpireMeta
, just like we did for hash fields in this suggestion.The next step might be to consider even to merge
dictEntry
androbj
into keys which are kind ofMSTR
, after the previous step.TODOs
As listpack not support yet, take care to
config set hash-max-listpack-entries 0
before trying HFE API.
ebuckets
required to support merging small adjacent buckets and segments.HSETF
andHGETF
db->hexpires
to cluster mode.