Always replicate TTLs as absolute timestamps in milliseconds #8474

ny0312 · 2021-02-08T23:37:08Z

Related to ##8433

Part I

This commit is to always replicate time-to-live(TTL) values as absolute UNIX timestamps in milliseconds. With this commit, all mutations in Redis would start to be propagated the same way in both AOF and replication stream. No more special command rewrite/translation for AOF only.

This commit aims to mitigate two issues, as discussed here ##8433:

When TTLs are replicated as relative values, TTL keys might outlive the their intended expiration deadline on replicas due to significant replication lag. Suppose in a two-node shard with one primary and one replica:

At T1, a client issues SET K V EX 10 on the primary, setting TTL of key K to be 10 seconds.
At T2 where T2 > T1, replica received the SET command due to replication lag.
At T1+9s, 1 second before the key would get expired, this primary dies. Replica takes over.
At T2+10s, the key finally expires by the new primary.

In aggregate, the key lived for T2+10s-T1 in wall time, where T2-T1 is the replication lag between old primary and new primary. The larger the replication lag is, the more the key outlives its intended lifetime.

TTLs are replicated as absolute values in RDB but was replicated as relative values in replication stream. As result, after full resynchronization, keys with similar TTLs could end up having very different lifespans on replicas. Suppose the following sequence:

At wall time T1, a client consecutively sets two keys A and B with the same relative TTL, say 3600 seconds(1 hour) on a primary. E.g. SET A a EX 3600 and SET B b EX 3600
Now the primary creates a RDB to full sync a replica. The RDB’s boundary happens to fall between the two SETs, so A is captured in RDB with absolute expiration timestamp T1+3600, but B is not captured.
Now suppose it takes the replica 2 hours to receive this RDB and starts to replicate the data. A would be immediately expired, but B would live on for another hour.

This is a counter-intuitive experience for clients. The client set two keys to expire after 1 hour at the relatively same wall time, but one outlived another for 1 hour.

Part II

Introduced a new command EXPIRETIME that returns the absolute Unix timestamp of an expire:

EXPIRETIME key TIME|PTIME

Available since 6.?

Time complexity: O(1)

Returns the absolute Unix timestamp(since January 1, 1970) at which the given key will expire.

Options

The EXPIRETIME command supports a set of options that modify its behavior:

TIME -- Returns the Unix timestamp in seconds.
PTIME -- Returns the Unix timestamp in milliseconds.

Return value

Integer reply: TTL in milliseconds, or a negative value in order to signal an error (see the description above).

The command returns -2 if the key does not exist.
The command returns -1 if the key exists but has no associated expire.

madolson

Thanks!

src/aof.c

tests/cluster/tests/14-consistency-check.tcl

src/t_string.c

tests/integration/aof.tcl

src/t_string.c

tests/unit/expire.tcl

src/expire.c

madolson · 2021-02-10T01:37:32Z

@redis/core-team Ok, probably time for a wider review. Nan outlined the two major additions in the top comment. The first one is more should be mostly agreed upon. The second item is just a mechanism to test that the absolute timestamp is happening correctly, the current implementation minimizes the amount of code.

oranagra · 2021-02-10T18:34:41Z

i'm sorry, i didn't follow the discussion in this PR and issue (wasn't aware of them).
since this goes against past decision, i would like to study both the correspondence here (didn't read it yet), and the past discussions in other issues and PRs (where Salvatore responded).
however i'm currently a bit short in time, and and in any case i'm not sure if we wanna merge such a change to 6.2 (already past it's last RC). since this is not an urgent change, i wanna propose that we'll hold this for a future version.
does this make sense? i might be overthinking of this, but why take the risk?

madolson · 2021-02-11T01:10:31Z

@oranagra, I was under the impression this would not be in 6.2 given that we said only bug fixes for the last RC. You can take as much time as you need to think it through :)

madolson

Update LGTM, still waiting on the major change decision.

src/cluster.c

src/expire.c

src/t_string.c

src/server.c

QuChen88 · 2021-03-13T00:15:45Z

@oranagra I just read through the other related issues on this topic #8433 and #5171. I am not entirely convinced about the original argument that Salvatore was making that the TTL should be the relative time since the moment key was written onto the redis server which is subject to the replication delay on the replica. From the point of view of the Redis user, I am writing a key with an TTL onto the master, and the expectation is that the key should be expired after that time duration from this moment on. So I also lean towards having the replica attempt to expire the item at the same point in time as the master regardless of the replication lag between master and replica.

I just looked at this PR from @ny0312 and I think the implementation looks good mostly. I raised a few minor comments for him namely around introducing the new expiretime command. Please take a look. Thanks!

tests/unit/expire.tcl

src/cluster.c

oranagra · 2021-03-17T09:12:40Z

Sorry for the delay, also sorry to be the last to join the party.

TL:DR i'm voting in favor of this PR (didn't read the code yet).

The main reason is that we're trying to fight two contradicting issues here:

a significant clock de-sync between master and replica. i'm not sure such a thing really exists, and if there is, there are ways to fix it (NTP and others).
significant replication delays (either due to PSYNC reading from the backlog, or due to large replica-buffers accumulated during full sync). these are very real, and there's no easy way to mitigate them.

Item 2 can also be said to be affected by trivial network latency, but that's not in the same ballpark of the other two replication delay issues.

On top of that there's the concern of code simplicity and consistency, and it would certainly be better to replicate the same to AOF and replicas, and have the same type of (absolute) value in full-sync and replication command stream.

I don't agree with some of the arguments that were posted, but the bottom line is that these two concerns contradict each other, and one of these concerns can be eliminated while the other can't (at least not easily).

Now that we seem to agree about this change we need to decide when it's safe to merge. if we conclude that we don't wanna release this change in 6.2.x, it would be better to leave it out of unstable for a while, so that unstable doesn't diverge too far from 6.2, making it harder to cherry pick bug fixes.

Footnotes:

I think what matters here is actually the wall-clock of the client machine. Requesting a key to expire in 3 minutes means 3 minutes since the command was sent, or since the time it was processed by the master (the client machine's wall-clock at that time).

And as Salvattore mentioned if the replica clock is 2 minutes ahead of master's clock, and the client is using the replica to accelerate reads, from the client's perspective, the key will appear to expire on the replica after one minute (and remain in the master for another 2).
So putting aside full-sync, psync, and failovers and other complications, this is what we break with this PR.
But on the bright side, i think clock offsets can and should be resolved with tech, so this should not be a real problem.

If we really wanted to solve it anyway, what we can do with quite a lot of extra complexity is:

when the replica first connects to the master, it sends a TIME command before it attempts to sync (this response will be processed right away, no backlog or replica-buffers delays). this way it knows both the clock difference between them, and also the network latency.
when master replicates relative TTLs to replicas (goes into the backlog and replica buffers which can be received with a huge delay) it replicates both the relative TTL and its current clock.

this way the replica knows how to respect the client's intent (imagine it knows the client machine's wall-clock at the time the command was processed by the master).
The replica will need to translate both the absolute TTLs it gets from RDB or command stream, and also the relative ones (which come together with the master's clock) with it's own clock diff (add 2 mins in our example)

I don't think we wanna this way...

src/cluster.c

src/expire.c

tests/test_helper.tcl

tests/unit/expire.tcl

ny0312 · 2021-03-17T22:04:03Z

@oranagra Appreciate the thoughtful comments.

I agree. Two things:

Re: When it's safe to merge. I understand your argument. The counter-argument is that merging it sooner gives us a longer runway for its release - it gives us more time/opportunity to see how it meshes with the rest of Redis and see if it has bugs or unforeseen consequences.
Re: Out-of-sync clock on replica, the issue you described in "footnotes". This is a good point. I agree this is what this PR "breaks". If we do care about such issues, I see at least two options that could help:

(1) Stop replicas from expiring(hiding) keys independently based on local clock. This is basically what Madelyn was referring to in #8433 with makes expire "linearizable" across the cluster. So basically, all expiration are entirely driven by masters. Replicas would keep serving TTL keys until it receives the DEL command for them from master.

With this scheme, TTL will essentially become a logical-time-based concept - A TTL key’s life span is measured in terms of replication offsets. Its starting offset is when a master first received the TTL. Its end offset is when a master expires it. The key lives for the same span of offsets on every node, regardless how much wall time it takes each node to replicate/traverse through that span.

(2) Another option is like what you suggested - replicate master's local time. E.g. master would periodically send its current local time to its replicas via replication stream. Upon receiving these timestamps, replicas would reset their local clock accordingly. This way, as long as time's "velocity" is roughly the same between master and replicas, TTL keys should live relatively the same lifetime on them.

Either way, even though it is related, I think it's better if we tackle this issue separately and leave it out of scope for this particular PR. I would love to hear what you think.

oranagra · 2021-05-30T06:22:25Z

finally merged (not sure why it was waiting for in the last month).
@ny0312 would you care to create a doc PR?

ny0312 · 2021-05-30T16:58:51Z

Nice. I will create a doc PR this week and reference the link here.

Thanks again for the reviews.

oranagra · 2021-05-31T06:07:11Z

FYI: valgrind and our freebsd CI is so slow that more than 10 seconds passed from the time the command was sent to the time it was executed: #9010

ny0312 · 2021-05-31T23:50:33Z

@madolson @oranagra Follow-up PR for API documentation on EXPIRETIME and PEXPIRETIME commands: redis/redis-doc#1582

Please review. Thanks.

oranagra · 2021-06-02T10:42:27Z

i see the valgrind run still fails despite my fix (to change from 10s to 100s).
but more interestingly, note the timestamps (it's not as slow as i thought it is):

2021-06-02T00:40:04.1835934Z [ok]: All time-to-live(TTL) in commands are propagated as absolute timestamp in milliseconds in AOF
2021-06-02T00:40:04.8603956Z [err]: All TTL in commands are propagated as absolute timestamp in replication stream in tests/unit/expire.tcl
2021-06-02T00:40:04.8608919Z Expected 'del a' to match 'set foo1 bar PXAT *' (context: type source line 778 file /home/runner/work/redis/redis/tests/test_helper.tcl cmd {assert_match [lindex $patterns $j] [read_from_replication_stream $s]} proc ::assert_replication_stream level 1)

@ny0312 maybe you can please look into that?

oranagra · 2021-06-06T06:21:36Z

happened again:
https://github.com/redis/redis/runs/2741999584?check_suite_focus=true
https://github.com/redis/redis/runs/2755028920?check_suite_focus=true

eliblight · 2021-06-08T06:06:43Z

@oranagra I am taking a look on this

ny0312 · 2021-06-10T21:55:02Z

Found the root cause and fixed: #9069

@oranagra please take a look

…onds (redis#8474) Till now, on replica full-sync we used to transfer absolute time for TTL, however when a command arrived (EXPIRE or EXPIREAT), we used to propagate it as is to replicas (possibly with relative time), but always translate it to EXPIREAT (absolute time) to AOF. This commit changes that and will always use absolute time for propagation. see discussion in redis#8433 Furthermore, we Introduce new commands: `EXPIRETIME/PEXPIRETIME` that allow extracting the absolute TTL time from a key.

There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7. We could see that on the unstable profile there are around 7% of CPU cycles spent on rewriteClientCommandVector that are not present on 6.2.7. This was introduced in #8474. This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments. the above usage of rewriteClientCommandArgument reduces the overhead in half. This PR should also improve PEXPIREAT performance by avoiding at all rewriteClientCommandArgument usage. Co-authored-by: Oran Agra <oran@redislabs.com>

There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7. We could see that on the unstable profile there are around 7% of CPU cycles spent on rewriteClientCommandVector that are not present on 6.2.7. This was introduced in redis#8474. This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments. the above usage of rewriteClientCommandArgument reduces the overhead in half. This PR should also improve PEXPIREAT performance by avoiding at all rewriteClientCommandArgument usage. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit c3fb48d)

There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7. We could see that on the unstable profile there are around 7% of CPU cycles spent on rewriteClientCommandVector that are not present on 6.2.7. This was introduced in #8474. This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments. the above usage of rewriteClientCommandArgument reduces the overhead in half. This PR should also improve PEXPIREAT performance by avoiding at all rewriteClientCommandArgument usage. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit c3fb48d)

There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7. We could see that on the unstable profile there are around 7% of CPU cycles spent on rewriteClientCommandVector that are not present on 6.2.7. This was introduced in redis#8474. This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments. the above usage of rewriteClientCommandArgument reduces the overhead in half. This PR should also improve PEXPIREAT performance by avoiding at all rewriteClientCommandArgument usage. Co-authored-by: Oran Agra <oran@redislabs.com>

ny0312 mentioned this pull request Feb 8, 2021

Should expires be replicated relatively or absolutely? #8433

Closed

madolson reviewed Feb 9, 2021

View reviewed changes

ny0312 force-pushed the absolute_expire branch from 28a04ab to d265120 Compare February 9, 2021 23:10

madolson reviewed Feb 10, 2021

View reviewed changes

src/expire.c Outdated Show resolved Hide resolved

madolson reviewed Feb 10, 2021

View reviewed changes

src/expire.c Outdated Show resolved Hide resolved

madolson reviewed Feb 10, 2021

View reviewed changes

src/expire.c Outdated Show resolved Hide resolved

ny0312 force-pushed the absolute_expire branch from d265120 to a219f07 Compare February 10, 2021 01:26

madolson added the state:major-decision Requires core team consensus label Feb 10, 2021

madolson previously approved these changes Feb 10, 2021

View reviewed changes

oranagra added this to the Next minor backlog milestone Feb 14, 2021

ny0312 dismissed madolson’s stale review via 3d09aa2 February 23, 2021 01:17

ny0312 force-pushed the absolute_expire branch from a219f07 to 3d09aa2 Compare February 23, 2021 01:17

madolson previously approved these changes Feb 24, 2021

View reviewed changes

QuChen88 reviewed Mar 12, 2021

View reviewed changes

src/cluster.c Outdated Show resolved Hide resolved

QuChen88 reviewed Mar 12, 2021

View reviewed changes

src/expire.c Outdated Show resolved Hide resolved

QuChen88 reviewed Mar 12, 2021

View reviewed changes

src/expire.c Outdated Show resolved Hide resolved

QuChen88 reviewed Mar 13, 2021

View reviewed changes

src/t_string.c Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

ny0312 commented Mar 15, 2021

View reviewed changes

tests/unit/expire.tcl Outdated Show resolved Hide resolved

ny0312 commented Mar 16, 2021

View reviewed changes

src/cluster.c Outdated Show resolved Hide resolved

oranagra reviewed Mar 17, 2021

View reviewed changes

ny0312 dismissed madolson’s stale review via 720741f March 18, 2021 00:29

ny0312 force-pushed the absolute_expire branch 3 times, most recently from 9e02c9c to 7ac2165 Compare March 19, 2021 21:38

oranagra dismissed madolson’s stale review via f9b8f3a May 27, 2021 12:25

oranagra approved these changes May 27, 2021

View reviewed changes

madolson approved these changes May 28, 2021

View reviewed changes

oranagra merged commit 53d1acd into redis:unstable May 30, 2021

oranagra moved this from In Review to Done in 7.0 May 31, 2021

ny0312 deleted the absolute_expire branch May 31, 2021 20:14

ny0312 mentioned this pull request May 31, 2021

API documentation of EXPIRETIME and PEXPIRETIME commands redis/redis-doc#1582

Merged

enjoy-binbin mentioned this pull request Jun 10, 2021

Fixed some typos, add a spell check ci and others minor fix #8890

Merged

ny0312 mentioned this pull request Jun 10, 2021

Fix flaky test case for absolute TTL replication #9069

Merged

sundb mentioned this pull request Jul 20, 2021

[fix] overflow guard on setGenericCommand TTL conversion from seconds to milliseconds #6801

Closed

oranagra added the breaking-change This change can potentially break existing application label Jan 23, 2022

oranagra mentioned this pull request Aug 21, 2022

Monotonic clock and updates to AE #7644

Merged

sundb mentioned this pull request Nov 18, 2022

Performance degrade 7.0.3 vs 6.2.7 #10981

Closed

filipecosta90 mentioned this pull request Dec 9, 2022

Reduce rewriteClientCommandVector usage on EXPIRE command #11602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always replicate TTLs as absolute timestamps in milliseconds #8474

Always replicate TTLs as absolute timestamps in milliseconds #8474

ny0312 commented Feb 8, 2021 •

edited

Loading

madolson left a comment

madolson commented Feb 10, 2021

oranagra commented Feb 10, 2021

madolson commented Feb 11, 2021

madolson left a comment

QuChen88 commented Mar 13, 2021

oranagra commented Mar 17, 2021

ny0312 commented Mar 17, 2021

oranagra commented May 30, 2021

ny0312 commented May 30, 2021

oranagra commented May 31, 2021

ny0312 commented May 31, 2021

oranagra commented Jun 2, 2021

oranagra commented Jun 6, 2021

eliblight commented Jun 8, 2021

ny0312 commented Jun 10, 2021

Always replicate TTLs as absolute timestamps in milliseconds #8474

Always replicate TTLs as absolute timestamps in milliseconds #8474

Conversation

ny0312 commented Feb 8, 2021 • edited Loading

Part I

Part II

madolson left a comment

Choose a reason for hiding this comment

madolson commented Feb 10, 2021

oranagra commented Feb 10, 2021

madolson commented Feb 11, 2021

madolson left a comment

Choose a reason for hiding this comment

QuChen88 commented Mar 13, 2021

oranagra commented Mar 17, 2021

ny0312 commented Mar 17, 2021

oranagra commented May 30, 2021

ny0312 commented May 30, 2021

oranagra commented May 31, 2021

ny0312 commented May 31, 2021

oranagra commented Jun 2, 2021

oranagra commented Jun 6, 2021

eliblight commented Jun 8, 2021

ny0312 commented Jun 10, 2021

ny0312 commented Feb 8, 2021 •

edited

Loading