Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always replicate TTLs as absolute timestamps in milliseconds #8474

Merged
merged 2 commits into from
May 30, 2021

Conversation

ny0312
Copy link
Contributor

@ny0312 ny0312 commented Feb 8, 2021

Related to ##8433


Part I

This commit is to always replicate time-to-live(TTL) values as absolute UNIX timestamps in milliseconds. With this commit, all mutations in Redis would start to be propagated the same way in both AOF and replication stream. No more special command rewrite/translation for AOF only.

This commit aims to mitigate two issues, as discussed here ##8433:

  1. When TTLs are replicated as relative values, TTL keys might outlive the their intended expiration deadline on replicas due to significant replication lag. Suppose in a two-node shard with one primary and one replica:
  • At T1, a client issues SET K V EX 10 on the primary, setting TTL of key K to be 10 seconds.
  • At T2 where T2 > T1, replica received the SET command due to replication lag.
  • At T1+9s, 1 second before the key would get expired, this primary dies. Replica takes over.
  • At T2+10s, the key finally expires by the new primary.

In aggregate, the key lived for T2+10s-T1 in wall time, where T2-T1 is the replication lag between old primary and new primary. The larger the replication lag is, the more the key outlives its intended lifetime.

  1. TTLs are replicated as absolute values in RDB but was replicated as relative values in replication stream. As result, after full resynchronization, keys with similar TTLs could end up having very different lifespans on replicas. Suppose the following sequence:
  • At wall time T1, a client consecutively sets two keys A and B with the same relative TTL, say 3600 seconds(1 hour) on a primary. E.g. SET A a EX 3600 and SET B b EX 3600
  • Now the primary creates a RDB to full sync a replica. The RDB’s boundary happens to fall between the two SETs, so A is captured in RDB with absolute expiration timestamp T1+3600, but B is not captured.
  • Now suppose it takes the replica 2 hours to receive this RDB and starts to replicate the data. A would be immediately expired, but B would live on for another hour.

This is a counter-intuitive experience for clients. The client set two keys to expire after 1 hour at the relatively same wall time, but one outlived another for 1 hour.


Part II

Introduced a new command EXPIRETIME that returns the absolute Unix timestamp of an expire:

EXPIRETIME key TIME|PTIME

Available since 6.?

Time complexity: O(1)

Returns the absolute Unix timestamp(since January 1, 1970) at which the given key will expire.

Options

The EXPIRETIME command supports a set of options that modify its behavior:

  • TIME -- Returns the Unix timestamp in seconds.
  • PTIME -- Returns the Unix timestamp in milliseconds.

Return value

Integer reply: TTL in milliseconds, or a negative value in order to signal an error (see the description above).

  • The command returns -2 if the key does not exist.
  • The command returns -1 if the key exists but has no associated expire.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/aof.c Outdated Show resolved Hide resolved
tests/cluster/tests/14-consistency-check.tcl Outdated Show resolved Hide resolved
src/t_string.c Outdated Show resolved Hide resolved
tests/integration/aof.tcl Outdated Show resolved Hide resolved
src/t_string.c Show resolved Hide resolved
src/t_string.c Outdated Show resolved Hide resolved
tests/unit/expire.tcl Show resolved Hide resolved
src/expire.c Outdated Show resolved Hide resolved
src/expire.c Outdated Show resolved Hide resolved
src/expire.c Outdated Show resolved Hide resolved
@madolson madolson added the state:major-decision Requires core team consensus label Feb 10, 2021
@madolson
Copy link
Contributor

@redis/core-team Ok, probably time for a wider review. Nan outlined the two major additions in the top comment. The first one is more should be mostly agreed upon. The second item is just a mechanism to test that the absolute timestamp is happening correctly, the current implementation minimizes the amount of code.

madolson
madolson previously approved these changes Feb 10, 2021
@oranagra
Copy link
Member

i'm sorry, i didn't follow the discussion in this PR and issue (wasn't aware of them).
since this goes against past decision, i would like to study both the correspondence here (didn't read it yet), and the past discussions in other issues and PRs (where Salvatore responded).
however i'm currently a bit short in time, and and in any case i'm not sure if we wanna merge such a change to 6.2 (already past it's last RC). since this is not an urgent change, i wanna propose that we'll hold this for a future version.
does this make sense? i might be overthinking of this, but why take the risk?

@madolson
Copy link
Contributor

@oranagra, I was under the impression this would not be in 6.2 given that we said only bug fixes for the last RC. You can take as much time as you need to think it through :)

madolson
madolson previously approved these changes Feb 24, 2021
Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update LGTM, still waiting on the major change decision.

src/cluster.c Outdated Show resolved Hide resolved
src/expire.c Outdated Show resolved Hide resolved
src/expire.c Outdated Show resolved Hide resolved
src/t_string.c Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
@QuChen88
Copy link
Collaborator

@oranagra I just read through the other related issues on this topic #8433 and #5171. I am not entirely convinced about the original argument that Salvatore was making that the TTL should be the relative time since the moment key was written onto the redis server which is subject to the replication delay on the replica. From the point of view of the Redis user, I am writing a key with an TTL onto the master, and the expectation is that the key should be expired after that time duration from this moment on. So I also lean towards having the replica attempt to expire the item at the same point in time as the master regardless of the replication lag between master and replica.

I just looked at this PR from @ny0312 and I think the implementation looks good mostly. I raised a few minor comments for him namely around introducing the new expiretime command. Please take a look. Thanks!

tests/unit/expire.tcl Outdated Show resolved Hide resolved
src/cluster.c Outdated Show resolved Hide resolved
@oranagra
Copy link
Member

Sorry for the delay, also sorry to be the last to join the party.

TL:DR i'm voting in favor of this PR (didn't read the code yet).

The main reason is that we're trying to fight two contradicting issues here:

  1. a significant clock de-sync between master and replica. i'm not sure such a thing really exists, and if there is, there are ways to fix it (NTP and others).
  2. significant replication delays (either due to PSYNC reading from the backlog, or due to large replica-buffers accumulated during full sync). these are very real, and there's no easy way to mitigate them.

Item 2 can also be said to be affected by trivial network latency, but that's not in the same ballpark of the other two replication delay issues.

On top of that there's the concern of code simplicity and consistency, and it would certainly be better to replicate the same to AOF and replicas, and have the same type of (absolute) value in full-sync and replication command stream.

I don't agree with some of the arguments that were posted, but the bottom line is that these two concerns contradict each other, and one of these concerns can be eliminated while the other can't (at least not easily).

Now that we seem to agree about this change we need to decide when it's safe to merge. if we conclude that we don't wanna release this change in 6.2.x, it would be better to leave it out of unstable for a while, so that unstable doesn't diverge too far from 6.2, making it harder to cherry pick bug fixes.


Footnotes:

I think what matters here is actually the wall-clock of the client machine. Requesting a key to expire in 3 minutes means 3 minutes since the command was sent, or since the time it was processed by the master (the client machine's wall-clock at that time).

And as Salvattore mentioned if the replica clock is 2 minutes ahead of master's clock, and the client is using the replica to accelerate reads, from the client's perspective, the key will appear to expire on the replica after one minute (and remain in the master for another 2).
So putting aside full-sync, psync, and failovers and other complications, this is what we break with this PR.
But on the bright side, i think clock offsets can and should be resolved with tech, so this should not be a real problem.

If we really wanted to solve it anyway, what we can do with quite a lot of extra complexity is:

  1. when the replica first connects to the master, it sends a TIME command before it attempts to sync (this response will be processed right away, no backlog or replica-buffers delays). this way it knows both the clock difference between them, and also the network latency.
  2. when master replicates relative TTLs to replicas (goes into the backlog and replica buffers which can be received with a huge delay) it replicates both the relative TTL and its current clock.

this way the replica knows how to respect the client's intent (imagine it knows the client machine's wall-clock at the time the command was processed by the master).
The replica will need to translate both the absolute TTLs it gets from RDB or command stream, and also the relative ones (which come together with the master's clock) with it's own clock diff (add 2 mins in our example)

I don't think we wanna this way...

src/cluster.c Outdated Show resolved Hide resolved
src/cluster.c Outdated Show resolved Hide resolved
src/expire.c Outdated Show resolved Hide resolved
src/expire.c Outdated Show resolved Hide resolved
src/expire.c Outdated Show resolved Hide resolved
tests/test_helper.tcl Outdated Show resolved Hide resolved
tests/unit/expire.tcl Outdated Show resolved Hide resolved
tests/unit/expire.tcl Outdated Show resolved Hide resolved
tests/unit/expire.tcl Outdated Show resolved Hide resolved
tests/unit/expire.tcl Outdated Show resolved Hide resolved
@ny0312
Copy link
Contributor Author

ny0312 commented Mar 17, 2021

@oranagra Appreciate the thoughtful comments.

I agree. Two things:

  1. Re: When it's safe to merge. I understand your argument. The counter-argument is that merging it sooner gives us a longer runway for its release - it gives us more time/opportunity to see how it meshes with the rest of Redis and see if it has bugs or unforeseen consequences.

  2. Re: Out-of-sync clock on replica, the issue you described in "footnotes". This is a good point. I agree this is what this PR "breaks". If we do care about such issues, I see at least two options that could help:

(1) Stop replicas from expiring(hiding) keys independently based on local clock. This is basically what Madelyn was referring to in #8433 with makes expire "linearizable" across the cluster. So basically, all expiration are entirely driven by masters. Replicas would keep serving TTL keys until it receives the DEL command for them from master.

With this scheme, TTL will essentially become a logical-time-based concept - A TTL key’s life span is measured in terms of replication offsets. Its starting offset is when a master first received the TTL. Its end offset is when a master expires it. The key lives for the same span of offsets on every node, regardless how much wall time it takes each node to replicate/traverse through that span.

(2) Another option is like what you suggested - replicate master's local time. E.g. master would periodically send its current local time to its replicas via replication stream. Upon receiving these timestamps, replicas would reset their local clock accordingly. This way, as long as time's "velocity" is roughly the same between master and replicas, TTL keys should live relatively the same lifetime on them.

Either way, even though it is related, I think it's better if we tackle this issue separately and leave it out of scope for this particular PR. I would love to hear what you think.

@oranagra oranagra merged commit 53d1acd into redis:unstable May 30, 2021
@oranagra
Copy link
Member

finally merged (not sure why it was waiting for in the last month).
@ny0312 would you care to create a doc PR?

@ny0312
Copy link
Contributor Author

ny0312 commented May 30, 2021

Nice. I will create a doc PR this week and reference the link here.

Thanks again for the reviews.

@oranagra
Copy link
Member

FYI: valgrind and our freebsd CI is so slow that more than 10 seconds passed from the time the command was sent to the time it was executed: #9010

@oranagra oranagra moved this from In Review to Done in 7.0 May 31, 2021
@ny0312 ny0312 deleted the absolute_expire branch May 31, 2021 20:14
@ny0312
Copy link
Contributor Author

ny0312 commented May 31, 2021

@madolson @oranagra Follow-up PR for API documentation on EXPIRETIME and PEXPIRETIME commands: redis/redis-doc#1582

Please review. Thanks.

@oranagra
Copy link
Member

oranagra commented Jun 2, 2021

i see the valgrind run still fails despite my fix (to change from 10s to 100s).
but more interestingly, note the timestamps (it's not as slow as i thought it is):

2021-06-02T00:40:04.1835934Z [ok]: All time-to-live(TTL) in commands are propagated as absolute timestamp in milliseconds in AOF
2021-06-02T00:40:04.8603956Z [err]: All TTL in commands are propagated as absolute timestamp in replication stream in tests/unit/expire.tcl
2021-06-02T00:40:04.8608919Z Expected 'del a' to match 'set foo1 bar PXAT *' (context: type source line 778 file /home/runner/work/redis/redis/tests/test_helper.tcl cmd {assert_match [lindex $patterns $j] [read_from_replication_stream $s]} proc ::assert_replication_stream level 1)

@ny0312 maybe you can please look into that?

@eliblight
Copy link
Contributor

@oranagra I am taking a look on this

@ny0312
Copy link
Contributor Author

ny0312 commented Jun 10, 2021

Found the root cause and fixed: #9069

@oranagra please take a look

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Sep 8, 2021
…onds (redis#8474)

Till now, on replica full-sync we used to transfer absolute time for TTL,
however when a command arrived (EXPIRE or EXPIREAT),
we used to propagate it as is to replicas (possibly with relative time),
but always translate it to EXPIREAT (absolute time) to AOF.

This commit changes that and will always use absolute time for propagation.
see discussion in redis#8433

Furthermore, we Introduce new commands: `EXPIRETIME/PEXPIRETIME`
that allow extracting the absolute TTL time from a key.
@oranagra oranagra added the breaking-change This change can potentially break existing application label Jan 23, 2022
oranagra added a commit that referenced this pull request Dec 9, 2022
There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7. 

We could see that on the unstable profile there are around 7% of CPU cycles
spent on rewriteClientCommandVector that are not present on 6.2.7.
This was introduced in #8474.
This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of
rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments.
the above usage of rewriteClientCommandArgument reduces the overhead in half.

This PR should also improve PEXPIREAT performance by avoiding at all
rewriteClientCommandArgument usage. 

Co-authored-by: Oran Agra <oran@redislabs.com>
oranagra pushed a commit to oranagra/redis that referenced this pull request Dec 11, 2022
There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7.

We could see that on the unstable profile there are around 7% of CPU cycles
spent on rewriteClientCommandVector that are not present on 6.2.7.
This was introduced in redis#8474.
This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of
rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments.
the above usage of rewriteClientCommandArgument reduces the overhead in half.

This PR should also improve PEXPIREAT performance by avoiding at all
rewriteClientCommandArgument usage.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit c3fb48d)
oranagra pushed a commit to oranagra/redis that referenced this pull request Dec 11, 2022
There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7.

We could see that on the unstable profile there are around 7% of CPU cycles
spent on rewriteClientCommandVector that are not present on 6.2.7.
This was introduced in redis#8474.
This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of
rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments.
the above usage of rewriteClientCommandArgument reduces the overhead in half.

This PR should also improve PEXPIREAT performance by avoiding at all
rewriteClientCommandArgument usage.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit c3fb48d)
oranagra pushed a commit that referenced this pull request Dec 12, 2022
There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7.

We could see that on the unstable profile there are around 7% of CPU cycles
spent on rewriteClientCommandVector that are not present on 6.2.7.
This was introduced in #8474.
This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of
rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments.
the above usage of rewriteClientCommandArgument reduces the overhead in half.

This PR should also improve PEXPIREAT performance by avoiding at all
rewriteClientCommandArgument usage.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit c3fb48d)
Mixficsol pushed a commit to Mixficsol/redis that referenced this pull request Apr 12, 2023
There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7. 

We could see that on the unstable profile there are around 7% of CPU cycles
spent on rewriteClientCommandVector that are not present on 6.2.7.
This was introduced in redis#8474.
This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of
rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments.
the above usage of rewriteClientCommandArgument reduces the overhead in half.

This PR should also improve PEXPIREAT performance by avoiding at all
rewriteClientCommandArgument usage. 

Co-authored-by: Oran Agra <oran@redislabs.com>
madolson pushed a commit to madolson/redis that referenced this pull request Apr 19, 2023
There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7. 

We could see that on the unstable profile there are around 7% of CPU cycles
spent on rewriteClientCommandVector that are not present on 6.2.7.
This was introduced in redis#8474.
This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of
rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments.
the above usage of rewriteClientCommandArgument reduces the overhead in half.

This PR should also improve PEXPIREAT performance by avoiding at all
rewriteClientCommandArgument usage. 

Co-authored-by: Oran Agra <oran@redislabs.com>
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
There is overhead on Redis 7.0 EXPIRE command that is not present on 6.2.7. 

We could see that on the unstable profile there are around 7% of CPU cycles
spent on rewriteClientCommandVector that are not present on 6.2.7.
This was introduced in redis#8474.
This PR reduces the overhead by using 2X rewriteClientCommandArgument instead of
rewriteClientCommandVector. In this scenario rewriteClientCommandVector creates 4 arguments.
the above usage of rewriteClientCommandArgument reduces the overhead in half.

This PR should also improve PEXPIREAT performance by avoiding at all
rewriteClientCommandArgument usage. 

Co-authored-by: Oran Agra <oran@redislabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change This change can potentially break existing application release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:needs-doc-pr requires a PR to redis-doc repository state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten
Projects
Archived in project
7.0
Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants