-
Notifications
You must be signed in to change notification settings - Fork 23.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: Use either monotonic or wall-clock to measure command execution time, to regain up to 4% execution time #10502
Optimization: Use either monotonic or wall-clock to measure command execution time, to regain up to 4% execution time #10502
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you're saying this regression is from #7491 (part of redis 6.2), which started using the monotonic timer instead of mstime to measure command execution time stats?
} | ||
|
||
void updateCachedTimeWithUs(int update_daylight_info, monotime ustime) { | ||
server.ustime = ustime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think you're allowed to do that.
you're mixing a monotonic time (used to measure time deltas), with server.mstime (wall clock unix time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right @oranagra . Given we need ustime() I've gone the other way around and removed the monotonic clock call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but now we're losing two things.
- the monotonic aspect of that time measurement (will be affected by wall clock shifts)
- the monotonic clock sampling can be much quicker (if we build redis to use the HW monotonic clock)
so we now reverted part of the "improvement" of #7491?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, even if not using the HW clock, i think clock_gettime(CLOCK_MONOTONIC)
is a lot faster than gettimeofday
.
but apparently two calls to gettimeofday are faster than one call to gettimeofday
+ two calls to clock_gettime
i.e. gettimeofday * 2 < gettimeofday + 2 * clock_gettime
8-)
not sure what's the right way forward..
i suppose this change in #7491 was assumed to be an improvement and now when we realize the damage it did, we can revert it.
but ideally we should maybe find out at runtime which of them is faster? (maybe we can leave it for some future day)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed I've tried getMonotonicUs(using https://github.com/filipecosta90/redis/tree/optimize.updateCachedTime.getMonotonicUs) vs ustime(this branch)
On getMonotonicUs() I've tried both with/without using the HW clock (via make CFLAGS="-DUSE_PROCESSOR_CLOCK"
)
getMonotonicUs() default:
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 0.00 --- --- --- --- --- --- 0.00
Gets 698452.68 698452.68 0.00 4.29455 3.66300 6.78300 7.29500 711335.51
Waits 0.00 --- --- --- --- --- --- ---
Totals 698452.68 698452.68 0.00 4.29455 3.66300 6.78300 7.29500 711335.51
getMonotonicUs() using HW clock (via make CFLAGS="-DUSE_PROCESSOR_CLOCK"
) :
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 0.00 --- --- --- --- --- --- 0.00
Gets 711777.48 711777.48 0.00 4.21409 3.59900 6.65500 7.10300 724906.09
Waits 0.00 --- --- --- --- --- --- ---
Totals 711777.48 711777.48 0.00 4.21409 3.59900 6.65500 7.10300 724906.09
ustime()
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 0.00 --- --- --- --- --- --- 0.00
Gets 697280.33 697280.33 0.00 4.30178 3.66300 6.84700 7.35900 710141.54
Waits 0.00 --- --- --- --- --- --- ---
Totals 697280.33 697280.33 0.00 4.30178 3.66300 6.84700 7.35900 710141.54
With no change on the compile flags they seem to be relatively equal @oranagra.
If we use the HW clock we can bump further the numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oranagra @filipecosta90 Did we benchmark the results of this patch (two calls to ustime
) vs the previous one call + two calls to getMonotonicUs
when USE_PROCESSOR_CLOCK
is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yoav-steinberg i think we didn't. but as i stated in my above message (considering the HW clock is nearly free), it should be somewhere in between.
i.e. between 697280.33 and 711777.48 (see the above benchmark).
considering a build without HW clock, this would be faster than what we currently have, since the monotonic clock seems to be as slow as the wall clock.
i.e. we'll get back the regression mentioned in #10460
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. So I think we need to change this PR so that if USE_PROCESSOR_CLOCK
is defined we use the HW clock for cmd duration and if it's not defined we use the proposed logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'll uglify the code a bit, but i guess considering how central this is, maybe it would be ok.
@filipecosta90 can you do a POC and verify the impact is what we expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely! Give me a couple of hours and I'll have the data.
yep. given we really need to the |
PS: not sure why tests are still failing. Do you see a problem on the code change @oranagra ? |
Co-authored-by: Oran Agra <oran@redislabs.com>
This reverts commit d1b365d.
Let me see if I understand the conversation so far:
The logic is something like this:
This logic is correct. When the HW clock is used for monotonic, However, if the HW clock is disabled, This would result in:
If I understand correctly, the proposed change would choose to improve performance for some users at the expense of other users. I think that the only reasonable way to make this optimization would be to actually check the HW flag with code something like this:
This is not ideal. It's a bit ugly, and More ideally, we would convert most of Redis to use the monotonic clock rather than needing a cached wall-clock time. If all of the code using the cached time was converted to use the monotonic clock, we wouldn't need any special logic here. (But this is a large task and unrealistic for a single update.) |
i don't think we can / want to convert all of redis to use monotonic clock. there are things that specifically require wall clock. |
I agree completely. However it might become possible to remove the need to get wall-clock on every command execution. This would make the need for wall-clock a more exceptional requirement. In the SLOWLOG case, we're not really concerned about the extra overhead to call |
I think Jim's idea is really good. We should be able to start a monotonic clock on every command start. If we run into a case we need a precise time, we can get the local time and then subtract off the time since the start with the monotonic clock. In the case where we don't have a HW clock, we won't do any math, we'll just return the monotonic click value we collected at the start. |
@yoav-steinberg WRT to:
using the last commit on this PR and testing with/without HW clock here's the data:
ustime:
hw clock (monotonic clock: X86 TSC @ 2100 ticks/us) :
this means we get ~2% improvement by default @oranagra. the HW clock reduces the overhead of time tracking by 0.3%. a bit of detail on the test system:
|
Add INFO field to indicate the type of monotonic clock used
@redis/core-team i added a runtime check for the monotonic clock type being used, and an INFO field as i think we agreed in the meeting. please approve the "interface change" (INFO field). p.s. considering that we may drop the direct use of the HW clock in the future, it means we'll also drop the info field, i guess that doesn't count as a breaking change, but still, maybe we wanna re-consider it. |
i re-benchmarked this PR (using the commands listed at the top comment) on my machine. using USE_PROCESSOR_CLOCK, it showed 0.3% improvement (probably just a glitch since we do the same number and type of of calls as we did before) I suppose that if i'd benchmark it on an Ubuntu / EC2 machine, the default build will show higher improvement, since we save one system call, but it's much more expensive. |
Agreed that in most cases, the performance improvement of direct TSC access is not very significant. I'd suggest that we might want to (by default) undefine However given that the code is isolated in a single file, behind a clean API, I'm not sure there's a good reason to remove the processor clock code. It does provide a slight performance improvement. On some systems/architectures, that improvement may be more meaningful. There are some places (like eviction) where the clock is used rapidly in a loop. |
The complexity is one thing, and indeed (mostly) isolated in one place. The other thing is that it may be doing the wrong (or less right) thing, I.e using TSC instead of RDTSCP. It's slightly faster, but we need to think if we may be badly affected. |
I may be missing something, so feel free to explain, but I'm not seeing that the additional overhead of RDTSCP adds any value in our case. My understanding is that RDTSCP is the same as RDTSC, but introduces barriers which prevent instruction reordering. Within |
I didn't even try to look into it beyond the basic idea that there is a difference. Your analysis seems correct to me, but I didn't bother to read about it or think about it, and with this kina of thing I always get surprised by things I didn't expect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Jim that I'm not sure there is that much value in removing the processor build flags as of right now. The RDTSCP instruction is probably generally safer to use, but it seems unnecessary in our case as we don't really need the precision is provides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a problem dropping the info field in the future, but also no urgency to remove the extra code.
As a (followup PR?) we should probably check the configured kernel clock source, warn about suboptimal configuration, and also report it as an additional INFO field when POSIX is used, now that we're aware of the performance impact it has.
that part is planned here: #10604 |
ok, so i'm merging it. |
…xecution time, to regain up to 4% execution time (#10502) In #7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure command execution time for stats, apparently this meant sampling the clock 3 times per command rather than two (wince we also need the wall-clock time). In some cases this causes a significant overhead. This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it should be extremely fast. This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 3cd8baf)
…xecution time, to regain up to 4% execution time (redis#10502) In redis#7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure command execution time for stats, apparently this meant sampling the clock 3 times per command rather than two (wince we also need the wall-clock time). In some cases this causes a significant overhead. This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it should be extremely fast. This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using. Co-authored-by: Oran Agra <oran@redislabs.com>
In #7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure command execution time for stats, apparently this meant sampling the clock 3 times per command rather than two (wince we also need the wall-clock time).
In some cases this causes a significant overhead.
This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it should be extremely fast.
This PR also adds a new INFO field called
monotonic_clock
that shows which clock redis is using.background
We saw in #10460 that updateCachedTime took ~1.5% of CPU cycles to get the time once again, when we already keep track of time within call() for the command latency tracking.
Given we need the UNIX timestamp for updateCachedTime() call we might as well measure time once before the command logic by using ustime() instead of having to use ustime() + elapsedStart().
to reproduce:
before profile and results:
after results:
ustime:
hw clock (monotonic clock: X86 TSC @ 2100 ticks/us) :
this means we get ~2% improvement by default @oranagra. the HW clock reduces the overhead of time tracking by 0.3%.