Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: Use either monotonic or wall-clock to measure command execution time, to regain up to 4% execution time #10502

Merged
merged 10 commits into from
Apr 20, 2022

Conversation

filipecosta90
Copy link
Contributor

@filipecosta90 filipecosta90 commented Mar 31, 2022

In #7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure command execution time for stats, apparently this meant sampling the clock 3 times per command rather than two (wince we also need the wall-clock time).
In some cases this causes a significant overhead.

This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it should be extremely fast.
This PR also adds a new INFO field called monotonic_clock that shows which clock redis is using.

background

We saw in #10460 that updateCachedTime took ~1.5% of CPU cycles to get the time once again, when we already keep track of time within call() for the command latency tracking.
Given we need the UNIX timestamp for updateCachedTime() call we might as well measure time once before the command logic by using ustime() instead of having to use ustime() + elapsedStart().

to reproduce:

# populate data
memtier_benchmark --ratio 1:0  -d 1000 --key-pattern P:P --key-maximum 1000000  --key-minimum 1 --pipeline 15 --hide-histogram

# benchmark
memtier_benchmark --ratio 0:1 --test-time 60  -d 1000  --key-maximum 1000000  --key-minimum 1 --pipeline 15 --hide-histogram

before profile and results:

ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       676114.65    676114.65         0.00         4.43638         3.74300         6.94300         7.48700    688585.47 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     676114.65    676114.65         0.00         4.43638         3.74300         6.94300         7.48700    688585.47 

image

after results:

ustime:

============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       689547.70    689547.70         0.00         4.34994         3.74300         7.29500         7.67900    702266.28 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     689547.70    689547.70         0.00         4.34994         3.74300         7.29500         7.67900    702266.28 

hw clock (monotonic clock: X86 TSC @ 2100 ticks/us) :

============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       691869.31    691869.31         0.00         4.33539         3.71100         7.23100         7.58300    704630.70 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     691869.31    691869.31         0.00         4.33539         3.71100         7.23100         7.58300    704630.70

this means we get ~2% improvement by default @oranagra. the HW clock reduces the overhead of time tracking by 0.3%.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you're saying this regression is from #7491 (part of redis 6.2), which started using the monotonic timer instead of mstime to measure command execution time stats?

}

void updateCachedTimeWithUs(int update_daylight_info, monotime ustime) {
server.ustime = ustime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you're allowed to do that.
you're mixing a monotonic time (used to measure time deltas), with server.mstime (wall clock unix time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right @oranagra . Given we need ustime() I've gone the other way around and removed the monotonic clock call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now we're losing two things.

  1. the monotonic aspect of that time measurement (will be affected by wall clock shifts)
  2. the monotonic clock sampling can be much quicker (if we build redis to use the HW monotonic clock)

so we now reverted part of the "improvement" of #7491?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, even if not using the HW clock, i think clock_gettime(CLOCK_MONOTONIC) is a lot faster than gettimeofday.
but apparently two calls to gettimeofday are faster than one call to gettimeofday + two calls to clock_gettime
i.e. gettimeofday * 2 < gettimeofday + 2 * clock_gettime 8-)

not sure what's the right way forward..
i suppose this change in #7491 was assumed to be an improvement and now when we realize the damage it did, we can revert it.
but ideally we should maybe find out at runtime which of them is faster? (maybe we can leave it for some future day)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed I've tried getMonotonicUs(using https://github.com/filipecosta90/redis/tree/optimize.updateCachedTime.getMonotonicUs) vs ustime(this branch)

On getMonotonicUs() I've tried both with/without using the HW clock (via make CFLAGS="-DUSE_PROCESSOR_CLOCK")

getMonotonicUs() default:

============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       698452.68    698452.68         0.00         4.29455         3.66300         6.78300         7.29500    711335.51 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     698452.68    698452.68         0.00         4.29455         3.66300         6.78300         7.29500    711335.51

getMonotonicUs() using HW clock (via make CFLAGS="-DUSE_PROCESSOR_CLOCK") :

============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       711777.48    711777.48         0.00         4.21409         3.59900         6.65500         7.10300    724906.09 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     711777.48    711777.48         0.00         4.21409         3.59900         6.65500         7.10300    724906.09 

ustime()

============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       697280.33    697280.33         0.00         4.30178         3.66300         6.84700         7.35900    710141.54 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     697280.33    697280.33         0.00         4.30178         3.66300         6.84700         7.35900    710141.54 

With no change on the compile flags they seem to be relatively equal @oranagra.
If we use the HW clock we can bump further the numbers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra @filipecosta90 Did we benchmark the results of this patch (two calls to ustime) vs the previous one call + two calls to getMonotonicUs when USE_PROCESSOR_CLOCK is enabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yoav-steinberg i think we didn't. but as i stated in my above message (considering the HW clock is nearly free), it should be somewhere in between.
i.e. between 697280.33 and 711777.48 (see the above benchmark).

considering a build without HW clock, this would be faster than what we currently have, since the monotonic clock seems to be as slow as the wall clock.
i.e. we'll get back the regression mentioned in #10460

Copy link
Contributor

@yoav-steinberg yoav-steinberg Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So I think we need to change this PR so that if USE_PROCESSOR_CLOCK is defined we use the HW clock for cmd duration and if it's not defined we use the proposed logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'll uglify the code a bit, but i guess considering how central this is, maybe it would be ok.
@filipecosta90 can you do a POC and verify the impact is what we expect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely! Give me a couple of hours and I'll have the data.

@filipecosta90 filipecosta90 changed the title Avoid ustime() call within updateCachedTime when being called from call(). From 2.2% to 0.13% of cpu cycles Avoid monotonic timer given we need to call ustime() on call(). From ~6% to 2% of cpu cycles Apr 5, 2022
@filipecosta90 filipecosta90 changed the title Avoid monotonic timer given we need to call ustime() on call(). From ~6% to 2% of cpu cycles Avoid monotonic timer given we need to call ustime() on call(). From ~6% to ~2% of cpu cycles Apr 5, 2022
@filipecosta90
Copy link
Contributor Author

so you're saying this regression is from #7491 (part of redis 6.3), which started using the monotonic timer instead of mstime to measure command execution time stats?

yep. given we really need to the ustime() call, using the monotonic time tracking was a double effort.

@filipecosta90
Copy link
Contributor Author

PS: not sure why tests are still failing. Do you see a problem on the code change @oranagra ?

src/server.c Outdated Show resolved Hide resolved
Co-authored-by: Oran Agra <oran@redislabs.com>
@filipecosta90 filipecosta90 changed the title Avoid monotonic timer given we need to call ustime() on call(). From ~6% to ~2% of cpu cycles using getMonotonicUs() instead of ustime() and elapsedUs(). From ~6% to ~2% of cpu cycles Apr 6, 2022
@filipecosta90 filipecosta90 changed the title using getMonotonicUs() instead of ustime() and elapsedUs(). From ~6% to ~2% of cpu cycles using getMonotonicUs() on call() instead of ustime() and elapsedUs(). From ~6% to ~2% of cpu cycles Apr 6, 2022
@filipecosta90 filipecosta90 changed the title using getMonotonicUs() on call() instead of ustime() and elapsedUs(). From ~6% to ~2% of cpu cycles using ustime() on call() instead of ustime() and elapsedUs(). From ~6% to ~2% of cpu cycles Apr 6, 2022
@JimB123
Copy link
Collaborator

JimB123 commented Apr 7, 2022

Let me see if I understand the conversation so far:

  • We need to call ustime() because there is a cached timestamp which matches the wall-clock. This is set at the beginning of every command and is used to support various functions which need this recent wall-clock time.
  • We also need to time the execution of the current command by measuring time before/after invoking the command. Measuring duration is something that we'd normally perform using the monotonic clock (not wall-clock).

The logic is something like this:

set cached_ustime = ustime()   // wall clock
...
set command_start = getMonotonic()
execute command
set command_duration = getMonotonic() - command_start

This logic is correct. When the HW clock is used for monotonic, getMonotonic() is essentially free and the total cost is basically 1 call to ustime().

However, if the HW clock is disabled, getMonotonic() is implemented in a similar way to calling ustime(). This results in essentially 3 calls to ustime(). We recognize that we could reduce this to 2 calls to ustime() by combining the caching of cached_ustime with the gathering of command_start from monotonic.

This would result in:

  • If the HW clock is enabled, a performance reduction as 1 call to ustime() would become 2 calls.
  • If the HW clock is disabled, a performance improvement as 3 calls to ustime() would become 2 calls.

If I understand correctly, the proposed change would choose to improve performance for some users at the expense of other users.

I think that the only reasonable way to make this optimization would be to actually check the HW flag with code something like this:

set cached_ustime = ustime()   // wall clock
...
#ifdef USE_PROCESSOR_CLOCK
    set command_start = getMonotonic()
#endif
execute command
#ifdef USE_PROCESSOR_CLOCK
    set command_duration = getMonotonic() - command_start
#else
    set command_duration = ustime() - cached_ustime
#endif

This is not ideal. It's a bit ugly, and USE_PROCESSOR_CLOCK define would have to be relocated (into config.h?).

More ideally, we would convert most of Redis to use the monotonic clock rather than needing a cached wall-clock time. If all of the code using the cached time was converted to use the monotonic clock, we wouldn't need any special logic here. (But this is a large task and unrealistic for a single update.)

@oranagra
Copy link
Member

oranagra commented Apr 7, 2022

i don't think we can / want to convert all of redis to use monotonic clock. there are things that specifically require wall clock.
all the places the measure time-delta and latencies should, use monotonic clock, but things like expiry (specifically EXPIREAT), and SLOWLOG timestamps must use the wall clock.

@JimB123
Copy link
Collaborator

JimB123 commented Apr 7, 2022

i don't think we can / want to convert all of redis to use monotonic clock. there are things that specifically require wall clock. all the places the measure time-delta and latencies should, use monotonic clock, but things like expiry (specifically EXPIREAT), and SLOWLOG timestamps must use the wall clock.

I agree completely. However it might become possible to remove the need to get wall-clock on every command execution. This would make the need for wall-clock a more exceptional requirement. In the SLOWLOG case, we're not really concerned about the extra overhead to call ustime() if we've already been handling a slow command. expiry might have wider implications though. Ideally, we could execute most commands without ever having to call ustime().

@madolson
Copy link
Contributor

madolson commented Apr 8, 2022

I think Jim's idea is really good. We should be able to start a monotonic clock on every command start. If we run into a case we need a precise time, we can get the local time and then subtract off the time since the start with the monotonic clock. In the case where we don't have a HW clock, we won't do any math, we'll just return the monotonic click value we collected at the start.

@filipecosta90
Copy link
Contributor Author

@yoav-steinberg WRT to:

Right. So I think we need to change this PR so that if USE_PROCESSOR_CLOCK is defined we use the HW clock for cmd duration and if it's not defined we use the proposed logic.

using the last commit on this PR and testing with/without HW clock here's the data:
unstable:

============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       676114.65    676114.65         0.00         4.43638         3.74300         6.94300         7.48700    688585.47 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     676114.65    676114.65         0.00         4.43638         3.74300         6.94300         7.48700    688585.47 

ustime:

============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       689547.70    689547.70         0.00         4.34994         3.74300         7.29500         7.67900    702266.28 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     689547.70    689547.70         0.00         4.34994         3.74300         7.29500         7.67900    702266.28 

hw clock (monotonic clock: X86 TSC @ 2100 ticks/us) :

============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       691869.31    691869.31         0.00         4.33539         3.71100         7.23100         7.58300    704630.70 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     691869.31    691869.31         0.00         4.33539         3.71100         7.23100         7.58300    704630.70

this means we get ~2% improvement by default @oranagra. the HW clock reduces the overhead of time tracking by 0.3%.

a bit of detail on the test system:

Linux hpe10 5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz (hyperthreading disabled)
used compiler: gcc-11 (11.1.0)

Add INFO field to indicate the type of monotonic clock used
@oranagra
Copy link
Member

@redis/core-team i added a runtime check for the monotonic clock type being used, and an INFO field as i think we agreed in the meeting. please approve the "interface change" (INFO field).

p.s. considering that we may drop the direct use of the HW clock in the future, it means we'll also drop the info field, i guess that doesn't count as a breaking change, but still, maybe we wanna re-consider it.

@oranagra
Copy link
Member

i re-benchmarked this PR (using the commands listed at the top comment) on my machine.
using default build flags, the before and after showed 3.3% improvement.

using USE_PROCESSOR_CLOCK, it showed 0.3% improvement (probably just a glitch since we do the same number and type of of calls as we did before)

I suppose that if i'd benchmark it on an Ubuntu / EC2 machine, the default build will show higher improvement, since we save one system call, but it's much more expensive.
and the USE_PROCESSOR_CLOCK build will still show no change.

@JimB123
Copy link
Collaborator

JimB123 commented Apr 19, 2022

Based on everything mentioned here, it does seem like we can drop the system check and just always use the clock_gettime(CLOCK_MONOTONIC), and get rid of all the ifdef.

Agreed that in most cases, the performance improvement of direct TSC access is not very significant. I'd suggest that we might want to (by default) undefine USE_PROCESSOR_CLOCK if it improves compatibility in the general case.

However given that the code is isolated in a single file, behind a clean API, I'm not sure there's a good reason to remove the processor clock code. It does provide a slight performance improvement. On some systems/architectures, that improvement may be more meaningful. There are some places (like eviction) where the clock is used rapidly in a loop.

@oranagra
Copy link
Member

The complexity is one thing, and indeed (mostly) isolated in one place. The other thing is that it may be doing the wrong (or less right) thing, I.e using TSC instead of RDTSCP. It's slightly faster, but we need to think if we may be badly affected.
Since it's under a compilation flag, maybe we care less.

@JimB123
Copy link
Collaborator

JimB123 commented Apr 19, 2022

The other thing is that it may be doing the wrong (or less right) thing, I.e using TSC instead of RDTSCP. It's slightly faster, but we need to think if we may be badly affected.

I may be missing something, so feel free to explain, but I'm not seeing that the additional overhead of RDTSCP adds any value in our case. My understanding is that RDTSCP is the same as RDTSC, but introduces barriers which prevent instruction reordering.

Within getMonotonicUs_x86(), the function is 1 line. So no compiler reordering can occur within the function. Since the function is accessed via function pointer, the compiler can't inline and no significant reordering in the caller is possible. At the processor level, instruction reordering is possible, however since the result is divided (reduced in precision) to micro-seconds, processor reordering is unimportant.

@oranagra
Copy link
Member

I didn't even try to look into it beyond the basic idea that there is a difference. Your analysis seems correct to me, but I didn't bother to read about it or think about it, and with this kina of thing I always get surprised by things I didn't expect.
On the bright side, a wrong measurement isn't critical for redis, and as you say, the chances that it'll be very wrong are very low.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Jim that I'm not sure there is that much value in removing the processor build flags as of right now. The RDTSCP instruction is probably generally safer to use, but it seems unnecessary in our case as we don't really need the precision is provides.

Copy link
Member

@yossigo yossigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem dropping the info field in the future, but also no urgency to remove the extra code.

As a (followup PR?) we should probably check the configured kernel clock source, warn about suboptimal configuration, and also report it as an additional INFO field when POSIX is used, now that we're aware of the performance impact it has.

@oranagra
Copy link
Member

that part is planned here: #10604

@oranagra oranagra added the state:major-decision Requires core team consensus label Apr 20, 2022
@oranagra
Copy link
Member

ok, so i'm merging it.
note that the INFO field string will be one of:
X86 TSC @ %ld ticks/us
ARM CNTVCT @ %ld ticks/us
POSIX clock_gettime
i.e. not planed to be machine friendly, so i think i won't bother people by mentioning any of these changes in the release notes.
but i will backport that fix to 6.2, to speed it up a bit.

@oranagra oranagra changed the title using ustime() on call() instead of ustime() and elapsedUs(). From ~6% to ~2% of cpu cycles Optimization: Use either monotonic or wall-clock to measure command execution time, to regain up to 4% execution time Apr 20, 2022
@oranagra oranagra merged commit 3cd8baf into redis:unstable Apr 20, 2022
@filipecosta90 filipecosta90 deleted the optimize.updateCachedTime branch April 20, 2022 12:58
This was referenced Apr 27, 2022
oranagra pushed a commit that referenced this pull request Apr 27, 2022
…xecution time, to regain up to 4% execution time (#10502)

In #7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure
command execution time for stats, apparently this meant sampling the clock 3 times per command
rather than two (wince we also need the wall-clock time).
In some cases this causes a significant overhead.

This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it
should be extremely fast.
This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit 3cd8baf)
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…xecution time, to regain up to 4% execution time (redis#10502)

In redis#7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure
command execution time for stats, apparently this meant sampling the clock 3 times per command
rather than two (wince we also need the wall-clock time).
In some cases this causes a significant overhead.

This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it
should be extremely fast.
This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using.

Co-authored-by: Oran Agra <oran@redislabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action:run-benchmark Triggers the benchmark suite for this Pull Request state:major-decision Requires core team consensus
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants