Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change all durations and timestamps to times.nim #15

Closed
arnetheduck opened this issue Jan 2, 2019 · 14 comments
Closed

Change all durations and timestamps to times.nim #15

arnetheduck opened this issue Jan 2, 2019 · 14 comments

Comments

@arnetheduck
Copy link
Member

There are several reasons to do this:

  • use the standard representation of time in nim (times.Duration and friends) to integrate better with other libraries
  • avoid confusion of time units - the present usage of milliseconds is unusual and arbitrary, meaning it's easy to get the units wrong
  • use a monotonic clock for relative time offsets (in timers, for example)
@cheatfate
Copy link
Collaborator

Windows GetQueuedCompletionStatus() which used to poll completion events and pause supports only milliseconds resolution, so how you propose to use microseconds and nanoseconds duration in such case?

I understand problem with monotonic clock, but currently library uses most performant primitives which are 5x times faster then its monotonic alternatives.

@arnetheduck
Copy link
Member Author

The point is to allow the user to specify the time in an unambiguous way - the precision you get is whatever the platform offers you, at that point. by fixing ad2, we can also offer the same usability advantage in downstream libraries.

When you use these functions, you necessarily have to take into account that these are estimates. For example, if you just blindly repeat a call with a 100ms delay, you'll have less than 10 calls per second regardless of the precision the underlying clock - for example because other events are running at the same time.

I'm curious about that 5x claim also, and how much of the total time of calling for example GetQueuedCompletionStatus the extra 4x make up.

@cheatfate
Copy link
Collaborator

cheatfate commented Jan 4, 2019

If you check source code of poll() function https://github.com/status-im/nim-asyncdispatch2/blob/master/asyncdispatch2/asyncloop.nim#L316, you will find that call to fastEpochTime() performed 2 times.

First call is used to handle already expired callbacks and calculate timeout for system queue waiter, second call is used also to process expired callbacks after system queue waiter.

Second call is also required because you don't know how much time you spend in system queue waiter (it can be equal to timeout value, but if timeout was not set, it will wait infinite time for FDs events).

@arnetheduck
Copy link
Member Author

I understand that the function is called - I'm asking two things:

  • compared to other calls in the loop, what % of time is taken by the clock call? for example, if the clock call takes 0.01% of the total loop time and 99.99% is spent in GetQueuedCompletionStatus, it really doesn't matter if one clock takes 5x the other
  • for the above question, also take into account that this is mostly relevant when there is actually work to do, so you would need to measure when there is work to do.. otherwise, if the loop is idling, again, it doesn't matter
  • where did you get the 5x performance from? on linux, CLOCK_REALTIME and CLOCK_MONOTONIC are near identical, measured in cycles

once you have the answers to those questions, you have to weigh that against the massive improvement in ergonomics:

  • you cannot make a mistake with the unit of time
  • on platforms with finer grained clocks you get access to shorter loop times if you want to (no artificial timing barrier)
  • applications that actually make use of the std lib are not penalized by ugly casts - we're having this problem downstream in nim-eth-p2p and nim-beacon-chain where we need to be careful with clocks or the system breaks down. imagine the embarrassment and massive cost of nimbus freezing on a leap second because we wanted to save a few cycles in an idle loop.
#include <stdint.h>
#include <time.h>

uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}
#include <stdio.h>


const int iters = 1024*1024;

int main(int a, char** b) {
	while(1) {
	uint64_t start, mid, end;
	uint64_t t = 0;
	struct timespec tt;

	start = rdtsc();

	for (int i = 0; i < iters; ++i) {
		t+=clock_gettime(CLOCK_REALTIME, &tt);
	}

	mid = rdtsc();

	for (int i = 0; i < iters; ++i) {
		t += clock_gettime(CLOCK_MONOTONIC, &tt);
	}

	end = rdtsc();

        printf("%ld %ld\n", (mid-start)/iters, (end-mid)/iters);
	}
}

on my crappy laptop, I see more variation due to thermal throttling than between the two calls..

@cheatfate
Copy link
Collaborator

@arnetheduck first of all i'm making asyncdispatch not for Linux only, so if Linux has advantage on some behavior it doesn't matter for library, because it must produce equal behavior on all OSes.

compared to other calls in the loop, what % of time is taken by the clock call? for example, if the clock call takes 0.01% of the total loop time and 99.99% is spent in GetQueuedCompletionStatus, it really doesn't matter if one clock takes 5x the other

nim-lang/Nim#3909

This is my old benchmarks but data still can be used to understand timer's impact on performance.

@arnetheduck
Copy link
Member Author

The benefit is in the hardware, not the operating system, and the point is that you need to use the correct clock.

On windows, that's QueryPerformanceCounters for monotonic and fast relative time - windows clocks are a bit of a mystery because they tend to store hardware time in local time and use 100ns increments generally in their time structures. No idea what kind of conversions happen in the background when you call that filetime function (which is UTC) - but notably, your fastEpochTime and epochTime in nim call the same underlying windows function - clearly the performance difference on windows is not in the clock itself.

However, the benchmarks I see in that post for timers look mostly irrelevant - they measure timing without optimizations (-d:release). The core difference on windows between your timer and the upstream timer is the float conversion - that conversion is unnecessary any way you look at it, and you'll notice a Duration doesn't use it either - and it's the kind of thing made worse by the lack of optimizations.

The second thing is that the incoming value in the API can be converted to whatever underlying clock you want - in fact, it should be. The key is to have a clear, powerful and unambiguous ABI followed up by an efficient implementation that makes the best use of the hardware available. Using milliseconds is strange (not an SI unit for time) and suboptimal any way you look at it.

@cheatfate
Copy link
Collaborator

The actual benefit is not in hardware, i can't use rdtsc because it is not cross-platform. It depends on how OS handles time, is it switches to kernel mode to get counters or not, is it uses some heavy calculations or not.

@arnetheduck
Copy link
Member Author

I'm not saying we should use RDTSC - the proposal is to use Duration in the API (for developer friendliness) and a monotonic clock in the implementation (for correctness).. clock_gettime(CLOCK_MONOTONIC, &tt) or QueryPerformanceCounters - there's no perf difference between these and whatever you're using - probably, there's an improvement on windows.

I'm also claiming that the we could use just a normal slow clock and probably wouldn't notice the difference, because the benchmark is kind of irrelevant - when there's no load, it doesn't really matter how many loops per second you can do because you're sleeping most of the time and when when there is load, the timing part is usually dwarfed by actual work (ie a beacon node packet arriving, in our case).. thus I'd focus on correctness and the use of good data structures (I see for example we're using a heap for the timers which seems much more sane than a seq or a linked list - in general) before worrying too much about micro-benchmarks

@cheatfate
Copy link
Collaborator

Now i will show my measurements, to benchmark i have used this code https://gist.github.com/cheatfate/d184c9f11fd49e9d0c75d166bd2d2b05.

So Linux benchmark is (compiled by nim c -d:release:

100_000_000 calls to fastEpochTimeMono() takes 1545765406ns
100_000_000 calls to fastEpochTime() takes     1537696307ns
100_000_000 calls to fastEpochTimeRaw() takes  23333947291ns
100_000_000 calls to fastEpochTimeMono() takes 1538502155ns
100_000_000 calls to fastEpochTime() takes     1537760239ns
100_000_000 calls to fastEpochTimeRaw() takes  23364704565ns
100_000_000 calls to fastEpochTimeMono() takes 1537733110ns
100_000_000 calls to fastEpochTime() takes     1538136763ns
100_000_000 calls to fastEpochTimeRaw() takes  23364710793ns
100_000_000 calls to fastEpochTimeMono() takes 1540042299ns
100_000_000 calls to fastEpochTime() takes     1538359638ns
100_000_000 calls to fastEpochTimeRaw() takes  23374689710ns
100_000_000 calls to fastEpochTimeMono() takes 1539592896ns
100_000_000 calls to fastEpochTime() takes     1538834740ns
100_000_000 calls to fastEpochTimeRaw() takes  23360426822ns
100_000_000 calls to fastEpochTimeMono() takes 1538338091ns
100_000_000 calls to fastEpochTime() takes     1537669102ns
100_000_000 calls to fastEpochTimeRaw() takes  23369929946ns
100_000_000 calls to fastEpochTimeMono() takes 1538175000ns
100_000_000 calls to fastEpochTime() takes     1538563607ns
100_000_000 calls to fastEpochTimeRaw() takes  23368758867ns
100_000_000 calls to fastEpochTimeMono() takes 1540537125ns
100_000_000 calls to fastEpochTime() takes     1546923064ns
100_000_000 calls to fastEpochTimeRaw() takes  23364539543ns
100_000_000 calls to fastEpochTimeMono() takes 1538087034ns
100_000_000 calls to fastEpochTime() takes     1537477042ns
100_000_000 calls to fastEpochTimeRaw() takes  23364063367ns
100_000_000 calls to fastEpochTimeMono() takes 1539868275ns
100_000_000 calls to fastEpochTime() takes     1539015094ns
100_000_000 calls to fastEpochTimeRaw() takes  23364882319ns

Windows benchmark is:

100_000_000 calls to fastEpochTimeMono() takes 1799527000ns
100_000_000 calls to fastEpochTime() takes     346997100ns
100_000_000 calls to fastEpochTimeMono() takes 1766980600ns
100_000_000 calls to fastEpochTime() takes     344437400ns
100_000_000 calls to fastEpochTimeMono() takes 1791302400ns
100_000_000 calls to fastEpochTime() takes     342270300ns
100_000_000 calls to fastEpochTimeMono() takes 1784696100ns
100_000_000 calls to fastEpochTime() takes     337763600ns
100_000_000 calls to fastEpochTimeMono() takes 1805359700ns
100_000_000 calls to fastEpochTime() takes     337840100ns
100_000_000 calls to fastEpochTimeMono() takes 1755878000ns
100_000_000 calls to fastEpochTime() takes     343742100ns
100_000_000 calls to fastEpochTimeMono() takes 1756078800ns
100_000_000 calls to fastEpochTime() takes     341183100ns
100_000_000 calls to fastEpochTimeMono() takes 1763183100ns
100_000_000 calls to fastEpochTime() takes     336385800ns
100_000_000 calls to fastEpochTimeMono() takes 1773583000ns
100_000_000 calls to fastEpochTime() takes     336432200ns
100_000_000 calls to fastEpochTimeMono() takes 1769893700ns
100_000_000 calls to fastEpochTime() takes     333117400ns

@cheatfate
Copy link
Collaborator

As you can see on Windows using fastEpochTimeMono() is 500% slower then fastEpochTime().

@cheatfate
Copy link
Collaborator

cheatfate commented Mar 21, 2019

Now about Linux results, this is quote from man clock_gettime.

CLOCK_MONOTONIC
              Clock  that  cannot be set and represents monotonic time since some unspecified starting point.  This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes
              the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
              Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3).

As you can see CLOCK_MONOTONIC is also affected to adjustments which can be made by ntpd daemon silently. While CLOCK_MONOTONIC_RAW is not affected by all this changes, but its speed for some reason is 1500+% slower.

@cheatfate
Copy link
Collaborator

So usage of MONOTONIC clock you are proposing is also not totally correct.

@arnetheduck
Copy link
Member Author

so from adjtime:

NOTES
       The  adjustment  that  adjtime()  makes  to  the clock is carried out in such a manner that the clock is always monotonically increasing.  Using adjtime() to adjust the time prevents the problems that can be caused for certain applications (e.g., make(1)) by
       abrupt positive or negative jumps in the system time.

the two work together to do (mostly) the right thing. specifically, they don't jump back in time, and don't cause large disruptions in timing, beyond what normally would happen anyway with non-realtime OS's.

for windows, you can try GetTickCount64, or you can try moving the frequency multiplication elsewhere.

all that said, this is not an argument about performance, primarily. you will be calling the clock function only when it doesn't matter, relatively speaking. it's possible to construct benchmarks that focus on the timer, but these will be far removed from reality.

@cheatfate
Copy link
Collaborator

Fixed in #24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants