Skip to content

Improve profiler timing on Windows/x86 using RDTSC#629

Closed
mvorisek wants to merge 3 commits intoxdebug:masterfrom
mvorisek:improve_windows_timing
Closed

Improve profiler timing on Windows/x86 using RDTSC#629
mvorisek wants to merge 3 commits intoxdebug:masterfrom
mvorisek:improve_windows_timing

Conversation

@mvorisek
Copy link
Copy Markdown
Contributor

@mvorisek mvorisek commented Aug 22, 2020

This PR further improves #611. The simplest php functions (is_int, count, ...) take about 20 ns per call, but the best Windows timer has resolution of 100 ns [1] which implies that this timer source is short to simple php functions. When a script/project under test contain a lot of calls of simple php functions, the xdebug profiler output is quite ambiguous (some calls are recorded with 10 ns duration (minimal time step) and some with 100 ns (minimal timer step)).

This PR solves this issue by using RDTSC as a clock source, which has resolution of about 0.25 ns and reading it is fast, it takes about 6 ns per call. On modern CPUs (Nehalen or newer), it is incremented at constant rate

However, there is no way to detect the RDTSC speed, therefore it needs a short calibration (2 ms) and this feature must by enabled explictly by xdebug.profiler_tsc_as_clock php ini option is enabled by default on Windows but only for profile mode. As php process takes about 10 ms to startup on Windows, I belive the calibration delay is acceptable against much more accurate results.

[1] https://stackoverflow.com/questions/63205226/better-than-100ns-resolution-timers-in-windows

@mvorisek
Copy link
Copy Markdown
Contributor Author

@derickr can we merge this PR? can you also please provide feedback for #628 ?

@derickr
Copy link
Copy Markdown
Contributor

derickr commented Aug 29, 2020

I'm not inclined to merge changes to timings until after Xdebug 3.0 is out, as it needs my full attention, and there is enough scope creep already.

PR #628 is still "in draft", so I didn't pay it any attention yet, as I thought it wasn't done yet.

@NattyNarwhal
Copy link
Copy Markdown
Contributor

This is architecture specific; it wouldn't work on ARM Win32.

@mvorisek
Copy link
Copy Markdown
Contributor Author

mvorisek commented Sep 28, 2020

This is architecture specific; it wouldn't work on ARM Win32.

thanks for your feedback - would you like to see there macro like __i386__ || __x86_64__ || _M_IX86 || _M_X64 or do you know a better condition to detect x86/x64 arch?

@mvorisek mvorisek force-pushed the improve_windows_timing branch 4 times, most recently from 85548a1 to 61ad26b Compare January 21, 2021 22:33
@mvorisek mvorisek force-pushed the improve_windows_timing branch 3 times, most recently from 1820d76 to cc71fc8 Compare January 28, 2021 23:35
@mvorisek mvorisek force-pushed the improve_windows_timing branch from cc71fc8 to b44d7e0 Compare January 28, 2021 23:41
@mvorisek mvorisek changed the title Improve timing on Windows using RDTSC Improve timing on Windows/x86 using RDTSC Jan 28, 2021
@mvorisek mvorisek force-pushed the improve_windows_timing branch 7 times, most recently from eaf7a10 to bda82c6 Compare January 29, 2021 17:35
@mvorisek mvorisek force-pushed the improve_windows_timing branch from bda82c6 to c55bae7 Compare January 29, 2021 19:06
@mvorisek mvorisek changed the title Improve timing on Windows/x86 using RDTSC Improve profiler timing on Windows/x86 using RDTSC Jan 30, 2021
@mvorisek mvorisek force-pushed the improve_windows_timing branch from 0e5402a to 0861545 Compare January 30, 2021 10:01
@mvorisek mvorisek force-pushed the improve_windows_timing branch from 0861545 to 8763cdd Compare January 30, 2021 10:02
@mvorisek mvorisek force-pushed the improve_windows_timing branch from 8763cdd to 2711607 Compare January 30, 2021 10:04
@mvorisek
Copy link
Copy Markdown
Contributor Author

mvorisek commented Mar 5, 2021

@derickr can we merge this PR?

  • RDTSC clock source is used only for Windows and profiler mode only
  • PR description is updated to summary the importance and the challenges

@mvorisek
Copy link
Copy Markdown
Contributor Author

@derickr can you please get to this PR?

@derickr
Copy link
Copy Markdown
Contributor

derickr commented Mar 22, 2021

I'm still not keen in adding this, as it's more code to maintain on a platform that I know little about. I haven't heard a good reason why this is needed, or what big benefit it has.

@mvorisek
Copy link
Copy Markdown
Contributor Author

I have been profiling very critical code parts which have a lot of very simple/fast calls.

Currently, on Windows, the time resolution is 100 ns. But, as mentioned in the description, the simple/fast calls take about 20 ns and without this PR the profiler result is more or less random :), thus I propose this PR to:
a) increase the time resolution above the fastest function execution time,
b) also decrease the time needed to read the current time (from about 50 ns to 6 ns).

I plan then to finish #628 which will remove the xdebug overhead and measure simple/fast calls below 10 ns.

In numbers, simple/fast call is currently recorded with 10 ns (minimal/fallback step) or 100 ns (minimal timer step). Even if the times should average to +- the real result, they don't as the time needed to read the current time is currently way larger than the fastest function call itself. Imagine a loop with 1k elements and simple code inside like is_string checks. The is_string call itself takes about 10 ns. The xdebug overhead is 10 ns. The read time overhead is 50 ns.

So with this PR, the measured time will decrease from 70 us to 20 us. With the later PR, it will be 10 us.

With more elements/calls it can be millisconds. And if the measured time is 7x greater than in the reality, it confuses the developer so much that he may think some heavy call like DB query is faster. It is not, only 50 ns overhead per 1 ms DB query is unimportant, but for the fastest functions, the results are almost an order of magnitude off.

To your point about Windows: it is tested in CI, feel free to break the code to verify (like by speeding the time by multiplying it by some constant)

@mvorisek
Copy link
Copy Markdown
Contributor Author

mvorisek commented Apr 7, 2021

@derickr would you be ok to add profiler_tsc_as_clock as optional experimental xdebug switch? I belive, I can make it working for linux as well and user can opt in when high performace profiling is desired. If there will be no negative feedback during longer time, we can make it default then.

@derickr
Copy link
Copy Markdown
Contributor

derickr commented Oct 6, 2021

Previous timing related PRs have given me a lot of trouble, so I won't be accepting this, or anything else related to timing any more.

@derickr derickr closed this Oct 6, 2021
@mvorisek
Copy link
Copy Markdown
Contributor Author

mvorisek commented Oct 6, 2021

Thank you for the feedback Derick, can you please be more specific if there was anything I could done better from my side?

@mvorisek
Copy link
Copy Markdown
Contributor Author

mvorisek commented Jul 9, 2022

Hi @derickr, I am still facing with inaccurate time measurements. Measuring the time in sense of CPU instructions may be a solution, although not all CPU instructions take the same time. What are your thoughts, would you accept a PR for measuring time based on CPU instructions? Or did you try this solution in the past and rejected it for some reason? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants