Optimized stolarsky_mean#2274
Conversation
Review checklistThis checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging. Purpose and scope
Code quality
Documentation
Testing
Performance
Verification
Created with ❤️ by the Trixi.jl community. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2274 +/- ##
==========================================
- Coverage 96.80% 96.80% -0.00%
==========================================
Files 528 528
Lines 42655 42660 +5
==========================================
+ Hits 41292 41295 +3
- Misses 1363 1365 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
andrewwinters5000
left a comment
There was a problem hiding this comment.
This is an interesting (and ingenious way) to rewrite the expression and possibly improve performance. Would it be worthwhile to also benchmark it on Roci (or some other machine) just to see its imfluence?
Co-authored-by: Andrew Winters <andrew.ross.winters@liu.se>
|
Hendrik made me realize that for integers the For real numbers: For integers So, basically just by preallocating the power functions a 50% speed up is gained. The |
|
Results on university machine (Goldstein): new version old version: There's still a roughly 17% improvement. Benchmarks for Goldstein: Integers: |
Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
julia> @inline stolarsky_mean_1(x::Real, y::Real, gamma::Real) = stolarsky_mean_1(promote(x, y)..., gamma)
stolarsky_mean_1 (generic function with 2 methods)
julia> @inline function stolarsky_mean_1(x::RealT, y::RealT, gamma::Real) where {RealT <: Real}
epsilon_f2 = convert(RealT, 1.0e-4)
f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
if f2 < epsilon_f2
# convenience coefficients
c1 = convert(RealT, 1 / 3) * (gamma - 2)
c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
else
if gamma isa Integer
yg = y^(gamma - 1)
xg = x^(gamma - 1)
else
yg = exp((gamma - 1) * log(y)) # equivalent to y^gamma but faster for non-integers
xg = exp((gamma - 1) * log(x)) # equivalent to x^gamma but faster for non-integers
end
return (gamma - 1) * (yg * y - xg * x) / (gamma * (yg - xg))
end
end
stolarsky_mean_1 (generic function with 2 methods)
julia> @inline stolarsky_mean_2(x::Real, y::Real, gamma::Real) = stolarsky_mean_1(promote(x, y)..., gamma)
stolarsky_mean_2 (generic function with 2 methods)
julia> @inline function stolarsky_mean_2(x::RealT, y::RealT, gamma::Real) where {RealT <: Real}
epsilon_f2 = convert(RealT, 1.0e-4)
f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
if f2 < epsilon_f2
# convenience coefficients
c1 = convert(RealT, 1 / 3) * (gamma - 2)
c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
else
if gamma isa Integer
yg = y^(gamma - 1)
xg = x^(gamma - 1)
else
yg = exp((gamma - 1) * log(y)) # equivalent to y^gamma but faster for non-integers
xg = exp((gamma - 1) * log(x)) # equivalent to x^gamma but faster for non-integers
end
return (gamma - 1) / gamma * (yg * y - xg * x) / (yg - xg)
end
end
stolarsky_mean_2 (generic function with 2 methods)
julia> @benchmark value = stolarsky_mean_1($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
Range (min … max): 22.289 ns … 34.858 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 22.791 ns ┊ GC (median): 0.00%
Time (mean ± σ): 24.045 ns ± 1.610 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▆██▃▂▂▂ ▁▂▁▂▁ ▁▁▂ ▂▁▂▂▂ ▁▂▁▂ ▂▂▂ ▂▅▅█▃ ▂
▃██████████████▇████████████████████████▅▃▁▆▅▆▇▆▄▃▅▅▅▃▁▅▃▅▅ █
22.3 ns Histogram: log(frequency) by time 28 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
Range (min … max): 22.368 ns … 32.981 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 22.550 ns ┊ GC (median): 0.00%
Time (mean ± σ): 22.639 ns ± 0.466 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃██▆▆▃▁ ▂
▅███████▃▁▁▁▁▄▆█▆▆▄▃▁▁▁▁▃▄▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆█▇▇▇▆▄▄▃▃▄▅ █
22.4 ns Histogram: log(frequency) by time 25.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark value = stolarsky_mean_1($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.7))[])
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
Range (min … max): 22.409 ns … 320.979 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 22.590 ns ┊ GC (median): 0.00%
Time (mean ± σ): 22.693 ns ± 3.015 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
██
▆▄▇██▇█▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂ ▃
22.4 ns Histogram: frequency by time 25.1 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.7))[])
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
Range (min … max): 22.430 ns … 318.380 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 22.560 ns ┊ GC (median): 0.00%
Time (mean ± σ): 22.678 ns ± 3.001 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▆█▆▅▂ ▂
███████▅▃▁▁▁▄▅█▇▅▅▄▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▃▃▃▃▃▁▄▁▁▁▁▁▁▆▇▇▇▇▅▄▄▅▄▁▅▇ █
22.4 ns Histogram: log(frequency) by time 25.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.There are no major differences between these two version. Looking at the median the second one looks slightly faster, so I chose the latter one. |
* first commit * format * Update src/auxiliary/math.jl Co-authored-by: Andrew Winters <andrew.ross.winters@liu.se> * fix for integers * format * Update src/auxiliary/math.jl Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com> * update stolarsky mean * fix typo --------- Co-authored-by: Andrew Winters <andrew.ross.winters@liu.se> Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
The stolarsky mean will come in handy in Trixi Atmo. Here a faster version:
Simulation of EC Polytropic Euler with the previous stolarsky mean:
──────────────────────────────────────────────────────────────────────────────────── Trixi.jl Time Allocations ─────────────────────── ──────────────────────── Tot / % measured: 1.98s / 99.3% 2.34MiB / 90.4% Section ncalls time %tot avg alloc %tot avg ──────────────────────────────────────────────────────────────────────────────────── rhs! 1.42k 1.93s 98.1% 1.36ms 5.14KiB 0.2% 3.70B volume integral 1.42k 1.74s 88.4% 1.22ms 0.00B 0.0% 0.00B interface flux 1.42k 163ms 8.3% 115μs 0.00B 0.0% 0.00B surface integral 1.42k 15.3ms 0.8% 10.8μs 0.00B 0.0% 0.00B Jacobian 1.42k 8.10ms 0.4% 5.70μs 0.00B 0.0% 0.00B reset ∂u/∂t 1.42k 3.72ms 0.2% 2.62μs 0.00B 0.0% 0.00B ~rhs!~ 1.42k 1.00ms 0.1% 705ns 5.14KiB 0.2% 3.70B boundary flux 1.42k 46.3μs 0.0% 32.6ns 0.00B 0.0% 0.00B source terms 1.42k 45.5μs 0.0% 32.0ns 0.00B 0.0% 0.00B calculate dt 285 26.2ms 1.3% 92.0μs 0.00B 0.0% 0.00B analyze solution 4 7.33ms 0.4% 1.83ms 314KiB 14.5% 78.5KiB I/O 5 4.19ms 0.2% 838μs 1.81MiB 85.3% 370KiB save solution 4 3.82ms 0.2% 956μs 1.80MiB 84.9% 460KiB ~I/O~ 5 365μs 0.0% 73.1μs 8.83KiB 0.4% 1.77KiB get element variables 4 730ns 0.0% 182ns 0.00B 0.0% 0.00B save mesh 4 608ns 0.0% 152ns 0.00B 0.0% 0.00B get node variables 4 95.0ns 0.0% 23.8ns 0.00B 0.0% 0.00B ────────────────────────────────────────────────────────────────────────────────────Results for the optimized version
──────────────────────────────────────────────────────────────────────────────────── Trixi.jl Time Allocations ─────────────────────── ──────────────────────── Tot / % measured: 1.62s / 99.1% 2.34MiB / 90.4% Section ncalls time %tot avg alloc %tot avg ──────────────────────────────────────────────────────────────────────────────────── rhs! 1.42k 1.57s 97.7% 1.11ms 5.14KiB 0.2% 3.70B volume integral 1.42k 1.41s 87.8% 994μs 0.00B 0.0% 0.00B interface flux 1.42k 131ms 8.2% 92.4μs 0.00B 0.0% 0.00B surface integral 1.42k 14.9ms 0.9% 10.5μs 0.00B 0.0% 0.00B Jacobian 1.42k 7.29ms 0.5% 5.13μs 0.00B 0.0% 0.00B reset ∂u/∂t 1.42k 3.83ms 0.2% 2.70μs 0.00B 0.0% 0.00B ~rhs!~ 1.42k 956μs 0.1% 673ns 5.14KiB 0.2% 3.70B boundary flux 1.42k 61.6μs 0.0% 43.3ns 0.00B 0.0% 0.00B source terms 1.42k 24.5μs 0.0% 17.3ns 0.00B 0.0% 0.00B calculate dt 285 26.0ms 1.6% 91.2μs 0.00B 0.0% 0.00B analyze solution 4 6.36ms 0.4% 1.59ms 314KiB 14.5% 78.6KiB I/O 5 5.35ms 0.3% 1.07ms 1.81MiB 85.3% 370KiB save solution 4 5.00ms 0.3% 1.25ms 1.80MiB 84.9% 461KiB ~I/O~ 5 352μs 0.0% 70.5μs 8.83KiB 0.4% 1.77KiB save mesh 4 755ns 0.0% 189ns 0.00B 0.0% 0.00B get element variables 4 477ns 0.0% 119ns 0.00B 0.0% 0.00B get node variables 4 86.0ns 0.0% 21.5ns 0.00B 0.0% 0.00B ────────────────────────────────────────────────────────────────────────────────────Thus on my machine, there's an 18% improvement.