Skip to content

Optimized stolarsky_mean#2274

Merged
ranocha merged 10 commits intotrixi-framework:mainfrom
MarcoArtiano:opt_stolarsky
Oct 9, 2025
Merged

Optimized stolarsky_mean#2274
ranocha merged 10 commits intotrixi-framework:mainfrom
MarcoArtiano:opt_stolarsky

Conversation

@MarcoArtiano
Copy link
Contributor

@MarcoArtiano MarcoArtiano commented Feb 10, 2025

The stolarsky mean will come in handy in Trixi Atmo. Here a faster version:

julia> @inline function stolarsky_mean(x::RealT, y::RealT, gamma::RealT) where {RealT <: Real}
           epsilon_f2 = convert(RealT, 1.0e-4)
           f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
           if f2 < epsilon_f2
               # convenience coefficients
               c1 = convert(RealT, 1 / 3) * (gamma - 2)
               c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
               c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
               return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
           else
               return (gamma - 1) / gamma * (y^gamma - x^gamma) /
                      (y^(gamma - 1) - x^(gamma - 1))
           end
       end
stolarsky_mean (generic function with 1 method)

julia> @inline function stolarsky_mean_2(x::RealT, y::RealT, gamma::RealT) where {RealT <: Real}
           epsilon_f2 = convert(RealT, 1.0e-4)
           f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
           if f2 < epsilon_f2
               # convenience coefficients
               c1 = convert(RealT, 1 / 3) * (gamma - 2)
               c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
               c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
               return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
           else
               expy = exp(gamma*log(y))
               expx = exp(gamma*log(x))
               return (gamma - 1) / gamma * (expy - expx) /
                      (expy/y - expx/x)
           end
       end
stolarsky_mean_2 (generic function with 1 method)

julia> @benchmark value = stolarsky_mean($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (min … max):  45.982 ns … 61.638 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.163 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.301 ns ±  0.665 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▅██▅             ▁▁                                        ▂
  █████▇▆▅▄▁▃▃▄▄▄▁▄▇█████▇▇▆▆▄▆▅▅▄▅▄▅▅▇▇▇▆▆▄▅▅▅▄▃▅▆▄▅▃▃▄▆▆█▇▇ █
  46 ns        Histogram: log(frequency) by time      49.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 997 evaluations per sample.
 Range (min … max):  19.342 ns … 630.474 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.566 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.007 ns ±   6.343 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█    ▂▂▄                                                    ▁
  ██▇▇▇█████▇▇▇█▇▄▅▄▁▃▁▄▃▁▄▄▄▁▄▁▃▃▁▁▃▄▃▃▄▄▄▃▄▅▅▅▅▅▅▆▆▆▆▅▅▆▆▅▅▅ █
  19.3 ns       Histogram: log(frequency) by time      30.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Simulation of EC Polytropic Euler with the previous stolarsky mean:

────────────────────────────────────────────────────────────────────────────────────
             Trixi.jl                      Time                    Allocations      
                                  ───────────────────────   ────────────────────────
        Tot / % measured:              1.98s /  99.3%           2.34MiB /  90.4%    

Section                   ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────
rhs!                       1.42k    1.93s   98.1%  1.36ms   5.14KiB    0.2%    3.70B
  volume integral          1.42k    1.74s   88.4%  1.22ms     0.00B    0.0%    0.00B
  interface flux           1.42k    163ms    8.3%   115μs     0.00B    0.0%    0.00B
  surface integral         1.42k   15.3ms    0.8%  10.8μs     0.00B    0.0%    0.00B
  Jacobian                 1.42k   8.10ms    0.4%  5.70μs     0.00B    0.0%    0.00B
  reset ∂u/∂t              1.42k   3.72ms    0.2%  2.62μs     0.00B    0.0%    0.00B
  ~rhs!~                   1.42k   1.00ms    0.1%   705ns   5.14KiB    0.2%    3.70B
  boundary flux            1.42k   46.3μs    0.0%  32.6ns     0.00B    0.0%    0.00B
  source terms             1.42k   45.5μs    0.0%  32.0ns     0.00B    0.0%    0.00B
calculate dt                 285   26.2ms    1.3%  92.0μs     0.00B    0.0%    0.00B
analyze solution               4   7.33ms    0.4%  1.83ms    314KiB   14.5%  78.5KiB
I/O                            5   4.19ms    0.2%   838μs   1.81MiB   85.3%   370KiB
  save solution                4   3.82ms    0.2%   956μs   1.80MiB   84.9%   460KiB
  ~I/O~                        5    365μs    0.0%  73.1μs   8.83KiB    0.4%  1.77KiB
  get element variables        4    730ns    0.0%   182ns     0.00B    0.0%    0.00B
  save mesh                    4    608ns    0.0%   152ns     0.00B    0.0%    0.00B
  get node variables           4   95.0ns    0.0%  23.8ns     0.00B    0.0%    0.00B
────────────────────────────────────────────────────────────────────────────────────

Results for the optimized version

────────────────────────────────────────────────────────────────────────────────────
             Trixi.jl                      Time                    Allocations      
                                  ───────────────────────   ────────────────────────
        Tot / % measured:              1.62s /  99.1%           2.34MiB /  90.4%    

Section                   ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────
rhs!                       1.42k    1.57s   97.7%  1.11ms   5.14KiB    0.2%    3.70B
  volume integral          1.42k    1.41s   87.8%   994μs     0.00B    0.0%    0.00B
  interface flux           1.42k    131ms    8.2%  92.4μs     0.00B    0.0%    0.00B
  surface integral         1.42k   14.9ms    0.9%  10.5μs     0.00B    0.0%    0.00B
  Jacobian                 1.42k   7.29ms    0.5%  5.13μs     0.00B    0.0%    0.00B
  reset ∂u/∂t              1.42k   3.83ms    0.2%  2.70μs     0.00B    0.0%    0.00B
  ~rhs!~                   1.42k    956μs    0.1%   673ns   5.14KiB    0.2%    3.70B
  boundary flux            1.42k   61.6μs    0.0%  43.3ns     0.00B    0.0%    0.00B
  source terms             1.42k   24.5μs    0.0%  17.3ns     0.00B    0.0%    0.00B
calculate dt                 285   26.0ms    1.6%  91.2μs     0.00B    0.0%    0.00B
analyze solution               4   6.36ms    0.4%  1.59ms    314KiB   14.5%  78.6KiB
I/O                            5   5.35ms    0.3%  1.07ms   1.81MiB   85.3%   370KiB
  save solution                4   5.00ms    0.3%  1.25ms   1.80MiB   84.9%   461KiB
  ~I/O~                        5    352μs    0.0%  70.5μs   8.83KiB    0.4%  1.77KiB
  save mesh                    4    755ns    0.0%   189ns     0.00B    0.0%    0.00B
  get element variables        4    477ns    0.0%   119ns     0.00B    0.0%    0.00B
  get node variables           4   86.0ns    0.0%  21.5ns     0.00B    0.0%    0.00B
────────────────────────────────────────────────────────────────────────────────────

Thus on my machine, there's an 18% improvement.

@github-actions
Copy link
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

@MarcoArtiano MarcoArtiano added the performance We are greedy label Feb 10, 2025
@MarcoArtiano MarcoArtiano requested a review from ranocha February 10, 2025 14:52
@codecov
Copy link

codecov bot commented Feb 10, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.80%. Comparing base (f4bbcd9) to head (24b441c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/auxiliary/math.jl 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2274      +/-   ##
==========================================
- Coverage   96.80%   96.80%   -0.00%     
==========================================
  Files         528      528              
  Lines       42655    42660       +5     
==========================================
+ Hits        41292    41295       +3     
- Misses       1363     1365       +2     
Flag Coverage Δ
unittests 96.80% <75.00%> (-<0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@andrewwinters5000 andrewwinters5000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting (and ingenious way) to rewrite the expression and possibly improve performance. Would it be worthwhile to also benchmark it on Roci (or some other machine) just to see its imfluence?

Co-authored-by: Andrew Winters <andrew.ross.winters@liu.se>
Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MarcoArtiano
Copy link
Contributor Author

MarcoArtiano commented Feb 11, 2025

Hendrik made me realize that for integers the exp(log(...)) is slower. I tried the trick to avoid division, but that doesn't change anything actually. I added a specialization for integers and the results are the following:
Functions

julia> @inline function stolarsky_mean(x::RealT, y::RealT, gamma::RealT) where {RealT <: Real}
           epsilon_f2 = convert(RealT, 1.0e-4)
           f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
           if f2 < epsilon_f2
               # convenience coefficients
               c1 = convert(RealT, 1 / 3) * (gamma - 2)
               c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
               c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
               return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
           else
               return (gamma - 1) / gamma * (y^gamma - x^gamma) /
                      (y^(gamma - 1) - x^(gamma - 1))
           end
       end
stolarsky_mean (generic function with 1 method)

julia> @inline function stolarsky_mean_2(x::RealT, y::RealT, gamma::RealT) where {RealT <: Real}
           epsilon_f2 = convert(RealT, 1.0e-4)
           f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
           if f2 < epsilon_f2
               # convenience coefficients
               c1 = convert(RealT, 1 / 3) * (gamma - 2)
               c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
               c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
               return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
           else
               expx = x^(gamma-1)
               expy = y^(gamma-1)
               return (gamma - 1) / gamma * (expy*y - expx*x) /
                      (expy - expx)
           end
       end
stolarsky_mean_2 (generic function with 1 method)

julia> @inline function stolarsky_mean_3(x::RealT, y::RealT, gamma::RealT) where {RealT <: Real}
           epsilon_f2 = convert(RealT, 1.0e-4)
           f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
           if f2 < epsilon_f2
               # convenience coefficients
               c1 = convert(RealT, 1 / 3) * (gamma - 2)
               c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
               c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
               return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
           else
               expy = exp((gamma-1)*log(y))
               expx = exp((gamma-1)*log(x))
               return (gamma - 1) / gamma * (expy*y - expx*x) /
                      (expy - expx)
           end
       end
stolarsky_mean_3 (generic function with 1 method)

julia> @inline function stolarsky_mean_4(x::RealT, y::RealT, gamma::RealT) where {RealT <: Real}
           epsilon_f2 = convert(RealT, 1.0e-4)
           f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
           if f2 < epsilon_f2
               # convenience coefficients
               c1 = convert(RealT, 1 / 3) * (gamma - 2)
               c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
               c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
               return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
           else
               if isinteger(gamma)
               expy = y^(gamma-1)
               expx = x^(gamma-1)    
               else
               expy = exp((gamma-1)*log(y))
               expx = exp((gamma-1)*log(x))
               end
               return (gamma - 1) / gamma * (expy*y - expx*x) /
                      (expy - expx)
           end
       end
stolarsky_mean_4 (generic function with 1 method)

For real numbers:

julia> @benchmark value = stolarsky_mean($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (min … max):  45.981 ns … 56.275 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.167 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.241 ns ±  0.440 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▄▆██▇▄                      ▁▁▂▁                          ▂
  ▆███████▇▅▅▄▁▁▃▁▃▁▁▃▁▃▃▁▁▁▁▄▅▇████▇▇▆▆▆▄▅▅▅▅▄▅▄▃▄▄▁▄▃▄▃▁▃▃▄ █
  46 ns        Histogram: log(frequency) by time      48.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min … max):  23.530 ns … 42.782 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.656 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.853 ns ±  0.654 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅█▇▂                                 ▁▃▁                   ▁
  ▆████▆▅▅▅▄▄▄▅▄▄▅▄▅▄▅▄▅▄▃▅▇███▇▇▆▆▆▅▅▆▆████▇▆▅▅▄▃▅▅▅▄▄▄▄▅▃▅▅ █
  23.5 ns      Histogram: log(frequency) by time      26.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_3($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 997 evaluations per sample.
 Range (min … max):  19.433 ns … 25.629 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.512 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.601 ns ±  0.378 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█▅                                     ▁▁                 ▂
  █████▄▅▅▅▅▅▅▅▅▃▄▅▄▅▄▃▄▆▄▅▅▄▄▅▅▇▇███▆▇▆▆▆▇███▇▆▅▅▄▄▃▁▄▅▄▃▆▅▄ █
  19.4 ns      Histogram: log(frequency) by time      21.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_4($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 997 evaluations per sample.
 Range (min … max):  19.665 ns … 34.485 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.760 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.803 ns ±  0.360 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂█▇                                                        
  ▂▄████▃▂▂▂▁▂▁▁▁▁▁▁▂▂▁▂▂▁▁▁▁▁▁▂▁▁▁▂▁▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▂
  19.7 ns         Histogram: frequency by time        21.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

For integers

julia> @benchmark value = stolarsky_mean($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 998 evaluations per sample.
 Range (min … max):  16.133 ns … 30.958 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.224 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.084 ns ±  0.718 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▁  ██       ▃▄   ▄▁  ██ ▁ ▂   ▂▂   ▂   ▅▆   ▃   ▂▃         ▃
  ██▁▁███▅▄▆▃▃▁██▄▄▁██▅█████▇██▇▆██▆▄▅█▇▅▆██▇▇███▆▆██▆▅▅▇▅▅▅█ █
  16.1 ns      Histogram: log(frequency) by time        19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 999 evaluations per sample.
 Range (min … max):  7.857 ns … 22.603 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.162 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.107 ns ±  0.444 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▁        ▁█▄▄  ▂   ▁█▄  ▄▃▂  ▁▆   ▂▃         ▁▁  ▂
  ▇▆▁▁▃▆▃▁▁▁█▄▄▆▅▇▄▃▅████▆▇█▆▄▅███▆▆███▆▇██▅▅▄███▇▆▆▇▆▆▅▇██▇ █
  7.86 ns      Histogram: log(frequency) by time     10.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_3($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 997 evaluations per sample.
 Range (min … max):  19.421 ns … 25.844 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.508 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.540 ns ±  0.250 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▅█▁                                                       
  ▂▃▇███▄▂▂▂▂▁▂▁▁▁▂▁▁▂▁▁▁▂▁▁▁▁▂▁▁▁▁▁▁▁▂▂▁▁▂▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  19.4 ns         Histogram: frequency by time        20.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_4($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 999 evaluations per sample.
 Range (min … max):  7.203 ns … 30.064 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.531 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.757 ns ±  0.461 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▂▁     ▁▁       █▇   ▂   █▆ ▁▂▁   ▅   ▂          ▂
  ▄▁▁▃▁▁▄▁▇████▄▃▁▄██▇▅▆█▅▄▅██▇▆▆█▅▆▅██▆███▆▅▅██▆▆█▇▆▅▆▆▇▆▇█ █
  7.2 ns       Histogram: log(frequency) by time       10 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So, basically just by preallocating the power functions a 50% speed up is gained. The exp(log(...)) is a small improvement compared to that. Hendrik made me notice that actually for real numbers julia is exactly calling that exp(log(...)). For some reason I noticed that newer versions of Julia have less noticeable differences between x^a and exp(a*log(x)).

@MarcoArtiano
Copy link
Contributor Author

MarcoArtiano commented Feb 11, 2025

Results on university machine (Goldstein):

new version

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 2.0  Time steps: 284 (accepted), 284 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────
             Trixi.jl                      Time                    Allocations      
                                  ───────────────────────   ────────────────────────
        Tot / % measured:              3.07s /  99.5%           2.35MiB /  90.4%    

Section                   ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────
rhs!                       1.42k    2.93s   95.8%  2.06ms   5.14KiB    0.2%    3.70B
  volume integral          1.42k    2.65s   86.7%  1.86ms     0.00B    0.0%    0.00B
  interface flux           1.42k    236ms    7.7%   166μs     0.00B    0.0%    0.00B
  surface integral         1.42k   24.7ms    0.8%  17.4μs     0.00B    0.0%    0.00B
  Jacobian                 1.42k   12.2ms    0.4%  8.56μs     0.00B    0.0%    0.00B
  reset ∂u/∂t              1.42k   5.15ms    0.2%  3.63μs     0.00B    0.0%    0.00B
  ~rhs!~                   1.42k    863μs    0.0%   607ns   5.14KiB    0.2%    3.70B
  boundary flux            1.42k   33.6μs    0.0%  23.7ns     0.00B    0.0%    0.00B
  source terms             1.42k   30.3μs    0.0%  21.3ns     0.00B    0.0%    0.00B
calculate dt                 285   76.0ms    2.5%   267μs     0.00B    0.0%    0.00B
I/O                            5   40.0ms    1.3%  8.01ms   1.81MiB   85.2%   370KiB
  save solution                4   33.5ms    1.1%  8.37ms   1.80MiB   84.8%   461KiB
  ~I/O~                        5   6.57ms    0.2%  1.31ms   8.83KiB    0.4%  1.77KiB
  get element variables        4    744ns    0.0%   186ns     0.00B    0.0%    0.00B
  save mesh                    4    605ns    0.0%   151ns     0.00B    0.0%    0.00B
  get node variables           4    371ns    0.0%  92.8ns     0.00B    0.0%    0.00B
analyze solution               4   11.3ms    0.4%  2.83ms    316KiB   14.5%  79.0KiB
────────────────────────────────────────────────────────────────────────────────────

old version:

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 2.0  Time steps: 284 (accepted), 284 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────
             Trixi.jl                      Time                    Allocations      
                                  ───────────────────────   ────────────────────────
        Tot / % measured:              3.69s /  99.5%           2.34MiB /  90.4%    

Section                   ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────
rhs!                       1.42k    3.54s   96.4%  2.49ms   5.14KiB    0.2%    3.70B
  volume integral          1.42k    3.20s   87.2%  2.25ms     0.00B    0.0%    0.00B
  interface flux           1.42k    298ms    8.1%   209μs     0.00B    0.0%    0.00B
  surface integral         1.42k   24.6ms    0.7%  17.3μs     0.00B    0.0%    0.00B
  Jacobian                 1.42k   12.3ms    0.3%  8.63μs     0.00B    0.0%    0.00B
  reset ∂u/∂t              1.42k   5.23ms    0.1%  3.68μs     0.00B    0.0%    0.00B
  ~rhs!~                   1.42k   1.01ms    0.0%   714ns   5.14KiB    0.2%    3.70B
  boundary flux            1.42k   35.8μs    0.0%  25.2ns     0.00B    0.0%    0.00B
  source terms             1.42k   30.8μs    0.0%  21.7ns     0.00B    0.0%    0.00B
calculate dt                 285   75.9ms    2.1%   266μs     0.00B    0.0%    0.00B
I/O                            5   42.0ms    1.1%  8.40ms   1.81MiB   85.3%   370KiB
  save solution                4   35.3ms    1.0%  8.82ms   1.80MiB   84.9%   461KiB
  ~I/O~                        5   6.69ms    0.2%  1.34ms   8.83KiB    0.4%  1.77KiB
  get element variables        4    559ns    0.0%   140ns     0.00B    0.0%    0.00B
  save mesh                    4    426ns    0.0%   106ns     0.00B    0.0%    0.00B
  get node variables           4    222ns    0.0%  55.5ns     0.00B    0.0%    0.00B
analyze solution               4   13.2ms    0.4%  3.30ms    314KiB   14.5%  78.6KiB
────────────────────────────────────────────────────────────────────────────────────

There's still a roughly 17% improvement.

Benchmarks for Goldstein:

julia> @benchmark value = stolarsky_mean($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 960 evaluations per sample.
 Range (min … max):  87.431 ns …  1.192 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     88.289 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.624 ns ± 13.750 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▅▅▃▇█▇▅▆▅▅▆▅▄▁                                      
  ▁▂▃▅▆▇██████████████████▅▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  87.4 ns         Histogram: frequency by time        90.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 976 evaluations per sample.
 Range (min … max):  44.864 ns …  1.349 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     45.267 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.460 ns ± 13.072 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▅▄▅▅▆▇█▆▄▄▄▃▁                                            
  ▂▃▅██████████████▇▅▄▃▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▂▂▂▂ ▄
  44.9 ns         Histogram: frequency by time        47.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_3($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 992 evaluations per sample.
 Range (min … max):  37.400 ns …  1.062 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.678 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.034 ns ± 10.298 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂█▅                                                          
  ███▇▆▆▄▄▃▃▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▂▂▂ ▃
  37.4 ns         Histogram: frequency by time        44.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_4($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.4))[])
BenchmarkTools.Trial: 10000 samples with 992 evaluations per sample.
 Range (min … max):  38.200 ns … 78.343 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.464 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.558 ns ±  0.867 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁▂▅██▆▄▃▂▁▂▁                                             
  ▁▂▄▇████████████▇█▇▆▅▅▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  38.2 ns         Histogram: frequency by time        39.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Integers:

julia> @benchmark value = stolarsky_mean($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min … max):  24.793 ns … 69.045 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.805 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.874 ns ±  0.858 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▇▆▂                       ▂▃▁▁                            ▂
  ██████▇▇▅▄▃▃▁▄▄▃▁▃▁▁▁▁▃▁▁▁▁▁████▇▁▁▁▁▁▁▁▁▃▁▃▁▁▅▅▁▃▁▄▃▁▄▁▇█▇ █
  24.8 ns      Histogram: log(frequency) by time      25.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 999 evaluations per sample.
 Range (min … max):  11.524 ns … 45.145 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.833 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.931 ns ±  0.569 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █             ▂                            ▁                 
  █▇▄▂▂▂▂▂▂▁▁▁▂▁█▅▃▂▂▂▁▂▂▂▂▂▂▂██▅▂▂▂▂▂▂▁▂▂▂▂▂█▄▂▂▂▂▂▂▂▂▂▂▂▁▃▂ ▃
  11.5 ns         Histogram: frequency by time        12.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_3($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 992 evaluations per sample.
 Range (min … max):  37.260 ns …  1.088 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.533 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.786 ns ± 10.568 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂██▇████▆▅▃▂                                               
  ▂▅████████████▇▇▅▆▅▅▅▆▆▅▄▅▄▄▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁ ▄
  37.3 ns         Histogram: frequency by time        38.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_4($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 999 evaluations per sample.
 Range (min … max):  11.528 ns … 109.493 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.570 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.817 ns ±   1.132 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▅▃▂▂▁▁ ▁  ▁          ▆▆▄          ▁          ▁           ▃ ▂
  ███████████████▆▅▆▆▇▆▆▆████▆▅▃▆▅▅▅▁▇█▆▃▄▄▁▃▁▁▄▁██▅▃▃▁▃▁▁▄▁▄█ █
  11.5 ns       Histogram: log(frequency) by time        13 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

MarcoArtiano and others added 3 commits February 11, 2025 14:13
@MarcoArtiano
Copy link
Contributor Author

julia> @inline stolarsky_mean_1(x::Real, y::Real, gamma::Real) = stolarsky_mean_1(promote(x, y)..., gamma)
stolarsky_mean_1 (generic function with 2 methods)

julia> @inline function stolarsky_mean_1(x::RealT, y::RealT, gamma::Real) where {RealT <: Real}
           epsilon_f2 = convert(RealT, 1.0e-4)
           f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
           if f2 < epsilon_f2
               # convenience coefficients
               c1 = convert(RealT, 1 / 3) * (gamma - 2)
               c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
               c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
               return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
           else
               if gamma isa Integer
                   yg = y^(gamma - 1)
                   xg = x^(gamma - 1)
               else
                   yg = exp((gamma - 1) * log(y)) # equivalent to y^gamma but faster for non-integers
                   xg = exp((gamma - 1) * log(x)) # equivalent to x^gamma but faster for non-integers
               end
               return (gamma - 1) * (yg * y - xg * x) / (gamma * (yg - xg))
           end
       end
stolarsky_mean_1 (generic function with 2 methods)

julia> @inline stolarsky_mean_2(x::Real, y::Real, gamma::Real) = stolarsky_mean_1(promote(x, y)..., gamma)
stolarsky_mean_2 (generic function with 2 methods)

julia> @inline function stolarsky_mean_2(x::RealT, y::RealT, gamma::Real) where {RealT <: Real}
           epsilon_f2 = convert(RealT, 1.0e-4)
           f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
           if f2 < epsilon_f2
               # convenience coefficients
               c1 = convert(RealT, 1 / 3) * (gamma - 2)
               c2 = convert(RealT, -1 / 15) * (gamma + 1) * (gamma - 3) * c1
               c3 = convert(RealT, -1 / 21) * (2 * gamma * (gamma - 2) - 9) * c2
               return 0.5f0 * (x + y) * @evalpoly(f2, 1, c1, c2, c3)
           else
               if gamma isa Integer
                   yg = y^(gamma - 1)
                   xg = x^(gamma - 1)
               else
                   yg = exp((gamma - 1) * log(y)) # equivalent to y^gamma but faster for non-integers
                   xg = exp((gamma - 1) * log(x)) # equivalent to x^gamma but faster for non-integers
               end
               return (gamma - 1) / gamma * (yg * y - xg * x) / (yg - xg)
           end
       end
stolarsky_mean_2 (generic function with 2 methods)

julia> @benchmark value = stolarsky_mean_1($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])

BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min  max):  22.289 ns  34.858 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     22.791 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.045 ns ±  1.610 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▆██▃▂▂▂ ▁▂▁▂▁ ▁▁▂ ▂▁▂▂▂ ▁▂▁▂ ▂▂▂ ▂▅▅█▃                    ▂
  ▃██████████████▇████████████████████████▅▃▁▆▅▆▇▆▄▃▅▅▅▃▁▅▃▅▅ █
  22.3 ns      Histogram: log(frequency) by time        28 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.0))[])

BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min  max):  22.368 ns  32.981 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     22.550 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.639 ns ±  0.466 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃██▆▆▃▁                                                    ▂
  ▅███████▃▁▁▁▁▄▆█▆▆▄▃▁▁▁▁▃▄▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆█▇▇▇▆▄▄▃▃▄▅ █
  22.4 ns      Histogram: log(frequency) by time      25.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_1($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.7))[])

BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min  max):  22.409 ns  320.979 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     22.590 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.693 ns ±   3.015 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ██                                                         
  ▆▄▇██▇█▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂ ▃
  22.4 ns         Histogram: frequency by time         25.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark value = stolarsky_mean_2($(Ref(300.1))[], $(Ref(410.7))[], $(Ref(1.7))[])

BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min  max):  22.430 ns  318.380 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     22.560 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.678 ns ±   3.001 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▆█▆▅▂                                                       ▂
  ███████▅▃▁▁▁▄▅█▇▅▅▄▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▃▃▃▃▃▁▄▁▁▁▁▁▁▆▇▇▇▇▅▄▄▅▄▁▅▇ █
  22.4 ns       Histogram: log(frequency) by time      25.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

There are no major differences between these two version. Looking at the median the second one looks slightly faster, so I chose the latter one.

Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ranocha ranocha enabled auto-merge (squash) October 9, 2025 09:16
@ranocha ranocha disabled auto-merge October 9, 2025 12:17
@ranocha ranocha merged commit 3fc9db1 into trixi-framework:main Oct 9, 2025
89 of 93 checks passed
DanielDoehring pushed a commit to DanielDoehring/Trixi.jl that referenced this pull request Feb 19, 2026
* first commit

* format

* Update src/auxiliary/math.jl

Co-authored-by: Andrew Winters <andrew.ross.winters@liu.se>

* fix for integers

* format

* Update src/auxiliary/math.jl

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

* update stolarsky mean

* fix typo

---------

Co-authored-by: Andrew Winters <andrew.ross.winters@liu.se>
Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance We are greedy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants