Closed
Description
Hi,
while optimizing a function I found a case where reducing allocations (by providing a cache struct) actually made the function slower.
The following minimal example:
using BenchmarkTools
using LinearAlgebra
function foo(m,n,k)
A = Matrix{Float64}(undef, m, n)
B = Matrix{Float64}(undef, n, k)
C = Matrix{Float64}(undef, m, k)
A .= 1.0
B .= 1.0
mul!(C, A, B)
return C
end
function bar(A,B,C)
A .= 1.0
B .= 1.0
mul!(C, A, B)
return C
end
struct MyCache
A::Array{Float64, 2}
B::Array{Float64, 2}
C::Array{Float64, 2}
function MyCache(m,n,k)
A = zeros(m,n)
B = zeros(n,k)
C = zeros(m,k)
return new(A,B,C)
end
end
function baz(cache::MyCache)
cache.A .= 1.0
cache.B .= 1.0
mul!(cache.C, cache.A, cache.B)
return cache.C
end
function test(m, n, k)
A = zeros(m,n)
B = zeros(n,k)
C = zeros(m,k)
mycache = MyCache(m, n, k)
@btime foo($m, $n, $k)
@btime bar($A, $B, $C)
@btime baz($mycache)
nothing
end
produces this output on my work computer:
julia> test(10, 1000, 100)
94.676 μs (5 allocations: 867.47 KiB)
143.076 μs (0 allocations: 0 bytes)
126.666 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
On my laptop I get much more intuitive results:
julia> test(10, 1000, 100)
210.219 μs (5 allocations: 867.47 KiB)
139.470 μs (0 allocations: 0 bytes)
144.290 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-4210M CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
@MasonProtter could reproduce similar timings on his AMD machine.
@giordano suggested it may be a code-generation/compiler bug, so I opened this issue.
Edit: Link to the Zulip discussion.