Allocating code faster than non-allocating

Hi,
while optimizing a function I found a case where reducing allocations (by providing a cache struct) actually made the function slower.
The following minimal example:  

```julia
using BenchmarkTools
using LinearAlgebra

function foo(m,n,k)
    A = Matrix{Float64}(undef, m, n)
    B = Matrix{Float64}(undef, n, k)
    C = Matrix{Float64}(undef, m, k)
    A .= 1.0
    B .= 1.0
    mul!(C, A, B)
    return C
end

function bar(A,B,C)
    A .= 1.0
    B .= 1.0
    mul!(C, A, B)
    return C
end

struct MyCache
    A::Array{Float64, 2}
    B::Array{Float64, 2}
    C::Array{Float64, 2}
    function MyCache(m,n,k)
        A = zeros(m,n)
        B = zeros(n,k)
        C = zeros(m,k)
        return new(A,B,C)
    end
end

function baz(cache::MyCache)
    cache.A .= 1.0
    cache.B .= 1.0
    mul!(cache.C, cache.A, cache.B)
    return cache.C
end

function test(m, n, k)
    A = zeros(m,n)
    B = zeros(n,k)
    C = zeros(m,k)
    mycache = MyCache(m, n, k)

    @btime foo($m, $n, $k)
    @btime bar($A, $B, $C)
    @btime baz($mycache)
    nothing
end
```

produces this output on my work computer:

```julia
julia> test(10, 1000, 100)
  94.676 μs (5 allocations: 867.47 KiB)
  143.076 μs (0 allocations: 0 bytes)
  126.666 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
```

On my laptop I get much more intuitive results:
```julia
julia> test(10, 1000, 100)
  210.219 μs (5 allocations: 867.47 KiB)
  139.470 μs (0 allocations: 0 bytes)
  144.290 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4210M CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
```

@MasonProtter could reproduce similar timings on his AMD machine.
@giordano suggested it may be a code-generation/compiler bug, so I opened this issue.

Edit: [Link](https://julialang.zulipchat.com/#narrow/stream/225542-helpdesk/topic/Allocating.20Code.20faster.20than.20non-allocating.3F) to the Zulip discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Allocating code faster than non-allocating #39566

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Allocating code faster than non-allocating #39566

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions