Skip to content

Allocating code faster than non-allocating #39566

Closed
@fkastner

Description

@fkastner

Hi,
while optimizing a function I found a case where reducing allocations (by providing a cache struct) actually made the function slower.
The following minimal example:

using BenchmarkTools
using LinearAlgebra

function foo(m,n,k)
    A = Matrix{Float64}(undef, m, n)
    B = Matrix{Float64}(undef, n, k)
    C = Matrix{Float64}(undef, m, k)
    A .= 1.0
    B .= 1.0
    mul!(C, A, B)
    return C
end

function bar(A,B,C)
    A .= 1.0
    B .= 1.0
    mul!(C, A, B)
    return C
end

struct MyCache
    A::Array{Float64, 2}
    B::Array{Float64, 2}
    C::Array{Float64, 2}
    function MyCache(m,n,k)
        A = zeros(m,n)
        B = zeros(n,k)
        C = zeros(m,k)
        return new(A,B,C)
    end
end

function baz(cache::MyCache)
    cache.A .= 1.0
    cache.B .= 1.0
    mul!(cache.C, cache.A, cache.B)
    return cache.C
end

function test(m, n, k)
    A = zeros(m,n)
    B = zeros(n,k)
    C = zeros(m,k)
    mycache = MyCache(m, n, k)

    @btime foo($m, $n, $k)
    @btime bar($A, $B, $C)
    @btime baz($mycache)
    nothing
end

produces this output on my work computer:

julia> test(10, 1000, 100)
  94.676 μs (5 allocations: 867.47 KiB)
  143.076 μs (0 allocations: 0 bytes)
  126.666 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)

On my laptop I get much more intuitive results:

julia> test(10, 1000, 100)
  210.219 μs (5 allocations: 867.47 KiB)
  139.470 μs (0 allocations: 0 bytes)
  144.290 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4210M CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)

@MasonProtter could reproduce similar timings on his AMD machine.
@giordano suggested it may be a code-generation/compiler bug, so I opened this issue.

Edit: Link to the Zulip discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions