-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Euclidean Distance Transform #576
Conversation
I had time to analyze memory consumption even further and figured out that computing the sub2ind sum previously can be replaced with a much less memory consuming way. Therefore, the stats are now:
Hence, the runtime is now at about 0.105 seconds using 9.706 MB memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, @muhlba91 (and @wkearn whose work this carries forward)! This performance is getting into the ballpark, so I'm willing to see this merged.
I've made a few suggestions below, mostly concerning whether it would make sense to switch from an Int
representation of pixel locations to a CartesianIndex{N}
representation. (See http://julialang.org/blog/2016/02/iteration if you're unfamiliar with this object.) You don't have to take this advice, as there are likely to be pluses and minuses to switching.
function permutedimsubs!{N}(F::AbstractArray{Int, N}, perm::Vector{Int}, stride::AbstractArray{Int}, sizeI::Tuple, B::AbstractArray{Int, N}, tempArray::AbstractArray{Int}) | ||
permutedims!(B, F, perm) | ||
|
||
@inbounds @simd for i = 1:length(B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simd
won't help here because you have a branch in the inner loop. You might get it to vectorize if you use ifelse(a, b, c)
instead of a ? b : c
, but the price is that you always compute b
and c
no matter what value a
has (so it may not be worth it).
Permute a tuple of subscripts given a permutation vector and | ||
writing the permutation into `result`. | ||
""" | ||
function permutesubs!(subs, perm::Vector{Int}, result::AbstractArray{Int}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wager we can do even better with Tuples, but this is good enough for me.
end | ||
|
||
|
||
# Relevant distance functions copied from Distance.jl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against the idea of depending on Distance.jl, though of course sometimes it's better to copy when it's just a tiny part of the functionality. Either choice is fine with me.
end | ||
|
||
""" | ||
stridedSub2Ind(stride::AbstractArray{Int}, i::AbstractArray{Int}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how this algorithm works in detail. Is the main motivation for ind2sub
and sub2ind
to store pixel locations as Int
s in a list? If so, we could alternatively store the Tuple{Int,Int}
or CartesianIndex{2}
directly.
""" | ||
\_computeft!(F::AbstractArray{Int, N}, I::AbstractArray{Bool, N}, stride::AbstractArray{Int}) | ||
|
||
Compute F_0 in the parlance of Maurer et al. 2003. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's any kind of "meaning" to F_0
that you can describe, it might be helpful. That said, this is an internal function so the fact that you provided any kind of documentation at all is already winning you bonus points.
|
||
|
||
""" | ||
bwdist(I::AbstractArray{Bool, N}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indicate the return information, e.g.,
bwdist(I::AbstractArray{Bool, N}) -> F, D
and give a brief statement explaining how to interpret F
and D
.
stride = collect(strides(I)) | ||
|
||
# F and D, we use D as a temporary array for permutedimsubs | ||
F = zeros(Int, sizeI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is the location of the nearest true
value. Would it be better to declare this
F = zeros(CartesianIndex{N}, sizeI)
? It takes more storage (N
times as much) but it might avoid the need for all the ind2sub
and sub2ind
transformations, and perhaps simplify the code considerably.
Hi, thanks for your comments and suggestions. :) I'll look into the Also, I'm still trying to figure out how to apply this to non-square arrays. If I increase the dimensions of |
Like I said, I don't know the algorithm well, but why does it need to be square? Is it because of the permutation? A few large allocations shouldn't hurt the runtime very much; the real performance hit comes from having lots of little allocations. So I don't necessarily think you have to do the permutation in-place. |
Yep, exactly. The square requirement is caused by those permutations. |
Indeed, allowing those extra allocations by permutedims doesn't change anything about the runtime (still at ~0.105 seconds) and got an allocation of ~9 MB again. I also adapted some comments and documentation so far. I am going to look into the |
You're doing some great work here. In case you're interested, permit me to show you some of the Joy of Tuples 😄 : # Your permutesubs! function
julia> function permutesubs!(subs::Tuple, perm::AbstractVector{Int}, result::AbstractArray{Int})
n = length(subs)
@inbounds @simd for i = 1:n
result[i] = subs[perm[i]]
end
return result
end
permutesubs! (generic function with 1 method)
# An alternative tuples-based approach. AFAICT you only need to circularly-permute the indices, so...
julia> circperm(t::Tuple) = _circperm(t)
circperm (generic function with 1 method)
julia> @inline _circperm(t::Tuple) = _circperm(t...)
_circperm (generic function with 1 method)
julia> @inline _circperm(t1, trest...) = (trest..., t1)
_circperm (generic function with 2 methods)
julia> circperm((1,2,3))
(2,3,1)
julia> using BenchmarkTools
julia> perm = [2,3,1]
3-element Array{Int64,1}:
2
3
1
julia> result = Array{Int}(3)
3-element Array{Int64,1}:
140662317212200
140662317187080
140662319415392
julia> t = (1,2,3)
(1,2,3)
julia> permutesubs!(t, perm, result)
3-element Array{Int64,1}:
2
3
1
julia> @benchmark permutesubs!($t, $perm, $result)
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 5.00 ns (0.00% GC)
median time: 5.00 ns (0.00% GC)
mean time: 5.05 ns (0.00% GC)
maximum time: 15.00 ns (0.00% GC)
julia> @benchmark circperm($t)
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 2.00 ns (0.00% GC)
mean time: 2.02 ns (0.00% GC)
maximum time: 13.00 ns (0.00% GC)
julia> @code_llvm circperm(t)
define void @julia_circperm_70788([3 x i64]* noalias sret, [3 x i64]*) #0 {
top:
%2 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 1
%3 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 2
%4 = load i64, i64* %2, align 1
%5 = load i64, i64* %3, align 1
%6 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 0
%7 = load i64, i64* %6, align 1
%.sroa.0.0..sroa_idx = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 0
store i64 %4, i64* %.sroa.0.0..sroa_idx, align 8
%.sroa.2.0..sroa_idx1 = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 1
store i64 %5, i64* %.sroa.2.0..sroa_idx1, align 8
%.sroa.3.0..sroa_idx2 = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 2
store i64 %7, i64* %.sroa.3.0..sroa_idx2, align 8
ret void
}
julia> @code_llvm permutesubs!(t, perm, result)
define %jl_value_t* @"julia_permutesubs!_70962"([3 x i64]*, %jl_value_t*, %jl_value_t*) #0 {
top:
%3 = bitcast %jl_value_t* %1 to i64**
%4 = load i64*, i64** %3, align 8
%5 = bitcast %jl_value_t* %2 to i64**
%6 = load i64*, i64** %5, align 8
%7 = load i64, i64* %4, align 8
%8 = add i64 %7, -1
%9 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %8
%10 = load i64, i64* %9, align 8
store i64 %10, i64* %6, align 8
%11 = getelementptr i64, i64* %4, i64 1
%12 = load i64, i64* %11, align 8
%13 = add i64 %12, -1
%14 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %13
%15 = load i64, i64* %14, align 8
%16 = getelementptr i64, i64* %6, i64 1
store i64 %15, i64* %16, align 8
%17 = getelementptr i64, i64* %4, i64 2
%18 = load i64, i64* %17, align 8
%19 = add i64 %18, -1
%20 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %19
%21 = load i64, i64* %20, align 8
%22 = getelementptr i64, i64* %6, i64 2
store i64 %21, i64* %22, align 8
ret %jl_value_t* %2
} So they're both great, but the tuples one is even faster. Those |
solve bug for 1D images
Hi, thanks a lot for pointing this out! :)
I also took a look at |
Regarding the storage (as written, this only works on julia 0.5 or higher): circperm(idx::CartesianIndex) = CartesianIndex(circperm(idx.I))
circperm(t::Tuple) = _circperm(t)
@inline _circperm(t::Tuple) = _circperm(t...)
@inline _circperm(t1, trest...) = (trest..., t1)
circperm2(sz, i) = _circperm2(sz, i);
@inline _circperm2(sz, i) = sub2ind(sz, circperm(ind2sub(sz, i))...)
A = [CartesianIndex((rand(1:1000), rand(1:1000))) for i = 1:1000, j = 1:1000]
B = map(idx->sub2ind((1000,1000), idx.I...), A)
A .= circperm.(A);
f = i -> circperm2((1000,1000), i)
B .= f.(B);
@time A .= circperm.(A);
@time B .= f.(B);
nothing Results: 0.001597 seconds (7 allocations: 224 bytes)
0.012136 seconds (7 allocations: 224 bytes) Julia/LLVM are quite smart about not allocating memory when it comes to tuples (which CartesianIndex is merely a wrapper around). The array doesn't actually store a "CartesianIndex object," it's just a chunk of memory of size 2-by-1000-by-1000 Ints; the All that said, I have no idea whether it's this part of your computation that's a bottleneck (from the runtimes above, I suspect not). Only if profiling shows that calls to I'd be fine with merging this as-is. Are you still playing with it, or are you good to go? |
oh cool! :) I'm not too familiar with |
remove unnecessary ind2sub calls; optimize square euclidean computation
@muhlba91, I can help you resolve the conflicts if you're unfamiliar with the process. |
Just to keep you updated on the progress. I was able to use a few
Internally, I changed my test suite to a problem size of 4000x4000 to spot any differences in the runtime easier. Here the current stats for that (before these optimizations, it was about 1 second slower ;) ):
@timholy - yes, that would be nice. I never had to rebase/merge the master branch into my own fork so far. |
Wow, that's a pretty dramatic improvement! Nice work! The conflicts arose from the merge and release of Images 0.6.0. If you're done with this, I can clean up the conflicts and merge to master for you, giving you credit for your work (using |
Thank you! :) I think making a new branch is not necessary. I have just one more idea left to try out tomorrow (although I'm not quite sure if it's a helpful one ^^), meaning we can clean up the conflicts and merge them to master tomorrow. ;) However, if you have a bit time I'd like to ask a short question. When I used |
How exactly did you use it? You can't use it at the "call site," you can only declare that certain functions should be inlined. Do you perhaps mean |
I used it in lines 212, 171 and 103. There, I created functions getting inlined for addition and multiplication. For example, in line 103 I used |
OK, I think we are good to go now. :) Can you clean up the conflicts and merge it in master, please? |
See if you like #580 (if you don't, tell me what you don't like). By the way, you were right to be skeptical about the benefit of creating those |
Closed by #580 |
I needed a Euclidean Feature Transform (square matrices) for a project of mine and found implementation #99 for issue #65.
However, as noted in PR #99 the efficiency was lower than for MATLAB and, therefore, I used the code of #99, adapted it to the newest Julia version (0.5) using the Compat module and enhanced the memory consumption to reduce runtime even further.
As a reminder from #99:
I used the following test suite to retrieve measurements:
For comparison, here are the results running the for Julia v0.5 adapted code from #99 on my system:
More or less, I had a runtime of about 0.57 seconds using 175.767 MB memory.
After my changes regarding memory consumption the results are now:
More or less, I got a runtime of about 0.15 seconds using only 44.191 MB memory.
The tests on 2D images I ran so far yielded the correct feature transform, but please let me know if there are some cases where it doesn't return the expected results if you encounter one.
Otherwise, I'd like to ask for a review and any further suggestions/ideas/... 😄