Euclidean Distance Transform #576

muhlba91 · 2017-01-23T15:16:05Z

I needed a Euclidean Feature Transform (square matrices) for a project of mine and found implementation #99 for issue #65.
However, as noted in PR #99 the efficiency was lower than for MATLAB and, therefore, I used the code of #99, adapted it to the newest Julia version (0.5) using the Compat module and enhanced the memory consumption to reduce runtime even further.

As a reminder from #99:

Calling (F,D) = bwdist(I) where I is some binary image returns two results. F is the feature transform (i.e. the closest feature pixel given as a vector of its indices in I). D is the Euclidean Distance Transform calculated from F by computing the Euclidean distance from the pixel of interest to its feature pixel.

I used the following test suite to retrieve measurements:

A=rand(Bool, 500, 500)
(F, D)=bwdist(A);

for i = 1:5
  @time (F, D) =bwdist(A);
end

For comparison, here are the results running the for Julia v0.5 adapted code from #99 on my system:

0.571396 seconds (6.00 M allocations: 175.767 MB, 4.29% gc time)
0.586315 seconds (6.00 M allocations: 175.767 MB, 4.78% gc time)
0.588186 seconds (6.00 M allocations: 175.767 MB, 4.73% gc time)
0.684327 seconds (6.00 M allocations: 175.767 MB, 18.52% gc time)
0.577263 seconds (6.00 M allocations: 175.767 MB, 4.40% gc time)

More or less, I had a runtime of about 0.57 seconds using 175.767 MB memory.

After my changes regarding memory consumption the results are now:

0.147726 seconds (1.00 M allocations: 44.191 MB, 7.17% gc time)
0.167731 seconds (1.00 M allocations: 44.191 MB, 9.02% gc time)
0.148577 seconds (1.00 M allocations: 44.191 MB, 7.45% gc time)
0.148894 seconds (1.00 M allocations: 44.191 MB, 4.38% gc time)
0.150299 seconds (1.00 M allocations: 44.191 MB, 6.09% gc time)

More or less, I got a runtime of about 0.15 seconds using only 44.191 MB memory.

The tests on 2D images I ran so far yielded the correct feature transform, but please let me know if there are some cases where it doesn't return the expected results if you encounter one.
Otherwise, I'd like to ask for a review and any further suggestions/ideas/... 😄

…time

coveralls · 2017-01-23T15:51:53Z

Coverage decreased (-2.4%) to 76.27% when pulling 8af7258 on muhlba91:master into b3d6f46 on timholy:master.

muhlba91 · 2017-01-24T14:12:29Z

I had time to analyze memory consumption even further and figured out that computing the sub2ind sum previously can be replaced with a much less memory consuming way.
Furthermore, I tried to use temporary arrays wherever possible. I hope I can spot more places where I can reduce the memory consumption but I hope we are getting closer to a usable efficiency. :)

Therefore, the stats are now:

0.102853 seconds (2.02 k allocations: 9.706 MB)
0.107617 seconds (2.02 k allocations: 9.706 MB, 2.83% gc time)
0.103794 seconds (2.02 k allocations: 9.706 MB)
0.109872 seconds (2.02 k allocations: 9.706 MB, 4.45% gc time)
0.103191 seconds (2.02 k allocations: 9.706 MB)

Hence, the runtime is now at about 0.105 seconds using 9.706 MB memory.

coveralls · 2017-01-24T14:35:53Z

Coverage decreased (-2.5%) to 76.212% when pulling a6d99a7 on muhlba91:master into b3d6f46 on timholy:master.

coveralls · 2017-01-25T16:11:22Z

Coverage decreased (-2.4%) to 76.27% when pulling 50b2b57 on muhlba91:master into b3d6f46 on timholy:master.

timholy

Nice work, @muhlba91 (and @wkearn whose work this carries forward)! This performance is getting into the ballpark, so I'm willing to see this merged.

I've made a few suggestions below, mostly concerning whether it would make sense to switch from an Int representation of pixel locations to a CartesianIndex{N} representation. (See http://julialang.org/blog/2016/02/iteration if you're unfamiliar with this object.) You don't have to take this advice, as there are likely to be pluses and minuses to switching.

timholy · 2017-01-25T16:08:52Z

src/bwdist.jl

+function permutedimsubs!{N}(F::AbstractArray{Int, N}, perm::Vector{Int}, stride::AbstractArray{Int}, sizeI::Tuple, B::AbstractArray{Int, N}, tempArray::AbstractArray{Int})
+  permutedims!(B, F, perm)
+
+  @inbounds @simd for i = 1:length(B)


@simd won't help here because you have a branch in the inner loop. You might get it to vectorize if you use ifelse(a, b, c) instead of a ? b : c, but the price is that you always compute b and c no matter what value a has (so it may not be worth it).

timholy · 2017-01-25T16:09:10Z

src/bwdist.jl

+Permute a tuple of subscripts given a permutation vector and
+writing the permutation into `result`.
+"""
+function permutesubs!(subs, perm::Vector{Int}, result::AbstractArray{Int})


I'll wager we can do even better with Tuples, but this is good enough for me.

timholy · 2017-01-25T16:11:35Z

src/bwdist.jl

+end
+
+
+# Relevant distance functions copied from Distance.jl


I'm not against the idea of depending on Distance.jl, though of course sometimes it's better to copy when it's just a tiny part of the functionality. Either choice is fine with me.

timholy · 2017-01-25T16:12:58Z

src/bwdist.jl

+end
+
+"""
+    stridedSub2Ind(stride::AbstractArray{Int}, i::AbstractArray{Int})


I don't know how this algorithm works in detail. Is the main motivation for ind2sub and sub2ind to store pixel locations as Ints in a list? If so, we could alternatively store the Tuple{Int,Int} or CartesianIndex{2} directly.

timholy · 2017-01-25T16:14:37Z

src/bwdist.jl

+"""
+    \_computeft!(F::AbstractArray{Int, N}, I::AbstractArray{Bool, N}, stride::AbstractArray{Int})
+
+Compute F_0 in the parlance of Maurer et al. 2003.


If there's any kind of "meaning" to F_0 that you can describe, it might be helpful. That said, this is an internal function so the fact that you provided any kind of documentation at all is already winning you bonus points.

timholy · 2017-01-25T16:21:45Z

src/bwdist.jl

+
+
+"""
+    bwdist(I::AbstractArray{Bool, N})


Indicate the return information, e.g.,

bwdist(I::AbstractArray{Bool, N}) -> F, D

and give a brief statement explaining how to interpret F and D.

timholy · 2017-01-25T16:25:54Z

src/bwdist.jl

+    stride = collect(strides(I))
+
+    # F and D, we use D as a temporary array for permutedimsubs
+    F = zeros(Int, sizeI)


IIUC this is the location of the nearest true value. Would it be better to declare this
F = zeros(CartesianIndex{N}, sizeI)? It takes more storage (N times as much) but it might avoid the need for all the ind2sub and sub2ind transformations, and perhaps simplify the code considerably.

muhlba91 · 2017-01-25T22:04:05Z

Hi,

thanks for your comments and suggestions. :) I'll look into the CartesianIndex approach because I think with this implementation we are getting already close to the optimum memory usage (I got down to using just one temporary array of the size of N+1 (N being the dimensions), meaning the current consumption is just a bit above 5 MB (with ~25 allocations and still having a runtime of 0.1 seconds) and if we get more performance out of using CartesianIndex and a bit more memory consumption, it could pay off as long as we don't sacrifice runtime with those additional memory needs.

Also, I'm still trying to figure out how to apply this to non-square arrays. If I increase the dimensions of F to a squared one using the maximum size of the dimensions of I, the results are fine but I'd need to cut the expanded dimensions of F in the end to match those of I again.

timholy · 2017-01-25T22:18:45Z

Like I said, I don't know the algorithm well, but why does it need to be square? Is it because of the permutation? A few large allocations shouldn't hurt the runtime very much; the real performance hit comes from having lots of little allocations. So I don't necessarily think you have to do the permutation in-place.

muhlba91 · 2017-01-25T22:24:12Z

Yep, exactly. The square requirement is caused by those permutations.
I'll check tomorrow how the performance changes when not doing it in-place.

muhlba91 · 2017-01-26T08:14:17Z

Indeed, allowing those extra allocations by permutedims doesn't change anything about the runtime (still at ~0.105 seconds) and got an allocation of ~9 MB again. I also adapted some comments and documentation so far. I am going to look into the CartesianIndex these days.
As for the suggestion re the perm Vector: this one comes from the circshift(d, 1), which in itself returns an array/vector.

coveralls · 2017-01-26T08:20:31Z

Coverage decreased (-2.5%) to 76.183% when pulling aea16ef on muhlba91:master into b3d6f46 on timholy:master.

coveralls · 2017-01-26T08:38:05Z

Coverage decreased (-2.5%) to 76.241% when pulling e54d3e7 on muhlba91:master into b3d6f46 on timholy:master.

coveralls · 2017-01-26T11:07:33Z

Coverage decreased (-2.2%) to 76.531% when pulling cc05abe on muhlba91:master into b3d6f46 on timholy:master.

coveralls · 2017-01-26T11:48:50Z

Coverage increased (+0.7%) to 79.37% when pulling 8a24bd8 on muhlba91:master into b3d6f46 on timholy:master.

timholy · 2017-01-26T13:28:30Z

You're doing some great work here. In case you're interested, permit me to show you some of the Joy of Tuples 😄 :

# Your permutesubs! function
julia> function permutesubs!(subs::Tuple, perm::AbstractVector{Int}, result::AbstractArray{Int})
           n = length(subs)
           @inbounds @simd for i = 1:n
               result[i] = subs[perm[i]]
           end
           return result
       end
permutesubs! (generic function with 1 method)                                                                                                                                                

# An alternative tuples-based approach. AFAICT you only need to circularly-permute the indices, so...
julia> circperm(t::Tuple) = _circperm(t)
circperm (generic function with 1 method)

julia> @inline _circperm(t::Tuple) = _circperm(t...)
_circperm (generic function with 1 method)

julia> @inline _circperm(t1, trest...) = (trest..., t1)
_circperm (generic function with 2 methods)

julia> circperm((1,2,3))
(2,3,1)

julia> using BenchmarkTools

julia> perm = [2,3,1]
3-element Array{Int64,1}:
 2
 3
 1

julia> result = Array{Int}(3)
3-element Array{Int64,1}:
 140662317212200
 140662317187080
 140662319415392

julia> t = (1,2,3)
(1,2,3)

julia> permutesubs!(t, perm, result)
3-element Array{Int64,1}:
 2
 3
 1

julia> @benchmark permutesubs!($t, $perm, $result)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     5.00 ns (0.00% GC)
  median time:      5.00 ns (0.00% GC)
  mean time:        5.05 ns (0.00% GC)
  maximum time:     15.00 ns (0.00% GC)

julia> @benchmark circperm($t)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     2.00 ns (0.00% GC)
  median time:      2.00 ns (0.00% GC)
  mean time:        2.02 ns (0.00% GC)
  maximum time:     13.00 ns (0.00% GC)

julia> @code_llvm circperm(t)

define void @julia_circperm_70788([3 x i64]* noalias sret, [3 x i64]*) #0 {
top:
  %2 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 1
  %3 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 2
  %4 = load i64, i64* %2, align 1
  %5 = load i64, i64* %3, align 1
  %6 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 0
  %7 = load i64, i64* %6, align 1
  %.sroa.0.0..sroa_idx = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 0
  store i64 %4, i64* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 1
  store i64 %5, i64* %.sroa.2.0..sroa_idx1, align 8
  %.sroa.3.0..sroa_idx2 = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 2
  store i64 %7, i64* %.sroa.3.0..sroa_idx2, align 8
  ret void
}

julia> @code_llvm permutesubs!(t, perm, result)

define %jl_value_t* @"julia_permutesubs!_70962"([3 x i64]*, %jl_value_t*, %jl_value_t*) #0 {
top:
  %3 = bitcast %jl_value_t* %1 to i64**
  %4 = load i64*, i64** %3, align 8
  %5 = bitcast %jl_value_t* %2 to i64**
  %6 = load i64*, i64** %5, align 8
  %7 = load i64, i64* %4, align 8
  %8 = add i64 %7, -1
  %9 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %8
  %10 = load i64, i64* %9, align 8
  store i64 %10, i64* %6, align 8
  %11 = getelementptr i64, i64* %4, i64 1
  %12 = load i64, i64* %11, align 8
  %13 = add i64 %12, -1
  %14 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %13
  %15 = load i64, i64* %14, align 8
  %16 = getelementptr i64, i64* %6, i64 1
  store i64 %15, i64* %16, align 8
  %17 = getelementptr i64, i64* %4, i64 2
  %18 = load i64, i64* %17, align 8
  %19 = add i64 %18, -1
  %20 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %19
  %21 = load i64, i64* %20, align 8
  %22 = getelementptr i64, i64* %6, i64 2
  store i64 %21, i64* %22, align 8
  ret %jl_value_t* %2
}

So they're both great, but the tuples one is even faster. Those @inline statements are absolutely crucial; any place you have ... in an inner loop, it will be slow unless you @inline it (in which case the compiler will specialize for the specific tuple length).

solve bug for 1D images

muhlba91 · 2017-01-27T12:14:18Z

Hi,

thanks a lot for pointing this out! :)
Now I used this function and tuples instead of permutesubs! as well as I got rid of the permutation vector perm as it just transposes the whole array. The gain in runtime isn't really noticeable with a problem size of 500x500 but I also run tests with a size of 2000x2000 and there I got down by nearly 0.7 seconds compared to the older version (down to ~3 seconds compared to 3.7 seconds before). One more thing optimized now is the number of needed allocations.
State with test suite posted above:

0.099389 seconds (25 allocations: 9.538 MB)
0.104037 seconds (25 allocations: 9.538 MB, 2.17% gc time)
0.107043 seconds (25 allocations: 9.538 MB)
0.107534 seconds (25 allocations: 9.538 MB)
0.103300 seconds (25 allocations: 9.538 MB, 2.29% gc time)

I also took a look at CartesianIndex but, imo, the main problem is the subscript permutation circperm as CartesianIndex is not mutable, meaning I'd need to create new indices increasing memory usage (and runtime).

coveralls · 2017-01-27T12:22:03Z

Coverage decreased (-74.4%) to 4.279% when pulling 35499a8 on muhlba91:master into b3d6f46 on timholy:master.

coveralls · 2017-01-27T12:39:56Z

Coverage increased (+0.6%) to 79.331% when pulling c13d47a on muhlba91:master into b3d6f46 on timholy:master.

timholy · 2017-01-27T14:45:57Z

Regarding the storage (as written, this only works on julia 0.5 or higher):

circperm(idx::CartesianIndex) = CartesianIndex(circperm(idx.I))
circperm(t::Tuple) = _circperm(t)
@inline _circperm(t::Tuple) = _circperm(t...)
@inline _circperm(t1, trest...) = (trest..., t1)

circperm2(sz, i) = _circperm2(sz, i);
@inline _circperm2(sz, i) = sub2ind(sz, circperm(ind2sub(sz, i))...)

A = [CartesianIndex((rand(1:1000), rand(1:1000))) for i = 1:1000, j = 1:1000]
B = map(idx->sub2ind((1000,1000), idx.I...), A)
A .= circperm.(A);
f = i -> circperm2((1000,1000), i)
B .= f.(B);
@time A .= circperm.(A);
@time B .= f.(B);
nothing

Results:

  0.001597 seconds (7 allocations: 224 bytes)
  0.012136 seconds (7 allocations: 224 bytes)

Julia/LLVM are quite smart about not allocating memory when it comes to tuples (which CartesianIndex is merely a wrapper around). The array doesn't actually store a "CartesianIndex object," it's just a chunk of memory of size 2-by-1000-by-1000 Ints; the CartesianIndex is just an interpretation of the meaning of those bits.

All that said, I have no idea whether it's this part of your computation that's a bottleneck (from the runtimes above, I suspect not). Only if profiling shows that calls to sub2ind or ind2sub result in an appreciable load is it worth worrying about this further.

I'd be fine with merging this as-is. Are you still playing with it, or are you good to go?

muhlba91 · 2017-01-27T16:48:26Z

oh cool! :) I'm not too familiar with CartesianIndex but I'll profile it again next week trying to figure out a spot where the bottleneck is and let you know whether we are good to go to merge this or I can find something. ;)

remove unnecessary ind2sub calls; optimize square euclidean computation

timholy · 2017-01-30T16:07:06Z

@muhlba91, I can help you resolve the conflicts if you're unfamiliar with the process.

muhlba91 · 2017-01-30T16:08:43Z

Just to keep you updated on the progress. I was able to use a few @inline functions and to get rid of really unnecessary ind2sub calls. This means, with the test suite posted in the first post, the results are now

0.079463 seconds (25 allocations: 9.538 MB)
0.096297 seconds (25 allocations: 9.538 MB, 9.64% gc time)
0.084291 seconds (25 allocations: 9.538 MB)
0.094686 seconds (25 allocations: 9.538 MB, 7.26% gc time)
0.079438 seconds (25 allocations: 9.538 MB)

Internally, I changed my test suite to a problem size of 4000x4000 to spot any differences in the runtime easier. Here the current stats for that (before these optimizations, it was about 1 second slower ;) ):

5.222761 seconds (25 allocations: 610.353 MB, 4.49% gc time)
5.184712 seconds (25 allocations: 610.353 MB, 4.56% gc time)
5.327673 seconds (25 allocations: 610.353 MB, 5.07% gc time)
5.246481 seconds (25 allocations: 610.353 MB, 4.86% gc time)
5.282295 seconds (25 allocations: 610.353 MB, 4.82% gc time)

@timholy - yes, that would be nice. I never had to rebase/merge the master branch into my own fork so far.

timholy · 2017-01-30T16:14:22Z

Wow, that's a pretty dramatic improvement! Nice work!

The conflicts arose from the merge and release of Images 0.6.0. If you're done with this, I can clean up the conflicts and merge to master for you, giving you credit for your work (using --author). If you're still playing with it but want to continue using the new version of Images, I can create a new branch for you to git fetch. Which would you prefer?

muhlba91 · 2017-01-30T16:47:55Z

Thank you! :)

I think making a new branch is not necessary. I have just one more idea left to try out tomorrow (although I'm not quite sure if it's a helpful one ^^), meaning we can clean up the conflicts and merge them to master tomorrow. ;)

However, if you have a bit time I'd like to ask a short question. When I used @inline functions for adding and multiplying 2 numbers, I gained quite a bit of performance. I had suspected not using these @inline functions is either more efficient or doesn't change anything. Maybe you can spot why I have gained at least about 0.15 seconds for the 4000x4000 problem size case?

timholy · 2017-01-30T16:53:53Z

How exactly did you use it? You can't use it at the "call site," you can only declare that certain functions should be inlined. Do you perhaps mean @inbounds?

muhlba91 · 2017-01-30T17:05:11Z

I used it in lines 212, 171 and 103. There, I created functions getting inlined for addition and multiplication. For example, in line 103 I used x+=(d_j -1)*stride[j] but switching to the inlined version yielded a gain of 0.05secs. That's not much but for the euclidean distance (line 212) the gain was about up to 0.10-0.15secs

muhlba91 · 2017-01-31T08:52:05Z

OK, I think we are good to go now. :) Can you clean up the conflicts and merge it in master, please?

timholy · 2017-02-01T14:03:21Z

See if you like #580 (if you don't, tell me what you don't like).

By the way, you were right to be skeptical about the benefit of creating those add_ and mul_ inlined functions: I couldn't find any evidence for a performance benefit. So I got rid of them (and then did more stuff). I'm not sure if your system is different than mine, but I do recommend BenchmarkTools for benchmarking.

timholy · 2017-02-07T11:02:30Z

Closed by #580

add and adapt bwdist function from #99 to reduce memory needs ans run…

8af7258

…time

optimize stridedSub2Ind and use temporary arrays

a6d99a7

reduced memory allocation to only 3 arrays

50b2b57

muhlba91 changed the title ~~Euclidean Distance Transform~~ [WIP] Euclidean Distance Transform Jan 25, 2017

timholy reviewed Jan 25, 2017

View reviewed changes

Daniel Muehlbachler added 2 commits January 26, 2017 08:58

adapt changes to support rectangular inputs

aea16ef

code cleanup to reduce number of allocations

e54d3e7

This was referenced Jan 26, 2017

Euclidean Distance Transform #99

Closed

bwdist #65

Closed

update documentation

cc05abe

adding test cases

8a24bd8

use circperm to reduce runtime and allocations;

35499a8

solve bug for 1D images

fix local setup commit

c13d47a

add inline functions;

1760ca4

remove unnecessary ind2sub calls; optimize square euclidean computation

muhlba91 changed the title ~~[WIP] Euclidean Distance Transform~~ Euclidean Distance Transform Jan 31, 2017

timholy mentioned this pull request Feb 1, 2017

Another implementation of bwdist (feature_transform/euclidean_distance_transform) #580

Merged

timholy closed this Feb 7, 2017

Euclidean Distance Transform #576

Euclidean Distance Transform #576

Conversation

muhlba91 commented Jan 23, 2017 • edited Loading

coveralls commented Jan 23, 2017

muhlba91 commented Jan 24, 2017

coveralls commented Jan 24, 2017

coveralls commented Jan 25, 2017

timholy left a comment

Choose a reason for hiding this comment

timholy Jan 25, 2017

Choose a reason for hiding this comment

timholy Jan 25, 2017

Choose a reason for hiding this comment

timholy Jan 25, 2017

Choose a reason for hiding this comment

timholy Jan 25, 2017

Choose a reason for hiding this comment

timholy Jan 25, 2017

Choose a reason for hiding this comment

timholy Jan 25, 2017

Choose a reason for hiding this comment

timholy Jan 25, 2017

Choose a reason for hiding this comment

muhlba91 commented Jan 25, 2017 • edited Loading

timholy commented Jan 25, 2017

muhlba91 commented Jan 25, 2017

muhlba91 commented Jan 26, 2017

coveralls commented Jan 26, 2017

coveralls commented Jan 26, 2017

coveralls commented Jan 26, 2017

coveralls commented Jan 26, 2017

timholy commented Jan 26, 2017

muhlba91 commented Jan 27, 2017 • edited Loading

coveralls commented Jan 27, 2017

coveralls commented Jan 27, 2017

timholy commented Jan 27, 2017 • edited Loading

muhlba91 commented Jan 27, 2017

timholy commented Jan 30, 2017

muhlba91 commented Jan 30, 2017

timholy commented Jan 30, 2017 • edited Loading

muhlba91 commented Jan 30, 2017

timholy commented Jan 30, 2017

muhlba91 commented Jan 30, 2017

muhlba91 commented Jan 31, 2017

timholy commented Feb 1, 2017

timholy commented Feb 7, 2017

muhlba91 commented Jan 23, 2017 •

edited

Loading

muhlba91 commented Jan 25, 2017 •

edited

Loading

muhlba91 commented Jan 27, 2017 •

edited

Loading

timholy commented Jan 27, 2017 •

edited

Loading

timholy commented Jan 30, 2017 •

edited

Loading