Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Euclidean Distance Transform #576

Closed
wants to merge 10 commits into from
Closed

Euclidean Distance Transform #576

wants to merge 10 commits into from

Conversation

muhlba91
Copy link
Contributor

@muhlba91 muhlba91 commented Jan 23, 2017

I needed a Euclidean Feature Transform (square matrices) for a project of mine and found implementation #99 for issue #65.
However, as noted in PR #99 the efficiency was lower than for MATLAB and, therefore, I used the code of #99, adapted it to the newest Julia version (0.5) using the Compat module and enhanced the memory consumption to reduce runtime even further.

As a reminder from #99:

Calling (F,D) = bwdist(I) where I is some binary image returns two results. F is the feature transform (i.e. the closest feature pixel given as a vector of its indices in I). D is the Euclidean Distance Transform calculated from F by computing the Euclidean distance from the pixel of interest to its feature pixel.

I used the following test suite to retrieve measurements:

A=rand(Bool, 500, 500)
(F, D)=bwdist(A);

for i = 1:5
  @time (F, D) =bwdist(A);
end

For comparison, here are the results running the for Julia v0.5 adapted code from #99 on my system:

0.571396 seconds (6.00 M allocations: 175.767 MB, 4.29% gc time)
0.586315 seconds (6.00 M allocations: 175.767 MB, 4.78% gc time)
0.588186 seconds (6.00 M allocations: 175.767 MB, 4.73% gc time)
0.684327 seconds (6.00 M allocations: 175.767 MB, 18.52% gc time)
0.577263 seconds (6.00 M allocations: 175.767 MB, 4.40% gc time)

More or less, I had a runtime of about 0.57 seconds using 175.767 MB memory.

After my changes regarding memory consumption the results are now:

0.147726 seconds (1.00 M allocations: 44.191 MB, 7.17% gc time)
0.167731 seconds (1.00 M allocations: 44.191 MB, 9.02% gc time)
0.148577 seconds (1.00 M allocations: 44.191 MB, 7.45% gc time)
0.148894 seconds (1.00 M allocations: 44.191 MB, 4.38% gc time)
0.150299 seconds (1.00 M allocations: 44.191 MB, 6.09% gc time)

More or less, I got a runtime of about 0.15 seconds using only 44.191 MB memory.

The tests on 2D images I ran so far yielded the correct feature transform, but please let me know if there are some cases where it doesn't return the expected results if you encounter one.
Otherwise, I'd like to ask for a review and any further suggestions/ideas/... 😄

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.4%) to 76.27% when pulling 8af7258 on muhlba91:master into b3d6f46 on timholy:master.

@muhlba91
Copy link
Contributor Author

I had time to analyze memory consumption even further and figured out that computing the sub2ind sum previously can be replaced with a much less memory consuming way.
Furthermore, I tried to use temporary arrays wherever possible. I hope I can spot more places where I can reduce the memory consumption but I hope we are getting closer to a usable efficiency. :)

Therefore, the stats are now:

0.102853 seconds (2.02 k allocations: 9.706 MB)
0.107617 seconds (2.02 k allocations: 9.706 MB, 2.83% gc time)
0.103794 seconds (2.02 k allocations: 9.706 MB)
0.109872 seconds (2.02 k allocations: 9.706 MB, 4.45% gc time)
0.103191 seconds (2.02 k allocations: 9.706 MB)

Hence, the runtime is now at about 0.105 seconds using 9.706 MB memory.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.5%) to 76.212% when pulling a6d99a7 on muhlba91:master into b3d6f46 on timholy:master.

@muhlba91 muhlba91 changed the title Euclidean Distance Transform [WIP] Euclidean Distance Transform Jan 25, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.4%) to 76.27% when pulling 50b2b57 on muhlba91:master into b3d6f46 on timholy:master.

Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @muhlba91 (and @wkearn whose work this carries forward)! This performance is getting into the ballpark, so I'm willing to see this merged.

I've made a few suggestions below, mostly concerning whether it would make sense to switch from an Int representation of pixel locations to a CartesianIndex{N} representation. (See http://julialang.org/blog/2016/02/iteration if you're unfamiliar with this object.) You don't have to take this advice, as there are likely to be pluses and minuses to switching.

function permutedimsubs!{N}(F::AbstractArray{Int, N}, perm::Vector{Int}, stride::AbstractArray{Int}, sizeI::Tuple, B::AbstractArray{Int, N}, tempArray::AbstractArray{Int})
permutedims!(B, F, perm)

@inbounds @simd for i = 1:length(B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simd won't help here because you have a branch in the inner loop. You might get it to vectorize if you use ifelse(a, b, c) instead of a ? b : c, but the price is that you always compute b and c no matter what value a has (so it may not be worth it).

Permute a tuple of subscripts given a permutation vector and
writing the permutation into `result`.
"""
function permutesubs!(subs, perm::Vector{Int}, result::AbstractArray{Int})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wager we can do even better with Tuples, but this is good enough for me.

end


# Relevant distance functions copied from Distance.jl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against the idea of depending on Distance.jl, though of course sometimes it's better to copy when it's just a tiny part of the functionality. Either choice is fine with me.

end

"""
stridedSub2Ind(stride::AbstractArray{Int}, i::AbstractArray{Int})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this algorithm works in detail. Is the main motivation for ind2sub and sub2ind to store pixel locations as Ints in a list? If so, we could alternatively store the Tuple{Int,Int} or CartesianIndex{2} directly.

"""
\_computeft!(F::AbstractArray{Int, N}, I::AbstractArray{Bool, N}, stride::AbstractArray{Int})

Compute F_0 in the parlance of Maurer et al. 2003.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's any kind of "meaning" to F_0 that you can describe, it might be helpful. That said, this is an internal function so the fact that you provided any kind of documentation at all is already winning you bonus points.



"""
bwdist(I::AbstractArray{Bool, N})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indicate the return information, e.g.,

    bwdist(I::AbstractArray{Bool, N}) -> F, D

and give a brief statement explaining how to interpret F and D.

stride = collect(strides(I))

# F and D, we use D as a temporary array for permutedimsubs
F = zeros(Int, sizeI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the location of the nearest true value. Would it be better to declare this
F = zeros(CartesianIndex{N}, sizeI)? It takes more storage (N times as much) but it might avoid the need for all the ind2sub and sub2ind transformations, and perhaps simplify the code considerably.

@muhlba91
Copy link
Contributor Author

muhlba91 commented Jan 25, 2017

Hi,

thanks for your comments and suggestions. :) I'll look into the CartesianIndex approach because I think with this implementation we are getting already close to the optimum memory usage (I got down to using just one temporary array of the size of N+1 (N being the dimensions), meaning the current consumption is just a bit above 5 MB (with ~25 allocations and still having a runtime of 0.1 seconds) and if we get more performance out of using CartesianIndex and a bit more memory consumption, it could pay off as long as we don't sacrifice runtime with those additional memory needs.

Also, I'm still trying to figure out how to apply this to non-square arrays. If I increase the dimensions of F to a squared one using the maximum size of the dimensions of I, the results are fine but I'd need to cut the expanded dimensions of F in the end to match those of I again.

@timholy
Copy link
Member

timholy commented Jan 25, 2017

Like I said, I don't know the algorithm well, but why does it need to be square? Is it because of the permutation? A few large allocations shouldn't hurt the runtime very much; the real performance hit comes from having lots of little allocations. So I don't necessarily think you have to do the permutation in-place.

@muhlba91
Copy link
Contributor Author

Yep, exactly. The square requirement is caused by those permutations.
I'll check tomorrow how the performance changes when not doing it in-place.

@muhlba91
Copy link
Contributor Author

Indeed, allowing those extra allocations by permutedims doesn't change anything about the runtime (still at ~0.105 seconds) and got an allocation of ~9 MB again. I also adapted some comments and documentation so far. I am going to look into the CartesianIndex these days.
As for the suggestion re the perm Vector: this one comes from the circshift(d, 1), which in itself returns an array/vector.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.5%) to 76.183% when pulling aea16ef on muhlba91:master into b3d6f46 on timholy:master.

This was referenced Jan 26, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.5%) to 76.241% when pulling e54d3e7 on muhlba91:master into b3d6f46 on timholy:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.2%) to 76.531% when pulling cc05abe on muhlba91:master into b3d6f46 on timholy:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.7%) to 79.37% when pulling 8a24bd8 on muhlba91:master into b3d6f46 on timholy:master.

@timholy
Copy link
Member

timholy commented Jan 26, 2017

You're doing some great work here. In case you're interested, permit me to show you some of the Joy of Tuples 😄 :

# Your permutesubs! function
julia> function permutesubs!(subs::Tuple, perm::AbstractVector{Int}, result::AbstractArray{Int})
           n = length(subs)
           @inbounds @simd for i = 1:n
               result[i] = subs[perm[i]]
           end
           return result
       end
permutesubs! (generic function with 1 method)                                                                                                                                                

# An alternative tuples-based approach. AFAICT you only need to circularly-permute the indices, so...
julia> circperm(t::Tuple) = _circperm(t)
circperm (generic function with 1 method)

julia> @inline _circperm(t::Tuple) = _circperm(t...)
_circperm (generic function with 1 method)

julia> @inline _circperm(t1, trest...) = (trest..., t1)
_circperm (generic function with 2 methods)

julia> circperm((1,2,3))
(2,3,1)

julia> using BenchmarkTools

julia> perm = [2,3,1]
3-element Array{Int64,1}:
 2
 3
 1

julia> result = Array{Int}(3)
3-element Array{Int64,1}:
 140662317212200
 140662317187080
 140662319415392

julia> t = (1,2,3)
(1,2,3)

julia> permutesubs!(t, perm, result)
3-element Array{Int64,1}:
 2
 3
 1

julia> @benchmark permutesubs!($t, $perm, $result)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     5.00 ns (0.00% GC)
  median time:      5.00 ns (0.00% GC)
  mean time:        5.05 ns (0.00% GC)
  maximum time:     15.00 ns (0.00% GC)

julia> @benchmark circperm($t)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     2.00 ns (0.00% GC)
  median time:      2.00 ns (0.00% GC)
  mean time:        2.02 ns (0.00% GC)
  maximum time:     13.00 ns (0.00% GC)

julia> @code_llvm circperm(t)

define void @julia_circperm_70788([3 x i64]* noalias sret, [3 x i64]*) #0 {
top:
  %2 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 1
  %3 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 2
  %4 = load i64, i64* %2, align 1
  %5 = load i64, i64* %3, align 1
  %6 = getelementptr inbounds [3 x i64], [3 x i64]* %1, i64 0, i64 0
  %7 = load i64, i64* %6, align 1
  %.sroa.0.0..sroa_idx = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 0
  store i64 %4, i64* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 1
  store i64 %5, i64* %.sroa.2.0..sroa_idx1, align 8
  %.sroa.3.0..sroa_idx2 = getelementptr inbounds [3 x i64], [3 x i64]* %0, i64 0, i64 2
  store i64 %7, i64* %.sroa.3.0..sroa_idx2, align 8
  ret void
}

julia> @code_llvm permutesubs!(t, perm, result)

define %jl_value_t* @"julia_permutesubs!_70962"([3 x i64]*, %jl_value_t*, %jl_value_t*) #0 {
top:
  %3 = bitcast %jl_value_t* %1 to i64**
  %4 = load i64*, i64** %3, align 8
  %5 = bitcast %jl_value_t* %2 to i64**
  %6 = load i64*, i64** %5, align 8
  %7 = load i64, i64* %4, align 8
  %8 = add i64 %7, -1
  %9 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %8
  %10 = load i64, i64* %9, align 8
  store i64 %10, i64* %6, align 8
  %11 = getelementptr i64, i64* %4, i64 1
  %12 = load i64, i64* %11, align 8
  %13 = add i64 %12, -1
  %14 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %13
  %15 = load i64, i64* %14, align 8
  %16 = getelementptr i64, i64* %6, i64 1
  store i64 %15, i64* %16, align 8
  %17 = getelementptr i64, i64* %4, i64 2
  %18 = load i64, i64* %17, align 8
  %19 = add i64 %18, -1
  %20 = getelementptr [3 x i64], [3 x i64]* %0, i64 0, i64 %19
  %21 = load i64, i64* %20, align 8
  %22 = getelementptr i64, i64* %6, i64 2
  store i64 %21, i64* %22, align 8
  ret %jl_value_t* %2
}

So they're both great, but the tuples one is even faster. Those @inline statements are absolutely crucial; any place you have ... in an inner loop, it will be slow unless you @inline it (in which case the compiler will specialize for the specific tuple length).

@muhlba91
Copy link
Contributor Author

muhlba91 commented Jan 27, 2017

Hi,

thanks a lot for pointing this out! :)
Now I used this function and tuples instead of permutesubs! as well as I got rid of the permutation vector perm as it just transposes the whole array. The gain in runtime isn't really noticeable with a problem size of 500x500 but I also run tests with a size of 2000x2000 and there I got down by nearly 0.7 seconds compared to the older version (down to ~3 seconds compared to 3.7 seconds before). One more thing optimized now is the number of needed allocations.
State with test suite posted above:

0.099389 seconds (25 allocations: 9.538 MB)
0.104037 seconds (25 allocations: 9.538 MB, 2.17% gc time)
0.107043 seconds (25 allocations: 9.538 MB)
0.107534 seconds (25 allocations: 9.538 MB)
0.103300 seconds (25 allocations: 9.538 MB, 2.29% gc time)

I also took a look at CartesianIndex but, imo, the main problem is the subscript permutation circperm as CartesianIndex is not mutable, meaning I'd need to create new indices increasing memory usage (and runtime).

@coveralls
Copy link

Coverage Status

Coverage decreased (-74.4%) to 4.279% when pulling 35499a8 on muhlba91:master into b3d6f46 on timholy:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.6%) to 79.331% when pulling c13d47a on muhlba91:master into b3d6f46 on timholy:master.

@timholy
Copy link
Member

timholy commented Jan 27, 2017

Regarding the storage (as written, this only works on julia 0.5 or higher):

circperm(idx::CartesianIndex) = CartesianIndex(circperm(idx.I))
circperm(t::Tuple) = _circperm(t)
@inline _circperm(t::Tuple) = _circperm(t...)
@inline _circperm(t1, trest...) = (trest..., t1)

circperm2(sz, i) = _circperm2(sz, i);
@inline _circperm2(sz, i) = sub2ind(sz, circperm(ind2sub(sz, i))...)

A = [CartesianIndex((rand(1:1000), rand(1:1000))) for i = 1:1000, j = 1:1000]
B = map(idx->sub2ind((1000,1000), idx.I...), A)
A .= circperm.(A);
f = i -> circperm2((1000,1000), i)
B .= f.(B);
@time A .= circperm.(A);
@time B .= f.(B);
nothing

Results:

  0.001597 seconds (7 allocations: 224 bytes)
  0.012136 seconds (7 allocations: 224 bytes)

Julia/LLVM are quite smart about not allocating memory when it comes to tuples (which CartesianIndex is merely a wrapper around). The array doesn't actually store a "CartesianIndex object," it's just a chunk of memory of size 2-by-1000-by-1000 Ints; the CartesianIndex is just an interpretation of the meaning of those bits.

All that said, I have no idea whether it's this part of your computation that's a bottleneck (from the runtimes above, I suspect not). Only if profiling shows that calls to sub2ind or ind2sub result in an appreciable load is it worth worrying about this further.

I'd be fine with merging this as-is. Are you still playing with it, or are you good to go?

@muhlba91
Copy link
Contributor Author

oh cool! :) I'm not too familiar with CartesianIndex but I'll profile it again next week trying to figure out a spot where the bottleneck is and let you know whether we are good to go to merge this or I can find something. ;)

remove unnecessary ind2sub calls;
optimize square euclidean computation
@timholy
Copy link
Member

timholy commented Jan 30, 2017

@muhlba91, I can help you resolve the conflicts if you're unfamiliar with the process.

@muhlba91
Copy link
Contributor Author

Just to keep you updated on the progress. I was able to use a few @inline functions and to get rid of really unnecessary ind2sub calls. This means, with the test suite posted in the first post, the results are now

0.079463 seconds (25 allocations: 9.538 MB)
0.096297 seconds (25 allocations: 9.538 MB, 9.64% gc time)
0.084291 seconds (25 allocations: 9.538 MB)
0.094686 seconds (25 allocations: 9.538 MB, 7.26% gc time)
0.079438 seconds (25 allocations: 9.538 MB)

Internally, I changed my test suite to a problem size of 4000x4000 to spot any differences in the runtime easier. Here the current stats for that (before these optimizations, it was about 1 second slower ;) ):

5.222761 seconds (25 allocations: 610.353 MB, 4.49% gc time)
5.184712 seconds (25 allocations: 610.353 MB, 4.56% gc time)
5.327673 seconds (25 allocations: 610.353 MB, 5.07% gc time)
5.246481 seconds (25 allocations: 610.353 MB, 4.86% gc time)
5.282295 seconds (25 allocations: 610.353 MB, 4.82% gc time)

@timholy - yes, that would be nice. I never had to rebase/merge the master branch into my own fork so far.

@timholy
Copy link
Member

timholy commented Jan 30, 2017

Wow, that's a pretty dramatic improvement! Nice work!

The conflicts arose from the merge and release of Images 0.6.0. If you're done with this, I can clean up the conflicts and merge to master for you, giving you credit for your work (using --author). If you're still playing with it but want to continue using the new version of Images, I can create a new branch for you to git fetch. Which would you prefer?

@muhlba91
Copy link
Contributor Author

Thank you! :)

I think making a new branch is not necessary. I have just one more idea left to try out tomorrow (although I'm not quite sure if it's a helpful one ^^), meaning we can clean up the conflicts and merge them to master tomorrow. ;)

However, if you have a bit time I'd like to ask a short question. When I used @inline functions for adding and multiplying 2 numbers, I gained quite a bit of performance. I had suspected not using these @inline functions is either more efficient or doesn't change anything. Maybe you can spot why I have gained at least about 0.15 seconds for the 4000x4000 problem size case?

@timholy
Copy link
Member

timholy commented Jan 30, 2017

How exactly did you use it? You can't use it at the "call site," you can only declare that certain functions should be inlined. Do you perhaps mean @inbounds?

@muhlba91
Copy link
Contributor Author

I used it in lines 212, 171 and 103. There, I created functions getting inlined for addition and multiplication. For example, in line 103 I used x+=(d_j -1)*stride[j] but switching to the inlined version yielded a gain of 0.05secs. That's not much but for the euclidean distance (line 212) the gain was about up to 0.10-0.15secs

@muhlba91
Copy link
Contributor Author

OK, I think we are good to go now. :) Can you clean up the conflicts and merge it in master, please?

@muhlba91 muhlba91 changed the title [WIP] Euclidean Distance Transform Euclidean Distance Transform Jan 31, 2017
@timholy
Copy link
Member

timholy commented Feb 1, 2017

See if you like #580 (if you don't, tell me what you don't like).

By the way, you were right to be skeptical about the benefit of creating those add_ and mul_ inlined functions: I couldn't find any evidence for a performance benefit. So I got rid of them (and then did more stuff). I'm not sure if your system is different than mine, but I do recommend BenchmarkTools for benchmarking.

@timholy
Copy link
Member

timholy commented Feb 7, 2017

Closed by #580

@timholy timholy closed this Feb 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants