Speedup _inpaint_biharmonic_single_channel #3489

GalAvineri · 2018-10-21T18:19:05Z

Description

This is a speedup for the inpaint method, specifically for the '_inpaint_biharmonic_single_channel' method, in the part where the masked points are being iterated over.

I've added a bit of speedup in 2 steps, each in a separate commit:

I've made a bit of speedup by vectorizing a few operations
I've added a speedup (perhaps a more significant one) by converting the iteration to a parallel manner.

Checklist

I'm new to contributing to open source projects so before I start working on the checklist I would like to receive your thoughts about it :)

Btw, as i'm new to contributing i'd love to hear your advice on my attempt to contribute and i'd appreciate your help in completing this feature enhancement :)

References

This is an enhancement regarding this issue #2008

For reviewers

(Don't remove the checklist below.)

Check that the PR title is short, concise, and will make sense 1 year
later.
Check that new functions are imported in corresponding __init__.py.
Check that new features, API changes, and deprecations are mentioned in
doc/release/release_dev.rst.
Consider backporting the PR with @meeseeksdev backport to v0.14.x

… accordingly) in a parallel manner

pep8speaks · 2018-10-21T18:19:07Z

Hello @GalAvineri! Thanks for submitting the PR.

In the file skimage/restoration/inpaint.py, following are the PEP8 issues :

Line 37:80: E501 line too long (85 > 79 characters)
Line 38:80: E501 line too long (86 > 79 characters)
Line 65:80: E501 line too long (101 > 79 characters)

emmanuelle · 2018-10-21T20:05:19Z

@GalAvineri thanks for your PR! At the moment it seems that tests are not passing. Can you execute pytest skimage/restoration/tests/test_inpaint.py in your branch and try to understand where the problem comes from?

hmaarrfk · 2018-10-22T13:28:30Z

skimage/restoration/inpaint.py

+    # Create biharmonic coefficients ndarray
+    neigh_coef = np.zeros(b_hi - b_lo)
+    neigh_coef[tuple(mask_pt_idx - b_lo)] = 1
+    neigh_coef = laplace(laplace(neigh_coef))


I think the call to laplace is the only one that is parallelizable. The others are all python and thus will get held up in the GIL.

it seems that each masked point has some variables that are unique to it.
These contain b_lo, b_hi, neight_cief, and the variables it_inner, coefs, tmp_pt_idx and tmp_pt_is.
All of these varibales aren't shared between the processes and thus computing them for each point can be parralilzed in my opinion, as the GIL will not intervene here. Do you agree with me on this point?

I do agree that the GIL might intervene with the modifications of the matrix_unknown and matrix_known variables.
So regarding this, I can suggest the following:
Each process will produce a list of changes that should be made to these matrices, and the main thread will make all of them.
This way the computation of which modification should be made will be done in parallel, and the modification themselves will be done serially. What do you say?

hmaarrfk · 2018-10-22T13:28:57Z

@GalAvineri, @emmanuelle how much overhead does Pool(1) add? is it worth having two code paths?

GalAvineri · 2018-10-23T18:31:31Z

@emmanuelle I will happily do so once i find some spare time! :)
@hmaarrfk I'm not sure about the overhead, how do you recommend checking this?
and what two code paths are you referring to?

hmaarrfk · 2018-10-23T19:02:29Z

Hi Gal, please disregard my comment about the GIL, I thought I read threads, but this Is probably starting up different putthon interpreters each having their own GIL.

I would start by defining a benchmark for your particular use case. It could be a simple one using %timeit but we can work on integrating it with ASV once it is more fleshed out. You might find that you get very different speed ups on different workloads.

hmaarrfk · 2018-10-23T19:03:10Z

The two codepaths would be: 1 starts the pool, the other just calls the function directly without staring a pool of 1

GalAvineri · 2018-10-23T20:11:51Z

The two codepaths would be: 1 starts the pool, the other just calls the function directly without staring a pool of 1

I'm still not entirely sure how you would have these 2 codes paths together, as mentioned here:

@GalAvineri, @emmanuelle how much overhead does Pool(1) add? is it worth having two code paths?

In case you are asking if the overhead of creating a pool is beneficial, I would say it depends on the amount of time that could be parallelized later.
I had an example (quite an extreme one perhaps) of an data structure of shape (50, 50, 96) with 99.3% pixels masked to be inpainted.
The result of the parallelization was a speed of of about 160%, which in my case was 2 minutes faster.
In this case the overhead of creating the pool was certainly negligible.

Did I answer your question? If not, can you try to elaborate more? :)

GalAvineri · 2018-10-23T20:13:33Z

Hi Gal, please disregard my comment about the GIL, I thought I read threads, but this Is probably starting up different putthon interpreters each having their own GIL.

I agree, although you might be correct about the modifications of the known and unknown matrices.

I would start by defining a benchmark for your particular use case. It could be a simple one using %timeit but we can work on integrating it with ASV once it is more fleshed out. You might find that you get very different speed ups on different workloads.

I'd love to do so once I find time for it :)

hmaarrfk · 2018-10-23T20:30:30Z

@GalAvineri I'm always pretty excited about performance PRs. I'm curious to see what will come from this experiment of yours! Looking forward to see your modifications when you have time to make them.

You asked for thoughts on your implementation. As it stands, even when processes=1, you are staring a Pool. Therefore, this leads me to believe that this will cause a slowdown for most people, simply by virtue of the call to Pool.

Pool(1) is basically doing nothing. If you have a main process, you don't need a Pool of 1 process.

Here is a simple benchmark showing how costly Pool(1) might be.

In [1]: from multiprocessing import Pool                                      

In [2]: %timeit -n 10 -r 10 Pool(1)                                           
5.61 ms ± 1.67 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

In [3]: %timeit -n 10 -r 10 pass                                              
78.5 ns ± 41.5 ns per loop (mean ± std. dev. of 10 runs, 10 loops each)

This benchmark indicates that starting a Pool of 1 worker is basically taking on the order of 5ms. That isn't really negligible compared to the noop of 100 ns.

This is the kind of preliminery benchmark that shows us the usecase you are considering.

Maybe more importantly than proving the performance speedup is proving the correctness as @emmanuelle pointed out.

GalAvineri · 2018-10-24T05:35:20Z

Thank you for the great elaboration!
As you suggested I'll start with the correctness and we can resume the speed up conversation afterwards :)

soupault · 2018-11-01T19:52:25Z

Hi @GalAvineri ,
Thanks for you interest in contributing and, particularly, in inpaint_biharmonic! 🙂

As already mentioned by the other dev team members, it would be very natural (and also very convenient to the reviewers) to have PRs such as yours supplemented with the benchmarking info. A set of numbers for comparison (running time on small, middle, large image / small, middle, large inpainting domain) would set a very solid ground for a constructive discussion. Optionally, you may consider adding an ASV (AirSpeed Velocity) file, which implements an interface for automatic benchmarking of the function in our infrastructure. For multiprocessing implementation, you may consider other PR, e.g. #3233.

Now for the PR itself. I think (as mentioned in #2008 (comment)), that the suggested change is, indeed, one of the ways of accelerating the function. From a brief profiling, it seems that a great amount of time comes from the laplace calls and sparse matrix operations. My intuition is that Cythonizing the function would be the most efficient option, although I foresee a lot of troubles in writing 3D+ Cython code. So, I'd be really interested in seeing how fast we can go with pure multithreaded Python.
The algorithm, of course, has its own limits, but the current implementation is quite far from optimal. Just for reference, attached below is the profiling results of the gallery example.

I had an example (quite an extreme one perhaps) of an data structure of shape (50, 50, 96) with 99.3% pixels masked to be inpainted.

🤣 What kind of science you are doing?

soupault · 2021-10-08T10:47:24Z

Superseded by #5240.
Thank you for the contribution and, please, feel free to open a new PR is there is something else you would like to suggest implementation-wise.

Gal Avineri added 2 commits October 21, 2018 21:05

a bit of speedup by vectorizing a few operations

c96b3f2

speed up for iterating over masked points (and modifying the matrices…

492c35c

… accordingly) in a parallel manner

soupault self-assigned this Oct 22, 2018

hmaarrfk reviewed Oct 22, 2018

View reviewed changes

soupault added the 📈 type: Performance label Nov 1, 2018

soupault changed the title ~~speedup for inpaint~~ Speedup _inpaint_biharmonic_single_channel Nov 1, 2018

soupault added the 💪 Work in progress label Nov 1, 2018

soupault added this to the 0.15 milestone Nov 1, 2018

soupault modified the milestones: 0.15, 0.16 Apr 20, 2019

grlee77 mentioned this pull request Feb 17, 2021

Fast biharmonic inpainting #5240

Merged

Base automatically changed from master to main February 18, 2021 18:23

mkcor mentioned this pull request Feb 22, 2021

2021's calendar of community management #5169

Closed

soupault closed this Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup _inpaint_biharmonic_single_channel #3489

Speedup _inpaint_biharmonic_single_channel #3489

GalAvineri commented Oct 21, 2018

pep8speaks commented Oct 21, 2018

emmanuelle commented Oct 21, 2018

hmaarrfk Oct 22, 2018

GalAvineri Oct 23, 2018

hmaarrfk commented Oct 22, 2018

GalAvineri commented Oct 23, 2018

hmaarrfk commented Oct 23, 2018

hmaarrfk commented Oct 23, 2018

GalAvineri commented Oct 23, 2018

GalAvineri commented Oct 23, 2018

hmaarrfk commented Oct 23, 2018 •

edited

GalAvineri commented Oct 24, 2018

soupault commented Nov 1, 2018

soupault commented Oct 8, 2021

Speedup _inpaint_biharmonic_single_channel #3489

Speedup _inpaint_biharmonic_single_channel #3489

Conversation

GalAvineri commented Oct 21, 2018

Description

Checklist

References

For reviewers

pep8speaks commented Oct 21, 2018

emmanuelle commented Oct 21, 2018

hmaarrfk Oct 22, 2018

Choose a reason for hiding this comment

GalAvineri Oct 23, 2018

Choose a reason for hiding this comment

hmaarrfk commented Oct 22, 2018

GalAvineri commented Oct 23, 2018

hmaarrfk commented Oct 23, 2018

hmaarrfk commented Oct 23, 2018

GalAvineri commented Oct 23, 2018

GalAvineri commented Oct 23, 2018

hmaarrfk commented Oct 23, 2018 • edited

GalAvineri commented Oct 24, 2018

soupault commented Nov 1, 2018

soupault commented Oct 8, 2021

hmaarrfk commented Oct 23, 2018 •

edited