Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup _inpaint_biharmonic_single_channel #3489

Closed
wants to merge 2 commits into from

Conversation

GalAvineri
Copy link

Description

This is a speedup for the inpaint method, specifically for the '_inpaint_biharmonic_single_channel' method, in the part where the masked points are being iterated over.

I've added a bit of speedup in 2 steps, each in a separate commit:

  1. I've made a bit of speedup by vectorizing a few operations
  2. I've added a speedup (perhaps a more significant one) by converting the iteration to a parallel manner.

Checklist

I'm new to contributing to open source projects so before I start working on the checklist I would like to receive your thoughts about it :)

Btw, as i'm new to contributing i'd love to hear your advice on my attempt to contribute and i'd appreciate your help in completing this feature enhancement :)

References

This is an enhancement regarding this issue #2008

For reviewers

(Don't remove the checklist below.)

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.
  • Consider backporting the PR with @meeseeksdev backport to v0.14.x

@pep8speaks
Copy link

Hello @GalAvineri! Thanks for submitting the PR.

Line 37:80: E501 line too long (85 > 79 characters)
Line 38:80: E501 line too long (86 > 79 characters)
Line 65:80: E501 line too long (101 > 79 characters)

@emmanuelle
Copy link
Member

@GalAvineri thanks for your PR! At the moment it seems that tests are not passing. Can you execute pytest skimage/restoration/tests/test_inpaint.py in your branch and try to understand where the problem comes from?

@soupault soupault self-assigned this Oct 22, 2018
# Create biharmonic coefficients ndarray
neigh_coef = np.zeros(b_hi - b_lo)
neigh_coef[tuple(mask_pt_idx - b_lo)] = 1
neigh_coef = laplace(laplace(neigh_coef))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the call to laplace is the only one that is parallelizable. The others are all python and thus will get held up in the GIL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that each masked point has some variables that are unique to it.
These contain b_lo, b_hi, neight_cief, and the variables it_inner, coefs, tmp_pt_idx and tmp_pt_is.
All of these varibales aren't shared between the processes and thus computing them for each point can be parralilzed in my opinion, as the GIL will not intervene here. Do you agree with me on this point?

I do agree that the GIL might intervene with the modifications of the matrix_unknown and matrix_known variables.
So regarding this, I can suggest the following:
Each process will produce a list of changes that should be made to these matrices, and the main thread will make all of them.
This way the computation of which modification should be made will be done in parallel, and the modification themselves will be done serially. What do you say?

@hmaarrfk
Copy link
Member

@GalAvineri, @emmanuelle how much overhead does Pool(1) add? is it worth having two code paths?

@GalAvineri
Copy link
Author

@emmanuelle I will happily do so once i find some spare time! :)
@hmaarrfk I'm not sure about the overhead, how do you recommend checking this?
and what two code paths are you referring to?

@hmaarrfk
Copy link
Member

Hi Gal, please disregard my comment about the GIL, I thought I read threads, but this Is probably starting up different putthon interpreters each having their own GIL.

I would start by defining a benchmark for your particular use case. It could be a simple one using %timeit but we can work on integrating it with ASV once it is more fleshed out. You might find that you get very different speed ups on different workloads.

@hmaarrfk
Copy link
Member

The two codepaths would be: 1 starts the pool, the other just calls the function directly without staring a pool of 1

@GalAvineri
Copy link
Author

The two codepaths would be: 1 starts the pool, the other just calls the function directly without staring a pool of 1

I'm still not entirely sure how you would have these 2 codes paths together, as mentioned here:

@GalAvineri, @emmanuelle how much overhead does Pool(1) add? is it worth having two code paths?

In case you are asking if the overhead of creating a pool is beneficial, I would say it depends on the amount of time that could be parallelized later.
I had an example (quite an extreme one perhaps) of an data structure of shape (50, 50, 96) with 99.3% pixels masked to be inpainted.
The result of the parallelization was a speed of of about 160%, which in my case was 2 minutes faster.
In this case the overhead of creating the pool was certainly negligible.

Did I answer your question? If not, can you try to elaborate more? :)

@GalAvineri
Copy link
Author

Hi Gal, please disregard my comment about the GIL, I thought I read threads, but this Is probably starting up different putthon interpreters each having their own GIL.

I agree, although you might be correct about the modifications of the known and unknown matrices.

I would start by defining a benchmark for your particular use case. It could be a simple one using %timeit but we can work on integrating it with ASV once it is more fleshed out. You might find that you get very different speed ups on different workloads.

I'd love to do so once I find time for it :)

@hmaarrfk
Copy link
Member

hmaarrfk commented Oct 23, 2018

@GalAvineri I'm always pretty excited about performance PRs. I'm curious to see what will come from this experiment of yours! Looking forward to see your modifications when you have time to make them.

You asked for thoughts on your implementation. As it stands, even when processes=1, you are staring a Pool. Therefore, this leads me to believe that this will cause a slowdown for most people, simply by virtue of the call to Pool.

Pool(1) is basically doing nothing. If you have a main process, you don't need a Pool of 1 process.

Here is a simple benchmark showing how costly Pool(1) might be.

In [1]: from multiprocessing import Pool                                      

In [2]: %timeit -n 10 -r 10 Pool(1)                                           
5.61 ms ± 1.67 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

In [3]: %timeit -n 10 -r 10 pass                                              
78.5 ns ± 41.5 ns per loop (mean ± std. dev. of 10 runs, 10 loops each)

This benchmark indicates that starting a Pool of 1 worker is basically taking on the order of 5ms. That isn't really negligible compared to the noop of 100 ns.

This is the kind of preliminery benchmark that shows us the usecase you are considering.

Maybe more importantly than proving the performance speedup is proving the correctness as @emmanuelle pointed out.

@GalAvineri
Copy link
Author

Thank you for the great elaboration!
As you suggested I'll start with the correctness and we can resume the speed up conversation afterwards :)

@soupault soupault changed the title speedup for inpaint Speedup _inpaint_biharmonic_single_channel Nov 1, 2018
@soupault
Copy link
Member

soupault commented Nov 1, 2018

Hi @GalAvineri ,
Thanks for you interest in contributing and, particularly, in inpaint_biharmonic! 🙂

As already mentioned by the other dev team members, it would be very natural (and also very convenient to the reviewers) to have PRs such as yours supplemented with the benchmarking info. A set of numbers for comparison (running time on small, middle, large image / small, middle, large inpainting domain) would set a very solid ground for a constructive discussion. Optionally, you may consider adding an ASV (AirSpeed Velocity) file, which implements an interface for automatic benchmarking of the function in our infrastructure. For multiprocessing implementation, you may consider other PR, e.g. #3233.

Now for the PR itself. I think (as mentioned in #2008 (comment)), that the suggested change is, indeed, one of the ways of accelerating the function. From a brief profiling, it seems that a great amount of time comes from the laplace calls and sparse matrix operations. My intuition is that Cythonizing the function would be the most efficient option, although I foresee a lot of troubles in writing 3D+ Cython code. So, I'd be really interested in seeing how fast we can go with pure multithreaded Python.
The algorithm, of course, has its own limits, but the current implementation is quite far from optimal. Just for reference, attached below is the profiling results of the gallery example.

I had an example (quite an extreme one perhaps) of an data structure of shape (50, 50, 96) with 99.3% pixels masked to be inpainted.

🤣 What kind of science you are doing?

image

@soupault soupault added this to the 0.15 milestone Nov 1, 2018
@soupault soupault modified the milestones: 0.15, 0.16 Apr 20, 2019
Base automatically changed from master to main February 18, 2021 18:23
@soupault
Copy link
Member

soupault commented Oct 8, 2021

Superseded by #5240.
Thank you for the contribution and, please, feel free to open a new PR is there is something else you would like to suggest implementation-wise.

@soupault soupault closed this Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants