Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example batch processing #3386

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

hmaarrfk
Copy link
Member

@hmaarrfk hmaarrfk commented Sep 3, 2018

@jni I guess I was always working with large images (2048 x 2048) and saw big improvements with parallel processing.

More to come, but this is the example I'm converging towards.

Do we have a way to have these in notebook format?

Checklist

[It's fine to submit PRs which are a work in progress! But before they are merged, all PRs should provide:]

[For detailed information on these and other aspects see scikit-image contribution guidelines]

For reviewers

(Don't remove the checklist below.)

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.
  • Consider backporting the PR with @meeseeksdev backport to v0.14.x

@pep8speaks

This comment has been minimized.

@jni
Copy link
Member

jni commented Sep 3, 2018

Do we have a way to have these in notebook format?

Not currently, but @matthew-brett has a library to do something of the sort: https://github.com/matthew-brett/nb2plots

I don't know whether sphinx/sphinx-gallery produces an rst intermediate from these that we could plug into, but it would be very valuable (imho) to have a "download as notebook" link on each example.

Regarding your example, (a) very interesting, but (b) I hope we find a solution for using large datasets in examples and we can make the example less artificial.

When I mentioned that I haven't got good results, it's not about the size of the image (they're big), but, maybe, about overlap. Here's two examples @mrocklin posted:

https://nbviewer.jupyter.org/gist/mrocklin/ec745d6c2a12dddddb125ef460a4da76
https://gist.github.com/mrocklin/7173bffbeca821bd14351889b8905a54

ie it's hard to get speedups for non-trivial tasks.

Nevertheless, your results here are surprising, because I think running ptime on a gaussian filter does not get you anywhere near 4x speedup or 4 cores:

https://pypi.org/project/ptime/

So probably what's happening is that the rest of the pipeline is parallelising quite well...

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Sep 3, 2018

@jni, man I definitely spent more time on this than I would have wanted to.

first, I seem to be able to get a 2x improvement, instead of a 30% improvement using dask in 2018. I think some of dask has definitely improved.

That said, honestly, pickle seems like the bottlekneck here. Maybe @mrocklin has already looked into this? https://matthewrocklin.com/blog/work/2018/07/23/protocols-pickle

http://nbviewer.jupyter.org/gist/hmaarrfk/b0ef570b36267a5e10c81bb0309a318c
https://gist.github.com/hmaarrfk/b0ef570b36267a5e10c81bb0309a318c

Issue on numpy here: numpy/numpy#7544

Even if parallelization is perfect, pickling and unpickling that array on the same processor takes 10s of ms.

Finally, copying into a continuous C array seems to be an other slowdown.

@soupault
Copy link
Member

soupault commented Sep 4, 2018

but it would be very valuable (imho) to have a "download as notebook" link on each example.

@jni I'm glad to inform you that we already have this feature 🙂:
image

Apparently, we could also have a notebook rendering plugin for Sphinx (which I find quite fascinating):
https://nbsphinx.readthedocs.io/en/0.3.4/
https://github.com/QuantEcon/sphinxcontrib-jupyter

Having the above, should we go even further and set up a Tutorials/Examples section in our website?

Sorry for being slightly of-topic.

@jni
Copy link
Member

jni commented Sep 4, 2018

@soupault haha so great! Tick!

Bonus points: launch it in binder. =P

Regarding tutorials, yes, absolutely, I have a medium term goal of porting the tutorials repo into a nice section in the website.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Sep 4, 2018

How do we upload ipynb? Do we clear all outputs before posting them?

This kind of tutorial requires a machine more powerful than Travis, are the regenerated results going to suffer?

Regarding Binder, do you know how the peformance might compare to the average student's laptop? Personally, I am very performance conscious.

@jni
Copy link
Member

jni commented Sep 5, 2018

@hmaarrfk

How do we upload ipynb? Do we clear all outputs before posting them?

I presume sphinx-gallery builds the notebooks. Not sure whether the outputs are cleared or not, should be simple enough to try it (e.g. point nbviewer.jupyter.org at one of the download links).

This kind of tutorial requires a machine more powerful than Travis, are the regenerated results going to suffer?

Think small. =) The Travis machine has 2 cores, so that is sufficient to demonstrate parallel processing.

Regarding Binder, do you know how the peformance might compare to the average student's laptop? Personally, I am very performance conscious.

The purpose of these links is not to get people doing their "real" scikit-image analyses on Binder or to demonstrate use with a 50GB dataset, but to demonstrate code that you could run on a 50GB dataset while being performant. If you can demonstrate that you get a 2x speedup by using dask properly on the Travis machine (where the docs are built), then you're done.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Sep 5, 2018

Think small. =) The Travis machine has 2 cores, so that is sufficient to demonstrate parallel processing.

I guess many people are on 13 inch machines that only have 2 cores too.

If you can demonstrate that you get a 2x speedup by using dask properly on the Travis machine (where the docs are built), then you're done.

My fear is that travis is so unpredictable that running anything close to a benchmark on it is really difficult.

@soupault
Copy link
Member

soupault commented Sep 5, 2018

I presume sphinx-gallery builds the notebooks.

From the first paragraph following one of the above URLs: Un-evaluated notebooks – i.e. notebooks without stored output cells – will be automatically executed during the Sphinx build process.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Sep 5, 2018

Thanks @soupault 😅

@jni, with @mrocklin's help, I reran the the notebook you gave linked to me and can get 3.25 speedup on the example dataset now using an experimental branch of dask.
On my 4 core machine, it takes 0.87 s instead of 2.8 s.

The two main optimizations are:

  1. when operating on chunked datasets, calling y.persist() is probably much better than y.compute() since it keeps the final result as a dask blocked array. This avoids a single copy of the large array at the end of the computation. The resulting array remains in a blocked form (in memory). If you need slices in [y, x], you can make that specific slice contiguous, which is much cheaper than making the whole array contiguous.
  2. The experimental branch of dask: Trim no expand dask/dask#3950 removed a few too many memory copies that were occurring. For example, the length of the computation graph used to be 109, it is now 33. The operations it removed were associated with padding the data, before eventually reslicing it. I think the non-optimized version was copying the array over as much as 3 times. I think this was a major holdup.

There is also something to be said about how the data is stored on disk (or SSD).
I don't think rechunking + overlapping is free but maybe it can be if data is stored "more optimially". I'm no expert in how to store data, so I'll defer that discussion to a later day. Making use of dask optimized loading/storing subroutines might give you that extra 0.5 x.

Copy link
Member

@emmanuelle emmanuelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by @soupault sphinx-gallery provides links to notebooks to download. The example as it is now does not comply with the format required by sphinx gallery (eg rst title, see the longer examples in the xx_applications section of the gallery).

This tutorial is very interesting, but at the moment I feel there is a bit too much information. Would it be possible to make it shorter and simpler? Also, to have a meaningful thumbnail for the gallery, could it be possible to plot (for example) the execution times for the two different methods, when you change the size or number of images?

doc/examples/batch_processing/dask_batch_processing.py Outdated Show resolved Hide resolved
doc/examples/batch_processing/dask_batch_processing.py Outdated Show resolved Hide resolved
doc/examples/batch_processing/dask_batch_processing.py Outdated Show resolved Hide resolved
@@ -4,3 +4,4 @@ sphinx>=1.3,!=1.7.8
numpydoc>=0.6
sphinx-gallery
scikit-learn
dask[delayed]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-image depends on dask, so I don't think we need dask in docs requirements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think delayed is different than array. Though they both depend on toolz, so I guess we satisfy the requirements for dask[delayed] today.
https://github.com/dask/dask/blob/master/setup.py#L16

@emmanuelle
Copy link
Member

Binder badges can be created by activating an option in sphinx-gallery, see https://sphinx-gallery.github.io/configuration.html?highlight=binder#binder-links

@emmanuelle
Copy link
Member

Out of curiosity, have you tried comparing with a third method, ie a more explicit embarrassingly parallel loop (eg with joblib.Parallel, with either the loky or the threading backend? https://joblib.readthedocs.io/en/latest/parallel.html)?

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Sep 5, 2018

Thank you for your comments @emmanuelle. Much of this has been rewritten in a notebook format while I was on a plane. Nothing better to do than benchmark right!

The version here is rather long, and obfuscated in function calls (I hate encapsulating simple tasks things in functions). The notebook removes much of this. I should really try joblib, but I just haven't had the time to personally play with it.

For image processing, I think learning a tool like dask might be a better long term investment even if it chokes when you give it 50000+ tasks.

It should be pretty easy to rewrite this example for joblib, though I've learned the devil is always in the details.

I would like to create a graph of the results, though I don't want to make it harder to read the notebook.

I think it is much easier for beginners to read

N_files = 200
images = []
for i in range(N_files):
    images.append(imread(i))

than

N_files = 200
def read_images(N_files):
    images = []
    for i in range(N_files):
        images.append(imread(i))
    return images
images = read_images(N_files)

Where the latter has been optimized for making the plot of the results.
Finally, the second version seems much more intimidating to modify for your own use, at least for my taste.

Is there a way to programatically runs cells from an other cell in jupyter?

@jni
Copy link
Member

jni commented Sep 6, 2018

@hmaarrfk Wow! That's super awesome! =) Fantastic work on this issue.

I think it is much easier for beginners to read

This depends on context. As you wrote it, of course, but if you define read_images elsewhere, this is easier:

N_files = 200
images = read_images(N_files)

This is a bad example because it's not at all obvious what read_images is doing with the integer input, but for many functions, it's very readable to abstract a task away.

Is there a way to programatically runs cells from an other cell in jupyter?

Why would you want to do such a thing?

@codecov-io
Copy link

codecov-io commented Sep 6, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@ffdb7cc). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #3386   +/-   ##
=========================================
  Coverage          ?   86.82%           
=========================================
  Files             ?      340           
  Lines             ?    27485           
  Branches          ?        0           
=========================================
  Hits              ?    23865           
  Misses            ?     3620           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffdb7cc...a7be184. Read the comment docs.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Sep 6, 2018

I mostly don't want to change the way the cells are laid out while running the benchmark programatically to create the figure
https://github.com/hmaarrfk/scikit-image/blob/example_batch_processing/doc/examples/batch_processing/batch_processing_with_dask.ipynb

I like not having too many levels of indentation.

@jni
Copy link
Member

jni commented Sep 6, 2018

@hmaarrfk very nice guide!

You can use nb2plot to convert it to rst. Style-wise, I would say change stuff like "the author thinks" etc to "we think". Once it gets merged, at least two members of the scikit-image team must have endorsed your message. =)

Rather than put this in the gallery, I would put it as a separate page in the getting started section, similar to "image data types and what they mean" and "a guide on numpy for images".

@sciunto
Copy link
Member

sciunto commented Sep 12, 2018

This should go in doc/source/user_guide/tutorial_parallelization.rst

@soupault
Copy link
Member

@hmaarrfk @jni just for you guys to see how notebooks are automatically integrated into the Sphinx-powered documentation via nbsphinx plugin - https://mipt-oulu.github.io/solt/Helen_faces.html .

@hmaarrfk
Copy link
Member Author

WIP: Still can't generate the docs.

@sciunto sciunto added 📄 type: Documentation Updates, fixes and additions to documentation 💪 Work in progress labels Oct 4, 2018
@sciunto
Copy link
Member

sciunto commented Oct 9, 2018

@hmaarrfk I successfully build the doc on my machine from your PR. What's not working for you? The rendering of build/html/user_guide/tutorial_parallelization_dask.html looks correct (maybe some missing syntax coloring...)

"outputs": [],
"source": [
"%%time\n",
"# Save\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this command and put a proper sentence instead.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Jul 7, 2019

Anway, i got rid of the magic commands, @jni, i would appreciate help turning this into just some python code that gets converted to html. nb2plots doens't seem to execute the code on my local copy of the scripts.

@jni
Copy link
Member

jni commented Jul 13, 2019

@jakirkham here is the example

@hmaarrfk
Copy link
Member Author

@jni, any idea on how to move forward with this PR? We are almost hitting the 1 year mark on it.

@jni
Copy link
Member

jni commented Jul 29, 2019

@hmaarrfk to paraphrase @jakirkham from a different context: "1 year? That's not so bad..." 😛

I'm confused about why you have both the rst and the notebook in the PR right now.

How does the ipython directive work? Can that get executed? imho we can just have pre-executed outputs and call it a day. And my absolute preference would be to use a timeit-based context manager, although that does not currently seem to exist... =\ But I'm happy with the generated rst, as long as we remove all the ipynb stuff.

If we want to have notebooks in the docs, that's a major discussion and I expect it will take a long time to resolve.

@hmaarrfk
Copy link
Member Author

I have both options because I'm not going to force push and delete my only copy of this notebook.

pre-executed outputs are less than ideal since code gets stale and becomes non-functional. I always thought that was the coolest thing about the docs of scikit-image. I could call timeit.timer() myself, like tic toc, but that basically encourages bad benchmarking.

I'm not too sure how the ipython directive works. I think it calls timeit with well chosen settings (why those aren't upstream is a little beyond me). I can look into it. But again, remember, that no beginner will ever use timeit directly. %%timeit is just too cool!!!

@jni
Copy link
Member

jni commented Jul 29, 2019

Actually, if with timeit.timer() as t: worked the same as %%timeit, I would use that all the time in preference to the magic. (in particular because it's handy to have access to the timer, e.g. to programmatically compare timings and write: speedup = t1.best_time / t0.best_time

@jni
Copy link
Member

jni commented Jul 29, 2019

I totally agree with you about the defaults not being in upstream, btw

@sciunto
Copy link
Member

sciunto commented Jul 29, 2019

A short term solution would be to move this in our tutorials and open a discussion for inclusion of notebooks in the doc (I would like to see how it looks like also)

@sciunto
Copy link
Member

sciunto commented Aug 9, 2019

@hmaarrfk What do you think?

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Aug 9, 2019

That's a good idea.

@sciunto
Copy link
Member

sciunto commented Mar 7, 2020

@hmaarrfk Would you like to open a PR in the tutorial repository, so that you get the authorship of the commit?

@sciunto sciunto modified the milestones: 0.16, 0.17 Mar 7, 2020
@hmaarrfk
Copy link
Member Author

hmaarrfk commented Mar 8, 2020

Sure

@sciunto sciunto modified the milestones: 0.17, 0.18 Sep 5, 2020
Base automatically changed from master to main February 18, 2021 18:23
@grlee77
Copy link
Contributor

grlee77 commented Mar 30, 2022

I think we should add some demonstration of batch processing to our docs. #5407 and #4214 are also related. I think for the CZI grant, we did say we would provide an example related to batch processing.

I'm confused about why you have both the rst and the notebook in the PR right now.

MyST-NB may be a good option? I haven't used it yet myself, but my understanding is that you can write the notebooks in Markdown making it easier to store in the repository, view diffs, etc.

@jarrodmillman jarrodmillman modified the milestones: 0.18, 0.20 Jun 4, 2022
@lagru lagru modified the milestones: 0.20, 0.21 Oct 25, 2022
@jarrodmillman jarrodmillman modified the milestones: 0.21, 0.22 May 19, 2023
@jarrodmillman jarrodmillman modified the milestones: 0.22, 0.23 Oct 24, 2023
@jarrodmillman jarrodmillman removed this from the 0.23 milestone Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 Work in progress 📄 type: Documentation Updates, fixes and additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants