Example batch processing #3386

hmaarrfk · 2018-09-03T00:42:42Z

@jni I guess I was always working with large images (2048 x 2048) and saw big improvements with parallel processing.

More to come, but this is the example I'm converging towards.

Do we have a way to have these in notebook format?

Checklist

[It's fine to submit PRs which are a work in progress! But before they are merged, all PRs should provide:]

Clean style in the spirit of PEP8
Docstrings for all functions
Gallery example in ./doc/examples (new features only)
Benchmark in ./benchmarks, if your changes aren't covered by an
existing benchmark
Unit tests

[For detailed information on these and other aspects see scikit-image contribution guidelines]

For reviewers

(Don't remove the checklist below.)

Check that the PR title is short, concise, and will make sense 1 year
later.
Check that new functions are imported in corresponding __init__.py.
Check that new features, API changes, and deprecations are mentioned in
doc/release/release_dev.rst.
Consider backporting the PR with @meeseeksdev backport to v0.14.x

jni · 2018-09-03T01:42:48Z

Do we have a way to have these in notebook format?

Not currently, but @matthew-brett has a library to do something of the sort: https://github.com/matthew-brett/nb2plots

I don't know whether sphinx/sphinx-gallery produces an rst intermediate from these that we could plug into, but it would be very valuable (imho) to have a "download as notebook" link on each example.

Regarding your example, (a) very interesting, but (b) I hope we find a solution for using large datasets in examples and we can make the example less artificial.

When I mentioned that I haven't got good results, it's not about the size of the image (they're big), but, maybe, about overlap. Here's two examples @mrocklin posted:

https://nbviewer.jupyter.org/gist/mrocklin/ec745d6c2a12dddddb125ef460a4da76
https://gist.github.com/mrocklin/7173bffbeca821bd14351889b8905a54

ie it's hard to get speedups for non-trivial tasks.

Nevertheless, your results here are surprising, because I think running ptime on a gaussian filter does not get you anywhere near 4x speedup or 4 cores:

https://pypi.org/project/ptime/

So probably what's happening is that the rest of the pipeline is parallelising quite well...

hmaarrfk · 2018-09-03T20:01:01Z

@jni, man I definitely spent more time on this than I would have wanted to.

first, I seem to be able to get a 2x improvement, instead of a 30% improvement using dask in 2018. I think some of dask has definitely improved.

That said, honestly, pickle seems like the bottlekneck here. Maybe @mrocklin has already looked into this? https://matthewrocklin.com/blog/work/2018/07/23/protocols-pickle

http://nbviewer.jupyter.org/gist/hmaarrfk/b0ef570b36267a5e10c81bb0309a318c
https://gist.github.com/hmaarrfk/b0ef570b36267a5e10c81bb0309a318c

Issue on numpy here: numpy/numpy#7544

Even if parallelization is perfect, pickling and unpickling that array on the same processor takes 10s of ms.

Finally, copying into a continuous C array seems to be an other slowdown.

soupault · 2018-09-04T06:04:13Z

but it would be very valuable (imho) to have a "download as notebook" link on each example.

@jni I'm glad to inform you that we already have this feature 🙂:

Apparently, we could also have a notebook rendering plugin for Sphinx (which I find quite fascinating):
https://nbsphinx.readthedocs.io/en/0.3.4/
https://github.com/QuantEcon/sphinxcontrib-jupyter

Having the above, should we go even further and set up a Tutorials/Examples section in our website?

Sorry for being slightly of-topic.

jni · 2018-09-04T06:16:25Z

@soupault haha so great! Tick!

Bonus points: launch it in binder. =P

Regarding tutorials, yes, absolutely, I have a medium term goal of porting the tutorials repo into a nice section in the website.

hmaarrfk · 2018-09-04T12:05:54Z

How do we upload ipynb? Do we clear all outputs before posting them?

This kind of tutorial requires a machine more powerful than Travis, are the regenerated results going to suffer?

Regarding Binder, do you know how the peformance might compare to the average student's laptop? Personally, I am very performance conscious.

jni · 2018-09-05T09:55:07Z

@hmaarrfk

How do we upload ipynb? Do we clear all outputs before posting them?

I presume sphinx-gallery builds the notebooks. Not sure whether the outputs are cleared or not, should be simple enough to try it (e.g. point nbviewer.jupyter.org at one of the download links).

This kind of tutorial requires a machine more powerful than Travis, are the regenerated results going to suffer?

Think small. =) The Travis machine has 2 cores, so that is sufficient to demonstrate parallel processing.

Regarding Binder, do you know how the peformance might compare to the average student's laptop? Personally, I am very performance conscious.

The purpose of these links is not to get people doing their "real" scikit-image analyses on Binder or to demonstrate use with a 50GB dataset, but to demonstrate code that you could run on a 50GB dataset while being performant. If you can demonstrate that you get a 2x speedup by using dask properly on the Travis machine (where the docs are built), then you're done.

hmaarrfk · 2018-09-05T11:22:49Z

Think small. =) The Travis machine has 2 cores, so that is sufficient to demonstrate parallel processing.

I guess many people are on 13 inch machines that only have 2 cores too.

If you can demonstrate that you get a 2x speedup by using dask properly on the Travis machine (where the docs are built), then you're done.

My fear is that travis is so unpredictable that running anything close to a benchmark on it is really difficult.

soupault · 2018-09-05T11:31:26Z

I presume sphinx-gallery builds the notebooks.

From the first paragraph following one of the above URLs: Un-evaluated notebooks – i.e. notebooks without stored output cells – will be automatically executed during the Sphinx build process.

hmaarrfk · 2018-09-05T16:24:00Z

Thanks @soupault 😅

@jni, with @mrocklin's help, I reran the the notebook you gave linked to me and can get 3.25 speedup on the example dataset now using an experimental branch of dask.
On my 4 core machine, it takes 0.87 s instead of 2.8 s.

The two main optimizations are:

when operating on chunked datasets, calling y.persist() is probably much better than y.compute() since it keeps the final result as a dask blocked array. This avoids a single copy of the large array at the end of the computation. The resulting array remains in a blocked form (in memory). If you need slices in [y, x], you can make that specific slice contiguous, which is much cheaper than making the whole array contiguous.
The experimental branch of dask: Trim no expand dask/dask#3950 removed a few too many memory copies that were occurring. For example, the length of the computation graph used to be 109, it is now 33. The operations it removed were associated with padding the data, before eventually reslicing it. I think the non-optimized version was copying the array over as much as 3 times. I think this was a major holdup.

There is also something to be said about how the data is stored on disk (or SSD).
I don't think rechunking + overlapping is free but maybe it can be if data is stored "more optimially". I'm no expert in how to store data, so I'll defer that discussion to a later day. Making use of dask optimized loading/storing subroutines might give you that extra 0.5 x.

emmanuelle

As mentioned by @soupault sphinx-gallery provides links to notebooks to download. The example as it is now does not comply with the format required by sphinx gallery (eg rst title, see the longer examples in the xx_applications section of the gallery).

This tutorial is very interesting, but at the moment I feel there is a bit too much information. Would it be possible to make it shorter and simpler? Also, to have a meaningful thumbnail for the gallery, could it be possible to plot (for example) the execution times for the two different methods, when you change the size or number of images?

doc/examples/batch_processing/dask_batch_processing.py

emmanuelle · 2018-09-05T20:26:27Z

requirements/docs.txt

@@ -4,3 +4,4 @@ sphinx>=1.3,!=1.7.8
 numpydoc>=0.6
 sphinx-gallery
 scikit-learn
+dask[delayed]


scikit-image depends on dask, so I don't think we need dask in docs requirements?

I think delayed is different than array. Though they both depend on toolz, so I guess we satisfy the requirements for dask[delayed] today.
https://github.com/dask/dask/blob/master/setup.py#L16

emmanuelle · 2018-09-05T20:29:02Z

Binder badges can be created by activating an option in sphinx-gallery, see https://sphinx-gallery.github.io/configuration.html?highlight=binder#binder-links

emmanuelle · 2018-09-05T20:31:35Z

Out of curiosity, have you tried comparing with a third method, ie a more explicit embarrassingly parallel loop (eg with joblib.Parallel, with either the loky or the threading backend? https://joblib.readthedocs.io/en/latest/parallel.html)?

hmaarrfk · 2018-09-05T21:09:27Z

Thank you for your comments @emmanuelle. Much of this has been rewritten in a notebook format while I was on a plane. Nothing better to do than benchmark right!

The version here is rather long, and obfuscated in function calls (I hate encapsulating simple tasks things in functions). The notebook removes much of this. I should really try joblib, but I just haven't had the time to personally play with it.

For image processing, I think learning a tool like dask might be a better long term investment even if it chokes when you give it 50000+ tasks.

It should be pretty easy to rewrite this example for joblib, though I've learned the devil is always in the details.

I would like to create a graph of the results, though I don't want to make it harder to read the notebook.

I think it is much easier for beginners to read

N_files = 200
images = []
for i in range(N_files):
    images.append(imread(i))

than

N_files = 200
def read_images(N_files):
    images = []
    for i in range(N_files):
        images.append(imread(i))
    return images
images = read_images(N_files)

Where the latter has been optimized for making the plot of the results.
Finally, the second version seems much more intimidating to modify for your own use, at least for my taste.

Is there a way to programatically runs cells from an other cell in jupyter?

jni · 2018-09-06T00:05:19Z

@hmaarrfk Wow! That's super awesome! =) Fantastic work on this issue.

I think it is much easier for beginners to read

This depends on context. As you wrote it, of course, but if you define read_images elsewhere, this is easier:

N_files = 200
images = read_images(N_files)

This is a bad example because it's not at all obvious what read_images is doing with the integer input, but for many functions, it's very readable to abstract a task away.

Is there a way to programatically runs cells from an other cell in jupyter?

Why would you want to do such a thing?

codecov-io · 2018-09-06T04:45:16Z

Codecov Report

❗ No coverage uploaded for pull request base (master@ffdb7cc). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master    #3386   +/-   ##
=========================================
  Coverage          ?   86.82%           
=========================================
  Files             ?      340           
  Lines             ?    27485           
  Branches          ?        0           
=========================================
  Hits              ?    23865           
  Misses            ?     3620           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffdb7cc...a7be184. Read the comment docs.

hmaarrfk · 2018-09-06T04:47:34Z

I mostly don't want to change the way the cells are laid out while running the benchmark programatically to create the figure
https://github.com/hmaarrfk/scikit-image/blob/example_batch_processing/doc/examples/batch_processing/batch_processing_with_dask.ipynb

I like not having too many levels of indentation.

jni · 2018-09-06T06:33:53Z

@hmaarrfk very nice guide!

You can use nb2plot to convert it to rst. Style-wise, I would say change stuff like "the author thinks" etc to "we think". Once it gets merged, at least two members of the scikit-image team must have endorsed your message. =)

Rather than put this in the gallery, I would put it as a separate page in the getting started section, similar to "image data types and what they mean" and "a guide on numpy for images".

sciunto · 2018-09-12T07:45:34Z

This should go in doc/source/user_guide/tutorial_parallelization.rst

soupault · 2018-09-12T07:55:10Z

@hmaarrfk @jni just for you guys to see how notebooks are automatically integrated into the Sphinx-powered documentation via nbsphinx plugin - https://mipt-oulu.github.io/solt/Helen_faces.html .

hmaarrfk · 2018-09-16T02:47:28Z

WIP: Still can't generate the docs.

sciunto · 2018-10-09T12:48:19Z

@hmaarrfk I successfully build the doc on my machine from your PR. What's not working for you? The rendering of build/html/user_guide/tutorial_parallelization_dask.html looks correct (maybe some missing syntax coloring...)

doc/source/user_guide/tutorial_parallelization_dask.ipynb

sciunto · 2018-10-09T12:54:09Z

doc/source/user_guide/tutorial_parallelization_dask.ipynb

+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "# Save\n",


I would remove this command and put a proper sentence instead.

hmaarrfk · 2019-07-07T17:24:23Z

Anway, i got rid of the magic commands, @jni, i would appreciate help turning this into just some python code that gets converted to html. nb2plots doens't seem to execute the code on my local copy of the scripts.

jni · 2019-07-13T21:31:24Z

@jakirkham here is the example

hmaarrfk · 2019-07-28T23:33:14Z

@jni, any idea on how to move forward with this PR? We are almost hitting the 1 year mark on it.

jni · 2019-07-29T03:13:35Z

@hmaarrfk to paraphrase @jakirkham from a different context: "1 year? That's not so bad..." 😛

I'm confused about why you have both the rst and the notebook in the PR right now.

How does the ipython directive work? Can that get executed? imho we can just have pre-executed outputs and call it a day. And my absolute preference would be to use a timeit-based context manager, although that does not currently seem to exist... =\ But I'm happy with the generated rst, as long as we remove all the ipynb stuff.

If we want to have notebooks in the docs, that's a major discussion and I expect it will take a long time to resolve.

hmaarrfk · 2019-07-29T03:20:42Z

I have both options because I'm not going to force push and delete my only copy of this notebook.

pre-executed outputs are less than ideal since code gets stale and becomes non-functional. I always thought that was the coolest thing about the docs of scikit-image. I could call timeit.timer() myself, like tic toc, but that basically encourages bad benchmarking.

I'm not too sure how the ipython directive works. I think it calls timeit with well chosen settings (why those aren't upstream is a little beyond me). I can look into it. But again, remember, that no beginner will ever use timeit directly. %%timeit is just too cool!!!

jni · 2019-07-29T03:35:15Z

Actually, if with timeit.timer() as t: worked the same as %%timeit, I would use that all the time in preference to the magic. (in particular because it's handy to have access to the timer, e.g. to programmatically compare timings and write: speedup = t1.best_time / t0.best_time

jni · 2019-07-29T03:35:53Z

I totally agree with you about the defaults not being in upstream, btw

sciunto · 2019-07-29T05:23:56Z

A short term solution would be to move this in our tutorials and open a discussion for inclusion of notebooks in the doc (I would like to see how it looks like also)

sciunto · 2019-08-09T06:58:04Z

@hmaarrfk What do you think?

hmaarrfk · 2019-08-09T11:53:25Z

That's a good idea.

sciunto · 2020-03-07T19:42:25Z

@hmaarrfk Would you like to open a PR in the tutorial repository, so that you get the authorship of the commit?

hmaarrfk · 2020-03-08T01:55:57Z

Sure

grlee77 · 2022-03-30T20:48:09Z

I think we should add some demonstration of batch processing to our docs. #5407 and #4214 are also related. I think for the CZI grant, we did say we would provide an example related to batch processing.

I'm confused about why you have both the rst and the notebook in the PR right now.

MyST-NB may be a good option? I haven't used it yet myself, but my understanding is that you can write the notebooks in Markdown making it easier to store in the repository, view diffs, etc.

This comment has been minimized.

Sign in to view

emmanuelle reviewed Sep 5, 2018

View reviewed changes

doc/examples/batch_processing/dask_batch_processing.py Outdated Show resolved Hide resolved

doc/examples/batch_processing/dask_batch_processing.py Outdated Show resolved Hide resolved

doc/examples/batch_processing/dask_batch_processing.py Outdated Show resolved Hide resolved

emmanuelle reviewed Sep 5, 2018

View reviewed changes

hmaarrfk force-pushed the example_batch_processing branch from 7107e25 to 5881fbf Compare September 16, 2018 02:25

sciunto added 📄 type: Documentation Updates, fixes and additions to documentation 💪 Work in progress labels Oct 4, 2018

sciunto added 🧐 Needs review and removed 💪 Work in progress labels Oct 9, 2018

sciunto reviewed Oct 9, 2018

View reviewed changes

doc/source/user_guide/tutorial_parallelization_dask.ipynb Outdated Show resolved Hide resolved

sciunto reviewed Oct 9, 2018

View reviewed changes

jni mentioned this pull request Sep 13, 2019

Improving the performance of specific functions #3118

Open

sciunto modified the milestones: 0.16, 0.17 Mar 7, 2020

sciunto modified the milestones: 0.17, 0.18 Sep 5, 2020

sciunto mentioned this pull request Sep 7, 2020

2020's calendar of community management #4486

Closed

Base automatically changed from master to main February 18, 2021 18:23

mkcor mentioned this pull request Feb 22, 2021

2021's calendar of community management #5169

Closed

jarrodmillman modified the milestones: 0.18, 0.20 Jun 4, 2022

lagru modified the milestones: 0.20, 0.21 Oct 25, 2022

jarrodmillman modified the milestones: 0.21, 0.22 May 19, 2023

jarrodmillman modified the milestones: 0.22, 0.23 Oct 24, 2023

jarrodmillman removed this from the 0.23 milestone Apr 4, 2024

Example batch processing #3386

Are you sure you want to change the base?

Example batch processing #3386

Conversation

hmaarrfk commented Sep 3, 2018

Checklist

For reviewers

This comment has been minimized.

jni commented Sep 3, 2018

hmaarrfk commented Sep 3, 2018

soupault commented Sep 4, 2018

jni commented Sep 4, 2018

hmaarrfk commented Sep 4, 2018

jni commented Sep 5, 2018

hmaarrfk commented Sep 5, 2018

soupault commented Sep 5, 2018

hmaarrfk commented Sep 5, 2018

emmanuelle left a comment

Choose a reason for hiding this comment

emmanuelle Sep 5, 2018

Choose a reason for hiding this comment

hmaarrfk Sep 5, 2018

Choose a reason for hiding this comment

emmanuelle commented Sep 5, 2018

emmanuelle commented Sep 5, 2018

hmaarrfk commented Sep 5, 2018

jni commented Sep 6, 2018

codecov-io commented Sep 6, 2018 • edited

Codecov Report

hmaarrfk commented Sep 6, 2018

jni commented Sep 6, 2018

sciunto commented Sep 12, 2018

soupault commented Sep 12, 2018

hmaarrfk commented Sep 16, 2018

sciunto commented Oct 9, 2018

sciunto Oct 9, 2018

Choose a reason for hiding this comment

hmaarrfk commented Jul 7, 2019

jni commented Jul 13, 2019

hmaarrfk commented Jul 28, 2019

jni commented Jul 29, 2019

hmaarrfk commented Jul 29, 2019

jni commented Jul 29, 2019

jni commented Jul 29, 2019

sciunto commented Jul 29, 2019

sciunto commented Aug 9, 2019

hmaarrfk commented Aug 9, 2019

sciunto commented Mar 7, 2020

hmaarrfk commented Mar 8, 2020

grlee77 commented Mar 30, 2022

codecov-io commented Sep 6, 2018 •

edited