New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example batch processing #3386
base: main
Are you sure you want to change the base?
Example batch processing #3386
Conversation
This comment has been minimized.
This comment has been minimized.
Not currently, but @matthew-brett has a library to do something of the sort: https://github.com/matthew-brett/nb2plots I don't know whether sphinx/sphinx-gallery produces an rst intermediate from these that we could plug into, but it would be very valuable (imho) to have a "download as notebook" link on each example. Regarding your example, (a) very interesting, but (b) I hope we find a solution for using large datasets in examples and we can make the example less artificial. When I mentioned that I haven't got good results, it's not about the size of the image (they're big), but, maybe, about overlap. Here's two examples @mrocklin posted: https://nbviewer.jupyter.org/gist/mrocklin/ec745d6c2a12dddddb125ef460a4da76 ie it's hard to get speedups for non-trivial tasks. Nevertheless, your results here are surprising, because I think running ptime on a gaussian filter does not get you anywhere near 4x speedup or 4 cores: https://pypi.org/project/ptime/ So probably what's happening is that the rest of the pipeline is parallelising quite well... |
@jni, man I definitely spent more time on this than I would have wanted to. first, I seem to be able to get a 2x improvement, instead of a 30% improvement using dask in 2018. I think some of dask has definitely improved. That said, honestly, pickle seems like the bottlekneck here. Maybe @mrocklin has already looked into this? https://matthewrocklin.com/blog/work/2018/07/23/protocols-pickle http://nbviewer.jupyter.org/gist/hmaarrfk/b0ef570b36267a5e10c81bb0309a318c Issue on numpy here: numpy/numpy#7544 Even if parallelization is perfect, pickling and unpickling that array on the same processor takes 10s of ms. Finally, copying into a continuous C array seems to be an other slowdown. |
@jni I'm glad to inform you that we already have this feature 🙂: Apparently, we could also have a notebook rendering plugin for Sphinx (which I find quite fascinating): Having the above, should we go even further and set up a Sorry for being slightly of-topic. |
@soupault haha so great! Tick! Bonus points: launch it in binder. =P Regarding tutorials, yes, absolutely, I have a medium term goal of porting the tutorials repo into a nice section in the website. |
How do we upload ipynb? Do we clear all outputs before posting them? This kind of tutorial requires a machine more powerful than Travis, are the regenerated results going to suffer? Regarding Binder, do you know how the peformance might compare to the average student's laptop? Personally, I am very performance conscious. |
I presume sphinx-gallery builds the notebooks. Not sure whether the outputs are cleared or not, should be simple enough to try it (e.g. point nbviewer.jupyter.org at one of the download links).
Think small. =) The Travis machine has 2 cores, so that is sufficient to demonstrate parallel processing.
The purpose of these links is not to get people doing their "real" scikit-image analyses on Binder or to demonstrate use with a 50GB dataset, but to demonstrate code that you could run on a 50GB dataset while being performant. If you can demonstrate that you get a 2x speedup by using dask properly on the Travis machine (where the docs are built), then you're done. |
I guess many people are on 13 inch machines that only have 2 cores too.
My fear is that travis is so unpredictable that running anything close to a benchmark on it is really difficult. |
From the first paragraph following one of the above URLs: |
Thanks @soupault 😅 @jni, with @mrocklin's help, I reran the the notebook you gave linked to me and can get 3.25 speedup on the example dataset now using an experimental branch of dask. The two main optimizations are:
There is also something to be said about how the data is stored on disk (or SSD). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by @soupault sphinx-gallery provides links to notebooks to download. The example as it is now does not comply with the format required by sphinx gallery (eg rst title, see the longer examples in the xx_applications
section of the gallery).
This tutorial is very interesting, but at the moment I feel there is a bit too much information. Would it be possible to make it shorter and simpler? Also, to have a meaningful thumbnail for the gallery, could it be possible to plot (for example) the execution times for the two different methods, when you change the size or number of images?
requirements/docs.txt
Outdated
@@ -4,3 +4,4 @@ sphinx>=1.3,!=1.7.8 | |||
numpydoc>=0.6 | |||
sphinx-gallery | |||
scikit-learn | |||
dask[delayed] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-image depends on dask, so I don't think we need dask in docs requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think delayed is different than array. Though they both depend on toolz, so I guess we satisfy the requirements for dask[delayed]
today.
https://github.com/dask/dask/blob/master/setup.py#L16
Binder badges can be created by activating an option in sphinx-gallery, see https://sphinx-gallery.github.io/configuration.html?highlight=binder#binder-links |
Out of curiosity, have you tried comparing with a third method, ie a more explicit embarrassingly parallel loop (eg with joblib.Parallel, with either the loky or the threading backend? https://joblib.readthedocs.io/en/latest/parallel.html)? |
Thank you for your comments @emmanuelle. Much of this has been rewritten in a notebook format while I was on a plane. Nothing better to do than benchmark right! The version here is rather long, and obfuscated in function calls (I hate encapsulating simple tasks things in functions). The notebook removes much of this. I should really try joblib, but I just haven't had the time to personally play with it. For image processing, I think learning a tool like dask might be a better long term investment even if it chokes when you give it 50000+ tasks. It should be pretty easy to rewrite this example for joblib, though I've learned the devil is always in the details. I would like to create a graph of the results, though I don't want to make it harder to read the notebook. I think it is much easier for beginners to read N_files = 200
images = []
for i in range(N_files):
images.append(imread(i)) than N_files = 200
def read_images(N_files):
images = []
for i in range(N_files):
images.append(imread(i))
return images
images = read_images(N_files) Where the latter has been optimized for making the plot of the results. Is there a way to programatically runs cells from an other cell in jupyter? |
@hmaarrfk Wow! That's super awesome! =) Fantastic work on this issue.
This depends on context. As you wrote it, of course, but if you define N_files = 200
images = read_images(N_files) This is a bad example because it's not at all obvious what
Why would you want to do such a thing? |
Codecov Report
@@ Coverage Diff @@
## master #3386 +/- ##
=========================================
Coverage ? 86.82%
=========================================
Files ? 340
Lines ? 27485
Branches ? 0
=========================================
Hits ? 23865
Misses ? 3620
Partials ? 0 Continue to review full report at Codecov.
|
I mostly don't want to change the way the cells are laid out while running the benchmark programatically to create the figure I like not having too many levels of indentation. |
@hmaarrfk very nice guide! You can use nb2plot to convert it to rst. Style-wise, I would say change stuff like "the author thinks" etc to "we think". Once it gets merged, at least two members of the scikit-image team must have endorsed your message. =) Rather than put this in the gallery, I would put it as a separate page in the getting started section, similar to "image data types and what they mean" and "a guide on numpy for images". |
This should go in |
@hmaarrfk @jni just for you guys to see how notebooks are automatically integrated into the Sphinx-powered documentation via |
7107e25
to
5881fbf
Compare
WIP: Still can't generate the docs. |
@hmaarrfk I successfully build the doc on my machine from your PR. What's not working for you? The rendering of build/html/user_guide/tutorial_parallelization_dask.html looks correct (maybe some missing syntax coloring...) |
"outputs": [], | ||
"source": [ | ||
"%%time\n", | ||
"# Save\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this command and put a proper sentence instead.
Anway, i got rid of the magic commands, @jni, i would appreciate help turning this into just some python code that gets converted to html. nb2plots doens't seem to execute the code on my local copy of the scripts. |
@jakirkham here is the example |
@jni, any idea on how to move forward with this PR? We are almost hitting the 1 year mark on it. |
@hmaarrfk to paraphrase @jakirkham from a different context: "1 year? That's not so bad..." 😛 I'm confused about why you have both the rst and the notebook in the PR right now. How does the ipython directive work? Can that get executed? imho we can just have pre-executed outputs and call it a day. And my absolute preference would be to use a timeit-based context manager, although that does not currently seem to exist... =\ But I'm happy with the generated rst, as long as we remove all the ipynb stuff. If we want to have notebooks in the docs, that's a major discussion and I expect it will take a long time to resolve. |
I have both options because I'm not going to force push and delete my only copy of this notebook. pre-executed outputs are less than ideal since code gets stale and becomes non-functional. I always thought that was the coolest thing about the docs of scikit-image. I could call I'm not too sure how the ipython directive works. I think it calls timeit with well chosen settings (why those aren't upstream is a little beyond me). I can look into it. But again, remember, that no beginner will ever use timeit directly. |
Actually, if |
I totally agree with you about the defaults not being in upstream, btw |
A short term solution would be to move this in our tutorials and open a discussion for inclusion of notebooks in the doc (I would like to see how it looks like also) |
@hmaarrfk What do you think? |
That's a good idea. |
@hmaarrfk Would you like to open a PR in the tutorial repository, so that you get the authorship of the commit? |
Sure |
I think we should add some demonstration of batch processing to our docs. #5407 and #4214 are also related. I think for the CZI grant, we did say we would provide an example related to batch processing.
MyST-NB may be a good option? I haven't used it yet myself, but my understanding is that you can write the notebooks in Markdown making it easier to store in the repository, view diffs, etc. |
@jni I guess I was always working with large images (2048 x 2048) and saw big improvements with parallel processing.
More to come, but this is the example I'm converging towards.
Do we have a way to have these in notebook format?
Checklist
[It's fine to submit PRs which are a work in progress! But before they are merged, all PRs should provide:]
./doc/examples
(new features only)./benchmarks
, if your changes aren't covered by anexisting benchmark
[For detailed information on these and other aspects see scikit-image contribution guidelines]
For reviewers
(Don't remove the checklist below.)
later.
__init__.py
.doc/release/release_dev.rst
.@meeseeksdev backport to v0.14.x