Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor (WIP) #237

Closed
wants to merge 52 commits into from
Closed

Refactor (WIP) #237

wants to merge 52 commits into from

Conversation

freeman-lab
Copy link
Member

This is a huge refactoring of Thunder, and will the basis of an upcoming new release. We'd normally break it up into multiple PRs, but this touches so much of the code base that it was easier to do all at once.

There are three primary goals, based on a year of community experience and feedback, and consideration of the current ecosystem:

  1. Loosen the dependency on Spark. This is a big one. Many superficial issues, including installation issues, complexity for new users and contributors, etc are due to Thunder's hard dependence on Spark. We will definitely continue to support Spark, we also want to enable work seamlessly across local and distributed environments, and against a variety of execution engines, including Spark but also new libraries like Dask. This PR begins that effort through some fundemental but neccessary refactoring.
  2. Modularize the components. Thunder has started absorbing a wide variety of algorithms / analyses, especially with recent additions to image registration and spatiotemporal source extraction. These components are at different levels of maturity and specificity, and are better off as pluggable, composable pieces living in separate repos.
  3. Modernize the codebase, and make more friendly to the Python ecosystem, in particular by ensuring Python 3 compatibility, using py.test for unit tests, and Pythonic naming conventions.

refactoring

  • develop global context manager for backend
  • refactor data reading / writing
  • update reading / writing tests
  • remove executables
  • remove standalone scripts
  • use S3 for external data
  • use py.test for unit tests
  • update documentation
  • make python 3 compatible
  • use snakecase

new packages (inside thunder-project)

  • rime - source extraction
  • sleet - image registration
  • thundercloud - manage cluster on ec2

new packages (external)

  • station - context manager for distributed backends
  • checkist - minimal argument checking
  • showit - simple display of images and tiled images
  • serdeme - custom class serialization/deserialization

@freeman-lab freeman-lab changed the title Starting refactor (WIP) Refactor (WIP) Nov 15, 2015
@d-v-b
Copy link
Contributor

d-v-b commented Nov 15, 2015

Looks great so far. I'm curious to hear more about how Thunder will abstract over parallelization engines.

Regarding the new packages, would usage be something like from thunder.sleet import fancy_registration?

@freeman-lab
Copy link
Member Author

Thanks @d-v-b !

For engine switching, there's a new user-facing global context for switching between backends, and internally we condition on its state (it mainly only matters during loading).

For example, you'll be able to load using spark with

import thunder
thunder.setup(spark=True)
data = thunder.series.fromBinary('path')

and load locally with

import thunder
thunder.setup()
data = thunder.series.fromBinary('path')

or because local is the default you can just do

import thunder
data = thunder.series.fromBinary('path')

In all cases the returned object data will (eventually) know how to talk local or distributed. As of now only the Spark paths will work, but the scaffolding is there to support others.

@freeman-lab
Copy link
Member Author

Oh and I haven't decided yet about the external libraries, could alias them as you suggest in which case thunder acts a meta-package (on top of its core functionality), or just do manual imports as in

from rime.algorithms import NMF
from sleet.algorithms import CrossCorr

open to suggestions here!

@d-v-b
Copy link
Contributor

d-v-b commented Nov 16, 2015

Regarding the external libraries, I'm not sure what would be best. If you envision people using the registration and source extraction methods outside of Thunder, then separate libs makes sense. Is this your intention?

In the other case, I don't see a big difference between
from thunder.registration import registration_method
and
from thunder.sleet.algorithms import registration_method

@freeman-lab
Copy link
Member Author

@d-v-b we'll get the "using it outside Thunder" thing regardless, that's the nice thing about making them separate, assuming we structure them correctly 👍 the aliasing is just sugar in case people forget the names. I'll look at how some other projects do it...

@freeman-lab
Copy link
Member Author

@d-v-b awesome suggestion! Didn't realize it was included with skimage, that's perfect. Just made the change for reading, seems to work great. For writing, how do you think we should handle dimensions and bit-depth? After playing with the options, here's a rough proposal:

  • assume all channels are luminance channels, so an (x, y) image will be saved as a single grayscale image and an (x, y, n) image will always be n grayscale image pages, as opposed to treating it as RGB or RGBA in the special case of n == 3 or n == 4 (i'd rather just add proper support for color channels down the road)
  • for both tif and png, make bit-depth a parameter (probably 8 for png, and 8/16/32 for tif), scale the values between a min and a max (an optional parameter), and convert to the appropriate numerical type (uint8, uint16, uint32) before writing.

@freeman-lab
Copy link
Member Author

And while I'm renaming... the current plan is as follows:

for two-word names where the first word is four characters or less, make it one word, e.g.

  • thunder.images.frompng
  • thunder.series.fromarray
  • data.topng
  • data.tobinary

for names where the first word is longer than four characters, or there are more than two words, use snake case, e.g.

  • data.apply_values
  • data.filter_keys
  • data.group_by_window

Exceptions would only be to ensure consistency with closely associated Python libraries (e.g. numpy).

Let me know if anyone disagrees!

@d-v-b
Copy link
Contributor

d-v-b commented Nov 27, 2015

@freeman-lab
I agree that 3d arrays should be saved as stacks. How would a user save a 3D array as an RGB image under this regime? Perhaps by making a 4D array with dims = [t, z, y, x] where z is singleton? It's not critical -- in the worst case, the user can fall back to a custom implementation of tifffile.py.

Regarding bit-depth and data type, we should be sure to allow everything that tifs can hold (or everything fiji can read...) -- I think tifs can contain 32-bit floats, and for fractional data (e.g. dff timeseries) this is pretty critical. Also I'm fairly sure signed integer types are allowed, so the may need to be nuanced. Numpy allows arrays with dtype float16 and float64, neither of which can be read by fiji, although iirc I've saved tifs with these data before.

Perhaps for recasting data, there could be some kwargs to specify how this should be done, e.g.
data.totif(fname, dtype='uint16', rescale=True) if the user wants the data to be cast as int and rescaled to fill the bit depth. Rescale should probably default to False since the default, non-rescaling casting operation is what users should expect. Sorry to be predictably pedantic about dtypes 😄

@freeman-lab
Copy link
Member Author

Thanks @d-v-b , I was counting on your pedantry 👍

  • re: RGB, we could for now just add a rgb=True flag that treats 3D images as RGB
  • re: bit-depth, i like the rescale=True idea, we can put off deciding the default but I'm fine with making it false for now (which puts the oness on the user to do the right thing). The only thing will be the png handling will be different from tif, because as far as I can tell the most sane, dependency-free png saving method is scipy.misc.imsave, which itself does scaling, so there's no way to skip it.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 28, 2015

@freeman-lab,
I completely support the renaming! Just expect confusion from lots of users who aren't active on github / gitter after the update...

@freeman-lab
Copy link
Member Author

@d-v-b 👍 careful documentation and explanation of changes (esp. breaking ones!) will be a high priority as soon as this is done

@auvipy
Copy link

auvipy commented Dec 8, 2015

thunder and scikit-image integration?

@freeman-lab
Copy link
Member Author

closing in favor of #243

@freeman-lab freeman-lab closed this Jan 8, 2016
@freeman-lab freeman-lab deleted the refactor branch April 5, 2016 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants