We are pleased to announce the release of Thunder 0.4.0.
This release introduces some major API changes, especially around loading and converting data types. It also brings some substantial updates to the documentation and tutorials, and better support for data sets stored on Amazon S3. While some big changes have been made, we feel that this new architecture provides a more solid foundation for the project, better supporting existing use cases, and encouraging contributions. Please read on for more!
- Data representation. Most data in Thunder now exists as subclasses of the new
thunder.rdds.Dataobject. This wraps a PySpark RDD and provides several general convenience methods. Users will typically interact with two main subclasses of data,
thunder.rdds.Series, representing spatially- and temporally-oriented data sets, respectively. A common workflow will be to load image data into an
Imagesobject and then convert it to a
Seriesobject for further analysis, or just to convert
- Loading data. The main entry point for most users remains the
thunder.utils.context.ThunderContextobject, available in the interactive shell as
tsc, but this class has many new, expanded, or renamed methods, in particular
convertImagesToSeries(). Please see the Thunder Context tutorial and the API documentation for more examples and detail.
- New methods for manipulating and processing images and series data, including refactored versions of some earlier analyses (e.g. routines from the package previously known as
- Documentation has been expanded, and new tutorials have been added.
- Core API components are now exposed at the top-level for simpler importing, e.g. from thunder import Series or from thunder import ICA
Improved support for loading image data directly from Amazon S3, using the boto AWS client library. The
load*methods in ThunderContext now all support
s3n://schema URIs as data path specifiers.
Notes about requirements and environments
- Spark 1.1.0 is required. Most functionality will be intact with earlier versions of Spark, with the exception of loading flat binary data.
- “Hadoop 1” jars as packaged with Spark are recommended, but Thunder should work fine if recompiled against the CDH4, CDH5, or “Hadoop 2” builds.
- Python 2 required, version 2.6 or greater.
- PIL/pillow libraries are used to handle tif images. We have encountered some issues working with these libraries, particularly on OSX 10.9. Some errors related to image loading may be traceable to a broken PIL/pillow installation.
- This release has been tested most extensively in three environments: local usage, a private research compute cluster, and Amazon EC2 clusters stood up using the thunder-ec2 script packaged with the distribution.
Thunder is still young, and will continue to grow. Now is a great time to get involved! While we will try to minimize changes to the API, it should not yet be considered stable, and may change in upcoming releases. That said, if you are using or contemplating using Thunder in a production environment, please reach out and let us know what you’re working on, or post to the mailing list.