Spark Alchemy is a collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive in our demanding petabyte-scale environment with rich data (thousands of columns).
Add the following to your
libraryDependencies in SBT:
resolvers += Resolver.bintrayRepo("swoop-inc", "maven") libraryDependencies += "com.swoop" %% "spark-alchemy" % "<version>"
You can find all released versions here.
For Spark users
For Spark framework developers
- Helpers for native function registration
Configuration Addressable Production (CAP), Automatic Lifecycle Management (ALM) and Just-in-time Dependency Resolution (JDR) as outlined in our Spark+AI Summit talk Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments.
Hundreds of productivity-enhancing extensions to the core user-level data types:
Data discovery and cleansing tools we use to ingest and clean up large amounts of dirty data from third parties.
Cross-cluster named lock manager, which simplifies data production by removing the need for workflow servers much of the time.
Versioned data source, which allows a new version to be written while the current version is being read.
case classcode generation from Spark schema, with easy implementation customization.
Tools for deploying Spark ML pipelines to production.
Lots more, as we are constantly building up our internal toolset.
More from Swoop
- spark-records: bulletproof Spark jobs with fast root cause analysis in the case of failures
Community & contributing
Contributions and feedback of any kind are welcome. Please, create an issue and/or pull request.
Spark Alchemy is maintained by the team at Swoop. If you'd like to contribute to our open-source efforts, by joining our team or from your company, let us know at
spark-interest at swoop dot com.
spark-alchemy is Copyright © 2018 Swoop, Inc. It is free software, and may be redistributed under the terms of the LICENSE.