Skip to content
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Scala Dockerfile Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci
alchemy-test/src/main/scala/com/swoop/alchemy/test
alchemy/src
docs/main
project
.decrypt-keys.sh Implemented new project structure and release process Oct 29, 2018
.gitignore
.travis.yml
DEVELOPMENT.md
LICENSE Initial commit Aug 3, 2018
README.md
VERSION
build.sbt
codeStyleSettings.xml
docker-compose.yml Add Aggregate Knowledge HLL implementation Oct 16, 2019
travis-deploy-key.enc

README.md

spark-alchemy

Download

Spark Alchemy is a collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive in our demanding petabyte-scale environment with rich data (thousands of columns).

Installation

Add the following to your libraryDependencies in SBT:

resolvers += Resolver.bintrayRepo("swoop-inc", "maven")

libraryDependencies += "com.swoop" %% "spark-alchemy" % "<version>"

You can find all released versions here.

For Spark users

  • Native HyperLogLog functions that offer reaggregatable fast approximate distinct counting capabilities far beyond those in OSS Spark with interoperability to Postgres and even JavaScript.

For Spark framework developers

What's coming

  • Configuration Addressable Production (CAP), Automatic Lifecycle Management (ALM) and Just-in-time Dependency Resolution (JDR) as outlined in our Spark+AI Summit talk Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments.

  • Hundreds of productivity-enhancing extensions to the core user-level data types: Column, Dataset, SparkSession, etc.

  • Data discovery and cleansing tools we use to ingest and clean up large amounts of dirty data from third parties.

  • Cross-cluster named lock manager, which simplifies data production by removing the need for workflow servers much of the time.

  • Versioned data source, which allows a new version to be written while the current version is being read.

  • case class code generation from Spark schema, with easy implementation customization.

  • Tools for deploying Spark ML pipelines to production.

  • Lots more, as we are constantly building up our internal toolset.

More from Swoop

  • spark-records: bulletproof Spark jobs with fast root cause analysis in the case of failures

Community & contributing

Contributions and feedback of any kind are welcome. Please, create an issue and/or pull request.

Spark Alchemy is maintained by the team at Swoop. If you'd like to contribute to our open-source efforts, by joining our team or from your company, let us know at spark-interest at swoop dot com.

License

spark-alchemy is Copyright © 2018 Swoop, Inc. It is free software, and may be redistributed under the terms of the LICENSE.

You can’t perform that action at this time.