Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Switch branches/tags
Clone or download
pidge Add `hll_row_merge` and `hll_intersect_cardinality`
Also does a bit of cleanup to existing HLL functions
Latest commit 14f294d Dec 17, 2018

README.md

spark-alchemy

Download

Spark Alchemy is a collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive in our demanding petabyte-scale environment with rich data (thousands of columns).

Installation

Add the following to your libraryDependencies in SBT:

resolvers += Resolver.bintrayRepo("swoop-inc", "maven")

libraryDependencies += "com.swoop" %% "spark-alchemy" % "<version>"

You can find all released versions here.

For Spark users

  • Native HyperLogLog functions that offer reaggregatable fast approximate distinct counting capabilities far beyond those in OSS Spark with interoperability to Postgres and even JavaScript.

For Spark framework developers

What's coming

  • Configuration Addressable Production (CAP), Automatic Lifecycle Management (ALM) and Just-in-time Dependency Resolution (JDR) as outlined in our Spark+AI Summit talk Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments.

  • Hundreds of productivity-enhancing extensions to the core user-level data types: Column, Dataset, SparkSession, etc.

  • Data discovery and cleansing tools we use to ingest and clean up large amounts of dirty data from third parties.

  • Cross-cluster named lock manager, which simplifies data production by removing the need for workflow servers much of the time.

  • Versioned data source, which allows a new version to be written while the current version is being read.

  • case class code generation from Spark schema, with easy implementation customization.

  • Tools for deploying Spark ML pipelines to production.

  • Lots more, as we are constantly building up our internal toolset.

More from Swoop

  • spark-records: bulletproof Spark jobs with fast root cause analysis in the case of failures

Community & contributing

Contributions and feedback of any kind are welcome. Please, create an issue and/or pull request.

Spark Alchemy is maintained by the team at Swoop. If you'd like to contribute to our open-source efforts, by joining our team or from your company, let us know at spark-interest at swoop dot com.

License

spark-alchemy is Copyright © 2018 Swoop, Inc. It is free software, and may be redistributed under the terms of the LICENSE.