Algebird's HyperLogLog support for Apache Spark.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
project
src Rename function package object. Apr 11, 2016
.gitignore First commit. Apr 9, 2016
.travis.yml
LICENSE
README.md
build.sbt Update settings for spPublish Sep 14, 2016
scalastyle-config.xml Add scalastyle checks. Apr 9, 2016

README.md

spark-hyperloglog

Algebird's HyperLogLog support for Apache Spark. This package can be used in concert with presto-hyperloglog to share HyperLogLog sets between Spark and Presto.

Build Status codecov.io

Example usage

import com.mozilla.spark.sql.hyperloglog.aggregates._
import com.mozilla.spark.sql.hyperloglog.functions._

val hllMerge = new HyperLogLogMerge
sqlContext.udf.register("hll_merge", hllMerge)
sqlContext.udf.register("hll_create", hllCreate _)
sqlContext.udf.register("hll_cardinality", hllCardinality _)

val frame = sc.parallelize(List("a", "b", "c", "c")).toDF("id")
val count = frame
  .select(expr("hll_create(id, 12) as hll"))
  .groupBy()
  .agg(expr("hll_cardinality(hll_merge(hll)) as count"))
  .show()

yields:

+-----+
|count|
+-----+
|    3|
+-----+

Deployment

  1. Configure your credentials for the Spark Packages repository in ~/.ivy2/.sbtcredentials, e.g:

    realm=Spark Packages Realm
    host=spark-packages.org
    user=foo
    password=bar
    
  2. Publish a new release with sbt spPublish