Skip to content
/ hypex Public

Fast HyperLogLog implementation for Elixir/Erlang

License

Notifications You must be signed in to change notification settings

whitfin/hypex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hypex

Build Status Coverage Status Hex.pm Version Documentation

Hypex is a fast HyperLogLog implementation in Elixir which provides an easy way to count unique values with a small memory footprint. This library is based on the paper documenting the algorithm written by Philippe Flajolet et al.

Installation

Hypex is available on Hex. You can install the package via:

  1. Add hypex to your list of dependencies in mix.exs:
def deps do
  [{ :hypex, "~> 1.1" }]
end
  1. Ensure hypex is started before your application:
def application do
  [applications: [:hypex]]
end

Usage

Hypex is extremely straightforward to use, you simply create a new Hypex instance and start adding values to it:

iex> hypex = Hypex.new(4)
{Hypex.Array, 4, {:array, 16, 0, 0, 100}}
iex> hypex = Hypex.update(hypex, "my term")
{Hypex.Array, 4,
 {:array, 16, 0, 0,
  {10, {0, 2, 0, 0, 0, 0, 0, 0, 0, 0}, 10, 10, 10, 10, 10, 10, 10, 10, 10}}}
iex> hypex |> Hypex.cardinality |> round
1

The 4 being passed to Hypex.new/1 is the width which determines the underlying memory structure of a Hypex instance. This value can be within the range 4 <= width <= 16, per the HyperLogLog algorithm. If you don't provide a width, it defaults to 16. Be aware that you should typically scale this number higher based upon the more unique values you expect to see.

For any other examples of how to use Hypex, please read the documentation.

Memory Optimization

As of v1.1.0, the default implementation has moved from a Bitstring to an Erlang Array. This is mainly due to Arrays performing faster on all operations when compared with Bitstrings. However in the case that you're operating in a low-memory environment (or simply want predictable memory usage), you might still wish to use the Bitstring implementation. You can do this by simply using Hypex.new(4, Bitstring) when creating a Hypex.

A rough memory estimate (in bytes) for a Bitstring Hypex can be calculated using the formula ((2 ^ width) * width) / 8 - although this will only include the memory of the registers and not the rest of the tuple structure (which should be minimal). This means that using the highest width available of 16, your memory usage will still only be 131,072 bytes.

At this point I don't know of a good way to measure the size of the Array implementation, but a rough estimate would suggest that it's probably within the range of 6-8 times more memory (if anyone can help measure, I'd appreciate it). Still, this amount of memory shouldn't pose an issue for most systems, and the throughput likely matters more to most users.

Rough Benchmarks

Below are some rough benchmarks for Hypex instances with the different underlying structures. Note that the update/2 tests are inserting a unique value - in the case a duplicate value is inserted, the operation is typically constant across widths at under 0.5 µs/op.

These tests use a width of 4, so it should be noted that larger widths will have slower performance. However, these benchmarks are for reference only and you should gauge which widths work best for the data you're operating with, rather than the performance shown below.

## Array Hypex

Array Hypex.new/1                0.53 µs/op
Array Hypex.update/2             2.13 µs/op
Array Hypex.cardinality/1        6.87 µs/op
Array Hypex.merge/2              16.61 µs/op

## Bitstring Hypex

Bitstring Hypex.new/1            0.46 µs/op
Bitstring Hypex.update/2         2.13 µs/op
Bitstring Hypex.cardinality/1    6.70 µs/op
Bitstring Hypex.merge/2          8.69 µs/op

Contributions

If you feel something can be improved, or have any questions about certain behaviours or pieces of implementation, please feel free to file an issue. Proposed changes should be taken to issues before any PRs to avoid wasting time on code which might not be merged upstream.

If you do make changes to the codebase, please make sure you test your changes thoroughly, and include any unit tests alongside new or changed behaviours. Hypex currently uses the excellent excoveralls to track code coverage.

$ mix test
$ mix coveralls
$ mix coveralls.html && open cover/excoveralls.html

About

Fast HyperLogLog implementation for Elixir/Erlang

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages