This project implements the HyperLogLog algorithm in Java.
HyperLogLog is a probabilistic data structure used to estimate the number of distinct elements (cardinality) in very large datasets.
Instead of storing all elements, HyperLogLog stores only small statistical information, allowing it to estimate the number of unique values using very little memory.
This algorithm is widely used in Big Data systems and distributed databases.
The HyperLogLog algorithm works in several steps:
- Each incoming element is passed through a hash function.
- The hash value is used to determine a bucket (register).
- The algorithm counts the number of leading zeros in the remaining hash bits.
- The register is updated if the new value is larger.
- A harmonic mean calculation is used to estimate the total number of distinct elements.
The cardinality estimation formula used by HyperLogLog:
E = αₘ * m² * ( Σ 2^-M[j] )⁻¹
Where:
- m → number of registers
- M[j] → value stored in register j
- αₘ → bias correction constant
| Operation | Complexity |
|---|---|
| Add element | O(1) |
| Estimate cardinality | O(m) |
| Merge two structures | O(m) |
O(m)
Where m = 2ᵖ represents the number of registers.
HyperLogLog provides approximate results with a known error bound.
Relative Error ≈ 1.04 / √m
Increasing the number of registers reduces the estimation error.
The Java implementation includes the following main functions:
hash()→ Generates a hash value using SHA-256leadingZeros()→ Counts leading zero bits in the hash valueadd()→ Adds a new element to the structureestimate()→ Estimates the number of unique elementsmerge()→ Merges two HyperLogLog structures
HyperLogLog hll = new HyperLogLog(10);
for (int i = 0; i < 100000; i++) {
hll.add("user" + i);
}
System.out.println("Estimated cardinality: " + hll.estimate());