-
Notifications
You must be signed in to change notification settings - Fork 0
ProbabilisticSet
A probabilistic set is a memory efficient set data structure which achieves its efficiency by trading memory consumption for accuracy. The probably (pun intended) most known variant of this data structure is a so called Bloom filter. The Claymore library provides two Bloom filter implementations: a StaticBloomFilter and a ScalableBloomFilter. In the following we'll focus on the latter as a scalable variant is more useful in most situations. If you want to know more about the scalable Bloom filter implementation and how well it performs, you can read our paper.
A Bloom filter does not store the actual elements in its set in memory, but rather just a couple of bits per element. These bits are computed by taking a hash value of each element. In order to compute a hash, the elements a Bloom filter should work with must be serialized. The most compatible way of doing this is to provide a Serializer object to one of the constructors of the ScalableBloomFilter. In the following examples we'll use Strings as elements for our filter. The simplest way to construct a ScalableBloomFilter is:
ProbabilisticSet<String> set = new ScalableBloomFilter<>(StringSerializer.forUTF8Encoding());The above example will construct a ScalableBloomFilter with an initial capacity of 10000 elements and a maximum allowed false positive probability of 1%. You can specify these parameters manually like this:
ProbabilisticSet<String> set = new ScalableBloomFilter<>(
StringSerializer.forUTF8Encoding(),
30000, //initial capacity
0.001); //max error rateThe above example will construct a ScalableBloomFilter with an initial capacity of 30000 elements and a maximum allowed false positive probability of 0.1%. The special characteristic of the constructed Bloom filter is that if more elements are added to it, it scales itself to adhere to the specified maximum error rate. This does not apply, however, to a StaticBloomFilter.
NOTE: There are more advanced parameters that you can adjust to fit your specific use case. See the Javadocs and paper for further details.
If you want a Bloom filter to work with custom objects, you have to provide your own serializer to the constructor. Your concrete class must implement the Serializer interface.
You can add an element to the Bloom filter with the add() method. Let's add a couple of elements to the filter from the previous example:
set.add("Tomato");
set.add("Banana");
set.add("Lettuce");After having added an element to a Bloom filter, it is guaranteed to be included in it.
A membership query tells you whether a specified element is in your set. You can do this with the contains() method:
set.contains("Tomato"); //returns true
set.contains("Cucumber"); //the probability that this call returns true is maximum 0.1%In the shown example, we first check whether the string "Tomato" is in our set. This is guaranteed to return true as we have added that string in the previous section. In other words, a Bloom filter will never deny the existence of an element in its set. On the other hand, we then check whether the string "Cucumber" is in our set. This call has a probability of error, i.e. the Bloom filter might affirm the existence of the string even though we have never added it (false positive). Again, you can adjust the maximum error rate you are willing to tolerate in the constructor. Be aware that a lower error rate generally requires more memory, though.
The ProbabilisticSet interface defines a method for computing the estimated size of a set, i.e. an approximation of how many elements are in a particular set. You can get an approximation with the approximateSize() method:
set.approximateSize();The estimation for the Bloom filter implementations is relatively accurate but be aware that the number might be a little off.