# **Group 2 - Work 1**

**Task :** 
Each group should choose between HLL or Boom filters and produce an iPynb file with python code and markdown, which:

- Briefly explains what the algorithm’s purpose is [15%];
- Discuss a possible (real) application of the algorithm [25%];
- Describe the packages/microservices you are going to use [15%];
- Implement the code simulating this real application [45%].

## **Students :**
- Daniel Filipe Vilhena Nunes (101220)
- Prosper Ablordeppey (106382)
- Richard Adolph Aires Jonker (109560)
- Roshan Poudel (109806)

# Hyper Log Log Algorithm (HLL)


_All information in this section is based on (Flajolet et al., 2007)_

HLL is a probabilistic algorithm belonging to the family of Flajolet-Martin (FM) derived algorithms.
The way FM works is by counting the trailing zeros of a hashed bit string and returning a probable count of distinct
values in a data stream. Derived from it we find the LogLog algorithm and from this we derive the HLL algorithm.
In essence, we count _leading_ zeros from a part of a hashed bit string, store them in an appropriate register and
return a probable count of distinct values.

The rationale is that if a bit string has length $L$, there are $2^L$ possible distinct values for it. If the first bit
is 1, now there are only $2^{L-1}$ possible values – half of the total. So, if we denote by N the number of leading
zeros of a bit string, we get that it has $2^{L-(N+1)}$ possible values. This is the basis for the HLL algorithm. Now,
if we find a string that has $N’>N$ leading zeros, we assume that there are instead $2^{L-(N’+1)}$ distinct values.
In essence, we store the highest count of leading zeros.

If only one register is used, all we get is the highest count of leading zeros which leads to error. For instance,
if the highest count is 8 in a 9-bit string, it is not correct to assume that there is only one distinct value. Following
the procedure in FM and LogLog, HLL also implements several registers, which reduces the error. As such, we define the
HLL structure as being composed of $m$ registers where each will hold the highest count of leading zeros of a bit
string assigned to it.

Given the object ($d$) of data stream ($\sigma$), we pass it as the argument of a hash function
($h: D \rightarrow \{0,1\}^L$) which return a bit string of length $L$ ($h(d)$).
The first $k$ bits of $h(d)$ indicate the register that will hold the count ($m=2^k$).
The remaining bits of $h(d)$ (length is $L-k$) are the one where the number of leading zeros are counted.
Namely, the value stored corresponds to the position of the leftmost bit with value $1$.
Contrary to the usual programming paradigm, the first bit corresponds to a position of 1 and not of 0.
If there are no bits with value zero, the value held is $L-k+1$. The highest count corresponding to a register
is always the one to be stored.

Afterwards, when the entire data stream has been processed, the product of the harmonic mean of $2^{N_i}$ ($z$),
where $N_i$ is the count associated with the $i$-th register, with $m$ gives a first estimation of the true count of
distinct values.

## Bias and Range Corrections

Also introduced in (Flajolet et al., 2007) is the use of two corrections to the result previously obtained,
these are determined as _bias correction_ ($\alpha_m$) and _range correction_.

### Bias Correction

The parameter is given by:

$$\alpha_m = \left(m \int_0^\infty \left(\log_2\left(\frac{2+u}{1+u}\right)\right)^m du\right)^{-1}$$

For some commonly used values of $m$:  
$\alpha_{16} = 0.673$  
$\alpha_{32} = 0.697$  
$\alpha_{64} = 0.709$  
$\alpha_{m\ge128} = \frac{0.7213}{1+\frac{1.079}{m}}$

Multiplying this parameter by the previous estimation, a raw estimation ($E$) is obtained:
$$E = \alpha_m m z$$

### Range Correction

Performance of the algorithm is divided in three ranges (defined through $E$ and $m$, each with an associated
correction which gives the final count estimation ($E^*$).

**Small range: $E \le \frac{5}{2}m$**  
> Let $V$ be the number of registers equal to 0. 
If $V \ne 0 \rightarrow E^* := m\log\left(\frac{m}{V}\right)$, otherwise $E^* := E$

**Intermediate range: $\frac{5}{2}m < E \le \frac{1}{30}2^{32}$**  
> No correction is applied, $E^* := E$

**Large range: $E > \frac{1}{30}2^{32}$**  
> Simply set $E^*:= -2^{32} \log\left(1-\frac{E}{2^{32}}\right)$

# Real world uses of HLL
HyperLogLog is mainly used to  estimate the number of unique values within a very large dataset or stream using little
memory and time.

- Counting unique views on a Reddit post.( [View Counting at Reddit](https://www.redditinc.com/blog/view-counting-at-reddit/) )
- Facebook uses HLL in Presto to speed up distinct queries (for example determine the number of distinct people
- Visiting Facebook in the past week). ( [HLL in Presto](https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/) )


- Other possible uses could be counting unique visitors on a website or counting people at a event (using ticket IDs)
or counting total visitors at a airport or counting distinct number of cars passing a junction.

### Counting unique visitors on a website
A possible use for HLL algorithm could be counting unique number of visitors on a website.
As some users often visit the same website multiple times and usually an exact number of visitors is not needed,
HLL would be perfect for this application. We can simply use the IP address of the user to determine unique number of
visitors.

# Implementation of HLL

To implement the HLL algorithm, an object-based approach was taken. A class representing it was defined and its
methods provide the desired functionalities - from the actual distinct count to comparing the result with an exact
answer.

## Methods Description

**Methods included in the `HLLAlg` class are :**

`hashString(self, value_to_hash)`
> Takes as input `value_to_hash`, which is to say, an object of a data stream. This (if not already) is converted to `string` format and passed to a hash function (`sha1` was chosen). This hashed bit string is returned with a fixed length of 24.

`initialise_registers(self)`
> From the defined error rate, $m$ and $k$ are determined. The variable `registers` are initialized as a dictionary of length $m$, with each key corresponding to the number of register. Each value of the register is initially set at 0.

`alg(self)`
> Implementation of the HLL Algorithm previously described. Takes the data and updates the registers when needed.

`verify(self)`
> Verification of the result. Simply prints the result of the implemented algorithm and the exact number of distinct values. The later being found by converting the stream to a Python **`set`** and passing it to Python's `len()` (length) function.

`range_correction_results`
> Range correction is performed as needed. Based on the value of $ \textrm{alfa} \times m \times z$ either small, intermediate or large range correction is performed.

**Methods included in the `StreamAlg` class are :**

`exec(self)`
> This method takes the stream as input and passed the values one by one to the HLLAlg class. Then range correlation is performed on the results which provides the estimated distinct count.


## Packages and Microservices

Our implementation is from scratch, so we do not have many Python package dependencies:
1. `random`: This is used to generate the stream of unique IP addresses, it does not fall part of the main solution.
2. `math`: This is used for 2 functions, logarithmic operations for range correction and register initialization, and
we use the ceiling function for calculating the final result of the algorithm.
3. `hashlib`: This package is needed for our `sha1` hash functionality to convert the input to a hash.
4. `statistics`: This package is used for calculating the harmonic mean of the registers.

If we did not implement our own HyperLogLog algorithm from scratch, we would have used the service, `Redis`, running
in a Docker container to implement the overhead of our algorithm. This would need to be used in conjunction with the
`redis` Python package, to access the container and Redis service to manage the values and retrieve the HyperLogLog
value.


## The Code

In [31]:
import random 
import math
import hashlib
import statistics

In [32]:
class HLLAlg:
    def __init__(self, error=0.01):
        """
        Initializes the variables for implementation in base class.
        Args:
            error (float) - the accepted error rate of the algorithm, defaults to 1%.
        Sets:
            self.registers - the register that hold all counts of values
            self.k - number of bits of the struct/register to consider
            self.m - actual number of struct/registers.
            self.error_rate - error rate
            self.results - final estimation
        """
        self.registers = None
        self.k = None
        self.m = None
        self.error_rate = error
        self.results = None
        

    def hashString(self, value_to_hash):
        """
        Hashes the input using the sha1 standard algorithm and returns a 
        string of binary characters (0, 1) with fixed length (24).
        
        Parameters:
            value_to_hash (str,int) - represents the value to hash.
        Returns:
            padded_binary (str) - a hashed string of binary characters (0, 1)
            with fixed length (24)
            
        Example
            >> self.hashString(25)
            >> '011001000000010000000000'
            >> len(self.hashString(25))
            >> 24

        """

        if not isinstance(value_to_hash, str):
            value_to_hash = str(value_to_hash) # required by the encoding function
        
        hashcode=hashlib.sha1(value_to_hash.encode('utf-8')).hexdigest()
        bin_code = bin(int(hashcode, 16))[-24:].zfill(24) # output padded to 24 bit string, in accordance to SHA1
        return bin_code

    
    def initialise_registers(self):
        """
        Creates the required registers for algorithm implementation with all
        of the registers initialized as 0. Number of registers are determined
        based on the error_rate.

        The lower the error_rate, the higher the amount of space or register 
        size needed which results on higher the precision to actual count.

        Sets:
            self.m - the actual size of the struct/register
            self.k - number of bits of the struct/register to consider
            self.registers - container/struct to hold the register of zeros in each bit position of the struct.
        """

        self.m = (1.04/self.error_rate)**2
        self.k = math.ceil(math.log(self.m, 2))
        self.m = 2**self.k # actual struct size computation
        self.registers = {r: 0 for r in range(self.m)}  # initialize the registers with 0
    

    def alg(self):
        """
        Main register implementation that updates the count of unique elements in the stream.
        ensures for all datastream elements, the register is rightly updated (reason for one-time run)

        Updates:
            self.register - updates the register for approximation
        """
        
        # ensure initializer is called once 
        if not self.init:
            self.initialise_registers()
            self.init = True

        
        x_hashed = self.hashString(self.x) # obtains the hash of x based on the hashString function
        key = int(x_hashed[:self.k], 2)  # extracts first k bits of x_hashed as key
        q = x_hashed[self.k:] # obtains the last len(x_hashed)-k bit string of x_hashed

        
        # If the hashed string is 000...0, we register its length+1, 
        # as if the first 1 was just outside the string
        if q.find('1') == -1:
            value_count = len(q) + 1
        else:
            value_count = q.find('1') + 1
            
        
        # only replace the register value if it is higher max(register_val, count)
        if value_count > self.registers[key]:
            self.registers[key] = value_count



    def range_correction_results(self):
        """
       Range correction
        """
        
        vals = [x for x in self.registers.values() if x != 0] # ignoring the registers with no value
        v = [2**x for x in vals]

        print(f"{len(v)} / {self.m} registers holding some count")

        z = statistics.harmonic_mean(v)
        
        # Range correction as mentioned in the article        
        alfa_dict = {16: 0.673, 32: 0.697, 64: 0.709, 128: 0.7213/(1+(1.079/self.m))}
        alfa = alfa_dict[128]

        raw = alfa * self.m * z

        if raw <= 2.5*self.m:
            print("Small range correction")
            u = len([x for x in self.registers.values() if x==0])
            if u != 0:
                self.results = self.m*math.log(self.m/u)
            else:
                self.results =  raw
                
        elif raw <= (1/30)*2**32:
            print("Intermediate range correction")
            self.results = raw
            
        else:
            print("Large range correction")
            self.results = -(2**32)*math.log(1-raw/(2**32))
            
        return 0


    def verify(self):
        """
        computes the exact results using a deterministic algorithm.
        computation is solely for comparison with the approximated value
        """
        
        print(f"Actual distinct values: {len(set(self.stream))}")
        print(f"Total stream values: {len(self.stream)}")
        print(f"Error: {round(abs(len(set(self.stream))-math.ceil(self.results))/len(set(self.stream)) * 100,7)}%")

In [33]:
class StreamAlg(HLLAlg):
    def __init__(self, stream, error = 0.01):

        """
        Parameters:
            stream [float, int, string]: The list of values to represent a stream
            error [float]: The error rate that should be passed to the algorithm, default is 1%
        """
        self.stream = stream
        HLLAlg.__init__(self, error)
        self.init = False
        

    def exec(self):
        """
        Passes each value of the data stream to the alg function for register update. 
        """
        for v in self.stream:
            self.x = v
            self.alg()
        self.range_correction_results()
        print(f"Estimated distinct values: {self.results}   ≡   {math.ceil(self.results)}")


# Simulating counting Unique IP addresses

In [34]:
ip_stream = []
for i in range(1000000):
    
    # Generate random IP address following format XXX.XXX.XXX.XXX
    ip = ".".join(map(str, (random.randint(0, 255) 
                        for _ in range(4))))
    
    ip_stream.append(ip)
    
print(ip_stream[:3])
print()

errors = [0.1, 0.05, 0.01, 0.001]

for i in errors:
    print("-----------------------------------------------")
    print()
    print(f"Expected error: {i*100}%")
    print()
    hll_alg = StreamAlg(ip_stream, i)
    hll_alg.exec()
    hll_alg.verify()
    print()

['38.92.220.149', '119.52.145.246', '160.247.218.207']

-----------------------------------------------

Expected error: 10.0%

128 / 128 registers holding some count
Intermediate range correction
Estimated distinct values: 855013.464058412   ≡   855014
Actual distinct values: 999884
Total stream values: 1000000
Error: 14.4886807%

-----------------------------------------------

Expected error: 5.0%

512 / 512 registers holding some count
Intermediate range correction
Estimated distinct values: 946408.9796203562   ≡   946409
Actual distinct values: 999884
Total stream values: 1000000
Error: 5.3481204%

-----------------------------------------------

Expected error: 1.0%

16384 / 16384 registers holding some count
Intermediate range correction
Estimated distinct values: 997358.2940143074   ≡   997359
Actual distinct values: 999884
Total stream values: 1000000
Error: 0.2525293%

-----------------------------------------------

Expected error: 0.1%

795260 / 2097152 registers holding so

## Conclusion

We can see that it is quite easy for our algorithm to achieve relative consistent results to the error we provide it
with. We can almost guarantee the error of the algorithm. This shows that we can count the number of unique items in a
stream with relative accuracy. We can see that this algorithm will be well suited in counting the number of unique
visitors to a website.

## References


1. Flajolet et al. (2007), HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. 2007 Conference on Analysis of Algorithms, AofA 07, DMTCS proc. AH, 2007, 127–146


2. Heule, Stefan, et al. “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.” Proceedings of the EDBT 2013 Conference, 2013.



