# **Group 2 - Work 1**

Each group should choose between HLL or Boom filters and produce an iPynb file with python code and markdown, which:

- Briefly explains what the algorithm’s purpose is [15%];
- Discuss a possible (real) application of the algorithm [25%];
- Describe the packages/microservices you are going to use [15%];
- Implement the code simulating this real application [45%].

## **Students :**
- Daniel Filipe Vilhena Nunes (101220)
- Prosper Ablordeppey (106382)
- Richard Adolph Aires Jonker (109560)
- Roshan Poudel (109806)

# Hyper Log Log Algorithm (HLL)

_All information in this section is based on (Flajolet et al., 2007)

HLL is a probabilistic algorithm belonging to the family of Flajolet-Martin (FM) derived algorithms. The way FM works is by counting the trailing zeros of a hashed bit string and returning a probable count of distinct values in a data stream. Derived from it we find the LogLog algorithm and from this we derive the HLL algorithm.
In essence, we count _leading_ zeros from a part of a hashed bit string, store them in an appropriate register and return a probable count of distinct values. 

The rationale is that if a bit string has length $L$, there are $2^L$ possible distinct values for it. If the first bit is 1, now there are only $2^{L-1}$ possible values – half of the total. So, if we denote by N the number of leading zeros of a bit string, we get that it has $2^{L-(N+1)}$ possible values. This is the basis for the HLL algorithm. Now, if we find a string that has $N’>N$ leading zeros, we assume that there are instead $2^{L-(N’+1)}$ distinct values. In essence, we store the highest count of leading zeros. 

If only one register is used, all we get is the highest count of leading zeros which leads to error. For instance, if the highest count is 8 in a 9-bit string, is it correct to assume that there is only one distinct value? Following the procedure in FM and LogLog, HLL also implements several registers, which reduces the error. As such, we define the HLL structure as being composed of $m$ registers where each will hold the highest count of leading zeros of a bit string assigned to it.

Given the object ($d$) of data stream ($\sigma$), we pass it as the argument of a hash function ($h: D \rightarrow \{0,1\}^L$) which return a bit string of length $L$ ($h(d)$). The first $k$ bits of $h(d)$ indicate the register that will hold the count ($m=2^k$). The remaining bits of $h(d)$ (length is $L-k$) are the one where the number of leading zeros are counted. Namely, the value stored corresponds to the position of the leftmost bit with value $1$. Contrary to the usual programming paradigm, the first bit corresponds to a position of 1 and not of 0. If there are no bits with value zero, the value held is $L-k+1$. The highest count corresponding to a register is always the one to be stored.

Afterwards, when the entire data stream has been processed, the product of the harmonic mean of $2^{N_i}$ ($z$), where $N_i$ is the count associated with the $i$-th register, with $m$ gives a first estimation of the true count of distinct values.

## Bias and Range Corrections

Also introduced in (Flajolet et al., 2007) is the use of two corrections to the result previously obtained, these are determined as bias correction ($\alpha_m$) and a range correction.

### Bias Correction

The parameter is given by:

$$\alpha_m = \left(m \int_0^\infty \left(\log_2\left(\frac{2+u}{1+u}\right)\right)^m du\right)^{-1}$$

For some commonly used values of $m$:
$\alpha_{16} = 0.673$  
$\alpha_{32} = 0.697$  
$\alpha_{64} = 0.709$  
$\alpha_{m\ge128} = \frac{0.7213}{1+\frac{1.079}{m}}$

Multiplying this parameter by the previous estimation, a _raw estimation_ ($E$) is obtained:
$$E = \alpha_m m z$$

### Range Correction

## Pseudo-Code Form



## Implementation

## How does HLL work?

![HLL Algorithm](images/hll_original.png "HyperLogLog Algorithm")

## Real world uses of HLL

## Packages used

## Implementation of HLL

## Implementation of a Real World Application of HLL

## References


Flajolet et al. (2007), HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. 2007 Conference on Analysis of Algorithms, AofA 07, DMTCS proc. AH, 2007, 127–146



In [9]:
import numpy as np
import random 
import math
import hashlib
import statistics as st

## Your test stream
stream = []
for i in range(50000):
    # Randomly generated IPv4 address
    ip = ".".join(map(str, (random.randint(0, 255) for _ in range(4))))
    stream.append(ip)
    
print(f"Size of our stream : {len(stream)}")

## Implement the algorithm in here
class BaseAlg:
    stream = []
    results = {}
    x = 0
    ## initilise the algorithm variables
    ...

    def alg(self):
        ## Steram algorithm
        self.results["test"] = "HEY"
    
    def verify(self):
        ## Exact algorithm (using all the stream) for comparison
        self.results["Exact"] = "5000"

## Do not change
class StreamAlg(BaseAlg):
    def __init__(self, stream):
        self.stream = stream
        self.exec()
        self.verify()

    def exec(self):
        for v in self.stream:
            self.x = v
            self.alg()
        print('Results:',self.results) 
        
SA = StreamAlg(stream)

Size of our stream : 50000
Results: {'test': 'HEY'}
