# Whats is HyperLogLog (HLL)

A brief introduction is needed to define the purpose and reason for the HLL algorithm properly. First, we briefly present a definition of the following essential terminologies:
- multiset and its multiplicity
- data stream and source
- the purpose of the HLL algorithm on a multiset

A set is basically a collection of well-defined objects. We refer to these objects as members or elements of the set. A multiset, on the other hand, is a collection of multiple unordered items or elements. Thus, a multiset could be thought of as a set with the possibility of repeated elements.

The multiplicity of an element x in a multiset is the number of times that element appears in the set. In other words, each element of a multiset may have a multiplicity of more than one (1). Thus, the elements of a multiset may be repeated or not. For instance, in the multiset {3, 3, 4, 5, 6} element 3 has multiplicity 2. The elements 4, 5, and 6 all have multiplicities of 1. Order doesn’t matter, so {3, 3, 4, 5, 6} is the same as {3, 4, 6, 5, 3}.

Data collected in real life could be of finite cardinality, where the total number of elements in the dataset are known, or infinite cardinality, where the cardinality of the dataset is not known. 

A data stream is a countably infinite sequence of elements used to represent data elements that are made available over time. Examples are readings from sensors, financial transaction logs, or network data in computer monitoring applications (activity logs from web browsers, IP addresses). Data presented in this fashion is referred to as data stream.

Most often, a count of the total number of distinct elements for a stream of data is needed to aid analysis and inform decisions. Literature reveals proposed algorithms that are well efficient for counting distinct values in small datasets. Unfortunately, for very large datasets, these algorithms fail or report incorrect values with high intolerable errors. In addition, calculating the exact cardinality of unique elements in a multiset/data stream requires an amount of memory proportional to the cardinality of the multiset/data stream, which is impractical in real life. 

For these reasons, the HLL algorithm was proposed to primarily approximate the number of distinct elements in a very large multiset or data stream. The HLL is a probabilistic cardinality estimator, use significantly less memory to obtain an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10**9 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm


# How does the HLL work

This presentation makes a slight modification to the original algorithm to include bias and range corrections which help significantly with the results and provide a better description of the algorithm than the one in the lecture slides.

~to be continued.... 

# Libraries/Modules Employed

- numpy
- random 
- math
- hashlib
- statistics

# Implementation

In [1]:
# modules/libraries implemented
import numpy as np
import random 
import math
import hashlib
import statistics


# generates random stream  of 25000 integers. 
stream=[random.randint(0, i) for i in range(25000)] 


class BaseAlg:
    def __init__(self):
        """
        Initializes the variables for implementation in base class.
        
        Sets:
            self.registers - the struct to hold all counts
            self.k - number of bits of the struct/register to consider
            self.m - actual struct size or cardinality.
            self.error_rate - error rate to 
        """
        self.k = None 
        self.m = None
        self.registers = None
        self.results = None
        self.error_rate = None
        

    def hashString(self, value_to_hash):
        """
        Hashes the input using the sha1 standard algorithm and returns the integer representation in binary format after padding 

        Args:
            value_to_hash [str,int] - represents the value to hash.
        Returns:
            padded_binary [str] - an 8bit representation of the input value of length 160 after some operations.
        Example
            >> self.hashString(25)
            >> '011001000000010000000000'
            >> len(self.hashString(25))
            >> 24

        """

        if not isinstance(value_to_hash, str):
            value_to_hash = str(value_to_hash) # required by the encoding function
        
        hashcode=hashlib.sha1(value_to_hash.encode('utf-8')).hexdigest() # hashes value into 15 bitstring length
        bin_code = bin(int(hashcode, 16))[-24:].zfill(24) # output padded to 24 bitstring
        return bin_code

        ## This hash function implementation returning all 160 bitstring(larger bitstring) results in larger error 
        # h = hashlib.sha1() 
        # h.update(value_to_hash.encode()) # hashes encoded string to produce hex representation of the digest
        # end_length = len(h.hexdigest()) * 4 # 160. h.hexdigest() produces 15bit long bitstring 
        # hex_as_int = int(h.hexdigest(), 16) # converts to integer the hexdigest()
        # hex_as_binary = bin(hex_as_int) # binary representation of integer representation of h.hexdigest
        # padded_binary = hex_as_binary[2:].zfill(end_length) # pads binarystring with zeros to 160 bits length
        # return(padded_binary)

    
    def initialise_registers(self):
        """
        creates the required register for algorithm implementation.
        register number is computed if error_rate is defined.
        otherwise, uses the default m=2**k bit long register.

        The lower the error_rate, the higher the amount of space or register size needed hence, the higher the precision to actual count.
        The registers must be k bits of length 2**k or updated with error_rate.

        Args:
            self.k - k-bits register, updated if error_rate defined. 
            self.error_rate - this is the error_rate/size to effect bit size calculation

        Sets:
            self.m - the actual size of the struct/register
            self.registers - container/struct to hold the register of zeros in each bit position of the struct.
        """

        self.k = 4 # sets k-bits struct size
        self.error_rate = 0.001 # None, 0.1, 0.01, 0.001 change to see effects

        if self.error_rate is not None:
            self.m = (1.04/self.error_rate)**2
            self.k = math.ceil(math.log(self.m, 2))  # k is updated if error_rate is defined
        
        self.m = 2**self.k # actual struct size computation
        self.registers = {r: 0 for r in range(self.m)}  # builds the struct/register
    

    def alg(self):
        """
        Main register implementation that updates the count of unique elements in the stream.
        ensures for all datastream elements, the register is rightly updated (reason for one-time run)

        Updates:
            self.register - updates the register to for approximation
        """
        
        # ensure initializer is called once 
        if not self.init:
            self.initialise_registers()
            self.init = True

        
        x_hashed = self.hashString(self.x) # obtains the hash of x based on the hashString function
        key = int(x_hashed[:self.k], 2)  # extracts first k bits of x_hashed as key
        q = x_hashed[self.k:] # obtains the last len(x_hashed)-k bitstring of x_hashed 

        #(a)
        f1_index = x_hashed.find('1', self.k) # find the first 1 after index k of x_hashed and return its index
        value_count = len(x_hashed[self.k:f1_index]) + 1 # length of set after first k to f1_index
        if key in self.registers:
            if self.registers[key] < value_count:
                self.registers[key] = value_count 
        else: 
            self.registers[key] = value_count
        
        #(b)
        # if q.find('1') == -1:
        #     value_count = 1
        # else:
        #     value_count = q.find('1') + 1
        # if value_count > self.registers[key]:
        #     self.registers[key] = value_count

        # either (a) or (b) produces same result 


    def range_correction_results(self):
        """
        HLL++ range correction implemented in this module
        """
        
        v = [x for x in self.registers.values() if x is not None]
        v = [2**x for x in self.registers.values()]

        print("%d/%d registers holding some count" % (len(v), self.m))

        z = statistics.harmonic_mean(v)
        
        alfa_dict = {16: 0.673, 32: 0.697, 64: 0.709, 128: 0.7213/(1+(1.079/self.m))}
        alfa = alfa_dict[128]

        raw = self.m*z

        if raw <= 2.5*self.m:
            print("Small range correction")
            u = len([x for x in self.registers.values() if x==0])
            if u != 0:
                self.results = self.m*math.log(self.m/u)
            else:
                self.results =  raw
        elif raw <= (1/30)*2**32:
            print("Intermediate range correction")
            self.results = raw
        else:
            print("Large range correction")
            self.results = -(2**32)*math.log(1-raw/(2**32))
        return 0


    def verify(self):
        """
        computes the exact results using a deterministic algorithm.
        computation is solely for comparison with the approximated value
        """
        
        print("Actual distinct values: %d" % len(set(stream)))


class StreamAlg(BaseAlg):
    def __init__(self, stream):
        self.stream = stream
        self.init = False
        self.exec()
        self.verify()

    def exec(self):
        """
        Passes each value of the data stream to the alg function for register update. 
        """
        for v in self.stream:
            self.x = v
            self.alg()
        self.range_correction_results()
        print("Estimated distinct values: %d" % self.results)

SA = StreamAlg(stream)

2097152/2097152 registers holding some count
Small range correction
Estimated distinct values: 12447
Actual distinct values: 12451


In [None]:
# Tweak the value of error_rate in the initialise_registers method to see the appropriate results.
# Before doing this, make sure to comment out the stream variable to prevent generation of new data stream.