# Counting Number of Elements in a Set

You are tasked with counting how many times a particular value $a$ shows up in a set $A = \{a_1, a_2, \ldots, a_n\}$.

There are two main algorithms proposed to solve this problem:
- Algorithm 1: Space complexity $O(N)$, Time complexity $O(N)$. This method involves iterating through each element in the set and counting occurrences of $a$ directly.
- Algorithm 2: Space complexity $O(\text{distinct elements})$, Time complexity $O(1)$. This approach pre-processes the set to store the count of each distinct element in a hash map, allowing for $O(1)$ lookup time for the count of any element.


# Q. How can we improve space complexity in sacrifice of accuracy?

- See `SamplingVSSketches.ipynb` for a naive solution.

# Count-Min Sketch

The Count-Min Sketch is a probabilistic data structure that serves to summarize stream data. It operates based on two main components: hash functions and data table.

### Hash Functions

- **Hash Functions**: The Count-Min Sketch utilizes $d$ independent hash functions, denoted as $h_0, h_1, ..., h_{d-1}$. Each hash function maps elements from the data stream to integers in the range $[0, m-1]$, where $m$ is the width of the sketch. These hash functions are designed to distribute the items uniformly across the hash space to minimize collisions and ensure even distribution of counts.

### Data Table Structure

The description of the data table structure is mostly accurate but lacks detail on the initialization and update process:

- **Data Table**: A $d \times m$ matrix, where $d$ is the depth and $m$ is the width of the sketch. Each row corresponds to a hash function, and each column represents a hash bucket. Initially, all entries in the table are set to 0. When an item is processed, it is hashed $d$ times, once per hash function, and the corresponding counters in the sketch are incremented. **Data Table**: Construct a table to keep track of the counts of hash values. The structure of the table is outlined below:

|             | 0   | 1   | ... | $m-1$ | **Sum** |
|-------------|-----|-----|-----|-------|---------|
| $h_0$       | 0   | 2   | ... |       | $n$     |
| $h_1$       |     |     | ... |       | $n$     |
| ...         |     |     |     |       | ...     |
| $h_{d-1}$   |     |     | ... |       | $n$     |


Each cell in the table represents the count of occurrences for the hash values produced by the corresponding hash function for each item. The counts are used to estimate the frequency of each item in the stream data.

### Algorithm Overview

- **Initialization**: Initialize a $d \times m$ matrix $T$ with all elements set to 0.
- **Update**: For each item $a$ in the stream, compute $d$ hash values $h_0(a), h_1(a), ..., h_{d-1}(a)$. For each $i \in [0, d-1]$, increment $T[i][h_i(a)]$ by 1.
- **Query**: To estimate the frequency of an item $a$, compute the minimum value among its $d$ hash positions: $\min(T[0][h_0(a)], T[1][h_1(a)], ..., T[d-1][h_{d-1}(a)])$.

### Error Analysis
To understand how the expected error of a Count-Min Sketch is bounded, we'll focus on the expected value of the error rather than the tail probabilities covered by Chernoff bounds. The expected error analysis involves a different approach, which does not directly apply Chernoff bounds but rather examines the average case behavior of the hash collisions in the sketch.

#### Setting the Stage

For an item $a$ with true count $|a|$ in a stream of total count $n$ items, the Count-Min Sketch aims to estimate $|a|$ using $d$ hash functions, each mapping items to one of $m$ buckets. The error in the estimation comes from other items being hashed to the same bucket as $a$.

#### Expected Error Calculation

Let's denote:
- $n$: Total count of items in the stream.
- $|a|$: True count of the specific item $a$.
- $m$: Number of buckets in each row of the sketch.
- $d$: Number of hash functions (rows in the sketch).

For each hash function $h_i$, the expected number of counts for other items (not $a$) that collide with $a$ in the same bucket is $\frac{n - |a|}{m}$. This is because each of the $n - |a|$ other items is equally likely to hash to any of the $m$ buckets, and $h_i$ is assumed to distribute items uniformly at random.

#### Expected Error for a Single Hash Function

For a single hash function, the expected error introduced due to collisions with item $a$ is thus $\frac{n - |a|}{m}$. This error is the additional count in the bucket of item $a$ not attributable to $a$ itself.

#### Total Expected Error in the Sketch

Since the Count-Min Sketch takes the minimum count across all $d$ hash functions as the estimate for $|a|$, the expected error across all hash functions would not simply add up. However, the sketch's design minimizes the impact of outliers, so the expected error in the estimate of $|a|$ is influenced by the behavior of the hash functions in aggregate.

The key insight is that the Count-Min Sketch's estimate for the count of $a$ can be thought of as $|a|$ plus the minimum error introduced by the hash functions. Since each hash function independently contributes an expected error of $\frac{n - |a|}{m}$, and taking the minimum across these tends to favor the lower end of the error distribution, the intuition is that the expected error in the estimate of $|a|$ is bounded by the average case error introduced by the hash collisions.

However, the precise derivation for the expectation of the minimum of these errors would involve more complex probabilistic analysis, as it depends on the distribution of counts across all items and the independence of hash functions.

#### Clarification of the Error Bound

The statement that the expected error is bounded by $\frac{n - |a|}{md}$ is a simplification. The direct expected error per hash function is $\frac{n - |a|}{m}$. When considering $d$ hash functions, the Count-Min Sketch design aims to minimize this error, but the reduction is not linear with $d$. The $\frac{1}{d}$ factor might be interpreted as an intuitive or average-case reduction due to taking the minimum across $d$ independent estimates, but strictly speaking, the expected minimum error does not divide evenly by $d$.

#### Correct Interpretation

The correct interpretation is that the Count-Min Sketch aims to minimize the error introduced by collisions, and while increasing $m$ and $d$ improves accuracy, the expected error for a single item's count estimation is primarily influenced by $\frac{n - |a|}{m}$. This reflects the average additional counts introduced by other items hashing to the same buckets as $a$, not a precise division of error by $md$.

# Comparison with Sample-based Approaches

The efficiency of the Count-Min Sketch in estimating item frequencies within data streams presents distinct advantages over traditional sample-based approaches, particularly in terms of error dynamics and the ability to detect non-existent elements.

## Error Dynamics

In the Count-Min Sketch, the expected error in frequency estimates decreases almost inversely with the parameter $k = md$, where $m$ is the number of buckets per hash function and $d$ is the number of hash functions. This relationship suggests a more favorable error reduction as the resources allocated to the sketch (i.e., memory or computational complexity represented by $k$) increase. In contrast, the expected error in a sample-based approach typically decreases in proportion to $\sqrt{k}$, where $k$ represents the sample size. This slower rate of improvement means that to achieve a comparable reduction in error, sample-based methods require significantly larger samples, which may not be feasible for large datasets or streaming data.

## Detection of Non-existent Elements

One of the notable advantages of the Count-Min Sketch is its deterministic capability to identify the absence of elements. Specifically, if the Count-Min Sketch reports a zero count for an element, it definitively did not occur in the dataset. This contrasts sharply with sample-based approaches, where the absence of an element in the sample does not necessarily imply its absence in the entire dataset due to the inherent limitations of sampling. This property of the Count-Min Sketch is particularly valuable for anomaly detection, as it can reliably identify items that have not been observed.

## Estimation of Frequent vs. Rare Items

The Count-Min Sketch excels in estimating the counts of frequent elements within a dataset. Its structure and algorithm reduce the relative error for items that occur frequently, making it an ideal tool for applications focused on identifying and monitoring common patterns or behaviors. The more frequently an item appears, the more accurately the Count-Min Sketch can estimate its count, benefiting from the law of large numbers as applied to hash collisions.

In contrast, sample-based approaches inherently possess the ability to capture rare events more clearly. Since each item in the sample is directly observed, rare items that are included in the sample can be identified with certainty. However, the catch is in the "if"—rare items must be part of the sample to be detected, which becomes increasingly unlikely as the rarity increases unless specific sampling strategies aimed at detecting rare events are employed.

## Conclusion

While the Count-Min Sketch offers substantial benefits in error reduction, non-existent element detection, and frequency estimation for common items, it is important to choose the data summarization strategy that best fits the specific needs of the application. For tasks requiring precise identification of rare items or events, a carefully designed sampling approach might be preferable. Conversely, for applications where the primary goal is to monitor frequent patterns or detect anomalies in large-scale or streaming data, the Count-Min Sketch provides a highly efficient and effective solution.

# Count-Min Sketch is NOT Robust

### Analysis with Corrected Error Term and Detailed Probability Calculation

### 3.2 Analysis
Let $n$ be the length of the stream. Fix an element $i$ and a hash function $h_j$. Assume that $h_j(i) = b$.

Let’s compute the expectation of the value of the bucket $b$ that $i$ hashes to using the $j$-th hash function:
$$E[C(h_j(i))] = E\left[\sum_{s:h_j(s)=b} f_s\right] \leq f_i + \frac{n - |i|}{m}$$

since the sum of all frequencies is just $n$, the number of elements in the stream, and each element has probability $\frac{1}{n}$ of mapping to a particular bucket. Since the count-min sketch only overestimates frequencies (i.e. $C(h_j(i)) - f_i \geq 0$), we may apply Markov’s inequality in conjunction with the above inequality to get:
$$P\left(C(h_j(i)) \geq f_i + \frac{\epsilon (n - |i|)}{m}\right) \leq \frac{1}{\epsilon}$$

Since, we select each hash function $j \in [d]$ independently, we have that:
$$P\left(\hat{f}_i \geq f_i + \frac{\epsilon (n - |i|)}{m}\right) = \prod_{j \in [d]} P\left(C(h_j(i)) \geq f_i + \frac{\epsilon (n - |i|)}{m}\right) \leq \left(\frac{1}{\epsilon}\right)^d$$

Choosing $\delta = P\left(\hat{f}_i \geq f_i + \frac{\epsilon (n - |i|)}{m}\right)$ and $d$, we can estimate how large $\epsilon$ gets.

For example, if we set $\delta = 0.05$ and $d = 10$, then $0.05 \le \frac{1}{\epsilon}^{10}$. Computing this for $\epsilon$ give $\epsilon \le \frac{1}{0.05^\frac{1}{10}} = 20^{\frac{1}{10}} \approx 1.35$. That means there is approximately $1.35 * n/m$ error in the count estimation returned. Assuming $m << n$ the estimation error could be significant. We could say that **Count-Min Sketch is not robust.**


In [1]:
import hashlib

class CountMinSketch:
    def __init__(self, w, d):
        self.w = w  # Width of the sketch
        self.d = d  # Depth of the sketch
        self.table = [[0] * w for _ in range(d)]
        self.seed = list(range(d))  # Seeds for hash functions

    def hash(self, item, seed):
        # Using hashlib for a simple hash function
        hash_value = int(hashlib.md5((str(item) + str(seed)).encode()).hexdigest(), 16)
        return hash_value % self.w

    def add(self, item):
        # Increment the count for each hash function
        for i in range(self.d):
            index = self.hash(item, self.seed[i])
            self.table[i][index] += 1

    def count(self, item):
        # Estimate the count by taking the minimum value among all hash functions
        estimates = [self.table[i][self.hash(item, self.seed[i])] for i in range(self.d)]
        return min(estimates)

# Example usage
w = 10  # Choose based on epsilon
d = 5   # Choose based on delta

cms = CountMinSketch(w, d)

# Simulate adding elements
elements = ['apple', 'banana', 'orange', 'apple', 'banana']
for el in elements:
    cms.add(el)

# Query some counts
print("Count of 'apple':", cms.count('apple'))
print("Count of 'banana':", cms.count('banana'))
print("Count of 'orange':", cms.count('orange'))
print("Count of 'grape':", cms.count('grape'))  # Not added, should be close to 0 or at least very low


Count of 'apple': 2
Count of 'banana': 2
Count of 'orange': 1
Count of 'grape': 0


Since Count-Min Sketch works well for checking the existence of a element, it is used for checking the set membership, which is called a Bloom Filter. Count-Min Sketch can over-count but never under-counts the number of element in a set. This means Bloom Filter has false-positives but not false-negatives.