# Problem

Given two sets, identify whether the two sets are same up to a permutation.

$$
S_1 = [1, 2, 6, 3, 2, 1, 7]
$$
$$
S_2 = [2, 1, 6, 3, 7, 1, 2]
$$

# Hash and Sort

|      | Hash | Sort |
|------|------|------|
| Time | $O(S_1 + S_2)$ | $O(S_1 \log S_1 + S_2 \log S_2)$ |
| Space | $O(\min(\tilde{S_1} + \tilde{S_2}))$ | $O(1)$ if we do in-place sort. $O(S_1 + S_2)$ otherwise |

# Optimizations (Early Stopping)

* If one of the sets is smaller than the other, immediately return False.
* If the sum of the two sets are different, immediately return False.
* If the max and min of the two sets are different, immediately return False.
...

These are all based on the comparison of frequency moments.


# Frequency Moments

$$ S = [x_1, x_2, ..., x_n], x_i \in X$$

$$ F_k(S) = \sum_{x \in X} f_x^k $$

where $f_x$ is the frequency of $x$ in $S$. For convenience, we define $0^0 = 0$.

For example, $S = [a, a, b, c]$, then $f_a = 2$, $f_b = 1$, and $f_c = 1$.

$$F_0(S) = 2^0 + 1^0 + 1^0 = 3$$
This represents the number of distinct elements in $S$.

$$F_1(S) = 2^1 + 1^1 + 1^1 = 4$$
This represents the number of elements in $S$.

$$F_2(S) = 2^2 + 1^2 + 1^2 = 6$$
This is the same order as the variance of $S$. 

Also, by convention, we define $F_{\infty}(S) = \max(f_x)$.
$$F_{\infty}(S) = \max(f_x)$$


$F_1$ is easy to calculate but $F_0, F_2, ...$ are hard.


# How to measure $F_0$ and $F_2$ with limited space?

## Attempt 1: Random Sampling

We can estimate $F_0, F_2, ...$ by random sampling.

Assume $M$ is the space budget for the algorithm. $|X|$ is the number of distinct elements in $S$. We define a hash function $h: X \rightarrow [0, |X|-1]$. Then, we can estimate $F_0, F_2, ...$ by the following.

```
for x_i in S:
    if h(x_i) < M:
        f[h(x_i)] += 1
```

$f(x)$ is the frequency of $x$ in $S$. This only keeps the frequency of "M" elements. $O(M)$ space. $O(N)$ time.

Estimated frequency moments are as follows.

$$\hat{F_k}(S) = \frac{|X|}{M} \sum_{i=0}^{M-1} f_i^k$$

If $k \ge 1$, then the error is as follows.

$$
Error^2 = \propto \frac{Var(f_x^k)}{M}
$$

From Jensen's inequality 

$$
\Psi(E[X]) \le E[\Psi(X)]
$$

where $\Psi$ is a convex function. We can derive the following.

$$ Var(f_x)^k \le Var(f_x^k)$$

Therefore, the error grows exponentially with $k$, which makes it hard to estimate $F_k$. For example, estimating the $F_\infty$ (max value) is super hard if $M$ is small.

If $k \le 1$, then Jensen's inequality does not hold. Therefore, it is not mathematically guaranteed that the error is small.


# Attempt 2: Count-Min Sketch

$$
\hat{F_k}(S) = \sum_{x \in X} \hat{f_x}^k = \sum_{x \in X} (f_x + \epsilon_x)^k
$$

$k = 0$, the the equation is not mathematically good (?).
If $k \ge 1$, then the error grows exponentially with $k$.

# Conclusion

Estimating $F_k$ for any $k$ is impossible with $O(M)$ space where $M << |X|$. This means the error depends on $k$.