# Problem

Continuous stream coming in. 
$$ A = \{a, b, a, a, c, b, a, c, a, d, d, d\} $$

We want to find the occurrence of each element using Count-Min Sketch.

# Count-Min Sketch

Two hash functions.

$$h_1 = \{(a, 1), (b, 2), (c, 2), (d, 1)\}$$
$$h_2 = \{(a, 2), (b, 1), (c, 1), (d, 1)\}$$

Then the table is:

|       | m=1 | m=2 |
|-------|---|---|
| $h_1$ | 8 | 4 |
| $h_2$ | 7 | 5 |

Thus, the estimated occurrence of each element is

$$ \hat{a} = \min(8, 5) = 5 = 5 $$
$$ \hat{b} = \min(4, 7) = 4 \neq 2 $$
$$ \hat{c} = \min(4, 7) = 4 \neq 2 $$
$$ \hat{d} = \min(8, 7) = 7 \neq 3 $$

The problem of Count-Min Sketch is that (on average) expected error is ok but actual performance is quite bad.
The expected count of $i$-th element of particular hash function is

$$ E[X_i] = |a_i| + \frac{|A| - |a_i|}{m} $$

where $|a_i|$ is the number of an element got hashed to the $i$-th bucket.
The expected error for $|a_i|$ is $\frac{|A| - |a_i|}{m}$.

Using $d$ hash functions, we model the error as 

$$E[error] = \frac{|A| - |a_i|}{m d}$$

Any low-frequency elements that collide with high-frequency elements will have over-reported frequency.


[https://web.stanford.edu/class/archive/cs/cs166/cs166.1166/lectures/11/Slides11.pdf]
[https://web.stanford.edu/class/cs166/]



# Linear Frequency Sketches

Count-Min Sketch could be modelled as a special case of Linear Frequency Sketches.

For example, the $h_1$ can be expressed as a matrix

$$
\begin{bmatrix}
1 & 0 & 0 & 1 \\
0 & 1 & 1 & 0 \\
\end{bmatrix}
$$

If you multiply the matrix with the vector of the occurrence of each element, you get the row corresponding to the hash function $h_1$.

$$
\begin{bmatrix}
1 & 0 & 0 & 1 \\
0 & 1 & 1 & 0 \\
\end{bmatrix}
\begin{bmatrix}
5 \\
2 \\
2 \\
3 \\
\end{bmatrix}
=
\begin{bmatrix}
8 \\
4 \\
\end{bmatrix}
$$

$h_2$ can be expressed as a matrix

$$
\begin{bmatrix}
0 & 1 & 1 & 1 \\
1 & 0 & 0 & 0 \\
\end{bmatrix}
$$

Therefore, the Count-Min Sketch can be obtained by multiplying $md \times n$ matrix with $n \times 1$ vector.

$$
\begin{bmatrix}
1 & 0 & 0 & 1 \\
0 & 1 & 1 & 0 \\
0 & 1 & 1 & 1 \\
1 & 0 & 0 & 0 \\
\end{bmatrix}
\begin{bmatrix}
5 \\
2 \\
2 \\
3 \\
\end{bmatrix}
=
\begin{bmatrix}
8 \\
4 \\
7 \\
5 \\
\end{bmatrix}
$$

Usually, the number of categories is much larger than $md$. Therefore, we can see this as randomized dimensionality reduction. Note that Johnson-Lindenstrauss Lemma indicated that using randomized matrix, we can reduce the dimensionality without losing much information.

$$
\hat{f} = G  f
$$
where
True frequency vector $f \in \mathbb{R}^{N}$, $N$ is the number of categories.
$$
f \in \mathbb{R}^{N}
$$
G is the sketching matrix
$$
G \in \mathbb{R}^{md \times N}
$$
Sketch vector $\hat{f} \in \mathbb{R}^{md}$.
$$
\hat{f} \in \mathbb{R}^{md}
$$



# Histograms are also Linear Frequency Sketches

Histograms are also Linear Frequency Sketches.

$G$ for histogram looks like:

$$
\begin{bmatrix}
1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
$$

# Linearity of Linear Frequency Sketches

We can add two histograms.

$$
\hat{f} = G (f_1 + f_2) = G f_1 + G f_2 = \hat{f}_1 + \hat{f}_2
$$

# Answering Count Queries from $\hat{f}$

We can answer the count queries from $\hat{f}$.

$$
\hat{f} = G f
$$

$$
f = G^{-1} \hat{f}
$$

If $G$ is invertible, we can answer the count queries from $\hat{f}$.
We can always make $G$ invertible by adding regularization term.

To find $f$, we solve

$$
\min_f ||G f - \hat{f}||^2 + \lambda ||f||^2
$$

where $\lambda$ is the regularization term. (Ridge regression)

Then 
$$
f = (G^T G + \lambda I)^{-1} G^T \hat{f}
$$

Why does this work?
1) Power of randomized dimensionality reduction.
2) Distances between Distribution are preserved in the lower dimension.

# Why does Randomized Dimensionality Reduction Work?

* Concentration of measure

If we have high dimensional data, distances become smaller after linear projection into lower dimensional space.
Imagine we project 3D data into 2D space. The distance between two points in 3D space becomes smaller in 2D space.

* Preservation of inner product

$$
\langle G x, G y \rangle = x^T G^T G y
$$
$$
E[x^T G^T G x] = x^T E[G^T G] x = x^T m I x = m ||x||^2
$$

# Distances between Distribution are preserved in the lower dimension

* Kullback-Leibler divergence
* Hellinger distance

These distances are preserved in the lower dimension.

See also: DistancesBetweenDistributions.ipynb