## Tracking Recent Popular Items



Given a stream of items x1, x2, ...., what are the most popular items?

Without constraints, we simply count the frequency of each item; sort them; and keep track of the ones with highest frequencies.

What if we can't store all frequencies?

- Count frequencies of a small set of random items of the stream.
    - we'll have to keep track of all frequencies.
    - we'll have to keep track of top-10 highest frequencies.
 
- Another idea is to use an exponentially deacying window.

- At time t, the score of each item $x$ is defined as:

$score(x) = \sum_{i=0}^t a_i(1-c)^{t-i}$

Given an item $x$, $a_i = 1$ if the i-th item is $x$.  Else, $a_i = 0$.

### If $c=0$, then $score(x)$ is simply the frequency of $x$.



If $c$ is a small positive number (e.g. ${1 \over 10^6}$), then what does $score(x)$ mean?

At time 0, score(x) = $a_0$

At time 1, $score(x) = a_0(1-c) + a_1$ 

At time 2, $score(x) = a_0(1-c)^2 + a_1(1-c) + a_2$ 

At time 3, $score(x) = a_0(1-c)^3 + a_1(1-c)^2 + a_2(1-c) + a_3$ 

At time 4, $score(x) = a_0(1-c)^4 + a_1(1-c)^3 + a_2(1-c)^2 + a_3(1-c) + a_4$ 

Say at time 5, the item is b.  How do we compute score(b) at time 5?


At time 4, $score(b) = a_0(1-c)^4 + a_1(1-c)^3 + a_2(1-c)^2 + a_3(1-c) + a_4$ 

If we had to compute "from the beginning", score(b) is
$a_0(1-c)^5 + a_1(1-c)^4 + a_2(1-c)^3 + a_3(1-c)^2 + a_4(1-c) + a_5$ 

$score(b) = score(b) \cdot (1-c) + 1$

Example:

0  1  2  3  4  5  6  7
a  b  a  a  c  c  b  a

time 0
- score(a) = 1
- score(b) = 0
- score(c) = 0

time 1
- score(a) = 1(1-c) + 0
- score(b) = 0 + 1
- score(c) = 0 + 0

time 2
- score(a) = 1(1-c)^2 + 0 + 1
- score(b) = 0 + 1(1-c) + 0
- score(c) = 0 + 0 + 0

time 3
- score(a) = 1(1-c)^3 + 0 + 1(1-c) + 1
    - score(a) = score(a) * (1-c) + 1 = (1(1-c)^2 + 0 + 1) * (1-c) + 1
- score(b) = 0 + 1(1-c)^2 + 0 + 0 = score(b) * (1-c) + 0


time 4
- score(a) = 1(1-c)^4 + 0        + 1(1-c)^2 + 1(1-c) + 0
- score(b) = 0        + 1(1-c)^3 + 0        + 0      + 0
- score(c) = 0        + 0        + 0        + 0      + 1

Update scores:

For each item x:
    - If x_t == x, score(x) = score(x)(1-c) + 1
    - Else, score(x) = score(x)(1-c)





If c = 0, scores are just frequencies. Highest scores represent most frequent items.

If c is very small, 1-c is very close to 1, scores approximate frequenceis, favoring recent items.  High scores represent recent and popular items.

For each item x that we are keeping track of:
    - If x_t == x:
        - if score(x) doesn't exist, score(x) = 1
        - else: score(x) = score(x)(1-c) + 1
    - Else
        - if score(x) doesn't exist, do nothing
        - else: score(x) = score(x)(1-c)
    - If score(x) < 0.5:
        - remove score(x). Stop keeping track of x.


The number of items we keep track of is less than ${2 \over c}$.

Because of the exponential decay, the sum of all scores converges to a constant, namely ${1 \over c}$.
    - This is because the sum of all scores = 1+(1-c)+(1-c)^2+ ...  This converges to 1/(1-(1-c))

We keep track of high scores only; namely items with scores > 0.5.

This means we will keep track of the top-${2 \over c}$ recent items.

If c = 1/10, then at any time, we only keep track of no more than 20 different items.

1-c = 0.9. This is the exponential decay factor.