## Premise

- The idea is that, you have a stream of **unknown** size $n$, but a memory buffer of size $k$ where $n >> k$ How do we take a random sample from this stream?
 
- Notes
    - $n$ is unknown, so we cannot use a naive approach of picking random indices from the stream.
    - $k$ is much smaller than $n$, so we can't use a naive approach of picking a random sample of size $k$ from the stream

- Solution overview:
    - Create a **reservoir** of size $k$, and call this $R$
    - For every element from stream $S$, we increment a counter $\text{index}$
    - When $0 \le \text{index} \le k-1$, add element to $R$
    - For each element after the $k$-th element, generate a random number $i$ where $0 \le i \le \text{index} $
    - If $i \le k-1$, replace $R[i]$ with $S[\text{index}]$

## Mathematical Proof

### What are we trying to prove?


- For every element of the stream $S$ to be randomly selected, we want to prove that every element has a $\frac{k}{n}$ probability of being selected

### Cases

#### For stream elements between $k$ and $n-1$

- Consider the last element of the stream $S[n-1]$
    - From the procedure above, we are generating $i$ such that $0 \le i \le n-1$
    - If the number is between $0 \le i \le k-1$, element $i$ in reservoir $R$ is replaced by $S[n-1]$
    - Else the number is not added to the reservoir
    - This replacement happens with exactly $\frac{k}{n}$ probability

- Consider the second last element of the stream $S[n-2]$
    - When we first encounter $S[n-2]$, it has a probability $\frac{k}{n-1}$ of getting added to $R$ 
    - IF $S[n-2]$ is added to $R$
        - We next encounter the last element of the stream $S[n-1]$
        - This has a $\frac{1}{n}$ probability of replacing the $S[n-2]$ element, which implies $\frac{n-1}{n}$ probability of not replacing $S[n-2]$
    - IF $S[n-2]$ is NOT added to $R$, the last element does not affect it's selection probability
    - Therefore, the probability that $S[n-2]$ remains in the reservoir at the end of the stream is $\frac{k}{n-1} \cdot \frac{n-1}{n} = \frac{k}{n}$

- Consider the third last element of the stream $S[n-3]$
    - Following the logic above
    - When we first encounter $S[n-3]$, it has a probability $\frac{k}{n-2}$ of getting added to $R$ 
    - IF $S[n-3]$ is added to $R$
        - We next encounter the second last element of the stream $S[n-2]$
        - This has a $\frac{1}{n-1}$ probability of replacing the $S[n-3]$ element, which implies $\frac{n-2}{n-1}$ probability of not replacing $S[n-3]$
    - IF $S[n-3]$ is added to $R$ AND $S[n-2]$ does not replace $S[n-3]$
        - We finally encounter the last element of the stream $S[n-1]$
        - This has a $\frac{1}{n}$ probability of replacing the $S[n-3]$ element, which implies $\frac{n-1}{n}$ probability of not replacing $S[n-3]$

    - Therefore, the probability that $S[n-3]$ remains in the reservoir at the end of the stream is $\frac{k}{n-2} \cdot \frac{n-2}{n-1} \cdot \frac{n-1}{n} = \frac{k}{n}$

- By induction, this logic holds for all stream elements between indices $[k+1, n]$

#### For stream elements between $0$ and $k$

- Consider the first element of the stream $S[0]$
    - It is added to $R$ with probability 1
    - Until the $k-1$-th element, $S[0]$ has no change of getting removed
    - For each index $i$ between $k$ and $n-1$ of the stream, $S[0]$ is removed with probability $\frac{1}{i}$
    - Starting from the $n-1$-th element, the probability of $S[0]$ **NOT** getting removed is $\frac{n-1}{n} \cdot \frac{n-2}{n-1} \cdot ... \frac{n - (n-k)}{n - (n-k) + 1} = \frac{k}{n}$ by telescoping series

- This logic repeats for every element from $S[0]$ to $S[k-1]$

## Simulation

In [26]:
import random
from collections import Counter
reservoir_size = 10
val_range = 100
sim_size = 10_000

def get_simulation():
    stream = [x for x in range(val_range)]
    reservoir = [-1] * reservoir_size
    for index, elem in enumerate(stream):
        if index < reservoir_size:
            reservoir[index] = elem
        else:
            _tmp = random.randint(0, index)
            if _tmp < reservoir_size:
                reservoir[_tmp] = elem
    return reservoir

sims = [get_simulation() for _ in range(sim_size)]

In [28]:
{val:sum([val in sim for sim in sims])/sim_size for val in range(val_range)}

{0: 0.1008,
 1: 0.0984,
 2: 0.1033,
 3: 0.0977,
 4: 0.1027,
 5: 0.0978,
 6: 0.0993,
 7: 0.0997,
 8: 0.1005,
 9: 0.0971,
 10: 0.0995,
 11: 0.1038,
 12: 0.0977,
 13: 0.1007,
 14: 0.0964,
 15: 0.0972,
 16: 0.1035,
 17: 0.097,
 18: 0.0969,
 19: 0.1055,
 20: 0.0979,
 21: 0.0986,
 22: 0.1032,
 23: 0.1058,
 24: 0.1001,
 25: 0.0993,
 26: 0.1032,
 27: 0.1023,
 28: 0.0985,
 29: 0.1003,
 30: 0.0998,
 31: 0.0981,
 32: 0.0984,
 33: 0.1011,
 34: 0.0978,
 35: 0.1019,
 36: 0.1068,
 37: 0.0978,
 38: 0.1024,
 39: 0.0991,
 40: 0.0983,
 41: 0.0951,
 42: 0.0966,
 43: 0.0975,
 44: 0.1055,
 45: 0.103,
 46: 0.1007,
 47: 0.1042,
 48: 0.0986,
 49: 0.1045,
 50: 0.1015,
 51: 0.0993,
 52: 0.1004,
 53: 0.1026,
 54: 0.1028,
 55: 0.0996,
 56: 0.1013,
 57: 0.0973,
 58: 0.0991,
 59: 0.1022,
 60: 0.0967,
 61: 0.1037,
 62: 0.0956,
 63: 0.1064,
 64: 0.0965,
 65: 0.0967,
 66: 0.1013,
 67: 0.0943,
 68: 0.0971,
 69: 0.1014,
 70: 0.1028,
 71: 0.101,
 72: 0.0991,
 73: 0.1008,
 74: 0.1004,
 75: 0.101,
 76: 0.1021,
 77: 0.1022,
