# Unknown Unknown Problem

Given a bucket of balls, each ball is either red, green, blue, black, or yellow, and the probability of drawing each color is $p_1, p_2, p_3, p_4, p_5$ respectively. You draw $N$ balls from the bucket without replacement, and observed $N'$ distinct colors. If I draw $R$ additional balls, how many more distinct colors will I observe?

This problem is known as "Species Estimation Problem" or "German Tank Problem".

## Fact 1: Geometric Distribution

$X$ is a random variable that represents the number of trials needed to get the first ball of a specific color. 

$$E[X] = 1 * p + (1 + E[X]) * (1 - p)$$
Hence,

$$E[X] = \frac{1}{p}$$

## Fact 2: Coupon Collector's Problem

Assume we know there are total of $C$ distinct colors with equal probability of drawing each color. We draw samples without replacement. How many samples do we need to draw to observe all $C$ colors?

$T_i$ is a random variable that represents the number of trials needed to see the $i$-th distinct color after seeing the first $i-1$ distinct colors.

$$E[T_1] = 1$$
$$E[T_2] = \frac{C}{C-1}$$ 
because the probability of drawing the second distinct color is $\frac{C-1}{C}$.
$$E[T_3] = \frac{C}{C-2}$$ 
because the probability of drawing the third distinct color is $\frac{C-2}{C}$.


The total number of trials needed to see all $C$ distinct colors is

$$E[T] = E[T_1] + E[T_2] + ... + E[T_C] = C \times (1 + \frac{1}{2} + \frac{1}{3} + ... + \frac{1}{C}) \approx C \times \ln(C)$$


## Fact 3: Skewed Distribution

What if the probability of drawing each color is not equal? If some colors are more likely to be drawn than others, is the expected number of trials needed to see all $C$ distinct colors better or worse than $C \times \ln(C)$?

### Entropy is the measure of uncertainty

$$H(X) = - \sum_{i=1}^{C} p_i \log(p_i)$$

In this problem, we want to be able to see all $C$ distinct colors with minimum number of trials. This means we want to be "surprised" as much as possible. The entropy is the measure of surprise. If the distribution is skewed, it means some colors are more likely to be drawn than others, and the "surprise" is less. This means we need more trials to see all $C$ distinct colors.


### "Skewed Distribution tends to be uniform somewhere"

More skew mean that some colors are more likely to be drawn than others. For example, skewed distribution would look like {red: 0.99, green: 0.005, blue: 0.003, yellow: 0.001, black: 0.002}. Interestingly, the distribution of rare colors is relatively uniform. More skew means more uniform in rare events. This means that rare events are interchangeable. 

When we think about the probability of drawing the $i$-th distinct color, we can think of it as the probability of drawing the rare event we have already seen. For example, if we draw {red: 6, green: 7, blue: 3, yellow: 1, black 1}, then the probability of drawing a unique color can be estimated as $\frac{2}{18}$, because we have already drawn 18 balls and 2 (yellow and black) of them seem to be rare (singleton) events.

<img src="images/skew_uniform.jpg" alt="Skew causes uniformity" width="400"/>

### How many trials do we need to run to see a rare event?

$O(log N)$ trials are needed to see a rare event(?).

<img src="images/rare_event.jpg" alt="Rare event" width="400"/>

