### 1. Each year, as part of a “Secret Santa” tradition, a group of 4 friends write their names on slips of papers and place the slips into a hat. Each member of the group draws a name at random from the hat and must buy a gift for that person. Of course, it is possible that they draw their own name, in which case they buy a gift for themselves. What is the expected number of people who draw their own name?

- Hint: Express this complicated random variable as a sum of indicator random variables (i.e., that only take on the values 0 or 1), and use linearity of expectation.

- Let $X_{ij}$ be a random variable representing person $i$ drawing name $j$
    - $X_{ij}$ is 1 when $i=j$, and 0 otherwise

- Let Y represent the total count of people who draw their own name
    - $E[Y] = \sum_{i} E[X_{ij}]$
    
- Note that these are not independent binomial draws. If one person does not draw his/her own name, that makes it impossible for another person to draw his/her own name
    - BUT
    - We know that each $X \sim \text{Binom}(n=1, p=0.25)$, so $E[X_{ij}] = 0.25$
    - And by linearity of expectations, $E[Y] = \sum_{i} E[X_{ij}]$
    - So $E[Y] = 0.25 + 0.25 + 0.25 + 0.25 = 1$

- Proof by simulation below gives the same answer

In [49]:
import numpy as np

names=np.array([x for x in range(4)])
n = 10_000
count_draw_own_name = []
for _ in range(10_000):
    draw = np.random.choice(names, 4, replace=False)
    count_draw_own_name.append(np.sum(draw == names))
np.mean(count_draw_own_name)

0.9792

### 2. McDonald’s decides to give a Pokemon toy with every Happy Meal. Each time you buy a Happy Meal, you are equally likely to get any one of the 6 types of Pokemon. What is the expected number of Happy Meals that you have to buy until you “catch ’em all”?

- Hint: Express this complicated random variable as a sum of geometric random variables, and use linearity of expectation.


- Let X be the number of happy meals you buy until you collect all 6
- This is too complicated to represent as a single distribution, so let $Y_{i}$ be the number of happy meals you buy until you collect a pokemon of type $i$
- $Y_{i}$ is a geometric random variable
    - $f_Y(y) = (1-p)^{y-1} p$
    - $\begin{align}
        E[Y] &= \sum_{y=1}^{\inf} (1-p)^{y-1} p \cdot y \\
        &= p \sum_{y=1}^{\inf} (1-p)^{y-1} \cdot y \\
        &= p \cdot -\frac{\partial }{\partial p} (\sum_{y=1}^{\inf} (1-p)^{y}) \\
        &= p \cdot -\frac{\partial }{\partial p} \sum_{y=1}^{\inf} \frac{1}{1 - 1 + p} & \text{By geometric sum relation proven in Q3 of section 24 questions}\\
        &= p \cdot -\frac{\partial }{\partial p}( \frac{1-p}{p}) \\
        &= p \cdot \frac{\partial }{\partial p} (1 - \frac{1}{p}) \\
        &= p \cdot \frac{1}{p^2} \\
        &= \frac{1}{p}
        \end{align}$

- There is one additional nuance here:
    - If you have nothing collected, it will only take you 1 purchase to collect your first, because any of the 6 options will be new
    - If you have collected 1 type, then each draw has probability $\frac{6-1}{6}$ of throwing up something you need
    - If you have collected 2 types, then each draw has probability $\frac{6-2}{6}$ of throwing up something you need
    - etc.
    - So 
        - $Y_1 \sim \text{Geom}(p=6/6)$
        - $Y_2 \sim \text{Geom}(p=5/6)$
        - $Y_3 \sim \text{Geom}(p=4/6)$
        - ...


- $X = Y_{1} + Y_{2} + ... Y_{6}$   
    - $\begin{align}
        E[X] &= E[Y_{1}] + E[Y_{2}] + ... E[Y_{6}] \\
        &= \frac{6}{6} + \frac{6}{5} + ... \\
        &= 14.7
        \end{align}$

- Proof by simulation

In [1]:
import numpy as np

types = [x for x in range(6)]
purchases_made = []

for _ in range(10_000):
    collected = set()
    purchases=0
    while len(collected) != 6:
        collected.add(np.random.choice(types, 1, replace=True)[0])
        purchases+=1
    purchases_made.append(purchases)
np.mean(purchases_made)

14.7202

### 3. A group of 60 people are comparing their birthdays (as usual, assume that their birthdays are independent, all 365 days are equally likely, etc.). Find the expected number of days in the year on which at least two of these people were born.

- Hint: Express this complicated random variable as a sum of indicator random variables, and use linearity of expectation.

- Let X be the count of days where multiple people share birthdays
- Let $Y_{i}$ be an indiator variable that is 1 when day $i$ is birthday for more than 1 person 
    - So $E[X] = \sum_{i=0}^{365} Y_i$

- For any given day $Y_i$, the probability of having 2 or more people having a birthday on $Y_i$ is the complement of the probability of having 1 or less person having a birthday on $Y_i$, i.e. 1 - f(1) - f(0)
    - $f(0) = \frac{364^{60}}{365^{60}} $
    - $f(1) = \binom{60}{1} \cdot \frac{1}{365} \cdot \frac{364^{59}}{365^{59}} $
    - So $E[Y_i] = 1 - f(0) - f(1) \approx 1 - 0.848 - 0.140 = 0.012$

- $E[X] = \sum_{i=0}^{365} Y_i = 0.012 * 365 = 4.38$

In [70]:
import numpy as np
population = [x for x in range(1, 366)]
count_multi_birthdays=[]
for _ in range(10_000):
    sample=np.random.choice(population, 60, replace=True)
    days, counts = np.unique(sample, return_counts=True)
    count_multi_birthdays.append(len([x for x,y in zip(days, counts) if y > 1]))
np.mean(count_multi_birthdays)

4.352

### 4. A hash table is a commonly used data structure in computer science, allowing for fast information retrieval. For example, suppose we want to store some people’s phone numbers. Assume that no two of the people have the same name. For each name x, a hash function h is used, where $h(x)$ is the location to store x’s phone number. After such a table has been computed, to look up x’s phone number one just recomputes $h(x)$ and then looks up what is stored in that location.

### Typically, $h$ is chosen to be (pseudo)random. Suppose there are 100 people, with each person’s phone number stored in a random location (independently), represented by an integer between 1 and 1000. It then might happen that one location has more than one phone number stored there, if two different people x and y end up with the same random location for their information to be stored.

### Find the expected number of locations with no phone numbers stored, the expected number with exactly one phone number, and the expected number with more than one phone number.

- With no phone numbers
    - Let $X$ be the number of locations with no phone numbers
    - Let $Y_i$ be an indicator variable that is 1 when there is at least 1 phone number in position $i$, and 0 otherwise
        - For a given individual, probability that he/she is put into Y_i is $\frac{1}{1000} = 0.001$
        - Since there are 100 people, probability that position $i$ is empty is $(1-0.001)^{100} = 0.905$ 
        - So the complement probability (that there is at least 1 number in position i), is $1 - 0.905 = 0.095$

    - $E[X] = E[Y_1] + E[Y_2] + ... = 0.905 * 1000 = 905$

- With 1 phone number
    - Let $X$ be the number of locations with exactly 1 phone number 
    - Let $Y_i$ be an indicator variable that is 1 when there is exactly 1 phone number in position $i$, and 0 otherwise

    - For a given position $i$, the probability of getting exactly 1 phone number is:
        $Pr(Y_i = 1) = E[Y_i] = \binom{100}{1} \cdot \frac{1}{1000} \cdot \frac{999^{99}}{1000^{99}} = 0.0906$
    
    - E[X] = E[Y_1] + E[Y_2] + ... = 0.0906 * 1000 = 90.6

- With more than 1 phone number
    - Let X be the number of locations with more than 1 phone number
    - Let $Y_i$ be an indicator variable hat is 1 where that is more than 1 phone number in position $i$ and 0 otherwise
    - For a given position $i$, the probability of getting more than 1 phone number is the complement of the probabilities of no phone numbers + probaiblity of 1 phone number, both of which we computed above
    - $Pr(Y_i > 1) = 1 - P(Y_1 = 1) - P(Y_1 = 0) = E[Y_i] = 1 - 0.905 - 0.0906 = 0.0044
- E[X] = E[Y_1] + E[Y_2] + ... = 0.0044 * 1000 = 4.4

In [78]:
import numpy as np
population=[x for x in range(1000)]
locations_without_numbers = []
locations_with_one_number = []
locations_with_gt_one_numbers = []

for _ in range(1_000):
    sample=np.random.choice(population, 100, replace=True)
    locations_without_numbers.append(len([x for x in population if x not in sample]))

    positions, counts = np.unique(sample, return_counts=True)
    
    locations_with_one_number.append(len([x for x,y in zip(positions, counts) if y == 1]))
    locations_with_gt_one_numbers.append(len([x for x,y in zip(positions, counts) if y > 1]))

    

print(np.mean(locations_without_numbers))
print(np.mean(locations_with_one_number))
print(np.mean(locations_with_gt_one_numbers))

904.728
90.7
4.572


### 5. Calculate E[X(X-1)] for a Hypergeometric(n,N1, N0) random variable X using linearity (Hint: Follow Example 26.4.)

- X(X-1) is akin to drawing two 1s consecutively
- Let $Y_{ij}$ represent an indicator variable that is 1 when draws $i$ and $j$ are both 1
- As such, $X(X-1) = \sum_{i,j} Y_{ij} = \sum_{i,j} E[Y_{ij}] = \frac{N_1}{N_1+N_0} \cdot \frac{N_1-1}{N_1+N_0-1} * n(n-1)$

In [103]:
import numpy as np
import scipy

'''
For each sample, take 20 samples from a box of 100 items, 40 of which are N1, and counts the number of 1s.
Do this `size` times
'''
X = scipy.stats.binom.rvs(20, 0.4, size=10_000) 
def compute_e_x_xminus1(X):
    gx = X * (X-1)
    return np.mean(gx)
print(compute_e_x_xminus1(X))
20 * 19 * 0.4**2

60.6372


60.80000000000001

In [132]:
X = scipy.stats.hypergeom.rvs(100, 20, 40, size=10_000)
print(compute_e_x_xminus1(X))
20 * 19 * (40/100 * 39/99)

59.7222


59.87878787878788