<a href="https://colab.research.google.com/github/yardsale8/probability_simulations_in_R/blob/main/2_3_simulating_the_hypergeometric_distribution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
library(dplyr)
library(tidyr)
library(purrr)
library(devtools)
install_github('yardsale8/purrrfect', force = TRUE)
library(purrrfect)

# Simulating the Hypergeometric Distribution

In this notebook, we will explore another distribution related to counting success. The hypergeometric distribution is the similar as the binomial distribution, except we are **sampling without replacement** from a finite population.

1. Simulating the raw outcomes using `sample` and converting to the number of success, as well as
2. Simulating the number of successes directly using `rhyper`.

Finally, we will illustrate how to use simulations to estimate various quantities of interest.

### Review - Urn Problem

The classic analogy of interest is drawing chips from an urn, such that

1. The chips consist of two colors (denoted success and failure),
2. A fixed number of chips are drawn, and
3. Chips are drawn without replacement.

### The hypergeomtric distribution

Suppose that we draw $n$ chips from a population contains $r$ total successes.  If $X$ is the number of successes in the $n$ draws, then $X$ will have a hypergeomtric distribution.  <font color="red"> NEED LINK</font>

## Strategies for simulating the hypergeometric distribution

1. Simulate the urn by using `sample` with `replace=FALSE`, then transform into the number of success using either reshaping or `mutate` + `map`.
2. Simulate the number of successes directly using `rhyper`.

### Example - Drawing chips from an urn

Suppose that we are drawing $n = 5$ chips from an urn with $r=15$ of the $N = 25$ being colored blue, and recording the number of blue chips.  We can simulating the raw outcomes for this scenario by

1. Creating a vector of strings to model the urn, and
2. Using `sample` with `replace=FALSE` to draw chips.

In [3]:
N <- 25
r <- 15
n <- 5
urn <- c(rep('B', r),
         rep('W', N - r))
replicate(10, sample(urn, n, replace = FALSE))

.trial,.outcome
<dbl>,<list>
1,"W, W, B, B, W"
2,"B, W, B, W, B"
3,"W, B, W, W, B"
4,"B, B, B, B, W"
5,"B, W, B, B, W"
6,"B, B, W, B, W"
7,"B, W, B, W, B"
8,"B, W, B, B, B"
9,"B, W, B, B, B"
10,"B, B, W, B, B"


### Review - Converting raw outcomes to the number of successes

We will illustrate this process in three ways to convert the raw outcomes to the number of successes.

1. Reshaping to compute number of successes,
2. Using `mutate` + `map` to compute number of successes, or
3. Using either `mutate` + `num_successes_int` or `col_num_successes`

#### Approach 1 - Simulate individual shots, reshape, and compute successes.

Comment out lines to explore the output in each step.

In [6]:
N <- 25
r <- 15
n <- 5
urn <- c(rep('B', r),
         rep('W', N - r))
num.trials <- 10
(replicate(num.trials, sample(urn, n, replace = FALSE), .reshape = 'stack')
 %>% mutate(is.success = ifelse(.outcome == 'B', 1, 0))
 %>% group_by(.trial) %>% summarise(num.successes = sum(is.success))
 )

.trial,num.successes
<dbl>,<dbl>
1,3
2,4
3,4
4,4
5,3
6,2
7,2
8,4
9,3
10,3


#### Approach 2 - Simulate individual shots, recode outcomes, and compute successes using `mutate` and `map`

Comment out lines to explore the output in each step.

In [5]:
N <- 25
r <- 15
n <- 5
urn <- c(rep('B', r),
         rep('W', N - r))
num.trials <- 10
(replicate(num.trials, sample(urn, n, replace = FALSE))
 %>% mutate(is.success = map(.outcome, \(x) ifelse(x == 'B', 1, 0)))
 %>% mutate(num.successes = map_int(is.success, sum))
 )

.trial,.outcome,is.success,num.successes
<dbl>,<list>,<list>,<int>
1,"B, W, B, B, B","1, 0, 1, 1, 1",4
2,"B, W, B, B, W","1, 0, 1, 1, 0",3
3,"B, B, W, B, W","1, 1, 0, 1, 0",3
4,"W, W, B, B, W","0, 0, 1, 1, 0",2
5,"W, B, W, B, B","0, 1, 0, 1, 1",3
6,"B, B, B, W, B","1, 1, 1, 0, 1",4
7,"W, W, B, B, B","0, 0, 1, 1, 1",3
8,"W, B, B, B, B","0, 1, 1, 1, 1",4
9,"B, B, W, W, W","1, 1, 0, 0, 0",2
10,"W, W, B, B, W","0, 0, 1, 1, 0",2


#### Approach 3 - Using the number of successes helper functions

In [8]:
N <- 25
r <- 15
n <- 5
urn <- c(rep('B', r),
         rep('W', N - r))
num.trials <- 10
(replicate(num.trials, sample(urn, n, replace = FALSE))
#   %>% mutate(.successes = num_successes_int(.outcome, 'B'))
  %>% col_num_successes(.outcome, 'B')
)

.trial,.outcome,.successes
<dbl>,<list>,<int>
1,"W, B, B, W, B",3
2,"W, B, B, B, B",4
3,"W, B, W, W, W",1
4,"B, W, B, B, B",4
5,"B, W, B, W, B",3
6,"B, B, W, B, W",3
7,"B, W, B, B, W",3
8,"W, B, B, B, B",4
9,"B, W, B, B, W",3
10,"W, W, B, W, B",2


### Simulate the number of successes directly using `rhyper`

Alternative, we can use the `rhyper` base `R` functions to generate the number of successes directly.  Note that the signature of `rhyper` is `rhyper(nn, m, n, k)` (use help!).  Here will use `nn = 1` to mean one experiment per row, with an urn with `m` and `n` blue chips when drawing a sample of `k` chips.  

**Note.** The parameters used by `rhyper` are none standard, so be prepared to make generous use of `?`/`help`.

In [None]:
?rhyper

In [9]:
N <- 25
r <- 15
n <- 5
num.trials <- 10
(replicate_int(num.trials, rhyper(1, r, N - r, n))
)

.trial,.outcome
<dbl>,<int>
1,2
2,4
3,2
4,5
5,2
6,1
7,3
8,4
9,3
10,5


## Four Types of Simulation Tasks

When performing a simulation, we will generally be trying to

1. Estimate the probability of an event,
2. Estimate the cut off for some region of the distribution (i.e., inverse probability),
3. Estimate all the values for the probability mass function, or
4. Estimating various types of expectation (mean, variance, SD, function of $X$)

These strategies are demonstrated below, but are covered in more detail in the binomial notebook.

#### Estimating Probabilities

**Example Question.** What the the probability that  3 or more of the 5 chips will be blue.

In [11]:
N <- 25
r <- 15
n <- 5
num.trials <- 100000
(replicate_int(num.trials, rhyper(1, r, N - r, n))
 %>% mutate(three.or.more = .outcome >= 3)
 %>% estimate_all_prob
 %>% mutate(exact.prob = sum(dhyper(3:5, r, N - r, n)))
)


three.or.more,exact.prob
<dbl>,<dbl>
0.69765,0.6988142


#### Estimating the Distribution

In [12]:
N <- 25
r <- 15
n <- 5
num.trials <- 100000
(replicate_int(num.trials, rhyper(1, r, N - r, n))
 %>% tabulate(.outcome)
)

X = 0,X = 1,X = 2,X = 3,X = 4,X = 5
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.00451,0.05968,0.23669,0.38508,0.2574,0.05664


#### Estimating the cut off for the left tail of the distribution.

**Example Lower-Tail Task.** Find the cut off for the smallest 5% of number of blue chips.

In [13]:
N <- 25
r <- 15
n <- 5
left.tail.percent <- 0.05
num.trials <- 100000
(replicate_int(num.trials, rhyper(1, r, N - r, n))
 %>% summarise(x.lower.5.percent = quantile(.outcome, left.tail.percent))
)

x.lower.5.percent
<dbl>
1


#### Example Upper-Tail Task - Find the cut off for the largest 10% of number of blue chips

In [14]:
N <- 25
r <- 15
n <- 5
left.tail.percent <- 1 - 0.10
num.trials <- 100000
(replicate_int(num.trials, rhyper(1, r, N - r, n))
 %>% summarise(x.upper.10.percent = quantile(.outcome, left.tail.percent) + 1)
)

x.upper.10.percent
<dbl>
5


####  Estimating the mean, variance, and standard deviation number of blue chips and compare to theoretical values

In [15]:
N <- 25
r <- 15
n <- 5
p <- r/N
finite.correct <- (N - n)/(N - 1)
num.trials <- 10000
(replicate_int(num.trials, rhyper(1, r, N - r, n))
  %>% summarise(approx.mu = mean(.outcome),
                approx.var = var(.outcome),
                approx.sd = sd(.outcome))
  %>% mutate(actual.mu = n*p,
             actual.var = n*p*(1-p)*finite.correct,
             actual.sd = sqrt(actual.var))
  %>% relocate(actual.mu, .after = approx.mu)
  %>% relocate(actual.var, .after = approx.var)
  %>% relocate(actual.sd, .after = approx.sd)
)



approx.mu,actual.mu,approx.var,actual.var,approx.sd,actual.sd
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2.9889,3,0.991876,1,0.9959297,1


### <font color="red"> Exercise 2.3.1 - Hypergeometric problems </font>

In a small pond, there are 50 fish, 10 of which have been tagged.  A fisherman’s catch consists of seven fish, selected at random and without replacement.  Let $X$ represent the number of the seven fish that are tagged.

**Tasks.**
1. Explain why we should use the hypergeometric distribution here.
2. Estimate $P(X≥2)$ two ways: by generating raw outcomes and using `rhyper`.
3. Find the cut-off for the largest 25% of the distribution.
4. Use  `tabulate` to estimate the PMF of $X$.

In [None]:
# Your code there

### <font color="red"> Exercise 2.3.2 - Estimating the population size </font>

Using the same set up as the last problem (50 fish; 10 of which have been tagged; catch consists of seven fish) and again let $X$ represent the number of the seven fish that are tagged.  Imagine the the fisherman doesn't know the total number of fish in the pond, but *does* know that there are 10 tagged fish.  She will using the following formula to estimate $\hat{N} \approx r/\hat{p}$ the total number of fish in the lake, where $\hat{p}$ is the sample proportion of tagged fish.

**Tasks.**
1. Note that the proportion of tagged fish in the pond is given by $p=\frac{r}{N}$ Use this fact to explain why the estimation formula makes sense.
2. Set up a simulation to estimate the PMF of $\hat{N}$.
3. Esimate the mean, variance, and SD of $\hat{N}$.
4. An estimate is said to be **unbiased** if the expected value of the estimate is the same as the actual population value.  Is this estimator unbiased?  Explain.

In [None]:
# Your code there