## The Urn Model
#### First need to make a few decisions before set up an urn model:
* The number of marbles in the urn
* The color (or label) on each marble
* The number of marbles to draw from the urn

In [2]:
import numpy as np

In [4]:
# In the urn, there are 3 black marbles and 2 white ones.
urn = ["b", "b", "b", "w", "w"]
print("Sample 1:", np.random.choice(urn, size=2, replace=False))
print("Sample 2:", np.random.choice(urn, size=2, replace=False))

Sample 1: ['w' 'w']
Sample 2: ['b' 'w']


In [6]:
# Estimate the fraction of samples where both marbles that we draw match in color.
n = 10_000
samples = [np.random.choice(urn, size=2, replace=False) for _ in range(n)]
is_matching = [marble1 == marble2 for marble1, marble2 in samples]
print(f"Proportion of samples with matching marbles: {np.mean(is_matching)}")

Proportion of samples with matching marbles: 0.394


#### Simple random sample
The urn model, where we do replace the marbles between draws, is a common selection method called the **simple random sample**. Every sample has the same chance of being selected.

In [7]:
from itertools import combinations

In [8]:
all_samples = ["".join(sample) for sample in combinations("ABCDEFG", 3)]
print(all_samples)
print("Number of Samples:", len(all_samples))

['ABC', 'ABD', 'ABE', 'ABF', 'ABG', 'ACD', 'ACE', 'ACF', 'ACG', 'ADE', 'ADF', 'ADG', 'AEF', 'AEG', 'AFG', 'BCD', 'BCE', 'BCF', 'BCG', 'BDE', 'BDF', 'BDG', 'BEF', 'BEG', 'BFG', 'CDE', 'CDF', 'CDG', 'CEF', 'CEG', 'CFG', 'DEF', 'DEG', 'DFG', 'EFG']
Number of Samples: 35


The chance of sample contains the marbles labeled A, B, and C in any order:
P(ABC) = 1/35

In [9]:
from itertools import permutations

In [10]:
print(["".join(sample) for sample in permutations("ABC")])

['ABC', 'ACB', 'BAC', 'BCA', 'CAB', 'CBA']


#### Stratified sampling
This is like having a separate urn for each stratum and drawing marbles from each urn, independently. The strata do not have to be the same size, and we need not take the same number of marbles from each.

#### Cluster sampling
We can think of this as a simple random sample from one urn that contains large marbles that are themselves containers of small marbles. (The large marbles need not have the same number of marbles in them.) When opened, the sample of large marbles turns into the sample of small marbles. (Clusters tend to be smaller than strata.)

For example, organize your marbles, labeled A-G, into three clusters (A, B), (C, D), and (E, F, G). Then, a cluster sample of size one has an equal chance of drawing any of the three clusters.

In this scenario, each marble has the same chance of being in the sample:

P(A in sample) = P(cluster(A, B) chosen) = 1/3

...

P(G in sample) = P(cluster(E, F, G) chosen) = 1/3

#### Example
We have 7 fuel tanks as a sample, and 4 of them fail the pressure test. Our test target is to find the failure rate.

In [15]:
# 1 for fail and 0 for pass
urn = [1, 1, 0, 1, 0, 1, 0]
sample = np.random.choice(urn, size=3, replace=False)
print(f"Sample: {sample}")
print(f"Prop Failures: {sample.mean()}")

Sample: [1 0 0]
Prop Failures: 0.3333333333333333


In a simulation study, we repeat the sampling process thousands of times to get thousands of proportions, and then we estimate the sampling distribution of the proportion from what we get in our simulation.

In [18]:
samples = [np.random.choice(urn, size=3, replace=False) for _ in range(10_000)]
prop_failures = [s.mean() for s in samples]

In [20]:
import pandas as pd

In [23]:
unique_els, counts_els = np.unique(prop_failures, return_counts=True)
pd.DataFrame({
    "Proportion of failures": unique_els,
    "Fraction of samples": counts_els / 10_000,
})


Unnamed: 0,Proportion of failures,Fraction of samples
0,0.0,0.0267
1,0.333333,0.3474
2,0.666667,0.5106
3,1.0,0.1153


In [26]:
# dict(zip(unique_els, counts_els / 10_000))
# pd.DataFrame([dict(zip(unique_els, counts_els / 10_000))]).T
pd.DataFrame(list(dict(zip(unique_els, counts_els / 10_000)).items()),
             columns=["Proportion of failures", "Fraction of samples"])

Unnamed: 0,Proportion of failures,Fraction of samples
0,0.0,0.0267
1,0.333333,0.3474
2,0.666667,0.5106
3,1.0,0.1153


#### Hypergeometric distribution
Instead of using random.choice, we can use np.random.hypergeometric to simulate drawing marbles from the urn and counting the number of failures.

In [28]:
simulations_fast = np.random.hypergeometric(
    ngood=4, nbad=3, nsample=3, size=10_000
)
print(simulations_fast)

[1 2 2 ... 1 2 1]


In [29]:
unique_els, counts_els = np.unique(simulations_fast, return_counts=True)
pd.DataFrame({
    "Number of failures": unique_els,
    "Fraction of samples": counts_els / 10_000,
})

Unnamed: 0,Number of failures,Fraction of samples
0,0,0.0297
1,1,0.3374
2,2,0.5171
3,3,0.1158


In [31]:
from scipy.stats import hypergeom

In [32]:
num_failures = [0, 1, 2, 3]
pd.DataFrame({
    "Number of failures": num_failures,
    # hypergeom.pmf: probability mass function
    "Fraction of samples": hypergeom.pmf(num_failures, 7, 4, 3)
})

Unnamed: 0,Number of failures,Fraction of samples
0,0,0.028571
1,1,0.342857
2,2,0.514286
3,3,0.114286


__Drawing without replacement is the hypergeometric distribution and drawing with replacement is the binomial distribution.__