# Sampling Basics

Explore the 2 basics sampling methods of `multivar_hypergeom` by soving some simple problems in statistics. Both of the problems will use the following scenario:

Say you are given a bag with different colored chips:

|Chip Color | Quantity|
|:---|:---|
|Red | 20 |
|Blue | 10|
|Yellow| 15|
|Green| 30 |

In [1]:
import numpy as np
from multivar_hypergeom.multivar_hypergeom import MultivarHypergeom as MV

## Problem 1

Drawing 1 chip at a time, what are the chances you draw a yellow chip, a blue chip, and a green chip in that order?

To solve the problem ***experimentally***, we can simulate many samples and estimate the likelihood of getting these 3 chips in the given order.

1. Learn how to take a sample using the library.

> Once you create a distribution object, simply call its `sample1()` method and the function will return the type of the object sampled. 
> 
> Run the code below several times to see how the sample varies.

In [7]:
# Create a distribution object
ex_dist1 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})
# Draw a sample of 4
sample = ex_dist1.sample1()
print(sample)

Blue


2. How do we sample repeatedly?

> Once you create a distribution object, you can sample from it repeatedly. 
>
> Run the code below to see what happens when we sample from the same distribution many times.

In [9]:
# Create a distribution object
ex_dist2 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})


# Try to sample constantly from the distribution
num_samples = 0
while True:
    try:
        sample = ex_dist2.sample1()
        num_samples += 1
    except:
        print('Failed after {} samples'.format(num_samples))
        # Uncomment line below after running the code once
        # print(ex_dist2)
        break

Failed after 75 samples
Types: ['Red', 'Blue', 'Yellow', 'Green'] 
Counts: [0, 0, 0, 0] 
Total: 0


> The code failed after 75 samples. Why did this happen? Run the code again with line 14 uncommented.
>
> By printing the distribution, you should notice there are no objects left. When we call `sample1()` repeatedly on the same distribution, the distribution itself is changing because we sample ***without replacement***.

3. How do we repeatedly sample from the same conditions?

> To solve our problem, we cannot just take samples from the distribution until it is empty because that is not the question we are given. Instead, we need to repeatedly take a sequence of 3 samples from the distribution in its initial state. 
>
> To do this, we have 2 options:
  - Repeatedly create a distribution and take 3 samples.
  - Create a distribution distribution and repeatedly sample then reset the distribution
> The second option is much more efficient, so let's try it.

In [10]:
# Create a distribution object
ex_dist3 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})

# Take 20 samples, reset the dist each time
for i in range(20):
    sample = []
    for i in range(3):
        sample.append(ex_dist3.sample1())
    print(sample)
    ex_dist3.reset()

# After sampling, check the status of our distribution
print(ex_dist3)

['Blue', 'Blue', 'Yellow']
['Yellow', 'Red', 'Blue']
['Blue', 'Green', 'Green']
['Blue', 'Red', 'Yellow']
['Yellow', 'Green', 'Blue']
['Green', 'Yellow', 'Blue']
['Blue', 'Blue', 'Green']
['Green', 'Blue', 'Red']
['Green', 'Blue', 'Red']
['Green', 'Red', 'Green']
['Green', 'Red', 'Green']
['Green', 'Yellow', 'Red']
['Red', 'Blue', 'Green']
['Green', 'Green', 'Red']
['Blue', 'Red', 'Green']
['Green', 'Green', 'Yellow']
['Blue', 'Yellow', 'Yellow']
['Blue', 'Green', 'Red']
['Red', 'Yellow', 'Red']
['Green', 'Green', 'Yellow']
Types: ['Red', 'Blue', 'Yellow', 'Green'] 
Counts: [20, 10, 15, 30] 
Total: 75


> We successfully took 20 sequences of 3 samples from the distribution under the same conditions!

4. Estimating the solution by sampling.

> Now we have the tools to take many sequences of samples from the given distribution and determine how many sequences meet the desired condition. We can then estimate the probability of this event by simply computing how many of our samples meet the condition out of the total number of samples.

In [13]:
# Create a distribution object
ex_dist4 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})

# Create variables to hold results
num_samples = 100
good_samples = 0
condition = ['Yellow', 'Blue', 'Green']

# Iterate through many trials
for i in range(num_samples):
    # Take 3 samples
    sample = []
    for i in range(3):
        sample.append(ex_dist4.sample1())
    # Check the condition
    if sample == condition:
        good_samples += 1
    # Reset the distribution
    ex_dist4.reset()
    
# Compute probability
prob = good_samples/num_samples
print('Probability of 1 of each type: ', prob)

Probability of 1 of each type:  0.01


> If you run the code above several times, you will notice the probability can change significantly between executions.
>
> To get more accuracy, we increase the number of trials (`num_samples`).

5. Imporving accuracy.

> Try increasing the number of trials in the code below to get an improved estimate of the probability.
>
> ***Hint:*** *The true probability is roughly 0.011*

In [15]:
# Create a distribution object
ex_dist5 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})

# Create variables to hold results
num_samples = 100
good_samples = 0
condition = ['Yellow', 'Blue', 'Green']

# Iterate through many trials
for i in range(num_samples):
    # Take 3 samples
    sample = []
    for i in range(3):
        sample.append(ex_dist5.sample1())
    # Check the condition
    if sample == condition:
        good_samples += 1
    # Reset the distribution
    ex_dist5.reset()
    
# Compute probability
prob = good_samples/num_samples
print('Probability of 1 of each type: ', prob)

Probability of 1 of each type:  0.0111


## Problem 2

What are the chances you get one of each color in a sample of 4?

To solve the problem ***experimentally***, we can simulate many samples of size 4 and estimate the likelihood of getting a sample with one of each color.

1. Learn how to take a sample using the library. 

> Once you create a distribution object, simply call its `sample()` method with the given sample size and the function will return an array `[c1, c2, ... cn]` where `ci` represents the number of objects of type `i` in the sample. 
> 
> Run the code below several times to see how the sample varies.


In [16]:
# Create a distribution object
ex_dist6 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})
# Draw a sample of 4
sample = ex_dist6.sample(4)
print(sample)

[1 0 0 3]


2. How do we sample repeatedly?

> Once you create a distribution object, you can sample from it repeatedly. 
>
> Run the code below to see what happens when we sample from the same distribution many times.

In [17]:
# Create a distribution object
ex_dist7 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})


# Try to sample constantly from the distribution
num_samples = 0
while True:
    try:
        sample = ex_dist7.sample(4)
        num_samples += 1
    except:
        print('Failed after {} samples'.format(num_samples))
        # Uncomment line below after running the code once
        # print(ex_dist2)
        break

Failed after 18 samples


> The code failed after 18 samples. Why did this happen? Run the code again with line 14 uncommented.
>
> By printing the distribution, you should notice there are only 3 objects left. When we call `sample()` repeatedly on the same distribution, the distribution itself is changing because we sample ***without replacement***.

3. How do we repeatedly sample from the same conditions?

> To solve our problem, we cannot just take samples of size 4 from the distribution until it is empty because that is not the question we are given. Instead, we need to repeatedly take samples of size 4 from the distribution in its initial state. 
>
> To do this, we have 2 options:
  - Repeatedly create a distribution and take a sample of size 4
  - Create a distribution distribution and repeatedly sample then reset the distribution
> The second option is much more efficient, so let's try it.

In [18]:
# Create a distribution object
ex_dist8 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})

# Take 20 samples, reset the dist each time
for i in range(20):
    sample = ex_dist8.sample(4)
    print(sample)
    ex_dist8.reset()

# After sampling, check the status of our distribution
print(ex_dist8)

[1 0 0 3]
[2 0 2 0]
[1 0 1 2]
[2 1 1 0]
[2 0 0 2]
[1 0 1 2]
[2 1 0 1]
[0 1 1 2]
[0 0 1 3]
[0 0 3 1]
[0 1 0 3]
[0 1 1 2]
[0 2 0 2]
[1 0 2 1]
[1 0 0 3]
[2 0 1 1]
[1 0 2 1]
[0 1 1 2]
[1 0 1 2]
[1 2 0 1]
Types: ['Red', 'Blue', 'Yellow', 'Green'] 
Counts: [20, 10, 15, 30] 
Total: 75


> We successfully took 20 samples from the distribution under the same conditions!

4. Estimating the solution by sampling.

> Now we have the tools to take many samples of 4 from the given distribution and determine how many samples meet the desired condition. We can then estimate the probability of this event by simply computing how many of our samples meet the condition out of the total number of samples.

In [20]:
# Create a distribution object
ex_dist9 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})

# Create variables to hold results
num_samples = 100
good_samples = 0
condition = np.ones(4)

# Iterate through many trials
for i in range(num_samples):
    # Take a sample of size 4
    sample = ex_dist9.sample(4)
    # Check the condition
    if np.array_equal(sample, condition):
        good_samples += 1
    # Reset the distribution
    ex_dist9.reset()
    
# Compute probability
prob = good_samples/num_samples
print('Probability of 1 of each type: ', prob)

Probability of 1 of each type:  0.05


> If you run the code above several times, you will notice the probability can change significantly between executions.
>
> To get more accuracy, we increase the number of trials (`num_samples`).

5. Imporving accuracy.

> Try increasing the number of trials in the code below to get an improved estimate of the probability.
>
> ***Hint:*** *The true probability is rough 0.074*

In [21]:
# Create a distribution object
ex_dist10 = MV({'Red': 20, 'Blue':10, 'Yellow':15, 'Green':30})

# Create variables to hold results
num_samples = 100
good_samples = 0
condition = np.ones(4)

# Iterate through many trials
for i in range(num_samples):
    # Take a sample of size 4
    sample = ex_dist10.sample(4)
    # Check the condition
    if np.array_equal(sample, condition):
        good_samples += 1
    # Reset the distribution
    ex_dist10.reset()
    
# Compute probability
prob = good_samples/num_samples
print('Probability of 1 of each type: ', prob)

Probability of 1 of each type:  0.04
