In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Lecture 14

In this lecture we will:
1. Simulate the Monty Hall Problem
2. Demonstrate Deterministic and Random Sampling
3. Probability Distributions and Empirical Distributions
4. Law of Large Numbers

## Addendum

In [None]:
p1 = 2*(1/100 * 1/99) + 2 * (1/100 * 98/99) + 2 * (98/100 * 1/99)

In [None]:
p2 = 1 - (98/100) * (97/99)

In [None]:
p1 == p2

In [None]:
p1, p2

In [None]:
round(p1, 10) == round(p2, 10)

---

## The Monty Hall Problem 

Here we simulate the Monty Hall problem.  We break the process into three steps. 

1. Simulate the prize behind the door we picked (this is the only chance event):


In [None]:
prizes = make_array("goat", "goat", "car")

In [None]:
N = 10_000
outcomes = Table().with_column("My Choice", np.random.choice(prizes, N))
outcomes

2. Then Monty Hall reveals a Goat behind one of the other doors.

In [None]:
outcomes = outcomes.with_column("Monty's Door", "goat")
outcomes

3. Finally we compute the prize behind the remaining door.  Since Monty revealed one of the goats, the prize behind the remaining door depends only on our initial choice.  If we picked a car, then the remaining door has a goat.  Otherwise it has a car.

In [None]:
def other_door(my_choice):
    if my_choice == "car":
        return "goat"
    else:
        return "car"

In [None]:
outcomes = outcomes.with_column("Other Door", outcomes.apply(other_door, "My Choice"))
outcomes

Notice that in the above table each row has two goats and a car.  Each row simulates an outcome of playing the game.

If we stayed with our initial choice how often would we get a car?

In [None]:
outcomes.group("My Choice").barh("My Choice")

If we switched to the Other door how often would we win?

In [None]:
outcomes.group("Other Door").barh("Other Door")

Would you switch?

---
<center> Return to Slides </center>

---

## Random Sampling ##

Here we will use a dataset of all United airlines flights from 6/1/15 to 8/9/15.  This data contains their destination and how long they were delayed, in minutes.

In [None]:
united = Table.read_table('data/united.csv')
united = ( # Adding row numbers so we can see samples more easily
    united
    .with_column('Row', np.arange(united.num_rows))
    .move_to_start('Row') 
)
united

For each of the following, is this a deterministic or random sampling strategy?

In [None]:
united.where('Destination', 'JFK')

<details><summary>Answer</summary>

**Deterministic**

</details>

In [None]:
united.sample(3, with_replacement=True)

<details><summary>Answer</summary>

**Random**

</details>

In [None]:
(
    united
    .where('Destination', 'JFK')
    .sample(3, with_replacement=True)
)

<details><summary>Answer</summary>

**Random**

</details>

---
<center> Return to Slides </center>

---

## Distributions 

In [None]:
die = Table().with_column('Face', np.arange(1, 7))
die

What is the **Probability Distribution** of drawing each face assuming each face is equally likely (a "fair die")?

In [None]:
roll_bins = np.arange(0.5, 6.6, 1)
die.hist(bins=roll_bins)

We can sample from the die table many times with replacement:

In [None]:
die.sample(3)

We can construct an **Empirical Distribution** from our simulation:

In [None]:
die.sample(10).hist(bins=roll_bins)

If we increase the number of trials in our simulation, what happens to the distribution?

In [None]:
die.sample(100).hist(bins=roll_bins)

In [None]:
die.sample(100_000).hist(bins=roll_bins)

---
<center> Return to Slides </center>

---

## Large Random Samples 

The United flight delays is a relatively large dataset:

In [None]:
united.num_rows

We can plot the distribution of delays for the population:

In [None]:
united.hist('Delay', bins = 50)

There appears to be some very delayed flights!

In [None]:
united.sort('Delay', descending=True)

Let's truncate the extreme flights with a histogram from -20 to 201. (More on why we do this later.)

In [None]:
united_bins = np.arange(-20, 201, 5)
united.hist('Delay', bins=united_bins)

What happens if we take a small sample from this population of flights and compute the distribution of delays:

In [None]:
united.sample(10).hist('Delay', bins=united_bins)

If we increase the sample size

In [None]:
united.sample(1000).hist('Delay', bins=united_bins)

In [None]:
united.sample(2000).hist('Delay', bins=united_bins)

---
<center> Return to Slides </center>

---

## Simulating Statistics ##

Because we have access to the population (this is rare!) we can compute the parameters directly from the data.  For example, supposed we wanted to know the median flight delay:


In [None]:
np.median(united.column('Delay'))

In practice, we will often have a sample.  The median of the sample is a statistic that estimates the median of the population.

In [None]:
np.median(united.sample(10).column('Delay'))

But is it a good estimate?  

It depends on the sample size (and how close we want it to be).  Here we define a function to simulate the process of computing the median from a random sample of a given size:

In [None]:
def sample_median(size):
    return np.median(united.sample(size).column('Delay'))

In [None]:
sample_median(10)

We can then simulate this sampling process many times:

In [None]:
sample_medians = make_array()

for i in np.arange(1000):
    new_median = sample_median(10)
    sample_medians = np.append(sample_medians, new_median)

In [None]:
medians = Table().with_columns(
    "Sample Medians", sample_medians,
    "Sample Size", 10)
medians.hist("Sample Medians", bins = 50)

In [None]:
sample_medians2 = make_array()

for i in np.arange(1000):
    new_median = sample_median(1000)
    sample_medians2 = np.append(sample_medians2, new_median)

In [None]:
medians.append(Table().with_columns(
    "Sample Medians", sample_medians2,
    "Sample Size", 1000)).hist("Sample Medians", group="Sample Size", bins=50)