In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Lecture 13: Probability and Sampling ##

## Monty Hall (Review)

The following cell defines a function to simulate a single round of the Monty Hall game. (Look at the lecture notebook from last class for more details on how this function works.)

In [None]:
doors = make_array('car', 'first goat', 'second goat')

def monty_hall():
    """
    Simulate one Monty Hall game.
    Returns a list containing:
        1. what was behind the contestant's original door
        2. what was behind the door the host opened
        3. what was behind the remaining door
    """
    
    # Step 1: the contestant picks a door
    # Since the goats / car are randomly assigned, it is reasonable
    # to assume this choice is random
    contestant_choice = np.random.choice(doors)
    
    # Step 2: the host opens one of the other two doors, to reveal a goat
    if contestant_choice == 'first goat':
        monty_choice = 'second goat'
        remaining_door = 'car'
        
    elif contestant_choice == 'second goat':
        monty_choice = 'first goat'
        remaining_door = 'car'
        
    elif contestant_choice == 'car': 
        monty_choice = np.random.choice(['first goat', 'second goat'])
        if monty_choice == 'first goat':
            remaining_door = 'second goat'
        if monty_choice == 'second goat':
            remaining_door = 'first goat'
        
    return [contestant_choice, monty_choice, remaining_door]

Now we simulate the Monty Hall game many times and plot the results:

In [None]:
# Simulate the game
games = Table(['Original Door', 'Revealed', 'Remaining'])
for i in range(1000):
    game_i = monty_hall()
    games.append(game_i)
    
# Use the group method to count how many times the car appears behind the original door...
original = games.group('Original Door') 
# ...and the remaining door
remaining = games.group('Remaining') 

# Use a bar chart to visualize the outcome
joined = original.join('Original Door', remaining, 'Remaining')
joined = joined.relabeled(0, 'Item').relabeled(1, 'Original Door').relabeled(2, 'Remaining Door')
joined.barh('Item')

Roughly 2/3 of the time, the car is behind the remaining door, and the "switch doors" strategy wins! Can we explain this seemingly paradoxical result using probability theory?

## Sampling

Let's look at some examples of different kinds of sampling, looking at a table of flights from United Airlines.

In [None]:
united = Table.read_table('data/united.csv')
united

We could create a sample by selecting only flights to JFK:

In [None]:
united.where('Destination', 'JFK')

**Question:** is this a random sample or a deterministic sample?

In [None]:
# ...

We can define a sample based on specific rows, e.g. rows 34, 6321, 

In [None]:
united.take(make_array(34, 6321, 10040))

**Question:** is this a random sample or a deterministic sample?

In [None]:
# ...

A *systematic sample* starts from a random position, then selects evenly-spaced positions afterwards:

In [None]:
start = np.random.choice(np.arange(1000))
rows = np.arange(start, united.num_rows, 1000)
rows

In [None]:
systematic_sample = united.take(rows)
systematic_sample.show()

**Question:** is this a random sample or a deterministic sample?

In [None]:
# ...

A *simple random sample* is a random sample in which every individual has an equal probability of being selected. Simple random samples are done without replacement, meaning that individuals cannot show up in the sample twice. We can perform a simple random sample using the `sample` table method, with the argument `with_replacement=False`:

In [None]:
sample_size = 100
simple_random_sample = united.sample(sample_size, with_replacement=False)
simple_random_sample

We can also sample with equal probabilities *with replacement*. We refer to this as a "simple random sample with replacement."

In [None]:
sample_size = 100
simple_random_sample_wrp = united.sample(sample_size, with_replacement=True)
simple_random_sample_wrp

## Distributions ##

Let's examine distributions for rolling a 6-sided die.

In [None]:
die = Table().with_column('Face', np.arange(1, 7))
die

Since each face is equally likely, there is a $\frac 1 6 = 16.67\%$ probability for each roll. We can visualize this using a histogram:

In [None]:
# Select bins of width 1, where the integer value of the roll is in the center of the bin
roll_bins = np.arange(0.5, 6.6, 1) 
die.hist(bins=roll_bins)

Let's draw some simple random samples with replacement, and see what the empirical distributions look like:

In [None]:
die.sample(10).hist(bins=roll_bins)

In [None]:
die.sample(1000).hist(bins=roll_bins)

In [None]:
die.sample(100000).hist(bins=roll_bins)

As we select larger and larger samples, the empirical distributions (usually) look more and more like the probability distribution!