In [None]:
#: the usual suspects
import babypandas as bpd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Lecture 13

## Simulation

# Simulation

## Finding probabilities with computers

## Simulation

- What is the probability of getting 60 or more heads if I flip 100 coins?
- Approximation through simulation:
    1. Figure out how to do one experiment (i.e., flip 100 coins).
    2. Run the experiment a bunch of times.
    3. Find the fraction of times where number of heads >= 60.

## Making a random choice (e.g., flipping a coin)

- `np.random.choice(options)`
- Input `options` is a list or array to choose from.
- Return a random element.

In [None]:
# simulate a coin flip
np.random.choice(['Heads', 'Tails'])

## Making multiple random choices

- `np.random.choice(options, n)`

In [None]:
#: simulate 10 coin flips
np.random.choice(['Heads', 'Tails'], 10)

## Replacement vs. without replacement

- By default, this selects *with* replacement.
- That is, after making selection, that option is still available.
- If an option can only be selected once, select *without* replacement.

In [None]:
#: choose three of the teams
teams = ["bunnies", "ducklings", "fawns", "joeys", 
         "lambs", "piglets", "porcupettes", "tadpoles" ]
np.random.choice(teams, 3, replace=False)

# Simulation

## Flipping coins

- What is the probability of getting 60 or more heads if I flip 100 coins?
- Approximation through simulation:
    1. Figure out how to do one experiment (i.e., flip 100 coins).
    2. Run the experiment a bunch of times.
    3. Find the fraction of times where number of heads >= 60.

## Running the experiment once...

- Use `np.random.choice` to flip 100 coins
- Use `np.count_nonzero` to count number of heads.
    - Counts number of entries which are `True`.

In [None]:
coins = np.random.choice(['Heads', 'Tails'], 100)
coins

In [None]:
coins == 'Heads'

## Put it into a function

Make it easier to run the experiment again.

In [None]:
def coin_experiment():
    coins = np.random.choice(['Heads', 'Tails'], 100)
    return np.count_nonzero(coins == 'Heads')

In [None]:
coin_experiment()

## Repeating the experiment

- We can repeat this process many times by using a `for`-loop
- Need to store the results in an array... use `np.append`!

In [None]:
# make head_counts array
n_repetitions = 10000

head_counts = np.array([])

for i in np.arange(n_repetitions):
    head_count = coin_experiment()
    head_counts = np.append(head_counts, head_count)

In [None]:
# in how many trials was the number of heads >= 60?
at_least_60 = ...
at_least_60

In [None]:
# what is this as a proportion?


## Visualizing the distribution

In [None]:
#: visualize distribution of trial results
bpd.DataFrame().assign(
    Number_of_Heads=head_counts
).plot(kind='hist', bins=np.arange(30.5,70), density=True)
# plt.axvline(60, color='C1')

## The "Monty Hall" Problem

<img src="data/monty_1.svg" width=75% />

<img src="data/monty_2.svg" width=75% />

<img src="data/monty_3.svg" width=75% />

## Discussion question

- You originally selected door #2. The host reveals door #3 to have a goat behind it.
- What should you do

    - A) might as well stick with door number #2; it has just as high a chance of winning as door #1.
    - B) switch to door number #1; it has a higher chance of winning than door #2.

## Let's see

- We'll compute:
    - probability of winning if we switch.
    - probability of winning if we stay.
        - it's just 1 - (probability of winning if we switch)
- Whichever strategy has higher probability of winning is best.

# Simulate

- *Simulate* the Monty Hall problem many times to *estimate* probability.

    1. Figure out how to simulate one game of Monty Hall.
    2. Play a bunch of games.
    3. Count the proportion of wins for each strategy (stay or switch).

## 1) Simulate a single game

When a contestant picks their door, there are three equally-likely outcomes:

1. Goat #1
2. Goat #2
3. Car

In [None]:
behind_picked_door = np.random.choice(['Car', 'Goat 1', 'Goat 2'])
behind_picked_door

## 1) Simulate a single game

Suppose we can see what is behind their door (but the contestant can't).

- If it is a car, they will win if they stay.
- If it is a goat, they will win if they switch.

## 1) Simulate a single game


In [None]:
#- determine winning_strategy ('Stay' or 'Switch') based on what is behind_picked_door
if behind_picked_door == 'Car':
    winning_strategy = 'Stay'
else:
    # a goat was behind the picked door.
    # Monty will reveal the other goat. 
    # Switching wins:
    winning_strategy = 'Switch'

## 1) Simulate a single game

Turn it into a function to make it easier to repeat:

In [None]:
def simulate_monty_hall():
    behind_picked_door = np.random.choice(['Car', 'Goat 1', 'Goat 2'])
    
    if behind_picked_door == 'Car':
        winning_strategy = 'Stay'
    else:
        winning_strategy = 'Switch'
        
    print(behind_picked_door, 'was behind the door. Winning strategy:', winning_strategy)
    return winning_strategy

In [None]:
simulate_monty_hall()

## 2) Play a bunch of times

In [None]:
n_repetitions = 100

for i in np.arange(n_repetitions):
    simulate_monty_hall()

## 2) Play a bunch of times

We should save the winning strategies. Use `np.append`:

In [None]:
#: many simulations

n_repetitions = 10000

winning_strategies = np.array([])
for i in np.arange(n_repetitions):
    winning_strategy = simulate_monty_hall()
    winning_strategies = np.append(winning_strategies, winning_strategy)

## 3) Count the proportion of wins for each strategy (stay or switch).

In [None]:
winning_strategies

In [None]:
np.count_nonzero(winning_strategies == 'Switch')

In [None]:
np.count_nonzero(winning_strategies == 'Switch') / n_repetitions

## Marilyn vos Savant's column


<div style="display: flex; margin-top: .5in">
<div style="width: 45%;">
    <ul>
        <li>vos Savant asked the question in <i>Parade</i> magazine.</li>
        <li>She stated the correct answer: <i>switch</i>.</li>
        <li>Received over 10,000 letters in disagreement.</li>
        <li>Over 1,000 letters from people with Ph.D.s</li>
    </ul>
</div>
<div style="width: 50%;">
    <img src="data/vos_savant.jpg" width=75%>
</div>
</div>


# Simulation Summary

1. Make a function that runs the experiment once.
2. Run that function a bunch of times with a `for`-loop, save results in an array with `np.append`.
3. Count how many times an outcome occurs with `np.count_nonzero`.

# Sampling

## Sampling

- What do people think of the new Star Wars movie?
- We can't ask *everyone* in the **population** at large.
- So we take a **sample**.
- Central question: what does the opinion of the sample say about the population.

## Population and Sample

- The **population** is the set of things being **sampled** from.
- Examples: all moviegoers, all voters, the faces of a die.

## Types of samples

### Deterministic sample:
* Sampling scheme doesn’t involve chance

### Probability (random) sample:
* Involves chance
* Before the sample is drawn, you can calculate the probability of selecting each subset of the **population**
* Not all individuals need to have an equal chance of being selected

### Example: deterministic sample

Sample of students: take 50% of students, alphabetically by last name

### Example: probability sample

Sample of students: flip a coin for each student in class (heads, keep; tails, don't keep)

### Example: a probability sample
* Population: 3 individuals (A, B, C)
* Select a sample of 2
    - A chosen with probability 1
    - Choose B or C based on coin toss
* Possible samples: AB, AC, BC
    - Chance of AB: ½
    - Chance of AC: ½
    - Chance of BC = 0

## Example: movies

In [None]:
top = bpd.read_csv('data/top_movies.csv').set_index('Title')
top

### Example: deterministic or probabilistic sample?
* a sample of 3 specific rows

In [None]:
top.take([3,5,8])

### Example: deterministic or probabilistic sample?
* a sample via a selection

In [None]:
top[top.index.str.contains('and the')]

### Discussion question
Is the following sampling scheme a deterministic or probabilistic sample?
* Start with a random number; take every tenth row thereafter.

|Option|Answer|
|---|---|
|A| Deterministic|
|B| Probabilitstic|

###  Answer
* Start with a random number; take every tenth row thereafter.
* Any given row is equally likely to be picked! (But not true for groups of rows!)

In [None]:
start = np.random.choice(np.arange(10))
top.take(np.arange(start, 200, 10))

### Example: samples uniformly at random with(out) replacement
* `.sample()` method
* `replace=False` is default.

In [None]:
# without replacement
top.sample(5)

In [None]:
# with replacement
top.sample(5, replace=True)

## Sample of Convenience
* Example: sample consists of whoever walks by
    - Just because you think you’re sampling “at random”, doesn’t mean you are.
* If you can’t (in principle) figure out ahead of time 
    * what’s the population
    * what’s the chance of selection, for each group in the population
- then you don’t have a random sample!

### Examples: sample of convenience

* Voluntary internet surveys
* Interviewing people on Library Walk
* The first 100 visits to a website after an email campaign begins

### Samples of convenience: pros and cons
* Pros: 
    - Easy and inexpensive
    - Most common type of sample
* Cons: 
    - Results won't generalize to the population as a whole
    - Results are likely biased

### Example: sample of convenience

* Study: determine the average age of gamblers at a casino 
* Methodology: conducted for three hours on a weekday afternoon 
* Bias: Might overrepresent elderly people who have retired and underrepresent people of working age

# Distributions

## Probability Distribution
* Random quantity with various possible values
* Example: what we see when we roll a die.
* “Probability distribution”:
    - All the possible values of the quantity
    - The probability of each of those values

## Example: probability distribution of die roll

- Distribution is **uniform**.

In [None]:
die =  (
    bpd.DataFrame()
    .assign(face=np.arange(1, 7, 1))
)
die

In [None]:
bins =  np.arange(0.5, 6.6, 1)
die.plot(kind='hist', y='face', bins=bins, density=True)

## Empirical Distribution

* Based on observations
* Observations can be from repetitions of an experiment
* “Empirical Distribution”
    - All observed values
    - The proportion of counts of each value

### Example: Dice
* Simulate a roll as a sample from a table
* Rolling a die = sampling with replacement.

In [None]:
n = 10

In [None]:
die.sample(n=n, replace=True).plot(kind='hist', y='face', bins=bins, density=True)