In [None]:
#:
import babypandas as bpd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

# Lecture 14

## Sampling and Distributions

## Sampling

- What does the opinion of the **sample** say about the **population**?

## Types of samples

### Deterministic sample:
* Sampling scheme doesn’t involve chance

### Probability (random) sample:
* Involves chance
* Before the sample is drawn, you can calculate the probability of selecting each subset of the **population**
* Not all individuals need to have an equal chance of being selected

## Example: movies

In [None]:
top = bpd.read_csv('data/top_movies.csv')
top

###  Example
* Start with a random number; take every tenth row thereafter.
* Any given row is equally likely to be picked! (But not true for groups of rows!)
* This is a probability sample.

In [None]:
start = np.random.choice(np.arange(10))
top.take(np.arange(start, 200, 10))

### Sampling uniformly at random with(out) replacement
* `.sample()` method
* `replace=False` is default. Note: different default than `np.random.choice`

In [None]:
# without replacement
top.sample(5)

In [None]:
# with replacement
top.sample(5, replace=True)

## Sample of Convenience
* Example: sample consists of whoever walks by
    - Just because you think you’re sampling “at random”, doesn’t mean you are.
* If you can’t (in principle) figure out ahead of time 
    * what’s the population
    * what’s the chance of selection, for each group in the population
- then you don’t have a random sample!

### Examples: sample of convenience

* Voluntary internet surveys
* Interviewing people on Library Walk
* The first 100 visits to a website after an email campaign begins

### Samples of convenience: pros and cons
* Pros: 
    - Easy and inexpensive (most common type of sample)
* Cons: 
    - Results won't generalize to the population as a whole
    - Results are likely biased

### Example: sample of convenience

* Study: determine the average age of gamblers at a casino 
* Methodology: conducted for three hours on a weekday afternoon 
* Bias: Might overrepresent elderly people who have retired and underrepresented by people of working age

# Distributions

## Probability Distribution
* Random quantity with various possible values, each of which has some associated probability.
* “Probability distribution”:
    - All the possible values of the quantity
    - The theoretical probability of each value
* Example, for rolling a die:

| Value     |Probability |
| ----------- | ----------- |
| 1      | 1/6       |
| 2   | 1/6        |
| 3      | 1/6       |
| 4   | 1/6        |
| 5      | 1/6       |
| 6   | 1/6        |


## Example: probability distribution of die roll

- Distribution is **uniform**.

In [None]:
die =  (
    bpd.DataFrame()
    .assign(face=np.arange(1, 7, 1))
)
die

In [None]:
bins =  np.arange(0.5, 6.6, 1)
die.plot(kind='hist', y='face', bins=bins, density=True)

## Empirical Distribution

* Based on observations
* Observations can be from repetitions of an experiment
* “Empirical Distribution”
    - All observed values
    - The proportion of counts of each value

### Example: Die roll
* Simulate a roll as a sample from a table
* Rolling a die = sampling with replacement.

In [None]:
num_rolls = 10
die.sample(n=num_rolls, replace=True)

In [None]:
die.sample(n=num_rolls, replace=True).plot(kind='hist', y='face', bins=bins, density=True)

# Large Random Samples

## Law of Averages

If a chance experiment is repeated 
    - many times,
    - independently,
    - under the same conditions,
    
then the proportion of times that an event occurs gets closer to the theoretical probability of the event.


Example: As you roll a die repeatedly, the proportion of times you roll a 5 gets closer to 1/6.

In [None]:
for num_rolls in [10, 50, 100, 500, 1000, 5000, 10000]:
    die.sample(n=num_rolls, replace=True).plot(kind='hist', y='face', bins=bins, density=True)

## Large Random Samples

If the sample size is large, then the empirical distribution of a uniform random sample approximates the true distribution, with "high probability".

### Example: distribution of flight delays

* All United flights leaving SFO between 6/1/15 and 8/9/15.

In [None]:
united_full = bpd.read_csv('data/united_summer2015.csv')
united_full

## Only need delays...

In [None]:
united = united_full.get(['Delay'])
united

### Empirical distribution of flight delays

* This is our population.
* We will sample from this population of all United flights.

In [None]:
# population distribution
bins = np.arange(-20, 300, 10)
united.plot(kind='hist', y='Delay', bins=bins, density=True)

In [None]:
# empirical distribution
N = 10**2
united.sample(N, replace=True).plot(kind='hist', y='Delay', bins=bins, density=True)

### Average Flight Delay

- What is the average delay of United out of SFO?
- We'd love to know the average delay of **population**, but we only have a **sample**.
- How does the mean of the **sample** compare to the mean of the **population**?

## Mean of the Population

In [None]:
# calculate the mean
united_mean = united.get('Delay').mean()
united_mean

## Mean of the Large Random Sample

- This is called the **sample mean**.
- Because the sample is random, the **sample mean** is too!

In [None]:
united.sample(100).get('Delay').mean()

## Mean of the Large Random Sample

- As the sample gets bigger, the mean gets closer to the mean of the population.

In [None]:
# the mean of a lot of samples
for n in np.arange(100, 10000, 200):
    m = united.sample(int(n)).get('Delay').mean()
    print('size of sample: ', n, '\t', 'mean of sample: ', m)
    
print('\n The population mean is', united_mean)

## How good is the **sample mean**?

- Is it close to the population mean?
- If the sample is small, high chance that sample mean is bad.
- If the sample is big, small chance that sample mean is bad.


## Small Random Sample

<img src="data/bullseye-high.png">

## Big Random Sample

<img src="data/bullseye-low.png">

### Distribution of sample means

- Repeatedly draw a bunch of samples.
- Record the mean of each & visualize.
    - "How different could the sample mean have been, if we'd drawn a different sample?"
- Try different sample sizes.

In [None]:
#sample one thousand flights, two thousand times
n_experiments = 2000
means = np.array([])

for n in np.arange(n_experiments):
    m = united.sample(1000, replace=True).get('Delay').mean()
    means = np.append(m, means)

bpd.DataFrame().assign(means=means).plot(kind='hist', bins=np.arange(0,40,.5), density=True)
plt.axvline(x=united_mean, c='r');

### Discussion Question

Above we sampled **one thousand** flights, two thousand times. If we now sample **one hundred** flights, two thousand times, how will the histogram change?

|Option|Answer|
|---|---|
|A| narrower|
|B| wider|
|C| shifted left|
|D| shifted right|
|E| unchanged|


### Answer: wider

In [None]:
#sample one hundred flights, two thousand times
n_experiments = 2000
means = np.array([])

for n in np.arange(n_experiments):
    m = united.sample(100, replace=True).get('Delay').mean()
    means = np.append(m, means)

bpd.DataFrame().assign(means=means).plot(kind='hist', bins=np.arange(0,40,.5), density=True)
plt.axvline(x=united_mean, c='r');

## How we sample matters

* So far, we've taken large uniform random samples, with replacement, from the full population.
* The sample mean, for samples like this, reflects the population mean.
* But this is not always the case if we sample differently.

### Different sampling scheme: uniform random sample of flights from Denver
* This sample is still a probability sample.
* Estimation of the mean is highly biased!

In [None]:
n_experiments = 2000
means = np.array([])

den = united_full[united_full.get('Destination') == 'DEN'].get(['Delay'])
for n in np.arange(n_experiments):
    m = den.sample(100, replace=True).get('Delay').mean()
    means = np.append(m, means)

bpd.DataFrame().assign(means=means).plot(kind='hist', bins=np.arange(0,40,.5), density=True)
plt.axvline(x=united_mean, c='r');

### Different sampling scheme: evenly spaced samples
* Randomly choose one of 20 places to start, take every 50th flight thereafter.
* This sample is also probability sample.
* Why does the histogram look this way?

In [None]:
n_experiments = 2000
means = np.array([])
for n in np.arange(n_experiments):
    start = np.random.choice(np.arange(20))
    m = united.take(np.arange(start, united.shape[0], 50)).get('Delay').mean()
    means = np.append(m, means)

bpd.DataFrame().assign(means=means).plot(kind='hist', bins=np.arange(0,40,.5), density=True)
plt.axvline(x=united_mean, c='r');

### Different sampling scheme: sample the first 100 rows
* Low variation and very high bias!

In [None]:
n_experiments = 2000
means = np.array([])
for n in np.arange(n_experiments):
    m = united.take(np.arange(100)).get('Delay').mean()
    means = np.append(m, means)

bpd.DataFrame().assign(means=means).plot(kind='hist', bins=np.arange(0,40,.5), density=True)
plt.axvline(x=united_mean, c='r');