# Lecture 11

Sampling

### Deterministic sample:
* Sampling scheme doesn’t involve chance

### Probability (random) sample:
* Before the sample is drawn, you have to know the probability of selecting each group of people in the population
* Not all individuals need to have an equal chance of being selected

### Example: deterministic sample

Sample of students: take 50% of students, alphabetically by last name

### Example: probability sample

Sample of students: flip a coin for each student in class (heads, keep; tails, leave)

### Example: a probability sample
* Population: 3 individuals (A, B, C)
* Select a sample of 2
    - A chosen with probability 1
    - Choose B or C based on coin toss
* Possible samples: AB, AC, BC
    - Chance of AB: ½
    - Chance of AC: ½
    - Chance of BC = 0

In [None]:
#:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

In [None]:
#:
top = Table.read_table('top_movies.csv')
top = top.with_column('Row Index', np.arange(top.num_rows))
top = top.move_to_start('Row Index')

top

### Example: deterministic or probabilistic sample?
* a sample of 3 specific rows

In [None]:
top.take(make_array(3,5,8))

### Example: deterministic or probabilistic sample?
* a sample via a where statement

In [None]:
top.where('Title', are.containing('and the'))

### Discussion question
Is the following sampling scheme a deterministic or probabilistic sample?
* Start with a random number; take every tenth row thereafter.

|Option|Answer|
|---|---|
|A| Deterministic|
|B| Probabilitstic|

###  Answer
* Start with a random number; take every tenth row thereafter.
* Any given row is equally likely to be picked! (But not true for groups of rows!)

In [None]:
start = np.random.choice(np.arange(10))
top.take(np.arange(start, 200, 10))

### Example: samples uniformly at random with(out) replacement
* `Table.sample` method
* `with_replacement=True` is default.

In [None]:
# with replacement
top.sample(5)

In [None]:
# without replacement
top.sample(5, with_replacement=False)

## Sample of Convenience
* Example: sample consists of whoever walks by
    - Just because you think you’re sampling “at random”, doesn’t mean you are.
* If you can’t figure out ahead of time 
    * what’s the population
    * what’s the chance of selection, for each group in the population

then you don’t have a random sample!

### Examples: sample of convenience

* Voluntary internet surveys
* Interviewing people on Library Walk
* The first 100 visits to a website after an email campaign begins.

### Samples of convenience: pros and cons
* Pros: 
    - Easy and inexpensive
    - Most common type of sample
* Cons: 
    - Results won't generalize to the population as a whole
    - Results are likely biased

### Example: sample of convenience

* Study: determine the average age and sex of gamblers at a casino 
* Methodology: conducted for three hours on a weekday afternoon 
* Bias: Might overrepresent elderly people who have retired and underrepresented by people of working age

# Distributions

## Probability Distribution
* Random quantity with various possible values
* “Probability distribution”:
    - All the possible values of the quantity
    - The probability of each of those values

## Empirical Distribution

* Based on observations
* Observations can be from repetitions of an experiment
* “Empirical Distribution”
    - All observed values
    - The proportion of counts of each value

### Example: Dice
* simulate a roll as a sample from a table

In [None]:
#:
die =  (
    Table()
    .with_column('face', np.arange(1, 7, 1))
)
die

In [None]:
# row a single die!
die.sample(1)

### The true distribution is uniform

In [None]:
#
bins =  np.arange(0.5, 6.6, 1)
die.hist('face', bins=bins)

### Roll the die and plot the empirical distribution
* Try it for 10, 100, 1000, etc
* What does it converge to?

In [None]:
die.sample(10)

In [None]:
die.sample(10).hist('face', bins=bins)

# Large Random Samples

## Law of Averages

* If a chance experiment is repeated 
    - many times,
    - independently,
    - under the same conditions,
    
then the proportion of times that an event occurs gets closer to the theoretical probability of the event.


Example: As you roll a die repeatedly, the proportion of times you roll a 5 gets closer to 1/6.

## Large Random Samples

If the sample size is large, then the empirical distribution of a uniform random sample matches the distribution of the population, with high probability.

### Example: distribution of flight delays
* All united flights leaving SFO between 6/1/15 and 8/9/15.
* The underlying distribution is not known.
* All we have is the observed data!

In [None]:
#:
united = Table().read_table('united_summer2015.csv')
united

### Empirical distribution of flight delays
* What is the population?

In [None]:
#: Plot empirical distribution of flight delays
bins = np.arange(-20, 300, 10)
united.hist('Delay', bins=bins, unit='minute')

In [None]:
#:
N = 10**2
united.sample(N).hist('Delay', bins=bins, unit='minute')

### Estimating a statistic: mean
* Calculate the mean of all delays
* Compare to the mean of uniform samples

In [None]:
# calculate the mean
united_mean = united.column('Delay').mean()

In [None]:
#:
for n in np.arange(100, 10000, 200):
    m = united.sample(n).column('Delay').mean()
    print('number of flights: ', n, 'mean of sample: ', m)

### Distribution of means from uniform samples with replacement
* Nice curve around the mean.
* Does the histogram skew one direction?

In [None]:
#:
n_experiments = 10000
means = make_array()
for n in np.arange(n_experiments):
    m = united.sample(100).column('Delay').mean()
    means = np.append(m, means)

Table().with_columns('mean', means).hist(bins=np.arange(0,40))
plt.axvline(x=united_mean, c='r');

### Distribution of means from uniform samples without replacement
* When sample size << population, sampling without replacement is similar to sampling with replacement.
* When sample size ~ population, this is *not* true.

In [None]:
#:
n_experiments = 10000
means = make_array()
for n in np.arange(n_experiments):
    m = united.sample(100, with_replacement=False).column('Delay').mean()
    means = np.append(m, means)

Table().with_columns('mean', means).hist(bins=np.arange(0,40))
plt.axvline(x=united_mean, c='r');

### Distribution of means from uniform samples of flights from Denver
* This sample is a probability sample.
* Estimation of the mean is highly biased!

In [None]:
#:
n_experiments = 10000
means = make_array()

den = united.where('Destination', 'DEN')
for n in np.arange(n_experiments):
    m = den.sample(100).column('Delay').mean()
    means = np.append(m, means)

Table().with_columns('mean', means).hist(bins=np.arange(0,40))
plt.axvline(x=united_mean, c='r');

### Distribution of means from evenly-spaced random samples
* This sample is a probability sample.
* Why does the histogram look this way?

In [None]:
#:
n_experiments = 10000
means = make_array()
for n in np.arange(n_experiments):
    start = np.random.choice(np.arange(20))
    m = united.take(np.arange(start, united.num_rows, 50)).column('Delay').mean()
    means = np.append(m, means)

Table().with_columns('mean', means).hist(bins=np.arange(0,40))
plt.axvline(x=united_mean, c='r');

### Distribution of means from repeated samples of the first 100 rows
* Low variation and very high bias!

In [None]:
#:
n_experiments = 10000
means = make_array()
for n in np.arange(n_experiments):
    m = united.take(np.arange(100)).column('Delay').mean()
    means = np.append(m, means)

Table().with_columns('mean', means).hist(bins=np.arange(0,40))
plt.axvline(x=united_mean, c='r');

### Estimating probability: rolling a die $N$ times

### Discussion Question

If you roll a die 4 times. What's P(at least one 6)?

|Option|Answer|
|---|---|
|A| $5/6$|
|B| $1-5/6$|
|C| $1-(5/6)^4$|
|D| $1-(1/6)^4$|
|E| None of the above|


### Answer for 4 rolls
* P(at least one 6) = 1 - P(no 6) = 1 - (5/6)\**4

### Answer for N rolls
* P(at least one 6) = 1 - P(no 6) = 1 - (5/6)\**N

### Plot the true distribution for each N

In [None]:
#:
rolls = np.arange(1, 51)
at_least_one = Table().with_columns('roll', rolls, 'Chance of getting at least one 6', 1-(5/6)**rolls)
at_least_one.scatter('roll')

### Simulate the probability for N=20
* What is the chance of getting at least one 6 in 20 rolls?

In [None]:
faces = np.arange(1, 7)
outcomes = np.random.choice(faces, 20) # pick random number from faces, 20 times
outcomes

In [None]:
# number of positive outcomes
np.count_nonzero(outcomes == 6)

In [None]:
rolled6 = 0
trials = 100000
for i in np.arange(trials):
    outcomes = np.random.choice(faces, 20)
    if np.count_nonzero(outcomes == 6) >=1:
        rolled6 = rolled6 + 1
        
#estimate the probability
rolled6/trials

### Simulate the probability for N=20
* wrap the experiment in a function
* run the experiment many times

In [None]:
#:
def roll_20(trials):
    rolled6 = 0
    for i in np.arange(trials):
        outcomes = np.random.choice(faces, 20)
        if np.count_nonzero(outcomes == 6) >=1:
            rolled6 = rolled6 + 1

    return rolled6/trials

roll_20(1000)

In [None]:
#:
estimates = make_array()
for i in np.arange(500):
    estimates = np.append(roll_20(1000), estimates)
    
probs = Table().with_column('estimates', estimates)

In [None]:
#:
probs.hist()
true_prob = 1 - (5/6)**20
plt.axvline(x=true_prob, c='r');