## Lecture 15: Sampling ##

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
from IPython.core.display import HTML
HTML('''<link href='http://fonts.googleapis.com/css?family=Lora:400,700,400i,700i' rel='stylesheet'><link href='https://fonts.googleapis.com/css?family=Lato:300,400,700,300i,400i,700i' rel='stylesheet'><link href='https://fonts.googleapis.com/css?family=Inconsolata:400' rel='stylesheet'><link rel="stylesheet" href="http://maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css"><style>h1, h2, h3, h4, h5 { font-family: 'Lato', sans-serif; } h5 { font-style: normal; } kbd { font-family: Lato, serif; } hr { border-width: 2px; border-color: #a9a9a9; } .cite { font-size: 85%; text-align: right; margin-top: 10px; } .note { font-family: Lora, serif; font-size: 10pt; font-weight: 400; margin-top: 0; margin-bottom: 0; } h5.prehead { font-family: Lato, serif; font-style: normal; font-size: 14pt; font-weight: 300; margin-bottom: 15px; margin-top: 30px; } h5.lesson { font-family: Lato, serif; font-weight: 400; font-size: 15pt; font-style: normal; margin-top: 0px; margin-bottom: 5px; } h1.lesson_title { font-family: Lato, serif; font-weight: 300; font-size: 32pt; line-height: 110%; color:#CD2305; margin-top: 0px; margin-bottom: 15px; } div.cell{ max-width: 1120px; margin-left: auto; margin-right: auto; } div.text_cell_render { font-family: Lora, serif; line-height: 160%; font-size: 13pt; } .rendered_html pre, .rendered_html code  { font-family: Inconsolata, monospace !important; font-size: 13pt; } div.CodeMirror, div.output_area pre, div.prompt { font-family: Inconsolata, monospace !important; font-size: 125%; } .rendered_html ul li { margin-top: 0.75em; margin-bottom: 0.75em; } .rendered_html ul li ul li { margin-top: 0.5em; margin-bottom: 0.5em; } .rred { color: #a00000; } </style> <script> MathJax.Hub.Config({ TeX: { extensions: ["AMSmath.js"] }, tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ], displayMath: [ ['$$','$$'], ["\\[","\\]"] ] }, displayAlign: 'center', // Change this to 'center' to center equations. "HTML-CSS": { styles: {'.MathJax_Display': {"margin": "0.75em 0"}} } }); </script>''')

In this lecture, we'll learn about sampling from a population, and empirical distributions.

## Random Sampling ##

All Southwest flights out of BWI in June 2019.  Data taken from [Bureau of Transportation Statistics website](https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGK&QO_fu146_anzr=b0-gvzr).  See `raw_flights.csv` to access original data file.

In [None]:
# Get table of Southwest flights 
# out of BWI in June 2019

swflights=Table.read_table('swflights.csv')
swflights

In [None]:
# Find range of delays

(swflights.column('Delay').min(),swflights.column('Delay').max())

In [None]:
# Generate histogram of delays

swflights.hist('Delay')

In [None]:
# Improved histogram of delays

swflights.hist('Delay', bins=np.arange(-40, 100, 5))

Note: The height of the [0,5) bar is about 3.8 percent per unit.  Since each bin width is 5 minutes, around 5 * 3.8 or 19% of the flights had a delay between 0 and 5 minutes.

In [None]:
# Look at all flights to Phoenix

swflights.where('Destination', 'PHX') 

<font color='green'> </font>

In [None]:
# We will take the collection of all flights to be our population.
# Take sample of 9 flights from population:

swflights.take(np.arange(0, swflights.num_rows, 1000))

In [None]:
# Take systematic random sample of 9 flights

start = np.random.choice(np.arange(1000))
systematic_sample = swflights.take(np.arange(start, swflights.num_rows, 1000))
systematic_sample.show()

## Distributions ##

In [None]:
# Create table of outcomes for rolling
# 6-sided die

die = Table().with_column('Face', np.arange(1, 7))
die

In [None]:
# Get random sample of 'die' table

die.sample()

In [None]:
# Get random sample of size 200

die.sample(300)

In [None]:
# Histogram of 'die' table
die.hist()

In [None]:
# Make new bins for 'die' histograms

roll_bins = np.arange(0.5, 6.6, 1)

In [None]:
# Improved histogram for 'die' table

die.hist('Face', bins=roll_bins)

In [None]:
# Histogram of 20 die rolls

die.sample(20).hist(bins=roll_bins)

In [None]:
# Histogram of 1000 die rolls

die.sample(1000).hist(bins=roll_bins)

In [None]:
# Histogram of 100000 die rolls

die.sample(100000).hist(bins=roll_bins)

### Law of Large Numbers: 
### If a chance experiment is repeated many times, independently and under the same conditions, then the proportion of times that an event occurs gets closer to the theoretical probability of the event

## Large Random Samples ##

In [None]:
swflights

In [None]:
#

southwest_bins = np.arange(-20, 201, 5)
swflights.hist('Delay', bins = southwest_bins)

In [None]:
# Histogram of random sample of size 10
# from swflights

swflights.sample(10).hist('Delay', bins = southwest_bins)

In [None]:
# Histogram of random sample of size 1000
# from swflights

swflights.sample(1000).hist('Delay', bins = southwest_bins)

In [None]:
swflights.hist('Delay', bins = southwest_bins)

<font color='green'> </font>

## Simulating Statistics ##

### Parameter: a numerical quantity associated to a population


* The percentage of voters who voted for a certain candidate
* The average height of all people in maryland 
* The *maximum* income of all wage earners in Baltimore County
* The median departure delay of all Southwest flights in our table



In [None]:
#Calculate median of Southwest flights

np.median(swflights.column('Delay'))

In [None]:
# Proportion of all flights that leave
# ahead of schedule

swflights.where('Delay', are.below_or_equal_to(0)).num_rows / swflights.num_rows

**Note**. The percent isn't exactly 50 because of "ties," that is, flights that had delays of exactly 0 minutes.

In [None]:
# Number of 'ties' 

swflights.where('Delay', are.equal_to(0)).num_rows

There were 682 such flights. Ties are quite common in data sets, and we will not worry about them in this course.

### Statistical Inference: Making conclusions based on data in random samples
  

### Statistic = any number computed using data from a sample

### Strategy: approximate the value of a population parameter by measuring the value of a sample statistic.

In [None]:
# Median of sample of 10 flights

np.median(swflights.sample(10).column('Delay'))

In [None]:
# Function determines median of random sample of 
# given size 
def sample_median(size):
    return np.median(swflights.sample(size).column('Delay'))

In [None]:
sample_median(10)

In [None]:
# Generate array with 1000 sample medians

sample_medians = make_array()

for i in np.arange(1000):
    new_median = sample_median(10)
    sample_medians = np.append(sample_medians, new_median)

In [None]:
sample_medians

<font color='green'> </font>

In [None]:
# Plot empirical histogram of sample medians

Table().with_column('Sample medians', sample_medians).hist(bins = np.arange(-10,31))

### Empirical Histogram: distribution of statistic for many random samples

### Probability Histogram: distribution of statistic for *all possible* random samples

# Law of averages: 
## Empirical Historgram *approximates* Probability Histogram

## Probability distribution of a statistic:

* The values of a statistic vary, because random samples vary.  
* The *probability distribution* of the statistic is the distribution of probabilities of these statistic values.
* Hard to calculate, because we need to do math *or* generate *all possible samples*

## Empirical distribution of a statistic:

* Based on simulated or observed values of the statistic, and proportion of times each value appears
* Gives a good approximation to the *probability distribution* of the statistic, *if* the number of repetitions in the simulation is large (Law of Averages)

<font color='green'> </font>