<a href="https://colab.research.google.com/github/tmckim/materials-fa23-colab/blob/main/lectures/lec08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Before you start - Save this notebook!

When you open a new Colab notebook from the WebCampus (like you hopefully did for this one), you cannot save changes. So it's  best to store the Colab notebook in your personal drive `"File > Save a copy in drive..."` **before** you do anything else.

The file will open in a new tab in your web browser, and it is automatically named something like: "**Copy of lec08.ipynb**". You can rename this to just the title of the assignment "**lec08.ipynb**". Make sure you do keep an informative name (like the name of the assignment) so that you know which files to submit back to WebCampus for grading! More instructions on this are at the end of the notebook.


**Where does the notebook get saved in Google Drive?**

By default, the notebook will be copied to a folder called “Colab Notebooks” at the root (home directory) of your Google Drive. If you use this for other courses or personal code notebooks, I recommend creating a folder for this course and then moving the assignments AFTER you have completed them.

In [None]:
# Setup and add files needed to gdrive
# If you restart colab, start by rerunning this cell first!
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

#!mkdir -p '/content/gdrive/My Drive/colab-materials-NS499DataSci-notebooks/'
%cd /content/gdrive/My Drive/colab-materials-NS499DataSci-notebooks/
!rm -r materials-fa23-colab

!git clone https://github.com/tmckim/materials-fa23-colab '/content/gdrive/My Drive/colab-materials-NS499DataSci-notebooks/materials-fa23-colab/'

%cd /content/gdrive/MyDrive/colab-materials-NS499DataSci-notebooks/materials-fa23-colab/lectures/

In [None]:
# Import packages and other things needed
# Don't change this cell; Just run this cell
# If you restart colab, make sure to run this cell again after the first one above^

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Lecture 08 ##

In this lecture we will:
- Demonstrate deterministic and random sampling
- Plot Probability and Empirical Distributions
- Demonstrate the law of large numbers

## Random Sampling ##

We load in a dataset of all United flights national flights from 6/1/15 to 8/9/15, their destination and how long they were delayed, in minutes.

In [None]:
# columns:
# date
# flight number
# destination
# delay (in minutes)

In [None]:
# Load in our data
united = Table.read_table('united.csv')
united = united.with_column('Row', np.arange(united.num_rows)).move_to_start('Row') # add row numbers so we can see samples more easily
united

For each of the following, is this a deterministic or a random sampling strategy?

In [None]:
# Take a sample, like we've been doing already in this class
united.where('Destination', 'JFK')

In [None]:
# Sampling table method, with replacement
united.sample(3, with_replacement= True)

In [None]:
# sample using np.arange
united.take(np.arange(0, united.num_rows, 1000))

In [None]:
# Sample using take method
united.take(make_array(34, 6321, 10040))

In [None]:
# combination of methods
united.where('Destination', 'JFK').sample(3,with_replacement= True)

In [None]:
# A systematic sample example
start = np.random.choice(np.arange(1000))
systematic_sample = united.take(np.arange(start, united.num_rows, 1000))
systematic_sample.show()

## Distributions ##

In [None]:
# A single, fair die
die = Table().with_column('Face', np.arange(1, 7))
die

What is the **Probability Distribution** of drawing each face assuming each face is equally likely (a 'fair die')?

In [None]:
# Probability distribution
roll_bins = np.arange(0.5,6.6,1)
die.hist(bins=roll_bins)

We can sample from the die table many tims with replacement:

In [None]:
die.sample(5)

We can construct an **Empirical Distribution** from our simulation:

In [None]:
die.sample(10).hist(bins=roll_bins)

What happens if we increase the number of trials in our simulation? What happens to the distribution?

In [None]:
die.sample(1000).hist(bins=roll_bins)

In [None]:
die.sample(100000).hist(bins=roll_bins)

## Large Random Samples ##

The United flight dataset is a relatively large dataset:

In [None]:
# Show how much data
united.num_rows

We can plot the distribution of delays for the population:

In [None]:
# Some very delayed flights
united.hist('Delay', bins = 50)

In [None]:
united.sort('Delay', descending = True)

Let's truncate the extreme flights with a histogram from -20 to 201. (More on why later)

In [None]:
# Show the adjusted histogram
united_bins = np.arange(-20,201,5) # -20 means left early, and then up to 5 mins
united.hist('Delay', bins = united_bins)

What happens if we take a small sample from this population of flights and compute the distribution of delays:

In [None]:
united.sample(10).hist('Delay', bins = united_bins)

In [None]:
# Increase the sample size
united.sample(1000).hist('Delay', bins = united_bins)

## Simulating Statistics ##

Because we have access to the population (this is rare!) we can compute the parameters directly from the data. For example, suppose we wanted to know the median flight delay:

In [None]:
np.median(united.column('Delay'))

In practice, we will often have a sample. The median of the sample is a statistic that estimates the median of the population.

In [None]:
np.median(united.sample(10).column('Delay'))

Here we define a function to simulate the process of computing the median from a random sample of a given size:

In [None]:
def sample_median(size):
    return np.median(united.sample(size).column('Delay'))

In [None]:
sample_median(10)

We can then simulate this sampling process many times:

In [None]:
sample_medians = make_array()

for i in np.arange(1000):
    new_median = sample_median(10)
    sample_medians = np.append(sample_medians, new_median)

In [None]:
medians = Table().with_columns('Sample medians', sample_medians, 'Sample size', 10)
medians.hist('Sample medians', bins = 50)

In [None]:
sample_medians2 = make_array()

for i in np.arange(1000):
    new_median = sample_median(1000)
    sample_medians2 = np.append(sample_medians2, new_median)

In [None]:
# Combine both samples into a table and plot
overall_tbl = medians.append(Table().with_columns(
    "Sample medians", sample_medians2,
    "Sample size", 1000))
overall_tbl.hist("Sample medians", group="Sample size", bins = 50)

#### Empirical Distributions of a Statistic (Overlayed)

In [None]:
sample_medians_10 = make_array()
sample_medians_100 = make_array()
sample_medians_1000 = make_array()

num_simulations = 2000

for i in np.arange(num_simulations):
    new_median_10 = sample_median(10)
    sample_medians_10 = np.append(sample_medians_10, new_median_10)
    new_median_100 = sample_median(100)
    sample_medians_100 = np.append(sample_medians_100, new_median_100)
    new_median_1000 = sample_median(1000)
    sample_medians_1000 = np.append(sample_medians_1000, new_median_1000)

In [None]:
sample_medians = Table().with_columns('Size 10', sample_medians_10,
                                      'Size 100', sample_medians_100,
                                      'Size 1000', sample_medians_1000)

In [None]:
sample_medians.hist(bins = np.arange(-5, 30))

## Mendel and Pea Flowers ##

In [None]:
## Mendel had 929 plants, of which 709 had purple flowers
observed_purples = 709 / 929
observed_purples

In [None]:
predicted_proportions = make_array(.75, .25)
sample_proportions(929, predicted_proportions)

In [None]:
def purple_flowers():
    return sample_proportions(929, predicted_proportions).item(0) * 100

In [None]:
purple_flowers()

In [None]:
purples = make_array()

for i in np.arange(10000):
    new_purple = purple_flowers()
    purples = np.append(purples, new_purple)

In [None]:
Table().with_column('Percent of purple flowers in sample of 929', purples).hist()

In [None]:
Table().with_column('Discrepancy in sample of 929 if the model is true', abs(purples- 75)).hist()

In [None]:
abs(observed_purples * 100 - 75)

## Swain vs. Alabama ##

In [None]:
population_proportions = make_array(.26, .74)
population_proportions

In [None]:
sample_proportions(100, population_proportions)

In [None]:
def panel_proportion():
    return sample_proportions(100, population_proportions).item(0)

In [None]:
panel_proportion()

In [None]:
panels = make_array()

for i in np.arange(10000):
    new_panel = panel_proportion() * 100
    panels = np.append(panels, new_panel)

In [None]:
Table().with_column(
    'Number of Black Men on Panel of 100', panels
).hist(bins=np.arange(5.5,40.))

# Plotting details; ignore this code
plots.ylim(-0.002, 0.09)
plots.scatter(8, 0, color='red', s=30);