<a href="https://colab.research.google.com/github/tmckim/materials-sp24-colab/blob/main/lec_demos/lec13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Before you start - Save this notebook!

When you open a new Colab notebook from the WebCampus (like you hopefully did for this one), you cannot save changes. So it's  best to store the Colab notebook in your personal drive `"File > Save a copy in drive..."` **before** you do anything else.

The file will open in a new tab in your web browser, and it is automatically named something like: "**Copy of lec13.ipynb**". You can rename this to just the title of the assignment "**lec13.ipynb**". Make sure you do keep an informative name (like the name of the assignment) so that you know which files to submit back to WebCampus for grading! More instructions on this are at the end of the notebook.


**Where does the notebook get saved in Google Drive?**

By default, the notebook will be copied to a folder called “Colab Notebooks” at the root (home directory) of your Google Drive. If you use this for other courses or personal code notebooks, I recommend creating a folder for this course and then moving the assignments AFTER you have completed them. <br>

I also recommend you give the folder where you save your notebooks^ a different name than the folder we create below that will store the notebook resources you need each time you work through a course notebook. This includes any data files you will need, links to the images that appear in the notebook, and the files associated with the autograder for answer checking.<br>
You should select a name other than '**NS499-DataSci-course-materials**'. <br>
This folder gets overwritten with each assignment you work on in the course, so you should **NOT** store your notebooks in this folder that we use for course materials! <br><br>For example, you could create a folder called 'NS499-**notebooks**' or something along those lines.
___

### We will now do the setup steps as separate cells to help with issues finding files in google drive/colab. <br> If you restart colab, you must rerun all **5** steps in each of these cells!

In [None]:
# Step 1
# Setup and add files needed to access gdrive
from google.colab import drive                                   # these lines mount your gdrive to access the files we import below
drive.mount('/content/gdrive', force_remount=True)

In [None]:
# Step 2
# Change directory to the correct location in gdrive (modified way to do this from before)
import os
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/')

In [None]:
# Step 3
# Remove the files that were previously there- we will replace with all the old + new ones for this assignment
!rm -r materials-sp24-colab

In [None]:
# Step 4
# These lines clone (copy) all the files you will need from where I store the code+data for the course (github)
# Second part of the code copies the files to this location and folder in your own gdrive
!git clone https://github.com/tmckim/materials-sp24-colab '/content/gdrive/My Drive/NS499-DataSci-course-materials/materials-sp24-colab/'

In [None]:
# Step 5
# Change directory into the folder where the resources for this assignment are stored in gdrive (modified way from before)
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/materials-sp24-colab/lec_demos/')

In [None]:
# Import packages and other things needed
# Don't change this cell; Just run this cell
# If you restart colab, make sure to run this cell again after the first ones above^

from datascience import *
import numpy as np
import warnings
warnings.simplefilter(action='ignore',category=np.VisibleDeprecationWarning)

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.rcParams["patch.force_edgecolor"] = True

## Learning Objectives ##


Topics:
- review `for` loops
- review and simulate using `np.random.choice`
- demonstrate deterministic and random sampling
- plot probability and empirical distributions
- demonstrate the law of large numbers

## Simulation

We will start to use simulation in this class. A key element of simulation is leveraging randomness. The `numpy` python library has many functions for generating random events. Today we will use the [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) function:

## Playing a Game of Chance

Let's play a game: we each roll a die.

If my number is bigger: you pay me a dollar.

If they're the same: we do nothing.

If your number is bigger: I pay you a dollar.

Steps:
1. Find a way to simulate two dice rolls.
2. Compute how much money we win/lose based on the result.
3. Do steps 1 and 2 10,000 times.

### Simulating the roll of a die

In [None]:
die_faces = np.arange(1, 7)
die_faces

In [None]:
np.random.choice(die_faces)

Implement a function to simulate a single round of play and return the results

In [None]:
def simulate_one_round():
    my_roll = np.random.choice(die_faces)
    your_roll = np.random.choice(die_faces)

    if my_roll > your_roll:
       return 1
    elif my_roll < your_roll:
       return -1
    else:
       return 0

In [None]:
simulate_one_round()

### Repeated Betting ###

In [None]:
results = make_array()

In [None]:
results = np.append(results, simulate_one_round())
results

In [None]:
results = np.append(results, simulate_one_round())
results

Use a for loop to simulate the total outcome of plays of our game of chance:

In [None]:
game_outcomes = make_array()

for i in np.arange(5):
    game_outcomes = np.append(game_outcomes, simulate_one_round())

game_outcomes

In [None]:
game_outcomes = make_array()

for i in np.arange(10000):
    game_outcomes = np.append(game_outcomes, simulate_one_round())

game_outcomes

In [None]:
len(game_outcomes)

In [None]:
results = Table().with_column('My winnings', game_outcomes)

In [None]:
results

In [None]:
results.group('My winnings').barh('My winnings')

### Another example: simulating heads in 100 coin tosses

In [None]:
# Example coin flip
coin = make_array('heads', 'tails')

In [None]:
# Use np.random.choice to pick
np.random.choice(coin)

In [None]:
# Pick multiple times
ten_picks = np.random.choice(coin, 10)
ten_picks

In [None]:
ten_picks == 'heads'

In [None]:
sum(ten_picks == 'heads')

In [None]:
# Count the sum of heads coin flip results in our array
sum(np.random.choice(coin, 10) == 'heads')

In [None]:
sum(np.random.choice(coin, 100) == 'heads')

In [None]:
# Simulate one outcome

def num_heads():
    return sum(np.random.choice(coin, 100) == 'heads')

In [None]:
# Decide how many times you want to repeat the experiment
repetitions = 10000

In [None]:
# Simulate that many outcomes
outcomes = make_array()

for i in np.arange(repetitions):
    outcomes = np.append(outcomes, num_heads())

In [None]:
heads = Table().with_column('Heads', outcomes)
heads.hist(bins = np.arange(29.5, 70.6))

## Random Sampling ##

We load in a dataset of all United flights national flights from 6/1/15 to 8/9/15, their destination and how long they were delayed, in minutes.

In [None]:
# columns:
# date
# flight number
# destination
# delay (in minutes)

In [None]:
# Load in our data
united = Table.read_table('united.csv')
united = united.with_column('Row', np.arange(united.num_rows)).move_to_start('Row') # add row numbers so we can see samples more easily
united

For each of the following, is this a deterministic or a random sampling strategy?

In [None]:
# Take a sample, like we've been doing already in this class
united.where('Destination', 'JFK')

In [None]:
# Sampling table method, with replacement
united.sample(3, with_replacement= True)

In [None]:
# sample using np.arange
united.take(np.arange(0, united.num_rows, 1000))

In [None]:
# Sample using take method
united.take(make_array(34, 6321, 10040))

In [None]:
# combination of methods
united.where('Destination', 'JFK').sample(3,with_replacement= True)

In [None]:
# A systematic sample example
start = np.random.choice(np.arange(1000))
systematic_sample = united.take(np.arange(start, united.num_rows, 1000))
systematic_sample.show()

## Distributions ##

In [None]:
# A single, fair die
die = Table().with_column('Face', np.arange(1, 7))
die

What is the **Probability Distribution** of drawing each face assuming each face is equally likely (a 'fair die')?

In [None]:
# Probability distribution
roll_bins = np.arange(0.5,6.6,1)
die.hist(bins=roll_bins)

We can sample from the die table many times with replacement:

In [None]:
die.sample(5)

We can construct an **Empirical Distribution** from our simulation:

In [None]:
die.sample(10).hist(bins=roll_bins)

What happens if we increase the number of trials in our simulation? What happens to the distribution?

In [None]:
die.sample(1000).hist(bins=roll_bins)

In [None]:
die.sample(100000).hist(bins=roll_bins)

## Large Random Samples ##

The United flight dataset is a relatively large dataset:

In [None]:
# Show how much data
united.num_rows

We can plot the distribution of delays for the population:

In [None]:
# Some very delayed flights
united.hist('Delay', bins = 50)

In [None]:
united.sort('Delay', descending = True)

Let's truncate the extreme flights with a histogram from -20 to 201. (More on why later)

In [None]:
# Show the adjusted histogram
united_bins = np.arange(-20,201,5) # -20 means left early, and then up to 5 mins
united.hist('Delay', bins = united_bins)

What happens if we take a small sample from this population of flights and compute the distribution of delays:

In [None]:
united.sample(10).hist('Delay', bins = united_bins)

In [None]:
# Increase the sample size
united.sample(1000).hist('Delay', bins = united_bins)

## Simulating Statistics ##

Because we have access to the population (this is rare!) we can compute the parameters directly from the data. For example, suppose we wanted to know the median flight delay:

In [None]:
np.median(united.column('Delay'))

In practice, we will often have a sample. The median of the sample is a statistic that estimates the median of the population.

In [None]:
np.median(united.sample(10).column('Delay'))

Here we define a function to simulate the process of computing the median from a random sample of a given size:

In [None]:
def sample_median(size):
    return np.median(united.sample(size).column('Delay'))

In [None]:
sample_median(10)

We can then simulate this sampling process many times:

In [None]:
sample_medians = make_array()

for i in np.arange(1000):
    new_median = sample_median(10)
    sample_medians = np.append(sample_medians, new_median)

In [None]:
medians = Table().with_columns('Sample medians', sample_medians, 'Sample size', 10)
medians.hist('Sample medians', bins = 50)

In [None]:
sample_medians2 = make_array()

for i in np.arange(1000):
    new_median = sample_median(1000)
    sample_medians2 = np.append(sample_medians2, new_median)

In [None]:
# Combine both samples into a table and plot
overall_tbl = medians.append(Table().with_columns(
    "Sample medians", sample_medians2,
    "Sample size", 1000))
overall_tbl.hist("Sample medians", group="Sample size", bins = 50)

#### Empirical Distributions of a Statistic (Overlayed)

In [None]:
sample_medians_10 = make_array()
sample_medians_100 = make_array()
sample_medians_1000 = make_array()

num_simulations = 2000

for i in np.arange(num_simulations):
    new_median_10 = sample_median(10)
    sample_medians_10 = np.append(sample_medians_10, new_median_10)
    new_median_100 = sample_median(100)
    sample_medians_100 = np.append(sample_medians_100, new_median_100)
    new_median_1000 = sample_median(1000)
    sample_medians_1000 = np.append(sample_medians_1000, new_median_1000)

In [None]:
sample_medians = Table().with_columns('Size 10', sample_medians_10,
                                      'Size 100', sample_medians_100,
                                      'Size 1000', sample_medians_1000)

In [None]:
sample_medians.hist(bins = np.arange(-5, 30))

## Saving
Remember to save your notebook before closing. Choose **Save** (and make sure you've already saved a copy in your drive) from the **File** menu.