<a href="https://colab.research.google.com/github/veyselberk88/Data-Science-Tools-and-Ecosystem/blob/main/lec16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Lecture 16: Sampling

Associated Textbook Section: [10.0](https://ccsf-math-108.github.io/textbook/chapters/10/Sampling_and_Empirical_Distributions.html)

---

## Outline

* [Sampling](#Sampling)
* [Sampling with Technology](#Sampling-with-Technology)

---

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Sampling

---

### Random Samples

<a href="https://en.wikipedia.org/wiki/Sampling_(statistics)"><img src="./simple_random_sampling.png" width=400px alt="A visual representation of selecting a simple random sample."/></a>

* Population: Set of all elements from whom a sample will be drawn
* Deterministic sample: The sampling scheme doesn't involve chance
* Random (Probability) sample:
    * Before the sample is drawn, you have to know the selection probability of every group of people in the population
    * Not all individuals/groups have to have an equal chance of being selected
    * If the chances are equal, then the sample is a simple random sample.

---

### Sample of Convenience

* Example: sample consists of whoever walks by
* Just because you think you're sampling "randomly", doesn't mean you have a random sample.
* If you can't figure out the following ahead of time, then you don't have a random sample
    * what's the population
    * what's the chance of selection, for each group in the population

---

### With and Without Replacement

<a href="https://towardsdatascience.com/an-introduction-to-probability-sampling-methods-7a936e486b5/"><img src="./sampling.webp" width=700px alt="A visual representation of sampling with and without replacement."/></a>

* Sampling with Replacement:
    * One event happening does not impact the chance of another event happening
    * Associated with the concept called independent events.
* Sampling without Replacement:
    * One event happening may impact the chance of another event happening
    * Associated with the concept called dependent events.

---

## Sampling with Technology

---

### Sampling from Arrays and Tables

* Sampling from a table
    * Use can use `take` to systematically sample data from a table
    * `tbl.sample(k=n, with_replacement=True)`:
        * Randomly samples with replacement `n` rows from `tbl` and creates a new table
        * `k` is `tbl.num_rows` by default
        * `with_replacement` is `True` by default
* Sampling from an array
    * `np.random.choice(a=an_array, size=n, replace=True)`:
        * Randomly samples with replacement `n` elements from the elements in `an_array`
        * `size` is 1 by default
        * `replace` is `True` by default

---

### Demo: Sampling from Arrays and Tables

<a href="https://www.bts.gov/"><img src="./Tarmac.png" alt="An airplane landing on a tarmac."/></a>

Load the November 2023 flight delay data in `delay.csv` sourced from the [Bureau of Transportation Statistic's Reporting Carrier On-Time Performance Data](https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr). The variable `ARR_DELAY` contains the difference in minutes between scheduled and actual arrival time at the destination airport `DEST`. Early arrivals show negative numbers, and the airline code is expressed in the variable `OP_CARRIER`.

In [None]:
delays = Table.read_table('delays.csv')
delays

FL_DATE,OP_CARRIER,ORIGIN,DEST,ARR_DELAY
11/1/2023 12:00:00 AM,9E,ABE,ATL,-22
11/1/2023 12:00:00 AM,9E,ABE,ATL,-7
11/1/2023 12:00:00 AM,9E,ABY,ATL,-21
11/1/2023 12:00:00 AM,9E,ABY,ATL,-11
11/1/2023 12:00:00 AM,9E,AEX,ATL,-27
11/1/2023 12:00:00 AM,9E,AEX,ATL,-10
11/1/2023 12:00:00 AM,9E,AGS,ATL,-18
11/1/2023 12:00:00 AM,9E,AGS,ATL,-15
11/1/2023 12:00:00 AM,9E,ALB,DTW,-13
11/1/2023 12:00:00 AM,9E,ALB,LGA,-15


---

Demonstrate how to use the `take` method to sample the data in a few ways.

In [None]:
delays.take(make_array(34,221,4000))

FL_DATE,OP_CARRIER,ORIGIN,DEST,ARR_DELAY
11/1/2023 12:00:00 AM,9E,ATL,HPN,-15
11/1/2023 12:00:00 AM,9E,EWR,CVG,-27
11/1/2023 12:00:00 AM,B6,BOS,LGA,-18


In [None]:
delays.take(np.arange(0,delays.num_rows,10_000))

FL_DATE,OP_CARRIER,ORIGIN,DEST,ARR_DELAY
11/1/2023 12:00:00 AM,9E,ABE,ATL,-22
11/1/2023 12:00:00 AM,OO,COS,ORD,-7
11/2/2023 12:00:00 AM,AA,DFW,MSP,-7
11/2/2023 12:00:00 AM,OO,MBS,DTW,-19
11/3/2023 12:00:00 AM,AA,DFW,VPS,-8
11/3/2023 12:00:00 AM,OO,ORD,JFK,-15
11/4/2023 12:00:00 AM,AA,MIA,TPA,-14
11/4/2023 12:00:00 AM,UA,LAX,SEA,3
11/5/2023 12:00:00 AM,DL,BOI,ATL,-39
11/5/2023 12:00:00 AM,WN,BWI,TPA,-11


In [None]:
start=np.random.choice(np.arange(10_000))
start

7930

In [None]:
systematic_sample = delays.take(np.arange(start, delays.num_rows, 10_000))
systematic_sample.show()

FL_DATE,OP_CARRIER,ORIGIN,DEST,ARR_DELAY
11/1/2023 12:00:00 AM,MQ,CRP,DFW,-13
11/1/2023 12:00:00 AM,YX,EWR,BUF,-11
11/2/2023 12:00:00 AM,NK,EWR,IND,1
11/2/2023 12:00:00 AM,YX,LGA,BOS,-16
11/3/2023 12:00:00 AM,NK,MCO,CLE,28
11/3/2023 12:00:00 AM,YX,SDF,EWR,-3
11/4/2023 12:00:00 AM,OO,LAX,PHX,-11
11/5/2023 12:00:00 AM,AA,VPS,DFW,192
11/5/2023 12:00:00 AM,UA,EWR,SAN,-36
11/6/2023 12:00:00 AM,AS,SEA,SNA,-1


In [None]:
start = ...
systematic_sample = ...
systematic_sample.show()

---

Demonstrate how to get a simple random sample of 12 flight delays using `np.random.choice` and `sample`.

In [None]:
delays_arr = delays.column('ARR_DELAY')
delays_arr

array([-22.,  -7., -21., ..., -25., -19.,  -5.])

In [None]:
np.random.choice(delays_arr, 12, replace=False)

array([ 13.,  11., -10.,  -5., -18.,  -8., -18.,  13., -17., -30.,   8.,
        15.])

In [None]:
np.random.choice(delays_arr, 12, replace=True)

array([ 73., -10., -11., -23., -20.,   4.,  -3.,   1., -36.,  -3.,  21.,
        11.])

In [None]:
random_sample = delays.sample(12, with_replacement=False)
random_sample.show()

FL_DATE,OP_CARRIER,ORIGIN,DEST,ARR_DELAY
11/10/2023 12:00:00 AM,YX,DCA,CMH,13
11/14/2023 12:00:00 AM,OO,XNA,DEN,-24
11/24/2023 12:00:00 AM,NK,SJU,BWI,-15
11/22/2023 12:00:00 AM,AA,LAS,DFW,3
11/15/2023 12:00:00 AM,WN,OMA,LAS,8
11/5/2023 12:00:00 AM,NK,MCO,ORD,-17
11/27/2023 12:00:00 AM,OO,DEN,BIS,-17
11/22/2023 12:00:00 AM,MQ,EYW,MIA,-10
11/7/2023 12:00:00 AM,UA,SLC,SFO,123
11/25/2023 12:00:00 AM,HA,KOA,HNL,-1


In [None]:
random_sample = delays.sample(12, with_replacement=False)
random_sample.show()

FL_DATE,OP_CARRIER,ORIGIN,DEST,ARR_DELAY
11/30/2023 12:00:00 AM,F9,CLE,RSW,-25
11/26/2023 12:00:00 AM,AA,PHL,CMH,0
11/15/2023 12:00:00 AM,WN,SAT,DAL,-9
11/5/2023 12:00:00 AM,WN,MCO,ALB,-4
11/21/2023 12:00:00 AM,WN,ATL,GSP,46
11/22/2023 12:00:00 AM,9E,JFK,DTW,-34
11/16/2023 12:00:00 AM,B6,FLL,LAX,34
11/11/2023 12:00:00 AM,AA,CLT,PNS,-16
11/10/2023 12:00:00 AM,MQ,TLH,DFW,12
11/1/2023 12:00:00 AM,NK,BDL,MYR,14


---

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>