In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# Tables
Tables are:
1. A collection of **columns**. Each column is an array containing values. Methods for working with columns are:
    * `.with_columns(...)` adds a new column
    * `.column(..)` grabs a column and convert it to an array
    * `.select(...)` pick out a column/columns for a new table
    
2. A collection of **rows**. Each row is an entry or an entity in the dataset. Methods for working with rows are:
    * `.where(...)` selects only rows that fulfill a condition
    * `.take(...)` takes in certain rows (by index)
    * `.sort(...)` sorts rows
    
## Table Pro-tips
1. When to use `.group(...)`?
    * If you need to use multiple `.where(...)`, chances are you're trying to group together parts of columns, in which `.group(...)` comes in handy
    
    
2. When to use `.join(...)`?
    * Do you need to combine 2 tables?
    
3. When to use `.pivot(...)`?
    * If you need to use `.group(...)` to group by 2 columns on the same time
    * You want the table to look different from before

## Voters
<img src = 'voters.jpg' width = 250/>
Let's say we have the table `voters`. How do we find:

1. The candidate than won? One way of doing it:
    * Group them
    * Sort by count starting from the most count
    * Take the first row of the count column
    * `voters.group('Choice')`
2. The distribution of votes for California ('CA')?
    * Use `.where(...`) to choose only California voters
    * Group them
    * `voters.where('State', 'CA').group('Choice')`

# Sampling
* Random samples let us approximate probability distributions
    * The larger the sample, the more the distribution would look like the population distribution
* When we have one sample, we can take that and for an estimate of a parameter
    * Example: Warplanes
    * If we take a sample of 20 planes and take the `max`, it'll give us one estimate of how many planes there are overall. 
* Multiple samples lets us estimate the probability distribution of a statistic
    * If we want to find the distribution of that statistic, we need multiple samples of 20 planes.

# Gambling
If you roll a fair die, and:
* If 1 or 6, pay $1
* If 2 or 5, get $1
* The rest, get nothing

Which would resemble the empirical histogram of the net winning?
<img src = 'gambling.jpg' width = 500/>
**Ans**: Histogram (ii). If the die is fair, the distribution of winning and losing cancels out, and chances are you'll get close to zero wins

# Hypothesis Test Recipe
* State null and alternative hypotheses!
    * These always depend on the context (depending on the question you're asking)
* Examine your alternative hypothesis and **pick a test statistic**
* Assume the null is true and **simulate the statistic** by taking random draws
* Mark your observed test statistic (real-life data) on the empirical distribution
* Compute P-value, which is the area equal to the observed test statistic or more extreme in the direction of your alternative hypothesis

## Picking a Test Statistic
If the alternative hypothesis is:
1. Is instructor cooler than the class?
2. Is instructor less cool than the class?
3. Is instructor just different?

For each alternative hypothesis, pick a test statistic and a direction to look in the distribution!

**Ans**:
1. Coolness level (+ direction)
2. Coolness level (- direction)
3. Distance from the average of the class's coolness level (take 50)
    * abs(coolness - 50) (+ direction)
    
## Picking a Test Statistic (part 2)
If the alternative hypothesis is:
1. Coin is unfair (in 20 flips)
2. In the Pea example, the proportion of purple flower is less than 75%
3. In Jury ethnicity panel, jury is unfair

For each alternative hypothesis, pick a test statistic and a direction to look in the distribution!
**Ans**:
1. Count whether the coin has more heads or less heads
    * abs(heads - 10) (+ direction)
**We can also count the tails instead!**
    * Why subtract by 10? Because we expect to see around 10 heads in 20 flips if the coin was fair
2. Proportion of purple / total (- direction)
    * Why (-) direction? Because the alternative hypothesis is that the proportion of purple flowers is less than 75% (You want to know how likely you'll get less than 75)
3. TVD (population, sample) (+ direction)
    * Why TVD? Jury panels consist of ethnicity distribution. We're comparing actual panel distribution with sample panel distribution. When comparing 2 distributions, use TVD.

## Simulate Drawing a Sample
Write code to simulate one test statistic of the following:
* Mendel grew 900 pea plants and got 705 purple flowers
* Null hypothesis: 75% of pea plants will have purple flowers
* Alt: >75% will have purple flowers

In [5]:
# Sample 900 plants with a 3:1 purple to white distribution
# (with assumption that the null hypothesis is true)
null = make_array('Purple', 'Purple', 'Purple', 'white')
sample = np.random.choice(null, 900)
# Count the number of purple
purple_sample = np.count_nonzero(sample == 'Purple')

#Calculate the test statistic
test_stat = purple_sample / len(sample)
test_stat

0.7377777777777778

In [10]:
# Faster version using sample_proportions
model = [0.75, 0.25]
# Create a sample distribution out of 900 samples, then pick the purple part
sample_proportions(900, model).item(0)

0.7288888888888889

## Simulate Drawing a Sample (Part 2)
Write code to simulate one test statistic of the following:
* Rolled a die 24 times and got no 6
* Null hypothesis: Die is fair
* Alt: Die is unfair

In [16]:
#Assuming die is fair, make a simulation of rolling dice 24 times
model = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
sample_prop = sample_proportions(24, model).item(5)
# Test statistic = abs(proportion - expected)
# The expected proportion is 1/6 since there are equally chance of getting
# each face in a die
expected = 1/6
test_stat = abs(sample_prop - expected )
test_stat

0.04166666666666666