# DSC 10 Discussion Week 7
---

Welcome to Discussion 7!

<img src="data/panda_smile.jpg" width="500">

## $\underline{Lecture\ 15 : Models\ and\ Statistics}$

### Model
- a set of assumptions about data
- assessing the quality of models $\rightarrow$ statistical inference!

### Terminology
- **Parameter** : a number associated with the *population* $\rightarrow$ rarely known exactly
- **Statistic** : a number calculated from the *sample* $\rightarrow$ estimate of a parameter

### Bias-Variance trade-off
- **Bias** : systematic error in one direction (too high or too low) $\rightarrow$ good estimates have *LOW bias*
- **Variance** : degree to which the value of an estimate varies $\rightarrow$ good estimates have *LOW variance*

### Simulation
- **Single experiment** : ```np.random.multinomial(sample_size, pop_distribution)```
- **A bunch of experiments** : iteration!
- **Visualize** : plot! $\rightarrow$ often *histogram* to show distribution

## $\underline{Lecture\ 16 : Hypothesis\ Testing}$

### Two Viewpoints
- **Null Hypothesis** : default view $\rightarrow$ must be simulatable
- **Alternate Hypothesis** : opposite of Null Hypothesis 

### Computing statistics under Null Hypothesis
- Choose a relevant *test statistic*
    - counts, ratios, differences, absolute differences, etc. depending on problem
    - **Total Variation Difference** : difference between two distributions  
    - be careful with use of ```abs()```!
- Track experiment outcomes and compute the **empirical distribution of the statistic under the null hypothesis**

### Drawing conclusions
- Compare the following : 
    - **observed test statistic** (red dot/line from class) 
    - **empirical distribution under the null hypothesis** (histograms from experiments)
- Determine if observed value is consistent
    - by visualization or some other conventional quantitative measure
    - **p-value** : probability that a result *at least* as extreme as the observation holds under the null hypothesis
        - common cutoff is 5% for statistical significance

#### Extra
- You can find additional help on these topics in the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).
- [Here](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) is a pointer to that reference sheet we saw last time.

In [None]:
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Example 1: Fighting Professors

Two professors are fighting about who is a better teacher. To settle the matter, they decide to give each of their classes the same exam. Whoever's class performs better will be considered the best teacher.

## The Data

In [None]:
scores = bpd.read_csv('data/scores.csv')
scores

## Exploration

### Which professor (A or B) appears to have "won"?

In [None]:

won_prof = 

## Question 1
The winning professor claims that they are significantly better than the other professor -- and it isn't just due to random chance. What technique can we use to evaluate their claim?

**Answer**: 

## Question 2

What are the null and alternative hypotheses?

- **Null**:
- **Alternative**:

## Question 3

What test statistic can we use? Remember: it is usually better for *large* values of the test statistic to point towards the alternative hypothesis.

**Answer**:

## Question 4

What was the *observed value* of your test statistic?

In [None]:
obs = 
obs

In [None]:
import math
math.isclose(obs, 4.7411054460319235, rel_tol=1e-4)

## Question 5

Implement your chosen technique to test whether the null hypothesis should be rejected.

In [None]:
num_simulations = 1000
simulated_stats = 


simulated_stats

## Question 6

What is the probability that we see our observed value of the test statistic if the null hypothesis is true?

In [None]:
p_val = 
p_val

In [None]:
p_val < .05

## Question 7

The "winning" professor claims that the results show that they are the better teacher. Is this correct?

In [None]:
claim_true_or_false = 
claim_true_or_false

# Example 2: Fun with Test Statistics

## Question 8

You want to test whether a coin is fair. Your hypotheses are:

- **Null**: the coin is fair
- **Alternative**: the coin is not fair

You'll flip the coin 100 times. What test statistic should you use to assess your claim?

In [None]:
# fill out the following code to set up this experiment

num_flips = 

# model the probability of our coin
model = 

# flip our coin ... times
flip_outcomes = 

# flip_outcomes = [num_heads, num_tails]
num_heads = flip_outcomes[0]

# What is our test statistic?
def test_statistic(num_heads):
    return 

# compute test statistic
print(f"Test statistic result : {test_statistic(num_heads)}")

## Question 9

In your experiment, you saw 61 heads. What is the observed value of your test statistic?

In [None]:
num_heads_experiment = 61

observed_test_statistic = 

print(f"Test statistic result : {observed_test_statistic}")

## Question 10

You want to test whether an *n*-sided die is fair. Your hypotheses are:

- **Null**: the die is fair
- **Alternative**: the die is not fair

You'll roll the die 100 times. What test statistic should you use to assess your claim?

In [None]:
# fill out the following code to set up this experiment


# specify number of sides
N = 20
num_rolls = 

# model the probability of our die
model_die = 

# roll our die ... times
roll_outcomes = 

# roll_outcomes = [count_num_side_1 ,..., ..., count_num_side_N]
# roll_outcomes_prob = [perc_num_side_1 ,..., ..., perc_num_side_N]

roll_outcomes_prob = 


# What is our test statistic?
def test_statistic_die(roll_outcomes_prob, model_die):
    return 

# compute test statistic
print(f"Test statistic result : {test_statistic_die(roll_outcomes_prob, model_die)}")

## Question 11

You rolled a 4-sided side 100 times and got "one" 20 times, "two" 30 times, "three" 40 times, and "four" 10 times. What is the observed value of your test statistic?

In [None]:
# specify number of sides
N = 4
num_rolls = 

# Given roll outcomes
roll_outcomes = np.array([20, 30, 40, 10]) 
roll_outcomes_prob = 

# model the probability of our die
model_die = 

# compute the test statistic
test_statistic = 

# display results
print(f"Test statistic result : {test_statistic_die(roll_outcomes_prob, model_die)}")

## Question 12

You rolled a 2-sided die 100 times and got "one" 61 times and "two" 39 times. What is the observed value of your test statistic?

In [None]:
# specify number of sides
N = 2
num_rolls = 

# Given roll outcomes
roll_outcomes = np.array([61, 39]) 
roll_outcomes_prob = 

# model the probability of our die
model_die = 

# compute the test statistic
test_statistic = 

# display results
print(f"Test statistic result : {test_statistic_die(roll_outcomes_prob, model_die)}")