# CSS Bootcamp

## Day 2 (NHST in practice): Lab

This lab is intended to accompany **Day 2** of the week on **Statistics**, which focuses on:

- **Explaining** the motivation for building statistical models in general.  
- **Explaining** the concept of *prediction error* and why it matters for statistics.
- **Describing** the basic assumptions (and limitations) of linear regression.  
- **Implementing** linear regression in Python using `statmodels`.  

This lab has some "free response" questions, in which you are asked to describe or make some inference from a graph. 

It also has questions requiring you to program answers in Python. In some cases, this will use built-in functions we've discussed in class (either today, or previous weeks). In others, there'll be a built-in function that we *haven't* discussed, which you will have to look up in the documentation. 

Please reach out for help if anything is unclear!

#### Key imports

Here, we import some of the libraries that will be critical for the lab.

In [1]:
import matplotlib.pyplot as plt
import math
import numpy as np
import seaborn as sns
import scipy.stats as ss
import pandas as pd



In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # makes figs nicer!

## Part 1: Philosophy of NHST

The notion of a sampling distribution is central to **Null Hypothesis Significance Testing**, or **NHST**. Combined with probability theory, this allows us to make inferences about the probability of obtaining a particular result (e.g., a test statistic), under some sampling distribution (which usually represents the null hypothesis).

In this section, the exercises will focus on a couple topics:

- Basic probability (with an emphasis on probability *functions*)  
- The philosophical nuts and bolts of NHST: null vs. alternate hypotheses  
- Signal detection theory

### 1A. Probability and normal distributions

Recall that the **standard normal distribution** is the result of **z-scoring** an already normal distribution. This results in a distribution with the following parameters:

- The **mean** ($\mu$) is 0.  
- The **standard deviation** ($\sigma$) is 1.

In [7]:
## Creates a standard normal distribution object
sn_dist = ss.norm(0, 1) 

###### Use the `cdf` function to calculate the cumulative probability density of a set of points using `sn_dist`.

In [8]:
#### Your code here
points =  np.linspace(-3, 3, num = 50)
### ....

###### Plot the cumulative density of those points relative to the points themselves.

In [10]:
#### Your code here

###### What is the probability of obtaining a value less than or equal to -1?

In [12]:
#### Your code here

###### What is the probability of obtaining a value less than or equal to 1?

In [14]:
#### Your code here

###### What is the probability of obtaining a value between 0 and 1?

In [16]:
#### Your code here

###### What is the probability of obtaining a value between -1 and 1?

In [16]:
#### Your code here

###### What is the probability of obtaining a value larger than 1?

In [21]:
#### Your code here

###### What is the probability of obtaining a value larger than 2 or smaller than -2? (This is called a two-tailed test.)

In [23]:
#### Your code here

### 1B. Null vs. alternate hypotheses

Now that you've had some hands-on practice with the underlying distributions, we'll turn to the theoretical or philosophical foundations of NHST.

NHST is centered around the contrasting notions of a **null hypothesis** vs. an **alternate hypothesis**.

Consider the following scenario:

> A researcher is interested in whether a new drug decreases blood pressure. They give the drug to 10 individuals, and find that after taking the drug their blood pressure is 75. The hypothesized **population mean** is 80, with a standard deviation of 4.5. Did the drug work?

###### First, what is the null hypothesis? 

In [26]:
#### Your response here

###### What is the alternative hypothesis?

In [27]:
#### Your response here

###### Is this a two-tailed or one-tailed test?

In [28]:
#### Your response here

###### What is the difference between the sample mean and population mean?

In [29]:
#### Your response here

###### What is the standard error of the mean?

In [31]:
#### Your response here

###### What is the z-statistic?

In [33]:
#### Your response here

###### What is the probability of obtaining a z-statistic less than or equal to this value?

In [35]:
#### Your response here

###### Assuming an alpha of 0.05, would we reject the null hypothesis?

In [37]:
#### Your response here

### 1C. Signal detection: false alarms and misses

NHST involves making relatively discrete **decisions**: do we **Reject** or **Fail to Reject** the null hypothesis?

As the last question implies, central to this decision is our value of **alpha**: essentially, our tolerance threshold for getting a false positive.

###### Suppose you run 5 hypothesis tests, where the true underlying effect is zero (but of course, you don't know this yet). Given an alpha of .05, what is the probability that at least one of them will be a false positive?

In [38]:
#### Your response here

###### Create a graph showing that the probability of obtaining at least one false positive result increases with the number of tests that you run (from `1` to `100`), given three alpha levels: `.05`, `.01`, and `.001`.

In [40]:
#### Your response here

## Part 2: The t-distribution(s)

Central to a **t-test** is the notion of the [**t-distribution**](https://en.wikipedia.org/wiki/Student%27s_t-distribution).

Like the normal distribution, the t-distribution has a symmetric bell shape. Importantly, the tails are usually slightly heavier––but as the **degrees of freedom** increases and approaches infinity, the distribution approximates a normal distribution.

#### Use the `t.pdf` function to calculate the probability density for a set of points between `[-4, 4]`, with $df = 1$.

In [42]:
points =  np.linspace(-4, 4, num = 50)

In [3]:
#### Your code here

#### Use `matplotlib.pyplot.plot` to plot the probability density of these points against the points themselves.

In [4]:
#### Your code here

#### Now calculate (and plot) the probability density for that same set of points, using several different values of $df$ (e.g., `[1, 5, 10]`).  

**Hint**: Run the code in a for loop (iterating over values of $df$), and call `plt.plot` with each iteration of the loop.

In [5]:
#### Your code here

#### What do you notice about the shape of these distributions?

In [46]:
#### Your answer here

#### Use `cdf` to calculate the probability of obtaining $t ≤ -2$ with $df = 1$.  

In [6]:
#### Your code here

#### Use `cdf` to calculate the probability of obtaining $t ≤ -2$ with $df = 20$.  

In [7]:
#### Your code here

#### Use `cdf` to calculate the probability of obtaining $t > 2$ with $df = 20$.  

In [8]:
#### Your code here

#### Now plot the change in $p(t < -2)$ for $df \in [1, 20]$.

In [9]:
#### Your code here

#### What do you notice about the relative probability of obtaining a given value of $t$, given different $df$?  

In [42]:
#### Your code here

#### What implications does this have for sample size? Is it easier or harder to detect an effect with a larger sample?

In [53]:
#### Your code here

## Part 3: T-tests from scratch

Although `scipy.stats` has a very handy set of functions for running a t-test, it's also useful to learn how to code the test from scratch. That's what this part of the lab will focus on.

### Experimental scenario

Suppose a new curriculum for **reading comprehension** is developed. 10 students are assigned to the **Treatment** group and 10 are assigned to the **Control** group (which uses the previous curriculum). 

After the course, their scores on a reading test are collected, resulting in the following distributions.

We want to know: **Is the mean score in the Treatment population higher than the mean score in the Control population?**

In [54]:
np.random.seed(10)
T = np.random.normal(loc = 7, scale = 1, size = 10)
C = np.random.normal(loc = 5, scale = 1, size = 10)

### 3A. Understanding the distributions

##### Plot each distribution using a histogram.

(Note: You can either plot these in the same graph, making each distribution transparent with the `alpha` parameter; or you can plot them in separate cells.) 

In [10]:
#### Your code here

##### What is the mean of each distribution?

In [11]:
#### Your code here


In [12]:
#### Your code here


##### What is the difference in means between the distributions?

In [13]:
#### Your code here


##### What is the sum of squares (SS) for the treatment group?

In [14]:
#### Your code here


##### What is the sum of squares (SS) for the control group?

In [15]:
#### Your code here


### 3B. Components of the t-test. 

##### What is the pooled variance of our samples?

Recall the formula:

$s_p^2 = \frac{SS_1}{n_1-1} + \frac{SS_2}{n_2-1}$

In [16]:
#### Your code here


##### What is the standard error of the mean?

Recall the formula:

$S_{\bar{X_1} - \bar{X_2}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$

In [17]:
#### Your code here


##### What is our value of $t$?  

Recall the formula:

$t = \frac{\bar{X_1} - \bar{X_2}}{S_{\bar{X_1} - \bar{X_2}}}$

In [18]:
#### Your code here


### 2C. NHST

##### How many degrees of freedom do we have in this sample?

In [19]:
#### Your code here


##### Is this a one-tailed or two-tailed test?

**Hint**: Think about whether the question is asking you if the scores are *different* or if one group is higher than the other.

In [65]:
#### Your code here

##### What is the probability of obtaining a value of $t$ at least that large, under the null hypothesis?

In [20]:
#### Your code here


##### Based on this p-value, should we reject or fail to reject the null hypothesis?

In [21]:
#### Your answer here


## Part 4: T-tests in `scipy.stats`

Now we'll replicate the same analysis, but using the `ttest_ind` function in `scipy.stats`.

### 4A. Replicating previous analyses.

##### Use `scipy.stats.ttest_ind` to calculate $t$ for the independent samples above.

**NOTE**: Remember to set the correct value for one-tailed vs. two-tailed.

In [22]:
#### Your answer here


##### Now assume we are simply asking whether the samples are *different* (as opposed to whether one is larger than the other). What is $t$ now?

**NOTE**: Remember to set the correct value for one-tailed vs. two-tailed.

In [23]:
#### Your answer here


### 4B. Paired samples.

##### Let's switch up the scenario, and instead assume that this is a pre/post measure––the "control" group is just the same students before the intervention, and the "treatment" group is the same students after the intervention. What kind of t-test should we use for this?

In [70]:
#### Your answer here

##### Calculate $t$ using this t-test. (Assume that we're asking whether `T > C`, as in the original calculation.)

In [24]:
#### Your answer here


##### Is this $t$ statistic larger or smaller than when we assumed the samples were independent? Why?

In [73]:
#### Your answer here

##### Now calculate $t$ again assuming a paired/repeated samples design, but if we were instead just asking if `T ≠ C`. 

In [25]:
#### Your answer here


# Conclusion

Congratulations, you've now implemented several key hypothesis tests in Python!