In [None]:
from prob140 import *
from datascience import *
import numpy as np
from scipy import stats
from scipy import special
from itertools import combinations

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.style.use('fivethirtyeight')

# Worksheet 7

You do not need to turn in any written work for this worksheet. Please answer all questions in the code or markdown cells provided. Please provide reasoning throughout, and answer open-ended questions thoughtfully. Turning in scrappy work will result in loss of credit.

\newpage
## 1.  Ranks
We will examine the Wilcoxon rank sum test by revisiting Deflategate, a storm in the world of American football and a topic familiar to us from Data 8.

Here are some extracts from the Data 8 textbook:

>On January 18, 2015, the Indianapolis Colts and the New England Patriots played the American Football Conference (AFC) championship game to determine which of those teams would play in the Super Bowl. After the game, there were allegations that the Patriots' footballs had not been inflated as much as the regulations required; they were softer. This could be an advantage, as softer balls might be easier to catch ...

>At half-time, all the game balls were collected for inspection. Two officials, Clete Blakeman and Dyrol Prioleau, measured the pressure in each of the balls. 

>Here are the data. Each row corresponds to one football. Pressure is measured in psi [pounds per square inch]. The Patriots ball that had been intercepted by the Colts was not inspected at half-time. Nor were most of the Colts' balls – the officials simply ran out of time and had to relinquish the balls for the start of second half play.

Each team had 12 footballs. Eleven of the Patriots' footballs were measured, and four of the Colts'.

In [None]:
football = Table.read_table('deflategate.csv')

In [None]:
football.show()

It is clear that the Patriots' footballs had less pressure than the Colts'. But that is not a fair comparison since the two sets of footballs started out at different pressures: all the Patriots' footballs at 12.5 psi and the Colts' at 13 psi, both levels allowed by NFL regulations. 

Pressure drops naturally during the game. The variable of interest, therefore, is the amount by which the pressure dropped. The Colts' allegation can be politely restated as saying that the drops in pressure among the Patriots' footballs were so large that something unusual had to have happened.

**(a)** Based on each of the columns `Blakeman` and `Prioleau`, calculate the drop in pressure for each football. Start by creating an array of the 15 starting values of pressure. Remember that `np.ones(n)` evaluates to an array of $n$ 1's, and `np.append(array_1, array_2)` evaluates to an array that appends `array_2` to `array_1`. 

In [None]:
start = ...

blakeman_drops = start - ...
prioleau_drops = start - ...

Run the cell below and confirm a few of the drop values by mental math.

In [None]:
drops = football.drop(1, 2).with_columns(
    'Blakeman', blakeman_drops,
    'Prioleau', prioleau_drops
)

drops.show()

**(b)** It does look as though the pressure drop among the Colts' footballs was less than that among the Patriots'. To test whether this is truly the case, we'll need to deal with the fact that the two officials' measurements were different from each other. The key idea is this: since we are just interested in the ordering of the pressure drops and not their actual values, we should look at the ranks and see how the two officials' rankings compare.

The `stats` function `rankdata` takes a numerical array as its argument and returns the array of ranks.

In [None]:
data_1 = make_array(27, 32, 28, 35, 25)
stats.rankdata(data_1)

When we use rank-based methods we do have to face the issue of "ties," that is, data values that are equal. For what we are going to do in this worksheet, it doesn't matter how you rank tied values. We ask that you rank ties by using the `method = 'ordinal'` option of `rankdata`. It assigns distinct ranks to all the values, assigning consecutive ranks to equal values in the order in which they appear in the data.

In [None]:
data_2 = np.append(data_1, 32 * np.ones(3))
data_2, stats.rankdata(data_2, method = 'ordinal')

Use `rankdata` with the `method = 'ordinal'` option to rank Blakeman's drop values, and, separately, Prioleau's drop values.

In [None]:
blakeman_ranks = stats.rankdata(...)
prioleau_ranks = stats.rankdata(...)

Look at the ranks below and do a quick mental check of a few of them for accuracy.

In [None]:
drops = drops.with_columns(
    'Blakeman Ranks', blakeman_ranks,
    'Prioleau Ranks', prioleau_ranks
)

drops.show()

**(c)** In which columns is it easier to compare consistency and inconsistency between the two officials: the ranks or the drop values? What consistencies and inconsistencies do you notice when you compare the ranks?

*Your answer here.*

\newpage
## 2. Wilcoxon's Rank Sum Statistic
Is the difference due to chance? More precisely, the question is whether the Colts' ranks are like a simple random sample of all 15 ranks or whether the Colts' ranks are generally smaller than the Patriots'. If the Colts' ranks are smaller, then it means that the pressure in the Patriots' footballs dropped by more than can be explained by random chance. That is what the Colts were alleging. 

In fact, the Colts were alleging even more, which is that the increased drop was deliberate. We can't assess that. But we can see whether the the Colts' ranks are generally too low to be explained by chance.

It is now time to quantify "ranks are generally too low". We will do this by using the **[Wilcoxon](https://en.wikipedia.org/wiki/Frank_Wilcoxon) Rank Sum statistic**, which is just the sum of the Colts' ranks. A low rank sum corresponds to the Colts' ranks being "generally low". In general, the Wilcoxon rank sum statistic is the sum of the ranks of one of the two samples.

It is important to keep in mind that we are not interested in which of the Colts' footballs received which rank; we are just interested in the set of ranks received by those balls. That is, we are interested in an unordered sample of 4 out of the 15 ranks.

**(a)** We'll start with Blakeman's rank sum.

**(i)** What is the rank sum statistic based on Blakeman's ranks? That is, what is the sum of the Colts' ranks as assigned by Blakeman?

*Your answer here.*

**(ii)** How many sets of four can be formed from among the numbers 1 through 15? Remember that `special.comb(n, k)` evaluates to $\binom{n}{k}$.

In [None]:
total_samples = ...
total_samples

**(iii)** What is the smallest possible sum that you can get from a subset of four numbers from chosen from the integers 1 through 15? How many subsets have this sum? 

*Your answer here.*

**(iv)** Based on the value of Blakeman's rank sum, should you conclude that the Colts' ranks are like a random sample of four ranks? Explain briefly.

*Your answer here.*

**(b)** For the remainder of the question, we'll use Prioleau's ranks.

In [None]:
prioleau = drops.select(0, 4)
prioleau.show()

You can of course calculate Prioleau's rank sum mentally, but for further applications it is useful to be able to do this using Python.

**(i)** Use `group` to find Prioleau's rank sum for both teams. Refer to the [Data 8 Python reference](http://data8.org/sp18/python-reference.html) if necessary. The table `both_sums` should contain both the rank sums, and `prioleau_colts_sum` should be the observed value of Prioleau's statistic.

In [None]:
both_sums = ...
prioleau_colts_sum = ...

both_sums

**(ii)** Use the cell below to show why the total of all the ranks is 120. Fill in the comment as an explanation, and then compute the sum **not by brute force but by using an appropriate formula that can easily be applied when the sample is larger.**

In [None]:
# The total of the ranks is the sum of ...

...

**(c)** The `combinations` function of `itertools` has been imported and is used below to display all the subsets of 4 out the 15 ranks. These are all the possible samples of ranks that the Colts' could have. Check that the table has the right number of rows.

In [None]:
population = np.arange(1, 16)

all_samples = Table().with_column(
    'Ranks', list(combinations(population, 4))
)

all_samples

Construct an array `rank_sums` consisting of the sums of the ranks in all the samples, and augment the table `all_samples` with a column `Rank Sum` containing the rank sums.

In [None]:
rank_sums = ...

all_samples = ...

all_samples

**(d)** Now we consider the probability distribution of the rank sum statistic.

**(i)** What is the smallest and largest the rank sum can be? You should not need to use `sort`.

In [None]:
smallest = ...
largest = ...

smallest, largest

**(ii)** Draw a histogram of the rank sums, using bins of width 1 centered on each possible value of the rank sum. As this histogram is based on every possible sample, it displays the *sampling distribution* or equivalently the exact probability distribution of the rank sum statistic under the null hypothesis of random selection.

Note that you will need to offset your bins by 0.5 to ensure that the bars are centered properly.

The additional lines of code plots the observed rank sum on the horizontal axis.

In [None]:
...
plt.scatter(prioleau_colts_sum, 0, color='red', s=40)
plt.ylim(-0.005, 0.05);

**(e)** Using the exact distribution of Prioleau's rank sum statistic, we can conduct the Wilcoxon rank sum test.

**(i)** Compute the $p$ value of the test. This is an exact $p$ value, not an empirical or numerical approximation.

In [None]:
p_val = ...
p_val

**(ii)** What is your decision, based on the test? Which of the two hypotheses do you think is better supported by Prioleau's measurements?

*Your answer here.*

\newpage
## 3. Normal Curves
The probability distribution of the rank sum statistic looks very much like the normal distribution, but not exactly. For example, look at the peak of the histogram. You will see two flat bits on either side. 

Still, the distribution doesn't look too far from normal, so it is worth reminding ourselves about the normal curve. This exercise takes you quickly through some code that you can use to display normal curves and areas under them.

**(a)** As you know, the equation of this curve is one of the greatest hits of probability theory, mathematics, and statistics. The parameters of the curve are a expectation $\mu$ that can be any number, and a variance $\sigma^2$ that is a positive number. The equation is

$$
f(x) ~ = ~ \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\big{(}\frac{x-\mu}{\sigma}\big{)}^2}, ~~~~~~~ -\infty < x < \infty
$$

To plot the normal curve, use the `prob140` function `Plot_norm` with three arguments:
- the interval of values of $x$ over which to plot the curve
- the mean $\mu$
- the standard deviation $\sigma$

**(i)** Use `Plot_norm` to plot the standard normal curve. Plot the curve for four standard deviations about its mean.

In [None]:
Plot_norm(...)

**(ii)** Use `Plot_norm` to plot the normal curve with mean $68$ and variance $3$. As in **(i)**, plot the curve for four standard deviations about its mean.

In [None]:
Plot_norm((56, 80), 68, 3)

**(b)** The gold area below is $\Phi(2)$, the value of the cdf of the standard normal at the point $x = 2$. Notice the use of `right_end = 2` to color the area; when the left end is not specified, it is assumed to be the leftmost value on the horizontal axis.

In [None]:
Plot_norm((-4, 4), 0, 1, right_end = 2)

Find the numerical value of the area below.

In [None]:
Plot_norm((56, 80), 68, 3, left_end = 65, right_end = 71)

In [None]:
...

\newpage
## 4. Normal Approximation
Let's find the normal distribution that approximates the distribution of the rank sum statistic in Exercise 2. Why approximate a distribution we already know exactly? The answer is that we will need the method of approximation when the sample sizes are too large for us to be able to enumerate all possible samples. Finding the approximation in a case where we know the exact answer helps us see that the approximation is good.

**(a)** Under the null hypothesis of random selection, the distribution of our rank sum statistic is the distribution of the sum of a simple random sample of size 4 from the population of integers 1 through 15.

In general, let $W$ be the sum of  $n$ ranks drawn at random without replacement from the integers 1 through $N$.

Refer to [Sections 12.1](http://prob140.org/textbook/content/Chapter_12/01_Definition.html) and [13.4](http://prob140.org/textbook/content/Chapter_13/04_Symmetry_and_Indicators.html) of the textbook for the formulas that you need in order to define the functions below. 

**(i)** Define a function `ev_ranksum` that takes $n$ and $N$ as its arguments and returns $E(W)$. **Do not** use arrays or `np.average` in your definition. Use the formulas derived in class.

In [None]:
def ev_ranksum(n, N):
    return ...

null_expectation = ev_ranksum(4, 15)
null_expectation

**(ii)** Now define a function `sd_ranksum` that takes $n$ and $N$ as arguments and returns $SD(W)$. **Do not** use arrays or `np.std` in your definition.

In [None]:
def sd_ranksum(n, N):
    return ...

null_sd = sd_ranksum(4, 15)
null_sd

To check your numerical answers, remember that you enumerated every possible sample and hence every possible rank sum. Run the cell below and confirm that its output is the same as the values returned by your functions.

In [None]:
np.average(rank_sums), np.std(rank_sums)

**(b)** Re-use both lines of code in the last cell of 2<b>(d)</b> and `Plot_norm` appropriately to superpose the approximating normal distribution over the histogram. Ignore the last line in the cell. It just sets a vertical scale so you can see all the different aspects of the figure.

In [None]:
...
...

Plot_norm(...)
plt.ylim(-0.01, 0.06);

You can see that the curve overestimates near the center, then underestimates on both sides, and then overestimates again in the tails. But it's not bad. The approximation typically improves when the sample sizes get larger.

**(c)** Compute an approximate $p$ value based on the normal approximation from **(b)**. Is your approximate $p$ value an overestimate or underestimate of the true $p$ value?

In [None]:
approx_p_val = ...
p_val, approx_p_val