In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
from scipy import stats

# Pivot #

In [None]:
nba = Table.read_table('data/nba_salaries.csv')
nba = nba.relabeled(3, 'SALARY')

In [None]:
nba.show(20)

Each player has two categorical attributes, Position and Team (actually three, because their name is also categorical). 

In [None]:
# Count how many players are in each Position/Team pair
# Two required arguments:
# First is the column label of the attribute 
# whose values are the column labels of the pivot table
# Second is the label for the rows

# Each cell contains the number of players in that Position/Team category.
# Go back to nba and check that there are 3 Centers in the Atlanta Hawks.

nba.pivot('POSITION', 'TEAM')

In [None]:
nba.pivot('TEAM', 'POSITION')

In [None]:
# This one is easier to read

nba.pivot('POSITION', 'TEAM')

Optionally, you can ask pivot to do the following: for each Position/Team combo, find all the **values** of another attribute and **collect** them in some way; display this in the cell.

- The `values` argument has to be the column label of the new attribute
- The `collect` argument has to be the name of a function

In [None]:
# List the players in each Position/Team combo

nba.pivot('POSITION', 'TEAM', values = 'PLAYER', collect = list)

In [None]:
# total salary in each Position/Team combo

nba.pivot('POSITION', 'TEAM', values ='SALARY', collect = sum)

In [None]:
# You don't have to type "values = ..." and "collect = "
# But you MUST put the arguments in the correct order

nba.pivot('POSITION', 'TEAM', 'SALARY', sum)

In [None]:
# Median salary in each Position/Team combo

nba.pivot('POSITION', 'TEAM', 'SALARY', np.median)

In [None]:
# This function returns the distance between the max and the min of a list/array

def data_range(x):
    return max(x) - min(x)

In [None]:
# You can use your own function as the collect
# Distance between the max salary and min salary in each Position/Team combo

nba.pivot('POSITION', 'TEAM', 'SALARY', data_range)

# Hypothesis Testing #

## Sample in Two Categories ##

## Example 1 ##
Jo: Every single day this bus has chance 70% of being late, regardless of other days.

Mo: Are you kidding? It's late more often than that!

Data: Watch bus for 200 days, note whether late or not

Null: Every single day this bus has chance 70% of being late, regardless of other days.

Alternative: Chance of "late" is more than 70%

Test statistic: All of the following are fine:
percent late - 70; number of days late - 140; number of days late; percent of days late; proportion of days late

(For the P-value) Direction that supports alternative: For each of these statistics, positive values or large values support the alternative. So look right.

## Example 2 ##
Jo: Every single day this bus has chance 70% of being late, regardless of other days.

Po: Jo, stop whining. It's not late that often.

Data: Watch bus for 200 days, note whether late or not

Null: Every single day this bus has chance 70% of being late, regardless of other days.

Alternative: Chance of late < 70%

Test statistic: Same as in Example 1:
percent late - 70; number of days late - 140; number of days late; percent of days late; proportion of days late

(For the P-value) Direction that supports alternative: For each of these statistics, negative values or small values support the alternative. So look left

## Example 2 ##
Jo: Every single day this bus has chance 70% of being late, regardless of other days.

Bo: Jo, that's just not true.

Data: Watch bus for 200 days, note whether late or not

Null: Every single day this bus has chance 70% of being late, regardless of other days.

Alternative: Chance of late is not 70%

Test statistic: |number of days late - 140|; |percent days late - 70|; |proportion of days late - 0.7|; also TVD, but see below.

(For the P-value) Direction that supports alternative: Big distances support the alternative. So look right

In [None]:
# The simulation will be under the null hypothesis.

null_proportions = make_array(0.7, 0.3)

In [None]:
# Suppose we choose |proportion late - .7| as the test statistic.
# Define a function that simulates ONE value of this statistic under the null

# This code depends on the null hypothesis and the choice of statistic.

def distance_under_null():
    proportion_late = sample_proportions(200, null_proportions).item(0)
    return abs(proportion_late - 0.7)

# Note: If you want to use counts instead of proportions, multiply by sample size
# If you want to use percents instead of proportions, multiply by 100

In [None]:
# Simulate 10,000 values of the test statistic
# and collect them in an array.
# This code always looks the same.

distances = make_array()
for i in np.arange(10000):
    distances = np.append(distances, distance_under_null())

In [None]:
# If the null is true, this is how the statistic should behave

distance_tbl = Table().with_column('Distance', distances)
distance_tbl.hist(bins=np.arange(0, 0.14, 0.01))

In [None]:
# Until now, we haven't needed to use what was actually observed.
# But at this point we have to compare the null prediction with the observed statistic.

# Suppose the data are 150 times late out of 200 times

observed_statistic = abs(150/200 - .7)
observed_statistic

In [None]:
empirical_p = np.count_nonzero(distances >= observed_statistic) / 10000
empirical_p

Interpreting the P-value:
It is the chance, assuming that the bus is late 70% of the time, that we get a statistic that is 0.05 or greater.

That chance is 12.4%, which is substantial. So if the null is true there is a decent chance of getting the statistic that was actually observed or one that looks even more like the alternative. So the data are consistent with the null.

In [None]:
# Want to use the TVD as the statistic?
# Go ahead:

(abs(0.75 - .7) + abs(.25 - .3)) / 2

When there are just two categories of data, the TVD is equal to the distance between one of the proportions and the corresponding proportion in the null. In other words, the simple distance we calculated as our statistic is actually the TVD. 

## Sample in Multiple Categories ##

Jo: Every single day this bus has a 50% chance of being a bit late, a 20% chance of being very late, and a 30% chance of being on time, regardless of other days.

Po: No it doesn't.

Data: Watch bus for 200 days, note arrivals in the three categories

Null: Every single day this bus has a 50% chance of being a bit late, a 20% chance of being very late, and a 30% chance of being on time, regardless of other days.

Alternative: The null model is wrong.

Test statistic: TVD. It has to be a distance (because "wrong"), and it has to measure the distance between two categorical distributions, not two numbers.

(For the P-value) Direction that supports alternative: Big distances support the alternative. Look right.

In [None]:
null_proportions = make_array(0.5, 0.2, 0.3)

def tvd_under_null():
    in_sample = sample_proportions(200, null_proportions)
    return sum(abs(in_sample - null_proportions))/2

In [None]:
tvds = make_array()
for i in np.arange(10000):
    tvds = np.append(tvds, tvd_under_null())

In [None]:
tvd_tbl = Table().with_column('TVD', tvds)
tvd_tbl.hist(bins=np.arange(0, .12, 0.01))

In [None]:
# This above is the prediction made by the null hypothesis.
# Compare with the data:
# Suppose the data are 90 times a bit late, 60 times very late, 50 times on time

observed_proportions = make_array(90, 60, 50)/200
observed_tvd = sum(abs(observed_proportions - null_proportions))/2
observed_tvd

In [None]:
empirical_p = np.count_nonzero(tvds >= observed_tvd)/10000
empirical_p

Conclusion of test: The data support the hypothesis that the null model is wrong.

Notice that when we had a more crude model ("late 70% of the time") compared to this one ("a bit late 50% of the time, very late 20% of the time, on time 30%), the percent late was 70% in both cases. The observed data were also consistent with each other: 150 late out of 200, compared to 90 "a bit late", 60 "very late", and 50 "on time". The latter is still 150 late.

The tests said the data are consistent with 70% late, but not with 50% a bit late, 20% very late, and 30% on time. That's not surprising. Often, the more detailed specifications you have in your model, the less likely you are to see all of that in the data even if the model is good.