# Lab 5: Randomness and Sampling

Welcome to Lab 5! In this lab we will learn about sampling strategies. More information about Sampling in the textbook can be found in [Chapter 9](http://sierra.ucsd.edu/dsc10-book/chapters/09/Randomness.html) and [Chapter 10](http://sierra.ucsd.edu/dsc10-book/chapters/10/Sampling_and_Empirical_Distributions.html). This lab is due on **Monday, 02/10 at 11:59pm.**


The data used in this lab will contain salary data and statistics for basketball players from the 2014-2015 NBA season. This data was collected from [basketball-reference](http://www.basketball-reference.com) and [spotrac](http://www.spotrac.com).

In [None]:
import numpy as np
import babypandas as bpd

# These lines set up graphing capabilities.
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from client.api.notebook import Notebook
ok = Notebook('lab.ok')
_ = ok.auth(inline=True)

%matplotlib inline

## 1. Dungeons and Dragons and Sampling
In the game Dungeons & Dragons, each player plays the role of a fantasy character.

A player performs actions by rolling a 20-sided die, adding a "modifier" number to the roll, and comparing the total to a threshold for success.  The modifier depends on her character's competence in performing the action.

For example, suppose Alice's character, a barbarian warrior named Roga, is trying to knock down a heavy door.  She rolls a 20-sided die, adds a modifier of 11 to the result (because her character is good at knocking down doors), and succeeds if the total is greater than 15.

**Question 1.1** 

Write code that simulates that procedure.  Compute three values: the result of Alice's roll (`roll_result`), the result of her roll plus Roga's modifier (`modified_result`), and a boolean value indicating whether the action succeeded (`action_succeeded`).  **Do not fill in any of the results manually**; the entire simulation should happen in code.

*Hint:* A roll of a 20-sided die is a number chosen uniformly from the array `np.array([1, 2, 3, 4, ..., 20])`. You can store these possibilities in `possible_rolls`.  So a roll of a 20-sided die *plus 11* is a number chosen uniformly from that array, plus 11.

In [None]:
possible_rolls = ...
roll_result = ...
modified_result = ...
action_succeeded = ...

# The next line just prints out your results in a nice way
# Once you're done,you can delete it if you want.
print("On a modified roll of {:d}, Alice's action {}.".format(modified_result, "succeeded" if action_succeeded else "failed"))

In [None]:
#DELETE
possible_rolls = np.arange(1, 21)
roll_result = np.random.choice(possible_rolls)
modified_result = roll_result + 11
action_succeeded = modified_result > 15

# The next line just prints out your results in a nice way
# once you're done.  You can delete it if you want.
print("On a modified roll of {:d}, Alice's action {}.".format(modified_result, "succeeded" if action_succeeded else "failed"))

In [None]:
_ = ok.grade('q1_1')

**Question 1.2** Run your cell 7 times. What fraction of times did Alice succeed at this action? Your answer should be a decimal number between 0 and 1.

In [None]:
#...rough_success_chance
rough_success_chance = np.sum(np.random.choice(possible_rolls, 10) + 11 > 15) / 10
rough_success_chance

In [None]:
_ = ok.grade('q1_2')

Suppose we don't know that Roga has a modifier of 11 for this action.  Instead, we observe the modified roll (that is, the die roll plus the modifier of 11) from each of 7 of her attempts to knock down doors.  We would like to estimate her modifier from these 7 numbers.

**Question 1.3** Write a Python function called `simulate_observations`.  It should take no arguments, and it should return an array of 7 numbers.  Each of the numbers should be the modified roll from one simulation.  **Then**, call your function once to compute an array of 7 simulated modified rolls.  Name that array `observations`.

In [None]:
modifier = 11
num_observations = 7

def simulate_observations():
    """Produces an array of 7 simulated modified die rolls"""
    ...
    
observations = ...
observations

In [None]:
#DELETE
modifier = 11
num_observations = 7

def simulate_observations():
    """Produces an array of 7 simulated modified die rolls"""
    return np.random.choice(possible_rolls, num_observations) + modifier
    
observations = simulate_observations()
observations

In [None]:
_ = ok.grade('q1_3')

**Question 1.4** Draw a histogram to display the *probability distribution* of the modified rolls we might see. 

In [None]:
# We suggest using these bins.
roll_bins = np.arange(1, modifier+2+20, 1)

In [None]:
#- place your code here
plt.hist(observations, bins=roll_bins)

Now let's imagine we don't know the modifier and try to estimate it from `observations`.

One straightforward way to do so is to find the smallest overall modified roll. The smallest number on a 20-sided die is 1, so if we see that the modified was 1, we know that the player's modifier must be zero. If we see that the modified is something larger -- say, 12 -- we can't say for certain what the player's modifier is, but we'll guess that player rolled a 1 and that their modifier is 11. This works because, if we see enough modified rolls, one of them will have occurred when the player rolled a one.

**Question 1.5** Using this method, estimate `modifier` from `observations` and name that estimate `min_estimate`.

In [None]:
#...min_estimate
min_estimate = observations.min() - 1
min_estimate

In [None]:
_ = ok.grade('q1_5')

Another way to estimate the modifier involves the mean of `observations`. If a player's modifier is zero, then the mean of a large number of their modified rolls will be close to the mean of 1, 2, ..., 20, which is 10.5. If their modifier is $m$, then the mean of their modified rolls will be close to the mean of $1 + m$, $2 + m$, ..., $20 + m$,
which is 10.5 + $m$.

**Question 1.6** Write a function named `mean_based_estimator` that computes your estimate using this method.  It should take an array of modified rolls (like the array `observations`) as its argument and return an estimate of `modifier` based on those numbers.

In [None]:
def mean_based_estimator(nums):
    """Estimate the roll modifier based on observed modified rolls in the array nums."""
    ...

# Here is an example call to your function.  It computes an estimate
# of the modifier from our 7 observations.
mean_based_estimate = mean_based_estimator(observations)
mean_based_estimate

In [None]:
#DELETE
def mean_based_estimator(nums):
    """Estimate the roll modifier based on observed modified rolls in the array nums."""
    return np.mean(nums) - 10.5

# Here is an example call to your function.  It computes an estimate
# of the modifier from our 7 observations.
mean_based_estimate = mean_based_estimator(observations)
mean_based_estimate

In [None]:
_ = ok.grade('q1_6')

## 2. Sampling

We'll use some NBA data to get some practice with sampling.
Run the cell below to load the player and salary data.

In [None]:
player_data = bpd.read_csv("player_data.csv").set_index('Name')
salary_data = bpd.read_csv("salary_data.csv").set_index('PlayerName')
full_data = salary_data.merge(player_data, left_index=True, right_index=True)

player_data

In [None]:
salary_data

In [None]:
full_data

Rather than getting data on every player, imagine that we had gotten data on only a smaller subset of the players.  For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky.  Instead, we often make *statistical inferences* about a large underlying population using a smaller sample.

A **statistical inference** is a statement about some statistic of the underlying population, such as "the average salary of NBA players in 2014 was $3 million".  You may have heard the word "inference" used in other contexts.  It's important to keep in mind that statistical inferences can be wrong.

A general strategy for inference using samples is to estimate statistics of the population by computing the same statistics on a sample.  This strategy sometimes works well and sometimes doesn't.  The degree to which it gives us useful answers depends on several factors, and we'll touch lightly on a few of those today.

One very important factor in the utility of samples is how they were gathered.  We have prepared some example sample datasets to simulate inference from different kinds of samples for the NBA player dataset.  Later we'll ask you to create your own samples to see how they behave.

To save typing and increase the clarity of your code, we will package the loading and analysis code into two functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data.

**Question 2.1**. Complete the `histograms` function, which takes a table with columns `Age` and `Salary` and draws a histogram for each one. Use the min and max functions to pick the bin boundaries so that all data appears for any table passed to your function. Use the same bin widths as before (1 year for `Age` and $1,000,000 for `Salary`). 

*Hint:* Make sure that your bins **include** the maximum value.  Remember that bins include the left value but exclude the right value.

In [None]:
def histograms(t):
    ages = t.get('Age')
    salaries = t.get('Salary')
    age_bins = ...
    salary_bins = ...
    
    a = plt.figure(1)
    plt.hist(ages, bins=age_bins, density=True)
    plt.title('Distribution of Ages')
    a.show()
    s = plt.figure(2)
    plt.hist(salaries, bins=salary_bins, density=True)
    plt.title('Distribution of Salaries')
    s.show()
    return age_bins, salary_bins # Keep this statement so that your work can be checked
    
histograms(full_data)
print('Two histograms should be displayed below')

In [None]:
#DELETE
def histograms(t):
    ages = t.get('Age')
    salaries = t.get('Salary')
    age_bins = np.arange(ages.min(), ages.max() + 1, 1)
    salary_bins = np.arange(salaries.min(), salaries.max()+1e6+1, 1e6)
    
    a = plt.figure(1)
    plt.hist(ages, bins=age_bins, density=True, alpha=.75)
    plt.title('Distribution of Ages')
    a.show()
    s = plt.figure(2)
    plt.hist(salaries, bins=salary_bins, density=True, alpha=.75)
    plt.title('Distribution of Salaries')
    s.show()
    return age_bins, salary_bins # Keep this statement so that your work can be checked
    
histograms(full_data)
# Two histograms should be displayed below

In [None]:
_ = ok.grade('q2_1') # Warning: Charts will be displayed while running this test

**Question 2.2**. Create a function called `compute_statistics` that takes a DataFrame containing ages and salaries and:
- Draws a histogram of ages
- Draws a histogram of salaries
- Return a two-element array containing the average age and average salary

You can call your `histograms` function to draw the histograms!

In [None]:
def compute_statistics(age_and_salary_data):
    ...
    age = ...
    salary = ...
    ...
    

full_stats = compute_statistics(full_data)
full_stats

In [None]:
#DELETE

def compute_statistics(age_and_salary_data):
    histograms(age_and_salary_data)
    age = np.average(age_and_salary_data.get('Age').values)
    salary = np.average(age_and_salary_data.get('Salary').values)
    return [age, salary]
    

full_stats = compute_statistics(full_data)
full_stats

In [None]:
_ = ok.grade('q2_2') # Warning: Charts will be displayed while running this test

### Convenience sampling
One sampling methodology, which is **generally a bad idea**, is to choose players who are somehow convenient to sample.  For example, you might choose players from one team that's near your house, since it's easier to survey them.  This is called, somewhat pejoratively, *convenience sampling*.

Suppose you survey only *relatively new* players with ages less than 22.  (The more experienced players didn't bother to answer your surveys about their salaries.)

**Question 2.3**  Assign `convenience_sample` to a subset of `full_data` that contains only the rows for players under the age of 22.

In [None]:
#...convenience_sample
convenience_sample = full_data[full_data.get('Age') < 22]
convenience_sample

In [None]:
_ = ok.grade('q2_3')

**Question 2.4** Assign `convenience_stats` to an array of the average age and average salary of your convenience sample, using the `compute_statistics` function.  Since they're computed on a sample, these are called *sample averages*. 

In [None]:
#...convenience_stats
convenience_stats = compute_statistics(convenience_sample)
convenience_stats

In [None]:
_ = ok.grade('q2_4')

Next, we'll compare the convenience sample salaries with the full data salaries.

In [None]:
# just run this cell, don't change it
def compare_salaries(first, second, first_title, second_title):
    """Compare the salaries in two tables."""
    bins = np.arange(0, 25_000_000, 1_000_000)
    first.plot(kind='hist', y='Salary', bins=bins, density=True)
    plt.title(first_title)
    second.plot(kind='hist', y='Salary', bins=bins, density=True)
    plt.title(second_title)

compare_salaries(full_data, convenience_sample, 'All Players', 'Convenience Sample')

**Question 2.5** From what you see in the histogram above, does the convenience sample give us an accurate picture of the age and salary of the full population of NBA players in 2014-2015?  Would you expect it to, in general?  Assign either 1, 2, 3, or 4 to the variable `sampling_q5` below. 
1. Yes. The sample is large enough, so it is an accurate representation of the population.
2. No. The sample is too small, so it won't give us an accurate representation of the population.
3. No. But this was just an unlucky sample, normally this would give us an accurate representation of the population.
4. No. This type of sample doesn't give us an accurate representation of the population.

In [None]:
#...sampling_q5
sampling_q5 = 4
sampling_q5

In [None]:
_ = ok.grade('q2_5')

### Simple random sampling
A more principled approach is to sample uniformly at random from the players.  If we ensure that each player is selected at most once, this is a *simple random sample without replacement*, sometimes abbreviated to "simple random sample" or "SRSWOR".  Imagine writing down each player's name on a card, putting the cards in a hat, and shuffling the hat.  Then, pull out cards one by one and set them aside, stopping when the specified *sample size* is reached.

We've produced two samples of the `salary_data` table in this way: `small_srswor_salary.csv` and `large_srswor_salary.csv` contain, respectively, a sample of size 44 (the same as the convenience sample) and a larger sample of size 100.  

The `load_data` function below loads a salary table and joins it with `player_data`.

In [None]:
def load_data(salary_file):
    return player_data.merge(bpd.read_csv(salary_file), left_index=True, right_on='PlayerName')

**Question 2.6** Run the same analyses on the small and large samples that you previously ran on the full dataset and on the convenience sample.  Compare the accuracy of the estimates of the population statistics that we get from the small simple random sample, the large simple random sample, and the convenience sample. 

**Note:** `small_srswor_data` and `large_srswor_data` should be DataFrames loaded from their respective `small_srswor_salary.csv` and `large_srswor_salary.csv`

In [None]:
# Original:
small_srswor_data = ...
small_stats = ...
large_srswor_data = ...
large_stats = ...
convenience_stats = ...
print('Full data stats:                 ', full_stats)
print('Small simple random sample stats:', small_stats)
print('Large simple random sample stats:', large_stats)
print('Convenience sample stats:        ', convenience_stats)

In [None]:
#DELETE

small_srswor_data = load_data('small_srswor_salary.csv')
small_stats = compute_statistics(small_srswor_data)
large_srswor_data = load_data('large_srswor_salary.csv')
large_stats = compute_statistics(large_srswor_data)
convenience_stats = compute_statistics(convenience_sample)

plt.figure(1).legend(['Small SRSWOR', 'Large SRSWOR', 'Convenience'])
plt.figure(2).legend(['Small SRSWOR', 'Large SRSWOR', 'Convenience'])
print('Full data stats:                 ', full_stats)
print('Small simple random sample stats:', small_stats)
print('Large simple random sample stats:', large_stats)
print('Convenience sample stats:        ', convenience_stats)

In [None]:
_ = ok.grade('q2_6')

### Producing simple random samples
Often it's useful to take random samples even when we have a larger dataset available.  The randomized response technique was one example we saw in [Chapter 10](http://sierra.ucsd.edu/dsc10-book/chapters/10/Sampling_and_Empirical_Distributions.html) and in lecture with united flight delays.  Another is to help us understand how inaccurate other samples are.

DataFrames provide the method `sample()` for producing random samples.  Note that its default is to sample **without** replacement. To see how to call `sample()` enter`full_data.sample?` into a code cell and press Enter.

In [None]:
full_data.sample?

**Question 2.7** Produce a simple random sample of size 44 from `full_data` *with replacement*.  (You don't need to bother with a merge this time –– just use `full_data.sample(...)` directly.  That will have the same result as sampling from `salary_data` and joining with `player_data`.)  Run your analysis on it again.

In [None]:
my_small_srswor_data = ...
my_small_stats = ...
my_small_stats

In [None]:
#DELETE

my_small_srswor_data = full_data.sample(44, replace=True)
my_small_stats = compute_statistics(my_small_srswor_data)
my_small_stats

Are your results similar to those in the small sample we provided you? Do things change a lot across separate samples? Run your code several times to get new samples. Assign either 1, 2, 3, or 4 to the variable `sampling_q7` below.
1. The results are very different from the small sample, and don't change at all across separate samples.
2. The results are very different from the small sample, and change a bit across separate samples.
3. The results are slightly different from the small sample, and change a bit across separate samples.
4. The results are not at all different from the small sample, and don't change at all across separate samples.

In [None]:
#...sampling_q7
sampling_q7 = 3
sampling_q7

In [None]:
_ = ok.grade('q2_7')

**Question 2.8** As in the previous question, analyze several simple random samples of size 100 from `full_data`.

In [None]:
my_large_srswor_data = ...
my_large_stats = ...
my_large_stats

In [None]:
#DELETE

my_large_srswor_data = full_data.sample(100, replace=True)
my_large_stats = compute_statistics(my_large_srswor_data)
my_large_stats

Do the average and histogram statistics seem to change more or less across samples of this size than across samples of size 44?  And are the sample averages and histograms closer to their true values for age or for salary?  Assign either 1, 2, 3, 4, or 5 to the variable `sampling_q8` below. 

Is this what you expected to see?
1. The statistics change *less* across samples of this size than across smaller samples. The statistics are closer to their true values for *age* than they are for *salary*.
2. The statistics change *less* across samples of this size than across smaller samples. The statistics are closer to their true values for *salary* than they are for *age*.
3. The statistics change *more* across samples of this size than across smaller samples. The statistics are closer to their true values for *age* than they are for *salary*.
4. The statistics change *more* across samples of this size than across smaller samples. The statistics are closer to their true values for *salary* than they are for *age*.
5. The statistics change an *equal amount* across samples of this size as across smaller samples. The statistics for age and salary are *equally close* to their true values.

In [None]:
#...sampling_q8
sampling_q8 = 1
sampling_q8

In [None]:
_ = ok.grade('q2_8')

# Finish Line

## Before submitting, select "Kernel" -> "Restart & Run All" from the menu!

Then make sure that all of your cells ran without error.

**Well Done!** You are done with lab 05. Please run the below cells to ensure that you have passed all of your tests and to submit to okPy. 

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
_ = ok.submit()