# Homework 4: Iteration and Randomness

## Due Friday, July 16th at 11:59pm

Welcome to Homework 4! This week, we will go over probability, simulations using iteration, and functions. You can find additional help on these topics in Lecture 7 (Iteration portion) and Lecture 8 (Probability and Simulations) of the course material.

<span style="color:red"><b>Note!</b></span> To make this assignment a little shorter and easier, the tests for the first two problems are *correctness* tests -- if these pass, you'll get full credit for the problem. On the other hand, the last two problems are regular homework problems -- if the test passes, your answer may still not be correct!

### Instructions

This assignment is due Friday, July 16th at 11:59pm. You are given six slip days thoughout the quarter which can extend the deadline by one day. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

You should start early so that you have time to get help if you're stuck. A calendar with lab hour times and locations appears on [the course webpage](http://dsc10.com).

In [1]:
# please don't change this cell, but do make sure to run it

%matplotlib inline

import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
import otter
grader = otter.Notebook()

## 1. Dungeons and Dragons and Sampling

<span style="color:red"><b>Note!</b></span> To make this assignment a little shorter and easier, the tests for the this problem are *correctness* tests -- if these pass, you'll get full credit for the problem.


In the game Dungeons & Dragons, each player plays the role of a fantasy character.

A player performs actions by rolling a 18-sided die, adding a "modifier" number to the roll, and comparing the total to a threshold for success.  The modifier depends on her character's competence in performing the action.

For example, suppose Alice's character, a barbarian warrior named Roga, is trying to knock down a heavy door.  She rolls a 18-sided die, adds a modifier of 7 to the result (because her character is good at knocking down doors), and succeeds if the total is greater than 16.

**Question 1.1** 

Write code that simulates that procedure.  Compute three values: the result of Alice's roll (`roll_result`), the result of her roll plus Roga's modifier (`modified_result`), and a boolean value indicating whether the action succeeded (`action_succeeded`).  **Do not fill in any of the results manually**; the entire simulation should happen in code.

*Hint:* A roll of a 18-sided die is a number chosen uniformly from the array `np.array([1, 2, 3, 4, ..., 18])`. You can store these possibilities in `possible_rolls`.  So a roll of a 18-sided die *plus 7* is a number chosen uniformly from that array, plus 7.

In [2]:
possible_rolls = ...
roll_result = ...
modified_result = ...
action_succeeded = ...

# The next line just prints out your results in a nice way
# once you're done.  You can delete it if you want.
print("On a modified roll of {:d}, Alice's action {}.".format(modified_result, "succeeded" if action_succeeded else "failed"))

In [None]:
grader.check("q11")

**Question 1.2** Run your cell 7 times. What fraction of times did Alice succeed at this action? Your answer should be a decimal number between 0 and 1.

In [4]:
rough_success_chance = ...
rough_success_chance

In [None]:
grader.check("q12")

Suppose we don't know that Roga has a modifier of 7 for this action.  Instead, we observe the modified roll (that is, the die roll plus the modifier of 7) from each of 7 of her attempts to knock down doors.  We would like to estimate her modifier from these 7 numbers.

**Question 1.3** Write a Python function called `simulate_observations`.  It should take no arguments, and it should return an array of 7 numbers.  Each of the numbers should be the modified roll from one simulation.  **Then**, call your function once to compute an array of 7 simulated modified rolls.  Name that array `observations`.

In [6]:
modifier = 7
num_observations = 7

def simulate_observations():
    """Produces an array of 7 simulated modified die rolls"""
    ...
    
observations = ...
observations

In [None]:
grader.check("q13")

**Question 1.4** Draw a histogram to display the *probability distribution* of the modified rolls we might see. 

In [9]:
# We suggest using these bins.
roll_bins = np.arange(1, modifier+2+18, 1)

In [10]:
#- place your code here
...

Now let's imagine we don't know the modifier and try to estimate it from `observations`.

One straightforward way to do so is to find the smallest overall modified roll. The smallest number on a 20-sided die is 1, so if we see that the modified was 1, we know that the player's modifier must be zero. If we see that the modified is something larger -- say, 12 -- we can't say for certain what the player's modifier is, but we'll guess that player rolled a 1 and that their modifier is 11. This works because, if we see enough modified rolls, one of them will have occurred when the player rolled a one.

**Question 1.5** Using this method, estimate `modifier` from `observations` and name that estimate `min_estimate`.

In [11]:
min_estimate = ...
min_estimate

In [None]:
grader.check("q15")

Another way to estimate the modifier involves the mean of `observations`. If a player's modifier is zero, then the mean of a large number of their modified rolls will be close to the mean of 1, 2, ..., 18, which is 9.5. If their modifier is $m$, then the mean of their modified rolls will be close to the mean of $1 + m$, $2 + m$, ..., $18 + m$,
which is 9.5 + $m$.

**Question 1.6** Write a function named `mean_based_estimator` that computes your estimate using this method.  It should take an array of modified rolls (like the array `observations`) as its argument and return an estimate of `modifier` based on those numbers.

In [13]:
def mean_based_estimator(nums):
    """Estimate the roll modifier based on observed modified rolls in the array nums."""
    ...

# Here is an example call to your function.  It computes an estimate
# of the modifier from our 7 observations.
mean_based_estimate = mean_based_estimator(observations)
mean_based_estimate

In [None]:
grader.check("q16")

## 2. Sampling

<span style="color:red"><b>Note!</b></span> To make this assignment a little shorter and easier, the tests for the this problem are *correctness* tests -- if these pass, you'll get full credit for the problem.


We'll use some NBA data to get some practice with sampling.
Run the cell below to load the player and salary data.

In [17]:
player_data = bpd.read_csv("data/player_data.csv").set_index('Name')
salary_data = bpd.read_csv("data/salary_data.csv").set_index('PlayerName')
full_data = salary_data.merge(player_data, left_index=True, right_index=True)

player_data

In [18]:
salary_data

In [19]:
full_data

Rather than getting data on every player, imagine that we had gotten data on only a smaller subset of the players.  For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky.  Instead, we often make *statistical inferences* about a large underlying population using a smaller sample.

A **statistical inference** is a statement about some statistic of the underlying population, such as "the average salary of NBA players in 2014 was $3 million".  You may have heard the word "inference" used in other contexts.  It's important to keep in mind that statistical inferences can be wrong.

A general strategy for inference using samples is to estimate statistics of the population by computing the same statistics on a sample.  This strategy sometimes works well and sometimes doesn't.  The degree to which it gives us useful answers depends on several factors, and we'll touch lightly on a few of those today.

One very important factor in the utility of samples is how they were gathered.  We have prepared some example sample datasets to simulate inference from different kinds of samples for the NBA player dataset.  Later we'll ask you to create your own samples to see how they behave.

To save typing and increase the clarity of your code, we will package the loading and analysis code into two functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data.

**Question 2.1**. Complete the `histograms` function, which takes a table with columns `Points` and `Salary` and draws a histogram for each one. Use the min and max functions to pick the bin boundaries so that all data appears for any table passed to your function. Use the same bin widths as before (100 points for `Points` and $1,000,000 for `Salary`). 

*Hint:* Make sure that your bins **include** the maximum value.  Remember that bins include the left value but exclude the right value.

In [20]:
def histograms(t):
    points = t.get('Points').values
    salaries = t.get('Salary').values
    points_bins = ...
    salary_bins = ...
    
    a = plt.figure(1)
    plt.hist(points, bins=points_bins, density=True, alpha=.75)
    plt.title('Distribution of Points')
    s = plt.figure(2)
    plt.hist(salaries, bins=salary_bins, density=True, alpha=.75)
    plt.title('Distribution of Salaries')
    return points_bins, salary_bins # Keep this statement so that your work can be checked
    
histograms(full_data)
print('Two histograms should be displayed below')

**Question 2.2**. Create a function called `compute_statistics` that takes a DataFrame containing points and salaries and:
- Draws a histogram of points
- Draws a histogram of salaries
- Return a two-element array containing the average point and average salary

You can call your `histograms` function to draw the histograms!

In [21]:
def compute_statistics(point_and_salary_data):
    ...
    points = ...
    salary = ...
    ...
    

full_stats = compute_statistics(full_data)
full_stats

### Convenience sampling
One sampling methodology, which is **generally a bad idea**, is to choose players who are somehow convenient to sample.  For example, you might choose players from one team that's near your house, since it's easier to survey them.  This is called, somewhat pejoratively, *convenience sampling*.

Suppose you survey only players whose points are **smaller than 50**.  (The more experienced players didn't bother to answer your surveys about their salaries.)

**Question 2.3**  Assign `convenience_sample` to a subset of `full_data` that contains only the rows for players having points smaller than 50.

In [22]:
convenience_sample = ...
convenience_sample

In [None]:
grader.check("q23")

**Question 2.4** Assign `convenience_stats` to an array of the average points and average salary of your convenience sample, using the `compute_statistics` function.  Since they're computed on a sample, these are called *sample averages*. 

In [25]:
%%capture

convenience_stats = ...

In [26]:
convenience_stats

Next, we'll compare the convenience sample salaries with the full data salaries.

In [27]:
# just run this cell, don't change it
def compare_salaries(first, second, first_title, second_title):
    """Compare the salaries in two tables."""
    bins = np.arange(0, 25_000_000, 1_000_000)
    first.plot(kind='hist', y='Salary', bins=bins, density=True)
    plt.title(first_title)
    second.plot(kind='hist', y='Salary', bins=bins, density=True)
    plt.title(second_title)

compare_salaries(full_data, convenience_sample, 'All Players', 'Convenience Sample')

**Question 2.5** From what you see in the histogram above, does the convenience sample give us an accurate picture of the points and salary of the full population of NBA players in 2014-2015?  Would you expect it to, in general?  Assign either 1, 2, 3, or 4 to the variable `sampling_q5` below. 
1. Yes. The sample is large enough, so it is an accurate representation of the population.
2. No. The sample is too small, so it won't give us an accurate representation of the population.
3. No. But this was just an unlucky sample, normally this would give us an accurate representation of the population.
4. No. This type of sample doesn't give us an accurate representation of the population.

In [28]:
sampling_q5 = ...
sampling_q5

In [None]:
grader.check("q25")

### Simple random sampling
A more principled approach is to sample uniformly at random from the players.  If we ensure that each player is selected at most once, this is a *simple random sample without replacement*, sometimes abbreviated to "simple random sample" or "SRSWOR".  Imagine writing down each player's name on a card, putting the cards in a hat, and shuffling the hat.  Then, pull out cards one by one and set them aside, stopping when the specified *sample size* is reached.

We've produced two samples of the `salary_data` table in this way: `small_srswor_salary.csv` and `large_srswor_salary.csv` contain, respectively, a sample of size 44 (the same as the convenience sample) and a larger sample of size 100.  

The `load_data` function below loads a salary table and joins it with `player_data`.

In [34]:
def load_data(salary_file):
    return player_data.merge(bpd.read_csv(salary_file), left_index=True, right_on='PlayerName')

**Question 2.6** Run the same analyses on the small and large samples that you previously ran on the full dataset and on the convenience sample.  Compare the accuracy of the estimates of the population statistics that we get from the small simple random sample, the large simple random sample, and the convenience sample. 

**Note:** `small_srswor_data` and `large_srswor_data` should be DataFrames loaded from their respective `data/small_srswor_salary.csv` and `data/large_srswor_salary.csv`

In [35]:
small_srswor_data = ...
small_stats = ...
large_srswor_data = ...
large_stats = ...
convenience_stats = ...

plt.figure(1).legend(['Small SRSWOR', 'Large SRSWOR', 'Convenience'])
plt.figure(2).legend(['Small SRSWOR', 'Large SRSWOR', 'Convenience'])
print('Full data stats:                 ', full_stats)
print('Small simple random sample stats:', small_stats)
print('Large simple random sample stats:', large_stats)
print('Convenience sample stats:        ', convenience_stats)

In [None]:
grader.check("q26")

### Producing simple random samples
Often it's useful to take random samples even when we have a larger dataset available.  The randomized response technique was one example we saw in lecture with united flight delays.  Another is to help us understand how inaccurate other samples are.

DataFrames provide the method `sample()` for producing random samples.  Note that its default is to sample **without** replacement. To see how to call `sample()` enter`full_data.sample?` into a code cell and press Enter.

In [40]:
full_data.sample?

**Question 2.7** Produce a simple random sample of size 44 from `full_data` *with replacement*.  (You don't need to bother with a merge this time –– just use `full_data.sample(...)` directly.  That will have the same result as sampling from `salary_data` and joining with `player_data`.)  Run your analysis on it again.

In [41]:
my_small_srswor_data = ...
my_small_stats = ...
my_small_stats

Are your results similar to those in the small sample we provided you? Do things change a lot across separate samples? Run your code several times to get new samples. Assign either 1, 2, 3, or 4 to the variable `sampling_q7` below.

*Hint:* Compare the results with the result you got from the small sample we provided you, as well as the large sample and the convenience sample so that you will have a better insight on the extent of difference.


1. The results are very different from the small sample, and don't change at all across separate samples.
2. The results are very different from the small sample, and change a bit across separate samples.
3. The results are slightly different from the small sample, and change a bit across separate samples.
4. The results are not at all different from the small sample, and don't change at all across separate samples.

In [42]:
sampling_q7 = ...
sampling_q7

In [None]:
grader.check("q27")

**Question 2.8** As in the previous question, analyze several simple random samples of size 100 from `full_data`.

In [48]:
my_large_srswor_data = ...
my_large_stats = ...
my_large_stats

Do the average and histogram statistics seem to change more or less across samples of this size than across samples of size 44?  And are the sample averages and histograms closer to their true values for points or for salary?  Assign either 1, 2, 3, 4, or 5 to the variable `sampling_q8` below. 

Is this what you expected to see?
1. The statistics change *less* across samples of this size than across smaller samples. The statistics are closer to their true values for *points* than they are for *salary*.
2. The statistics change *less* across samples of this size than across smaller samples. The statistics are closer to their true values for *salary* than they are for *points*.
3. The statistics change *more* across samples of this size than across smaller samples. The statistics are closer to their true values for *points* than they are for *salary*.
4. The statistics change *more* across samples of this size than across smaller samples. The statistics are closer to their true values for *salary* than they are for *points*.
5. The statistics change an *equal amount* across samples of this size as across smaller samples. The statistics for points and salary are *equally close* to their true values.

In [49]:
sampling_q8 = ...
sampling_q8

In [None]:
grader.check("q28")

## 3. Jupyter Notebook Cells

<span style="color:red"><b>Note!</b></span> 
**Unlike the first two problems, this is a regular homework problem -- if the test passes, your answer may still not be correct!**

**Question 3.1.** Suppose you found a super long Jupyter notebook file with 1000 cells. Some of the cells are Code cells, and the others are Markdown cells. The file `cells.csv` contains 1000 rows, with each roll representing the type of a cell in the Jupyter notebook. Read `cells.csv` into a table called `cell_table`.

In [55]:
cell_table = ...
cell_table

In [None]:
grader.check("q3_1")

**Question 3.2.** You're interested in the proportion of Markdown cells in the file. Calculate the true proportion of Markdown cells and store it in the variable `md_true_prop`.

In [58]:
md_true_prop = ...
md_true_prop

In [None]:
grader.check("q3_2")

**Question 3.3.** If you are only able to randomly sample 300 different cells. Which of the following would create a representative sample of the cells in the file? Assign 1, 2, 3, or 4 to `q3_3`.

1. `cell_table.take(np.arange(300))`
2. `cell_table.iloc[0:300]`
3. `cell_table.sample(300, replace=False)`
4. `cell_table[cell_table.get('Cell Type') == 'Markdown']`

In [62]:
q3_3 = ...

In [None]:
grader.check("q3_3")

**Question 3.4.** You decide to pick 300 different cells using the sampling method you chose in question 3.3 above. Write a function called `pick_300_cells` that simulates this. Specifically, the function should take *no* arguments and should return a table of the types of 300 cells.

In [66]:
def pick_300_cells():
    """Randomly select 300 different cells from cell_table."""
    ...
pick_300_cells()

In [None]:
grader.check("q3_4")

**Question 3.5.** You are interested in knowing the true proportion of Markdown cells of all the cells in the file, but suppose you can only look through 300 cells at a time. Hence, you simulate this experiment in 400 trials. For each trial, you decide to calculate the proportion of Markdown cells. Simulate the experiment and store the *array* of proportions in the variable `md_empirical_props`.

*Note*: Your proportions should be decimals between 0 and 1. Feel free to use functions and create new cells if necessary, but be sure to store the *array* of proportions in `md_empirical_props`.

In [70]:
md_empirical_props = ...
md_empirical_props[:10] # only display the first 10 simulation results for convenience

In [None]:
grader.check("q3_5")

**Question 3.6.** You are wondering what the proportion of the Code cells for each of the 500 trials would be. Store the *array* of the proportion of the Code cells for each of the 400 trials from Question 3.5 in `code_empirical_props`.

*Note*: You **should not** run another simulation. Think about which operation you can use on `md_empirical_props` to find the corresponding proportions of Markdown cells, since you know that there are only Code or Markdown cells in `cell_table`.

In [74]:
code_empirical_props = ...
code_empirical_props[:10] # only display the first 10 simulation results for convenience

In [None]:
grader.check("q3_6")

**Question 3.7.** Now, compute the average of `md_empirical_props`. You claim that this average is a good estimate of the proportion of Markdown cells. Store your average in `md_claim_prop`.

In [78]:
md_claim_prop = ...
md_claim_prop

In [None]:
grader.check("q3_7")

**Question 3.8.**  How far away is your claim from the true proportion of Code cells? Compute the absolute difference between the two and store it in the variable `error`. Remember that you calculated the true proportion of Code cells in Question 3.2.

In [82]:
error = ...
error

In [None]:
grader.check("q3_8")

**Question 3.9.** When you ran your simulation 500 times, you got 500 different estimates for the proportion of Markdown cells. Plot the distribution of these estimates as a histogram.

## 4. Powerball

<span style="color:red"><b>Note!</b></span> 
Unlike the first two problems, this is a regular homework problem -- if the test passes, your answer may still not be correct!

You go to the nearest supermarket (or the gas station if you prefer) and buy a Powerball lottery ticket.

You pick five different numbers, one at a time, from 1 to 69. Then you separately pick a number from 1 to 26. These are your numbers, for example (59, 12, 53, 20, 3, 25).

The winning numbers are chosen by somebody drawing five balls, one at a time, from a collection of white balls numbered 1 to 69. Then they draw a red ball (the powerball) from a collection of red balls numbered 1 to 26.

We’ll assume for this problem that in order to win the biggest prize (the jackpot), all your numbers need to match the winning numbers and be in the exact same order. However, you can still win some money if you have some numbers that match the winning numbers and appear in the same position as the corresponding winning number.

**Question 4.1.** What is the probability that you win the jackpot? Calculate your answer by hand and assign it to `jackpot_chance`. It should be a decimal number between 0 and 1.

Hint: Since you are choosing five different numbers for the white balls, the denominator should be decreasing. The probability of getting the first number correct is 1/69, the second is 1/68, and so on...

In [87]:
jackpot_chance = ...
jackpot_chance

In [None]:
grader.check("q4_1")

**Question 4.2.** Your chance of winning the jackpot is quite low, but you can still win some money if you have at least one number correct, in the same position as the winning number. What is the probability that you get at least one number correct and win some money? Assign your answer to `non_losing_prob`.

*Hint:* The probablity of having at least number correct is equvalent to $$(1 - (\textrm{probability of missing the first number} \times \textrm{probability of missing the second number} \times \textrm{...}))$$

In [90]:
non_losing_prob = ...
non_losing_prob

In [None]:
grader.check("q4_2")

**Question 4.3.** Write a function called `simulate_one_ticket`. It should take no arguments, and it should return an array with 6 random numbers. The first five numbers should all be randomly chosen (without replacement) from between 1 and 69. The last number should be between 1 and 26.

In [93]:
def simulate_one_ticket():
    """Simulate one ticket that you buy."""
    ...

In [None]:
grader.check("q4_3")

**Question 4.4.** It's draw day, and you checked the lucky numbers posted, which happened to be (29, 40, 46, 63, 58, 2). Suppose you didn't win the jackpot, and you are quite ugly. You want to remind yourself how unlikely it is to win a jackpot. Call the function simulate_one_ticket 100,000 times (this would cost at least $500,000 if you were to buy that many!). How many times did you win the jackpot? Assign your answer to `count_jackpot`.

Hint: Try it first with only buying 10 tickets. Once you are sure you have that figured out, change it to 100,000 tickets. It will take a little while (about a minute) for Python to perform the calculations when you are buying 100,000 tickets.

Hint 2: You'll have to count how many of the numbers you chose match the numbers that were drawn. One way to do this involves np.count_nonzero().

In [98]:
count_jackpot = 0
...
count_jackpot

In [None]:
grader.check("q4_4")

**Question 4.5.** Suppose you can win a smaller prize if you match 1-5 numbers on the ticket. Simulate 100,000 tickets and observe what is the greatest prize you can win. In other words, try to find the maximum number of ticket number matches that would give you a prize and assign this to `wins`.

In [101]:
...

wins = ...
wins

In [None]:
grader.check("q4_5")

Suppose one draw costs $5.

The ticket is advertised that you will never lose with the following winning scheme:

- Win $15 with 1-number match

- Win $100 with 2-numbers match

- Win $1,000 with 3-numbers match

- Win $10,000 with 4-numbers match

- Win $100,000 with 5-numbers match

- Win $1,000,000 for Jackpot

**Question 4.6.** If you had the money to buy 100,000 tickets, how much money are you likely to win? Is it true that you won’t be losing money? Assign the amount to `winning_money`.

In [104]:
winning_money = ...
winning_money

In [None]:
grader.check("q4_6")

# Finish Line

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [107]:
grader.check_all()