In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw05.ipynb")

# Homework 05: Probability, Simulation, Estimation, and Assessing Models

**Reading**: 
* [Randomness](https://www.inferentialthinking.com/chapters/09/randomness.html) 
* [Sampling and Empirical Distributions](https://www.inferentialthinking.com/chapters/10/sampling-and-empirical-distributions.html)
* [Testing Hypotheses](https://www.inferentialthinking.com/chapters/11/testing-hypotheses.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.

This assignment is due by the deadline listed in Canvas/Gradescope. Start early so that you can come to office hours if you're stuck. Check the course website for the office hours schedule. Late work will not be accepted as per the course expectations.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the course expectations document to learn more about how to learn cooperatively.

For all problems that you must write our explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## 1. Probability


We will be testing some probability concepts that were introduced in class. For all of the following problems, we will introduce a problem statement and give you a proposed answer. You must assign the provided variable to one of the following three integers, depending on whether the proposed answer is too low, too high, or correct. 

1. Assign the variable to 1 if you believe our proposed answer is too high.
2. Assign the variable to 2 if you believe our proposed answer is too low.
3. Assign the variable to 3 if you believe our proposed answer is correct.


You are more than welcome to create more cells across this notebook to use for arithmetic operations.

### Question 1.1.

You roll a 6-sided die 10 times. What is the chance of getting 10 sixes?

Our proposed answer: $$\left(\frac{1}{6}\right)^{10}$$

Assign `ten_sixes` to either 1, 2, or 3 depending on if you think our answer is too high, too low, or correct. 

<!--
BEGIN QUESTION
name: q1_1
points:
  - 0
  - 1
-->

In [None]:
ten_sixes = ...
ten_sixes

In [None]:
grader.check("q1_1")

### Question 1.2.

Take the same problem set-up as before, rolling a fair dice 10 times. What is the chance that every roll is less than or equal to 5?

Our proposed answer: $$1 - \left(\frac{1}{6}\right)^{10}$$

Assign `five_or_less` to either 1, 2, or 3. 

<!--
BEGIN QUESTION
name: q1_2
points:
  - 0
  - 1
-->

In [None]:
five_or_less = ...
five_or_less

In [None]:
grader.check("q1_2")

### Question 1.3.

Assume we are picking a lottery ticket. We must choose three distinct numbers from 1 to 1000 and write them on a ticket. Next, someone picks three numbers one by one from a bowl with numbers from 1 to 1000 each time without putting the previous number back in. We win if our numbers are all called in order. 

If we decide to play the game and pick our numbers as 12, 140, and 890, what is the chance that we win? 

Our proposed answer: $$\left(\frac{3}{1000}\right)^3$$

Assign `lottery` to either 1, 2, or 3. 

<!--
BEGIN QUESTION
name: q1_3
points:
  - 0
  - 1
-->

In [None]:
lottery = ...

In [None]:
grader.check("q1_3")

### Question 1.4.

Assume we have two lists, list A and list B. List A contains the numbers [20,10,30], while list B contains the numbers [10,30,20,40,30]. We choose one number from list A randomly and one number from list B randomly. What is the chance that the number we drew from list A is larger than or equal to the number we drew from list B?

Our proposed solution: $$1/5$$

Assign `list_chances` to either 1, 2, or 3. 

*Hint: Consider the different possible ways that the items in List A can be greater than or equal to items in List B. Try working out your thoughts with a pencil and paper, what do you think the correct solutions will be close to?*

<!--
BEGIN QUESTION
name: q1_4
points:
  - 0
  - 1
-->

In [None]:
list_chances = ...

In [None]:
grader.check("q1_4")

## 2. Monkeys Typing Shakespeare

A monkey is banging repeatedly on the keys of a typewriter. Each time, the monkey is equally likely to hit any of the 26 lowercase letters of the English alphabet, 26 uppercase letters of the English alphabet, and any number between 0-9 (inclusive), regardless of what it has hit before. There are no other keys on the keyboard.  

This question is inspired by a mathematical theorem called the Infinite monkey theorem (<https://en.wikipedia.org/wiki/Infinite_monkey_theorem>), which postulates that if you put a monkey in the situation described above for an infinite time, they will eventually type out all of Shakespeare’s works.

### Question 2.1.

Suppose the monkey hits the keyboard 6 times.  Compute the probability that the monkey types the sequence of characters `ma4110`.  (Call this `ma4110_chance`.) Determine the theoretical probability (without simulation) and assign it as either fraction or arithmetic expression. For example your assignment statement could be formatted like `ma4110_chance = 1/10000` or `ma4110_chance = (1/100) ** 2` and still be graded correctly.

<!--
BEGIN QUESTION
name: q2_1
-->

In [None]:
ma4110_chance = ...
ma4110_chance

In [None]:
grader.check("q2_1")

### Question 2.2.

Write a function called `simulate_key_strike`.  It should take **no arguments**, and it should return a random one-character string that is equally likely to be any of the 26 lower-case English letters, 26 upper-case English letters, or any number between 0-9 (inclusive). The provided code below will create a list called `keys` that contains all the lower-case English letters, upper-case English letters, and the digits 0-9 (inclusive).

<!--
BEGIN QUESTION
name: q2_2
-->

In [None]:
# Proivded code, do not change
import string
keys = list(string.ascii_lowercase + string.ascii_uppercase + string.digits)

def simulate_key_strike():
    """Simulates one random key strike."""
    # Your code goes below this line
    ...

# An example call to your function:
simulate_key_strike()

In [None]:
grader.check("q2_2")

### Question 2.3.

Write a function called `simulate_several_key_strikes`.  It should take one argument: an integer specifying the number of key strikes to simulate. It should return a string containing that many characters, each one obtained from simulating a key strike by the monkey.

**Hint:** If you make a list or array of the simulated key strikes called `key_strikes_array`, you can convert that to a string by calling `"".join(key_strikes_array)`

<!--
BEGIN QUESTION
name: q2_3
-->

In [None]:
def simulate_several_key_strikes(num_strikes):
    ...

# An example call to your function:
simulate_several_key_strikes(11)

In [None]:
grader.check("q2_3")

### Question 2.4.

Call `simulate_several_key_strikes` 5000 times, each time simulating the monkey striking 6 keys.  Compute the proportion of times the monkey types `"ma4110"`, calling that proportion `ma4110_proportion`.

<!--
BEGIN QUESTION
name: q2_4
manual: false
-->

In [None]:
...
ma4110_proportion

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

### Question 2.5.

Check the value your simulation computed for `ma4110_proportion`.  Is your simulation a good way to estimate the chance that the monkey types `"ma4110"` in 6 strikes (the answer to question 2.1)?  Why or why not? Think about the theoretical probability of this event occuring and the number of simulations you ran when writing your response.

<!--
BEGIN QUESTION
name: q2_5
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 2.6.

Compute the chance that the monkey types the letter `"m"` at least once in the 6 strikes.  Call it `m_chance`. Provide your answer as an expression Python can evalute to calculate the final probability. (For example, you should put `(1/6)**3` instead of `0.00462962962963`) 

<!--
BEGIN QUESTION
name: q2_6
-->

In [None]:
m_chance = ...
m_chance

In [None]:
grader.check("q2_6")

<!-- BEGIN QUESTION -->

### Question 2.7.

Do you think that a computer simulation with 5000 trials would be more or less effective to estimate `m_chance` when compared to when we tried to estimate `ma4110_chance` this way? Why or why not? What specific criteria did you consider when making your decision?

<!--
BEGIN QUESTION
name: q2_7
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 3. Sampling Basketball Players


This exercise uses salary data and game statistics for basketball players from the 2019-2020 NBA season. The data was collected from [Basketball-Reference](http://www.basketball-reference.com).

Run the next cell to load the two datasets.

In [None]:
player_data = Table.read_table('player_data.csv')
salary_data = Table.read_table('salary_data.csv')
player_data.show(3)
salary_data.show(3)

### Question 3.1.

We would like to relate players' game statistics to their salaries.  Compute a table called `full_data` that includes one row for each player who is listed in both `player_data` and `salary_data`.  It should include all the columns from `player_data` and `salary_data`, except the `"Name"` column.

*Hint: A `.join` operation would be helpful here to combine your tables!*

<!--
BEGIN QUESTION
name: q3_1
-->

In [None]:
full_data = ...
full_data

In [None]:
grader.check("q3_1")

Basketball team managers would like to hire players who perform well but don't command high salaries.  From this perspective, a very crude measure of a player's *value* to their team is the number of 3 pointers and free throws the player scored in a season for every **\$100000 of salary** (*Note*: the `Salary` column is in dollars, not hundreds of thousands of dollars). For example, Al Horford scored an average of 5.2 points for 3 pointers and free throws combined, and has a salary of **\$28 million.** This is equivalent to 280 thousands of dollars, so his value is $\frac{5.2}{280}$. 

The formula used to make this calculation is:

$$\frac{\text{"PTS"} - 2 \times \text{"2P"}}{\text{"Salary"}\ / \ 100000}$$

<!-- BEGIN QUESTION -->

### Question 3.2.

Create a table called `full_data_with_value` that's a copy of `full_data`, with an extra column called `"Value"` containing each player's value (according to our crude measure).  Then make a histogram of players' values.  Use the specified bins, as they've been chose to make the histogram informative. Then, don't forget to include your units in the histogram! Remember that `hist()` takes in an optional third argument, `unit`, that allows you to specify the units of your data. Refer to the python reference sheet to look at `tbl.hist(...)` if necessary.

*Just so you know:* Informative histograms contain a majority of the data and **exclude outliers**. The provided bins will intentionally exclude some data points that are considered outliers.

<!--
BEGIN QUESTION
name: q3_2
manual: true
-->

In [None]:
my_bins = np.arange(0, 0.7, .1) # Use these provided bins when you make your histogram
full_data_with_value = ...
...

<!-- END QUESTION -->



Now suppose we **weren't** able to find out every player's salary (perhaps it was too costly to interview each player).  Instead, we have gathered a *simple random sample* of 50 players' salaries.  The cell below will load a pre-made sample of 50 players to the table `sample_salary_data`.

In [None]:
sample_salary_data = Table.read_table("sample_salary_data.csv")
sample_salary_data.show(3)

<!-- BEGIN QUESTION -->

### Question 3.3.

Make a histogram of the values of the players in `sample_salary_data`, using the same method for measuring value we used in question 3.2. Make sure to specify the same bins and units that were used in question 3.2.

**Hint:** This will take several steps and perhaps several intermediate tables. Don't feel like you need to do this in a single line of code.

<!--
BEGIN QUESTION
name: q3_3
manual: true
-->

In [None]:
sample_data = player_data.join('Player', sample_salary_data, 'Name')
sample_data_with_value = ...
...

<!-- END QUESTION -->



Now let us summarize what we have seen.  To guide you, we have written most of the summary already.

### Question 3.4.

Complete the statements below by setting each relevant variable name to the value that correctly fills the blank.

* The plot in question 3.2 displayed a(n) [`distribution_1`] distribution of the population of [`player_count_1`] players.  The areas of the bars in the plot sum to [`area_total_1`].

* The plot in question 3.3 displayed a(n) [`distribution_2`] distribution of the sample of [`player_count_2`] players.  The areas of the bars in the plot sum to [`area_total_2`].

`distribution_1` and `distribution_2` should be set to one of the following strings: `"empirical"` or `"probability"`. 

`player_count_1`, `area_total_1`, `player_count_2`, and `area_total_2` should be set to integers.

Remember that areas are represented in terms of **percentages**, not proportions.

**Hint 1:** For a refresher on distribution types, check out [Section 10.1](https://www.inferentialthinking.com/chapters/10/1/empirical-distributions.html)

**Hint 2:** The `hist()` table method ignores data points outside the range of its bins, but you may ignore this fact and calculate the areas of the bars using what you know about histograms from lecture.

<!--
BEGIN QUESTION
name: q3_4
points:
  - 0
  - 0
  - 0
  - 0
  - .16
  - .17
  - .17
  - .16
  - .17
  - .17
-->

In [None]:
distribution_1 = ...
player_count_1 = ...
area_total_1 = ...

distribution_2 = ...
player_count_2 = ...
area_total_2 = ...

In [None]:
grader.check("q3_4")

<!-- BEGIN QUESTION -->

### Question 3.5.

For which range of values does the plot of the simple random sample in question 3.3 better represent the distribution of the **population's** player values: 0 to 0.3, or above 0.3? Explain your answer. 

<!--
BEGIN QUESTION
name: q3_5
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 4. Earthquakes


The next cell loads a table containing information about **every earthquake with a magnitude above 5** in 2019 (smaller earthquakes are generally not felt, only recorded by very sensitive equipment), compiled by the US Geological Survey. (source: https://earthquake.usgs.gov/earthquakes/search/)

In [None]:
earthquakes = Table().read_table('earthquakes_2019.csv').select(['time', 'mag', 'place'])
earthquakes

If we were studying all human-detectable 2019 earthquakes and had access to the above data, we’d be in good shape - however, if the USGS didn’t publish the full data, we could still learn something about earthquakes from just a smaller subsample. If we gathered our sample correctly, we could use that subsample to get an idea about the distribution of magnitudes (above 5, of course) throughout the year! 

We'll create two different samples from the `earthquake` table. Analyze the code for each sample to examine how the methods are similar and how they differ. You'll be asked to compare the results of each method, and understanding the differences in the sampling method will potentially help you explain any differences in the average value of `mag` that is computed.

### Sample #1

In [None]:
# First sample method
sample1 = earthquakes.sort('mag', descending = True).take(np.arange(100))

# Calculate the mean of the first sample
sample1_magnitude_mean = np.mean(sample1.column('mag'))

sample1_magnitude_mean

### Sample #2

In [None]:
# Second sample method
sample2 = earthquakes.take(np.arange(100))

# Calculate the mean of the second sample
sample2_magnitude_mean = np.mean(sample2.column('mag'))

sample2_magnitude_mean

<!-- BEGIN QUESTION -->

### Question 4.1.

Neither of these samples accurately represent the population from which they were drawn. Explain why each of the two samples would create a biased average of `mag`.

*Hint:* Consider the ordering of the rows in the tables from which the samples were constructed.

<!--
BEGIN QUESTION
name: q4_1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 4.2.

Write code to produce a sample of size 200 that **is** representative of the population and assign it to a Table named `representative_sample`. Then, take the mean of the magnitudes of the earthquakes in this sample and assign it to `representative_mean`. 

**Hint:** In class, you've learned what type of sample should be used to best represent a population.

<!--
BEGIN QUESTION
name: q4_2
-->

In [None]:
representative_sample = ...
representative_mean = ...
representative_mean

In [None]:
grader.check("q4_2")

### Question 4.3.

Suppose we want to figure out what the biggest magnitude earthquake was in 2019, but we only have our representative sample of 200. Let’s see if trying to find the biggest magnitude in the population from a random sample of 200 is a reasonable idea!

In the cell below write code that uses simulation to create 5,000 random samples of size 200 from the `earthquakes` Table. For each sample determine the maximum value of `mag` in the sample. All 5,000 maximum values from the samples should be stored in an array named `maximums`.

<!--
BEGIN QUESTION
name: q4_3
-->

In [None]:
maximums = ...
for i in np.arange(5000): 
    ...

In [None]:
grader.check("q4_3")

Run the cell below to create a histogram of the 5,000 maximums you simulated to view their distribution. You'll need this to help answer question 4.5.

In [None]:
# Histogram of your maximums
Table().with_column('Largest magnitude in sample', maximums).hist('Largest magnitude in sample') 

### Question 4.4.

Now find the magnitude of the **actual** strongest earthquake in 2019 using the `earthquake` Table, not the maximum of just a sample.

<!--
BEGIN QUESTION
name: q4_4
points:
  - 0
  - 1
-->

In [None]:
strongest_earthquake_magnitude = ...
strongest_earthquake_magnitude

In [None]:
grader.check("q4_4")

<!-- BEGIN QUESTION -->

### Question 4.5

Explain whether you believe you can accurately use a sample size of 200 to determine the maximum. Use the histogram and population maximum of `mag` determined in 4.3 and 4.4 to help answer this question. What is one problem with using the maximum function as your estimator that you don't have to worry about when using an average?

**Hint:** Use the histogram to estimate how many of the samples got the maximum correct?

<!--
BEGIN QUESTION
name: q4_5
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 5. Assessing Jade's Models
#### Games with Jade

Our friend Jade comes over and asks us to play a game with her. The game works like this: 

> We will draw randomly with replacement from a simplified 13 card deck with 4 face cards (A, J, Q, K), and 9 numbered cards (2, 3, 4, 5, 6, 7, 8, 9, 10). If we draw cards with replacement 13 times, and if the number of face cards is greater than or equal to 4, we lose.
> 
> Otherwise, Jade wins.

We play the game once and we lose, observing 8 total face cards. We are angry and accuse Jade of cheating! Jade is adamant, however, that the deck is fair.

#### Jade's model of the game
Jade claims that there is an equal chance of getting any of the cards (A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K), but we do not believe her. 

#### Our alternative model of the game
We believe that the deck is clearly rigged, with face cards (A, J, Q, K) being more likely than the numbered cards (2, 3, 4, 5, 6, 7, 8, 9, 10).

### Question 5.1.

Assign `deck_model_probabilities` to a two-item array containing the chance of drawing a face card as the first element, and the chance of drawing a numbered card as the second element under the assumptions of Jade's model. Since we're working with probabilities, make sure your values are between 0 and 1. Probabilities should be exact representations of the values (1/3 not 0.333).

<!--
BEGIN QUESTION
name: q5_1
points:
  - 0
  - 0
  - .5
  - .5
-->

In [None]:
deck_model_probabilities = ...
deck_model_probabilities

In [None]:
grader.check("q5_1")

### Question 5.2.

We believe Jade's model (every card is equally likely to be drawn) is incorrect. In particular, we believe the deck is rigged such that there is a larger chance of getting a face card.  Which of the following statistics can we calculate in our simulation to test for differences between the model and our alternative? Assign `statistic_choice` to the correct answer. 

**Hint:** Look back at the data we observed from the game we've already played. 

1. The actual number of face cards we get in 13 draws
2. The distance (absolute value) between the actual number of face cards in 13 draws and the expected number of face cards in 13 draws (4)
3. The expected number of face cards in 13 draws (4)

<!--
BEGIN QUESTION
name: q5_2
points:
  - 0
  - 1
-->

In [None]:
statistic_choice = ...
statistic_choice

In [None]:
grader.check("q5_2")

<!-- BEGIN QUESTION -->

### Question 5.3.

Define the function `deck_simulation_and_statistic`, which, given an integer sample size and an array of model proportions (like the one you created in Question 5.1), returns **the number of face cards** in one simulation in of drawing cards under the model specified in `model_proportions`. The included final line of code in the cell below will call your function to simulate drawing 13 cards from the deck using the probabilities assigned to the array `deck_model_probabilities` and return how many face cards those 13 drawn cards contained.

**Hint:** Think about how you can use the function `sample_proportions` contained in the `datascience` library. 

<!--
BEGIN QUESTION
name: q5_3
manual: true
-->

In [None]:
def deck_simulation_and_statistic(sample_size, model_proportions):
    ...

deck_simulation_and_statistic(13, deck_model_probabilities)

In [None]:
grader.check("q5_3")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 5.4.

Use your function from question 5.3 to run 5,000 simulations in which 13 cards are drawn under the proportions that you specified in Question 5.1. Store of all of your statistics in an array named `deck_statistics`. 

<!--
BEGIN QUESTION
name: q5_4
manual: true
-->

In [None]:
repetitions = 5000 
...

deck_statistics

In [None]:
grader.check("q5_4")

<!-- END QUESTION -->



Let’s take a look at the distribution of simulated statistics. The cell below will create a histogram of your results. How many face cards would seem to be the most common result when dealt 13 cards when assuming our proportions from question 5.1? You'll want to keep that in mind when answering the next question.

In [None]:
# Draw a distribution of statistics 
Table().with_column('Deck Statistics', deck_statistics).hist(bins=np.arange(-0.5,13.5,1))

<!-- BEGIN QUESTION -->

### Question 5.5.

Given the observed value of being dealt 8 face cards out of 13 total cards, do you believe that Jade's model is reasonable? Explain your answer using the distribution drawn in the previous problem. In particular, consider how likely such a result would be using the information in the histogram.

<!--
BEGIN QUESTION
name: q5_5
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Submitting your work
You're done with Graded HW 05! All assignments in the course will be distributed as notebooks like this one, and you will submit your work by doing the following:
* Save your notebook
* Restart the kernel and run up to this cell.
* Run all the tests by running the cell containing `grader.check_all()`. Make sure they pass the way you expect them to.
* Run the cell below with the code `grader.export(...)`.
* Download the file named `hw05.zip`, found in the explorer pane on the left side of the screen.
* Upload `hw05.zip` to the Graded HW 05 assignment on Canvas.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by finding it in the file browswer on the left side of the screen, then right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)