# Homework 6: Probability, Simulation, Estimation, and Assessing Models

**Reading**: 
* [Randomness](https://inferentialthinking.com/chapters/09/Randomness.html)
* [Sampling and Empirical Distributions](https://inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)
* [Testing Hypotheses](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html)

This is the **sixth** of 12 homework assignments you'll complete in this course.  The homework assignments provide you with more opportunity to practice the skills we've learned in lecture.  This homework assignment is due by **11:59pm, Oct 11, 2021**. 

Start early so that you can ask for assistance if you're stuck. If you remain stuck on a question, please reach out to me, or post your questions on the course Question Board in Canvas. Feel free to start a new thread for your question, or check there to see if someone else has done so already. I encourage everyone to participate by offering their own explanations, and I'll do the same. But be sure to refrain from directly posting or sharing answers. It's important for everyone to arrive at their own answers.

For all problems that you must write out explanations and sentences for, you **must** provide your answer in the designated space.  Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!  For example, if you use `max_temperature` in your anbswer to one question, do not reassign it later on. Moreover, please be sure to only put your written answers in the cells provided. 

You will submit this assignment by uploading your completed Jupyter notebook to Canvas.  All files will have a LASTNAME as part of their name.  So before you continue, rename your file so that your last name replaces LASTNAME.  Type your last name in ALL CAPS.  For example, this downloaded file has the name 

`HW 01-FA2021-LASTNAME.ipynb`

If your last name is Peralta, change the file name to 

`HW 01-FA2021-PERALTA.ipynb`
 

Next, fill in the cell below with your student ID number replacing the `...`  Your cell should look like: `IDs = [3141593]`.   Then click the "run cell" button at the top that looks like ▶| or hold down `shift` + `return`.

Please use the passcode **basis** for my [office hours](https://aacc-edu.zoom.us/j/92542256939?pwd=cGsveTlHdW9EWGVDUTJuT0dHc0pjdz09) on Mondays and Wednesdays,10am-12pm and 8:45pm-9:45pm.  

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)



In [None]:
global IDs # leave this line alone
IDs = [...] # replace ... with your student ID number

## 1. Probability


We will be testing some probability concepts that were introduced in lecture. For all of the following problems, we will introduce a problem statement and give you a proposed answer. You must assign the provided variable to one of the following three integers, depending on whether the proposed answer is too low, too high, or correct. 

1. Assign the variable to 1 if you believe our proposed answer is too high.
2. Assign the variable to 2 if you believe our proposed answer is too low.
3. Assign the variable to 3 if you believe our proposed answer is correct.


You are more than welcome to create more cells across this notebook to use for arithmetic operations 

**Question 1. (1 pt)** You roll a 6-sided die 10 times. What is the chance of getting 10 sixes?

Our proposed answer: $$\left(\frac{1}{6}\right)^{10}$$

Assign `ten_sixes` to either 1, 2, or 3 depending on if you think our answer is too high, too low, or correct. 


In [None]:
ten_sixes = ...
ten_sixes

**Question 2. (1 pt)** Take the same problem set-up as before, rolling a fair dice 10 times. What is the chance that every roll is less than or equal to 5?

Our proposed answer: $$1 - \left(\frac{1}{6}\right)^{10}$$

Assign `five_or_less` to either 1, 2, or 3. 



In [None]:
five_or_less = ...
five_or_less

**Question 3. (1 pt)** Assume we are picking a lottery ticket. We must choose three distinct numbers from 1 to 1000 and write them on a ticket. Next, someone picks three numbers one by one from a bowl with numbers from 1 to 1000 each time without putting the previous number back in. We win if our numbers are all called in order. 

If we decide to play the game and pick our numbers as 12, 140, and 890, what is the chance that we win? 

Our proposed answer: $$\left(\frac{3}{1000}\right)^3$$

Assign `lottery` to either 1, 2, or 3. 



In [None]:
lottery = ...
lottery

**Question 4. (1 pt)** Assume we have two lists, list A and list B. List A contains the numbers [20,10,30], while list B contains the numbers [10,30,20,40,30]. We choose one number from list A randomly and one number from list B randomly. What is the chance that the number we drew from list A is larger than or equal to the number we drew from list B?

Our proposed solution: $$1/5$$

Assign `list_chances` to either 1, 2, or 3. 

*Hint: Consider the different possible ways that the items in List A can be greater than or equal to items in List B. Try working out your thoughts with a pencil and paper, what do you think the correct solutions will be close to?*


In [None]:
list_chances = ...
list_chances 

## 2. Monkeys Typing Shakespeare
##### (...or at least the string "datascience")

A monkey is banging repeatedly on the keys of a typewriter. Each time, the monkey is equally likely to hit any of the 26 lowercase letters of the English alphabet, 26 uppercase letters of the English alphabet, and any number between 0-9 (inclusive), regardless of what it has hit before. There are no other keys on the keyboard.  

This question is inspired by a mathematical theorem called the Infinite monkey theorem (<https://en.wikipedia.org/wiki/Infinite_monkey_theorem>), which postulates that if you put a monkey in the situation described above for an infinite time, they will eventually type out all of Shakespeare’s works.

**Question 5. (1 pt)** Suppose the monkey hits the keyboard 5 times.  Compute the chance that the monkey types the sequence `HELLO`.  (Call this `data_chance`.) Use algebra and type in an arithmetic equation that Python can evalute.


In [None]:
data_chance = ...
data_chance

**Question 6. (1 pt)** Write a function called `simulate_key_strike`.  It should take **no arguments**, and it should return a random one-character string that is equally likely to be any of the 26 lower-case English letters, 26 upper-case English letters, or any number between 0-9 (inclusive). 

*Hint: you will need the function* `np.random.choice` *in the body of your function.  [You can read more about this function here.](https://inferentialthinking.com/chapters/09/Randomness.html)*


In [None]:
# We have provided the code below to compute a list called keys,
# containing all the lower-case English letters, upper-case English letters, and the 
#digits 0-9 (inclusive).  Print it if you want to verify what it contains.

import string
keys = list(string.ascii_lowercase + string.ascii_uppercase + string.digits)

def simulate_key_strike():
    """Simulates one random key strike."""
    ...

# An example call to your function:
simulate_key_strike()

**Question 7. (1 pt)** Write a function called `simulate_several_key_strikes`.  It should take one argument: an integer specifying the number of key strikes to simulate. It should return a string containing that many characters, each one obtained from simulating a key strike by the monkey.

*Hints:* 

* *You can use a* `for` *loop to build an array one character at a time, each time calling the* `simulate_key_strike` *function.*

* *If you make a list or array of the simulated key strikes called* `key_strikes_array`, *you can convert that to a string by calling* `"".join(key_strikes_array)`.  *This joins the empty string (given here as* "" *) with the contents of* `key_strikes_array`.



In [None]:
def simulate_several_key_strikes(num_strikes):
    strikes=make_array()
    for i in np.arange(num_strikes):
        one_strike = ...
        strikes = np.append( ... , ... )
    return ...        

# An example call to your function:
simulate_several_key_strikes(11)

**Question 8. (1 pt)** Call `simulate_several_key_strikes` 5000 times, each time simulating the monkey striking 5 keys.  Compute the proportion of times the monkey types `"HELLO"`, calling that proportion `data_proportion`.


In [None]:
num_simulations = 5000
num_HELLO=0
for i in np.arange(num_simulations):
    if ... == 'HELLO' :
        num_HELLO = num_HELLO + 1
    
data_proportion = ... / ... 
data_proportion

**Question 9. (2 pts)** Check the value your simulation computed for `data_proportion`.  Is your simulation a good way to estimate the chance that the monkey types `"HELLO"` in 6 strikes (the answer to question 1)?  Why or why not?



*Write your answer here, replacing this text.*

**Question 10. (1 pt)** Compute the chance that the monkey types the letter `"t"` at least once in the 5 strikes.  Call it `t_chance`. Use algebra and type in an arithmetic equation that Python can evalute. 


In [None]:
t_chance = ...
t_chance

**Question 11. (2 pts)** Do you think that a computer simulation is more or less effective to estimate `t_chance` compared to when we tried to estimate `data_chance` this way? Why or why not? (You don't need to write a simulation, but it is an interesting exercise.)


*Write your answer here, replacing this text.*

## 3. Sampling Basketball Players


This exercise uses salary data and game statistics for basketball players from the 2019-2020 NBA season. The data was collected from [Basketball-Reference](http://www.basketball-reference.com).

Run the next cell to load the two datasets.

In [None]:
player_data = Table.read_table('player_data.csv')
salary_data = Table.read_table('salary_data.csv')
player_data.show(10)
salary_data.show(10)

The table `player_data` has four columns: 
* `Player`, the player name, 
* `3P`, the average number of 3-point field goals per game
* `2P`, the average number of 2-point field goals per game
* `PTS`, the average number of points scored per game

This table does *not* list the average number of free throws per game.  But this *could* be calculated from the given data, because the total number of points is equal to 3* (number of 3-pointers) + 2*(number of 2-pointers) + (number of free throws).

The table `salary_data` has two columns:
* `Name`, the player name, and 
* `Salary`, the player's annual salary (in dollars).

**Question 12. (1 pt)** We would like to relate players' game statistics to their salaries.  Compute a table called `full_data` that includes one row for each player who is listed in both `player_data` and `salary_data`.  It should include all the columns from `player_data` and `salary_data`, except the `"Name"` column.



In [None]:
full_data = player_data.join( ... , ... , ...)
full_data

Basketball team managers would like to hire players who perform well but don't command high salaries.  From this perspective, a very crude measure of a player's *value* to their team is the number of 3 pointers and free throws the player scored in a season for every **\$100000 of salary** (*Note*: the `Salary` column is in dollars, not hundreds of thousands of dollars). For example, Al Horford scored an average of 5.2 points for 3 pointers and free throws combined, and has a salary of **\$28 million.** This is equivalent to 280 thousands of dollars, so his value is $\frac{5.2}{280}$. The formula is:

$$\frac{\text{"PTS"} - 2 * \text{"2P"}}{\text{"Salary"}\ / \ 100000}$$

**Question 13. (2 pts)** Create a table called `full_data_with_value` that's a copy of `full_data`, with an extra column called `"Value"` containing each player's value (according to our crude measure).  Then make a histogram of players' values.  **Specify bins that make the histogram informative and don't forget your units.**  Remember that `hist()` takes in an optional third argument that allows you to specify the units - use 'pts/$100k' for your units.  Refer to the python reference to look at `tbl.hist(...)` if necessary.

*Just so you know:* Informative histograms contain a majority of the data and **exclude outliers**.


In [None]:
bins = np.arange(0, 0.7, .1) # Use these provided bins when you make your histogram
full_data_with_value = full_data.with_column( ... , ... )

full_data_with_value.hist( ... , bins = ... , unit = ...)

Now suppose we weren't able to find out every player's salary (perhaps it was too costly to interview each player).  Instead, we have gathered a *simple random sample* of 50 players' salaries.  The cell below loads those data.

In [None]:
sample_salary_data = Table.read_table("sample_salary_data.csv")
sample_salary_data.show(3)

**Question 14. (2 pts)** Make a histogram of the values of the players in `sample_salary_data`, using the same method for measuring value we used in question 13. Make sure to specify the units again in the histogram as stated in the previous problem. **Use the same bins, too.**  

*Hint:* This will take several steps.


In [None]:
sample_data = player_data.join('Player', sample_salary_data, 'Name')
sample_data_with_value = ...

sample_data_with_value.hist( ... , bins = ... , unit = ...)

### Now let us summarize what we have seen.  To guide you, we have written most of the summary already.

**Question 15. (1 pt)** Complete the statements below by setting each relevant variable name to the value that correctly fills the blank.

`distribution_1` and `distribution_2` should be set to one of the following strings: `"empirical"` or `"probability"`. 

`player_count_1`, `area_total_1`, `player_count_2`, and `area_total_2` should be set to integers.



* The plot in question 13 displayed a(n) [`distribution_1`] distribution of the population of [`player_count_1`] players.  The areas of the bars in the plot sum to [`area_total_1`].

* The plot in question 14 displayed a(n) [`distribution_2`] distribution of the sample of [`player_count_2`] players.  The areas of the bars in the plot sum to [`area_total_2`].


Remember that areas are represented in terms of percentages.

*Hint 1:* For a refresher on distribution types, check out [Section 10.1](https://inferentialthinking.com/chapters/10/1/Empirical_Distributions.html).

*Hint 2:* The `hist()` table method ignores data points outside the range of its bins, but you may ignore this fact and calculate the areas of the bars using what you know about histograms from lecture.

<!--
BEGIN QUESTION
name: q3_4
-->

In [None]:
distribution_1 = ...
player_count_1 = ...
area_total_1 = ...

distribution_2 = ...
player_count_2 = ...
area_total_2 = ...

#Leave this array here to hold your answers...
answers=make_array(distribution_1, player_count_1, area_total_1,distribution_2, player_count_2, area_total_2)
answers

**Question 16. (1 pt)** For which range of values does the plot in question 3 better depict the distribution of the **population's player values**: 0 to 0.3, or above 0.3? Explain your answer. 



*Write your answer here, replacing this text.*

## 4. Earthquakes


The next cell loads a table containing information about **every earthquake with a magnitude above 5** in 2019 (smaller earthquakes are generally not felt, only recorded by very sensitive equipment), compiled by the US Geological Survey. (source: https://earthquake.usgs.gov/earthquakes/search/)

In [None]:
earthquakes = Table().read_table('earthquakes_2019.csv').select(['time', 'mag', 'place'])
earthquakes

If we were studying all human-detectable 2019 earthquakes and had access to the above data, we’d be in good shape - however, if the USGS didn’t publish the full data, we could still learn something about earthquakes from just a smaller subsample. If we gathered our sample correctly, we could use that subsample to get an idea about the distribution of magnitudes (above 5, of course) throughout the year!

In the following lines of code, we take two different samples from the earthquake table, and calculate the mean of the magnitudes of these earthquakes.

In [None]:
sample1 = earthquakes.sort('mag', descending = True).take(np.arange(100))
sample1_magnitude_mean = np.mean(sample1.column('mag'))
sample2 = earthquakes.take(np.arange(100))
sample2_magnitude_mean = np.mean(sample2.column('mag'))
[sample1_magnitude_mean, sample2_magnitude_mean]

**Question 17. (2 pts)**  Are these samples representative of the population of earthquakes in the original table (that is, the should we expect the mean to be close to the population mean)? 

*Hint:* Consider the ordering of the `earthquakes` table. 


*Write your answer here, replacing this text.*

**Question 18. (1 pt)** Write code to produce a sample of size 200 that is representative of the population. Then, take the mean of the magnitudes of the earthquakes in this sample. Assign these to `representative_sample` and `representative_mean` respectively. 

*Hint: you can use the* `.sample` *method here.  Read about this method in [Section 10.3](https://inferentialthinking.com/chapters/10/2/Sampling_from_a_Population.html) of our course text.* 


In [None]:
representative_sample = ...
representative_mean = np.mean(...)
representative_mean

**Question 19. (1 pt)** Suppose we want to figure out what the biggest magnitude earthquake was in 2019, but we only have our representative sample of 200. Let’s see if trying to find the biggest magnitude in the population from a random sample of 200 is a reasonable idea!

Write code that takes many random samples from the `earthquakes` table and finds the maximum of each sample. You should take a random sample of size 200 and do this 5000 times. Assign the array of maximum magnitudes you find to `maximums`.


In [None]:
maximums=make_array()

for i in np.arange(5000): 
    sample = ...
    ...
    maximums = ...
maximums

In [None]:
#Histogram of your maximums
Table().with_column('Largest magnitude in sample', maximums).hist('Largest magnitude in sample') 

**Question 20. (1 pt)** Now find the magnitude of the actual strongest earthquake in 2019 (not the maximum of a sample). This will help us determine whether a random sample of size 200 is likely to help you determine the largest magnitude earthquake in the population.


In [None]:
strongest_earthquake_magnitude = max(...)
strongest_earthquake_magnitude

**Question 21. (2 pts)** 
Explain whether you believe you can accurately use a sample size of 200 to determine the maximum. What is one problem with using the maximum as your estimator? Use the histogram above to help answer. 


*Write your answer here, replacing this text.*

Congratulations, you're done with Homework 6! 