In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab06.ipynb")

# Lab 6: Probability, Simulation, Estimation, and Assessing Models

**Helpful Resources:**
- [Python Reference](http://www.cs.williams.edu/~cs104/auto/python-library-ref.html): Cheat sheet of helpful library methods.

**Recommended Readings**: 
* [Ch 9. Randomness](https://www.inferentialthinking.com/chapters/09/Randomness.html)
* [Ch 10. Sampling and Empirical Distributions](https://www.inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)
* [Ch 11. Testing Hypotheses](https://www.inferentialthinking.com/chapters/11/Testing_Hypotheses.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.  For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. That means even though tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**

## 1. Roulette (50 pts)



In [None]:
# Run this cell to set up the notebook.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines make plots look nice and hide some messy Python warnings.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', np.VisibleDeprecationWarning)

A Nevada roulette wheel has 38 pockets and a small ball that rests on the wheel. When the wheel is spun, the ball comes to rest in one of the 38 pockets. That pocket is declared the winner. 

The pockets are labeled 0, 00, 1, 2, 3, 4, ... , 36. Pockets 0 and 00 are green, and the other pockets are alternately red and black. The table `wheel` is a representation of a Nevada roulette wheel. Note that *both* columns consist of strings. Below is an example of a roulette wheel!

<img src="roulette_wheel.jpeg" width="330px">

In [None]:
wheel = Table.read_table('roulette_wheel.csv', dtype=str)
wheel

### Betting on Red ###
If you bet on *red*, you are betting that the winning pocket will be red. This bet *pays 1 to 1*. That means if you place a one-dollar bet on red, then:

- If the winning pocket is red, you gain 1 dollar. That is, you get your original dollar back, plus one more dollar.
- If the winning pocket is not red, you lose your dollar. In other words, you gain -1 dollars.

Let's see if you can make money by betting on red at roulette.

#### Part 1.1 (5 pts)


 Define a function `dollar_bet_on_red` that takes the name of a color and returns your gain in dollars if that color had won and you had placed a one-dollar bet on red. Remember that the gain can be negative. Make sure your function returns an integer. 

You can assume that the only colors that will be passed as arguments are red, black, and green. Your function doesn't have to check that.



In [None]:
...
    ...


In [None]:
grader.check("q1.1")

Run the cell below to make sure your function is working.

In [None]:
print(dollar_bet_on_red('green'))
print(dollar_bet_on_red('black'))
print(dollar_bet_on_red('red'))

#### Part 1.2 (5 pts)


 Add a column labeled `Winnings: Red` to the table `wheel`. For each pocket, the column should contain your gain in dollars if that pocket won and you had bet one dollar on red. 

Your code should use the function `dollar_bet_on_red`.



In [None]:
red_winnings = ...
wheel = ...
wheel

In [None]:
grader.check("q1.2")

### Simulating 10 bets on Red
Roulette wheels are set up so that each time they are spun, the winning pocket is equally likely to be any of the 38 pockets regardless of the results of all other spins. Let's see what would happen if we decided to bet one dollar on red each round.

#### Part 1.3 (5 pts)


Create a table `ten_bets` by sampling the table `wheel` to simulate 10 spins of the roulette wheel. Your table should have the same three column labels as in `wheel`. Once you've created that table, set `sum_bets` to your net gain in all 10 bets, assuming that you bet one dollar on red each time.

*Hint:* It may be helpful to print out `ten_bets` after you create it!

In [None]:
ten_bets = ...
sum_bets = ...

sum_bets

In [None]:
grader.check("q1.3")

Run the cells above a few times to see how much money you would make if you made 10 one-dollar bets on red. Making a negative amount of money doesn't feel good, but it is a reality in gambling. Casinos are a business, and they make money when gamblers lose.

#### Part 1.4 (5 pts)


 Let's see what would happen if you made more bets. Define a function `net_gain_red` that takes the number of bets and returns the net gain in that number of one-dollar bets on red. 

*Hint:* You should use your `wheel` table within your function definition.



In [None]:
...
...


In [None]:
grader.check("q1.4")

Run the cell below a few times to make sure that the results are similar to those you observed in the previous exercise.

In [None]:
net_gain_red(10)

#### Part 1.5 (5 pts)


 Complete the cell below to simulate the net gain in 200 one-dollar bets on red, repeating the process 10,000 times. After the cell is run, `all_gains_red` should be an array with 10,000 entries, each of which is the net gain in 200 one-dollar bets on red. 



In [None]:
num_bets = ...
repetitions = ...

all_gains_red = ...
...

len(all_gains_red) # Do not change this line! Check that all_gains_red is length 10000.

In [None]:
grader.check("q1.5")

Run the cell below to visualize the results of your simulation.

In [None]:
gains = Table().with_columns('Net Gain on Red', all_gains_red)
gains.hist(bins = np.arange(-80, 41, 4))

#### Part 1.6 (5 pts)


 Using the histogram above, decide whether the following statement is true or false:

>If you make 200 one-dollar bets on red, your chance of losing money is more than 50%.

Assign `loss_more_than_50` to either `True` or `False` depending on your answer to the question. 



In [None]:
loss_more_than_50 = ...

In [None]:
grader.check("q1.6")

### Betting on a Split
If betting on red doesn't seem like a good idea, maybe a gambler might want to try a different bet. A bet on a *split* is a bet on two consecutive numbers such as 5 and 6. This bets pays 17 to 1. That means if you place a one-dollar bet on the split 5 and 6, then:

- If the winning pocket is either 5 or 6, your gain is 17 dollars.
- If any other pocket wins, you lose your dollar, so your gain is -1 dollars.

#### Part 1.7 (5 pts)


Define a function `dollar_bet_on_split` that takes a pocket number and returns your gain in dollars if that pocket won and you had bet one dollar on the 5-6 split. **(4 points)**

*Hint:* Remember that the pockets are represented as strings.

In [None]:
...
...


Run the cell below to check that your function is doing what it should.

In [None]:
print(dollar_bet_on_split('5'))
print(dollar_bet_on_split('6'))
print(dollar_bet_on_split('00'))
print(dollar_bet_on_split('23'))

#### Part 1.8 (5 pts)


 Add a column `Winnings: Split` to the `wheel` table. For each pocket, the column should contain your gain in dollars if that pocket won and you had bet one dollar on the 5-6 split. 



In [None]:
split_winnings = ...
wheel = ...
wheel.show(8) # Do not change this line.

In [None]:
grader.check("q1.8")

#### Part 1.7 (5 pts)


 Simulate the net gain in 200 one-dollar bets on the 5-6 split, repeating the process 10,000 times and saving your gains in the array `all_gains_split`. 

*Hint:* Your code parts 4 and 5 may be helpful here!

In [None]:
all_gains_split = ...

...

for _ in np.arange(0, 10000):
    all_gains_split = np.append(all_gains_split, net_gain_split(200))

# Do not change the two lines below
gains = gains.with_columns('Net Gain on Split', all_gains_split)
gains.hist(bins = np.arange(-200, 150, 20))

#### Part 1.8 (5 pts)


 Look carefully at the histograms above and say whether each of the following statements is `True` or `False`. 

1. If you bet one dollar 200 times on a split, your chance of losing money is more than 50%.
2. If you bet one dollar 200 times in roulette, your chance of making more than 50 dollars is greater if you bet on a split each time than if you bet on red each time.
3. If you bet one dollar 200 times in roulette, your chance of losing more than 50 dollars is greater if you bet on a split each time than if you bet on red each time.

Assign the `histogram_statements` to an array of statement number(s) that corresponding to `True` statements.

*Hint:* We've already seen one of these statements in a prior question.



In [None]:
statement_1 = ...
statement_2 = ...
statement_3 = ...

In [None]:
grader.check("q1.10")

If this exercise has put you off playing roulette, it has done its job. If you are still curious about other bets, [here](https://en.wikipedia.org/wiki/Roulette#Bet_odds_table) they all are, and [here](https://en.wikipedia.org/wiki/Roulette#House_edge) is the bad news. The house – that is, the casino – always has an edge over the gambler.

## 2. Chances (35 pts)



Before you do this exercise, make sure you understand the logic behind all the examples in [Section 9.5](https://inferentialthinking.com/chapters/09/5/Finding_Probabilities.html). 

Good ways to approach probability calculations include:

- Thinking one trial at a time: What does the first one have to be? Then what does the next one have to be?
- Breaking up the event into distinct ways in which it can happen.
- Seeing if it is easier to find the chance that the event does not happen.

### Finding Chances

On each spin of a roulette wheel, all 38 pockets are equally likely to be the winner regardless of the results of other spins. Among the 38 pockets, 18 are red, 18 black, and 2 green. In each part below, write an expression that evaluates to the chance of the event described.

#### Part 2.1 (5 pts)


 The winning pocket is black on all of the first three spins. 



In [None]:
first_three_black = ...
first_three_black

In [None]:
grader.check("q2.1")

#### Part 2.2 (5 pts)


 The color green never wins in the first 10 spins. 



In [None]:
no_green = ...
no_green

#### Part 2.3 (5 pts)


 The color green wins at least once on the first 10 spins. 



In [None]:
at_least_one_green = ...
at_least_one_green

In [None]:
grader.check("q2.3")

#### Part 2.4 (5 pts)


 Two of the three colors never win in the first 10 spins. 

*Hint:* Imagine the event. What must happen on all 10 spins?



In [None]:
lone_winners = ...
lone_winners

In [None]:
grader.check("q2.4")

### Comparing Chances
In each of Questions 5-7, two events A and B are described. Choose from one of the following three options and set each answer variable to a single integer:

1. Event A is more likely than Event B
2. Event B is more likely than Event A
3. The two events have the same chance.

You should be able to make the choices **without calculation**. Good ways to approach this exercise include imagining carrying out the chance experiments yourself, one trial at a time, and by thinking about the [law of averages](https://inferentialthinking.com/chapters/10/1/Empirical_Distributions.html#the-law-of-averages).

#### Part 2.5 (5 pts)


 A child picks four times at random from a box that has four toy animals: a bear, an elephant, a giraffe, and a kangaroo. 

- Event A: all four different animals are picked (assuming the child picks without replacement)
- Event B: all four different animals are picked (assuming the child picks with replacement)



In [None]:
toys_option = ...

In [None]:
grader.check("q2.5")

#### Part 2.6 (5 pts)


 In a lottery, two numbers are drawn at random with replacement from the integers 1 through 1000. 

- Event A: The number 8 is picked on both draws
- Event B: The same number is picked on both draws



In [None]:
lottery_option = ...

In [None]:
grader.check("q2.6")

#### Part 2.7 (5 pts)


 A fair coin is tossed repeatedly. 

- Event A: There are 60 or more heads in 100 tosses
- Event B: There are 600 or more heads in 1000 tosses

*Hint*: Think in terms of proportions.



In [None]:
coin_option = ...

In [None]:
grader.check("q2.7")

## 3. Three Ways Python Draws Random Samples -- this is annoying question... (10 pts)



You have learned three ways to draw random samples using Python:

- `tbl.sample` draws a random sample of rows from the table `tbl`. The output is a table consisting of the sampled rows. 

- `np.random.choice` draws a random sample from a population whose elements are in an array. The output is an array consisting of the sampled elements.

- `sample_proportions` draws from a categorical distribution whose proportions are in an array. The output is an array consisting of the sampled proportions in all the categories. 

In [None]:
# Just run this cell, it will become more useful in Questions 1 and 2
top = Table.read_table('top_movies_2017.csv').select(0, 1)
top.show(3)

In [None]:
# Just run this cell, it will become more useful in Questions 1 and 2
studios_with_counts = top.group('Studio').sort('count', descending=True)
studios_with_counts.show(3)

In [None]:
# Just run this cell, it will become more useful in Questions 1 and 2
studios_of_all_movies = top.column('Studio')
distinct_studios = studios_with_counts.column('Studio')
studio_counts_only = studios_with_counts.column('count')
studio_counts_only

In Questions 1 and 2 we will present a scenario. Determine which of the following options are true in regards to what the question is asking. If any of the options apply, list them in the following answer cell. If your answer includes any of (i)-(iii), state what you would fill in the blank to make it true: `top`, `studios_with_counts`, `studios_of_all_movies`, `distinct_studios`, or `studio_counts_only`.

(i) This can be done using `sample` and the table _________.

(ii) This can be done using `np.random.choice` and the array ________.

(iii) This can be done using `sample_proportions` and the array _______.

(iv) This cannot be done using `sample` and the data given.

(v) This cannot be done using `np.random.choice` and the data given.

(vi) This cannot be done using `sample_proportions` and the data given.

<!-- BEGIN QUESTION -->

#### Part 3.1 (5 pts)


 Simulate a sample of 10 movies drawn at random with replacement from the 200 movies. Outputs True if Paramount appears more often than Warner Brothers among studios that released the sampled movies, and False otherwise. 

*Example Answer:* (i) with studios_of_all_movies, (iii) with top, (v)



_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Part 3.2 (5 pts)


 Simulate a sample of 10 movies drawn at random with replacement from the 200 movies. Outputs True if the first sampled movie was released by the same studio as the last sampled movie. 

*Example Answer:* (i) with studios_of_all_movies, (iii) with top, (v)



_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- END QUESTION -->

## 4. Earthquakes (25 pts)



The next cell loads a table containing information about **every earthquake with a magnitude above 5** in 2021 (smaller earthquakes are generally not felt, only recorded by very sensitive equipment), compiled by the US Geological Survey. (source: https://earthquake.usgs.gov/earthquakes/search/)

In [None]:
earthquakes = Table().read_table('earthquakes_2021.csv').select(['time', 'mag', 'place'])
earthquakes

If we were studying all human-detectable 2019 earthquakes and had access to the above data, we’d be in good shape - however, if the USGS didn’t publish the full data, we could still learn something about earthquakes from just a smaller subsample. If we gathered our sample correctly, we could use that subsample to get an idea about the distribution of magnitudes (above 5, of course) throughout the year!

In the following lines of code, we take two different samples from the earthquake table, and calculate the mean of the magnitudes of these earthquakes.

In [None]:
sample1 = earthquakes.sort('mag', descending = True).take(np.arange(100))
sample1_magnitude_mean = np.mean(sample1.column('mag'))
sample2 = earthquakes.take(np.arange(100))
sample2_magnitude_mean = np.mean(sample2.column('mag'))
[sample1_magnitude_mean, sample2_magnitude_mean]

<!-- BEGIN QUESTION -->

#### Part 4.1 (5 pts)


Are these samples representative of the population of earthquakes in the original table (that is, the should we expect the mean to be close to the population mean)? 

*Hint:* Consider the ordering of the `earthquakes` table. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### Part 4.2 (5 pts)


 Write code to produce a sample of size 200 that is representative of the population. Then, take the mean of the magnitudes of the earthquakes in this sample. Assign these to `representative_sample` and `representative_mean` respectively. 

*Hint:* In class, we learned what kind of samples should be used to properly represent the population.

In [None]:
representative_sample = ...
representative_mean = ...
representative_mean

In [None]:
grader.check("q4.2")

#### Part 4.3 (5 pts)


 Suppose we want to figure out what the biggest magnitude earthquake was in 2021, but we only have our representative sample of 200. Let’s see if trying to find the biggest magnitude in the population from a random sample of 200 is a reasonable idea!

Write code that takes many random samples from the `earthquakes` table and finds the maximum of each sample. You should take a random sample of size 200 and do this 5000 times. Assign the array of maximum magnitudes you find to `maximums`.

In [None]:
maximums = ...
for i in np.arange(5000): 
    ...

In [None]:
grader.check("q4.3")

#### Part 4.4 (5 pts)


 Now find the magnitude of the actual strongest earthquake in 2021 (not the maximum of a sample). This will help us determine whether a random sample of size 200 is likely to help you determine the largest magnitude earthquake in the population.

In [None]:
strongest_earthquake_magnitude = ...
strongest_earthquake_magnitude

In [None]:
grader.check("q4.4")

#### Part 4.5 (5 pts)


Explain whether you believe you can accurately use a sample size of 200 to determine the maximum. What is one problem with using the maximum as your estimator? Use the histogram above to help answer. 

_Type your answer here, replacing this text._

## 4. Assessing Jade's Models (25 pts)


Before you begin, [Section 10.5](https://inferentialthinking.com/chapters/10/4/Random_Sampling_in_Python.html) of the textbook is a useful reference for this part.

Our friend Jade comes over and asks us to play a game with her. The game works like this: 

> We will draw randomly with replacement from a simplified 13 card deck with 4 face cards (A, J, Q, K), and 9 numbered cards (2, 3, 4, 5, 6, 7, 8, 9, 10). If we draw cards with replacement 13 times, and if the number of face cards is greater than or equal to 4, we lose.
> 
> Otherwise, Jade loses.

We play the game once and we lose, observing 8 total face cards. We are angry and accuse Jade of cheating! Jade is adamant, however, that the deck is fair.

Jade's model claims that there is an equal chance of getting any of the cards (A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K), but we do not believe her. We believe that the deck is clearly rigged, with face cards (A, J, Q, K) being more likely than the numbered cards (2, 3, 4, 5, 6, 7, 8, 9, 10).

#### Part 4.1 (5 pts)


 Assign `deck_model_probabilities` to a two-item array containing the chance of drawing a face card as the first element, and the chance of drawing a numbered card as the second element under Jade's model. Since we're working with probabilities, make sure your values are between 0 and 1. 



In [None]:
deck_model_probabilities = ...
deck_model_probabilities

In [None]:
grader.check("q5.1")

#### Part 4.2 (5 pts)


 We believe Jade's model is incorrect. In particular, we believe there to be a  larger chance of getting a face card. Which of the following statistics can we use during our simulation to test between the model and our alternative? Assign `statistic_choice` to the correct answer. 

1. The distance (absolute value) between the actual number of face cards in 13 draws and the expected number of face cards in 13 draws (4)
2. The expected number of face cards in 13 draws (4)
3. The actual number of face cards we get in 13 draws



In [None]:
statistic_choice = ...
statistic_choice

In [None]:
grader.check("q5.2")

#### Part 4.3 (5 pts)


 Define the function `deck_simulation_and_statistic`, which, given a sample size and an array of model proportions (like the one you created in Question 1), returns the number of face cards in one simulation of drawing cards under the model specified in `model_proportions`. 

*Hint:* Think about how you can use the function `sample_proportions`. 



In [None]:
def deck_simulation_and_statistic(sample_size, model_proportions):
    #BEGIN SOLUTION
    return sample_proportions(model_proportions, sample_size).item(0) * sample_size
    #END SOLUTION

deck_simulation_and_statistic(deck_model_probabilities, 13)

In [None]:
grader.check("q5.3")

#### Part 4.4 (5 pts)


 Use your function from above to simulate the drawing of 13 cards 5000 times under the proportions that you specified in Question 1. Keep track of all of your statistics in `deck_statistics`. 



In [None]:
repetitions = 5000 
...

# print the first 100 elements for debugging
deck_statistics

In [None]:
grader.check("q5.4")

Let’s take a look at the distribution of simulated statistics.

In [None]:
#Draw a distribution of statistics 
Table().with_column('Deck Statistics', deck_statistics).hist()

<!-- BEGIN QUESTION -->

#### Part 4.5 (5 pts)


 Given your observed value, do you believe that Jade's model is reasonable, or is our alternative (that our deck is rigged) more likely? Explain your answer using the histogram produced above. 



_Type your answer here, replacing this text._

<!-- END QUESTION -->

You're done with Lab 6!  

**Important submission information:** Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to the corresponding assignment. The name of this assignment is "Lab 5". **Be sure your work is saved before running the last cell!**

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()