<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Simulation for Ten Heads in a Row</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/simulation-for-ten-heads-in-a-row/">https://discovery.cs.illinois.edu/microproject/simulation-for-ten-heads-in-a-row/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Your Simulation of Flipping a Coin

Simulation is a powerful tool that allows us to run a event with a probabilistic outcome millions of times in under a second.  In this MicroProject, you will use simple simulation to flip a coin a million times and discover how to find trends in the simulated data.  After writing this simulation, you will do analysis that compiles data over multiple observation -- a simple form of "time-series analysis" -- to find if the statistical probability of events measure the simulated probability.

### Create Your Simulation

Create a simulation of flipping a fair coin 1,000,000 times.  Record an `"H"` for heads (50%) and a `"T"` for tails (50%), store the coin flip result in a column called `flip`, and store all 1,000,000 observations in a DataFrame `df`:

In [106]:
import random
import pandas as pd
data = []
# Your Simulation:
for i in range(1000000):
    n = random.randint(1,2)
    if (n == 1):
        flip = "H"
    else:
        flip = "T"
    d = {"flip": flip}
    data.append(d)



# Store your simulation is the DataFrame called `df`:
df = pd.DataFrame(data)

In [107]:
df

Unnamed: 0,flip
0,H
1,T
2,T
3,T
4,T
...,...
999995,T
999996,H
999997,H
999998,T


### 🔬 Checkpoint Tests 🔬

In [108]:
### TEST CASE for Part 1: Initial Simulation
tada = "\N{PARTY POPPER}"

assert("df" in vars())
assert("flip" in df)
assert(len(df[df.flip == "H"]) > 400000)
assert(len(df[df.flip == "T"]) > 400000)
assert(len(df[df.flip == "H"]) + len(df[df.flip == "T"]) == 1e6)

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 2: Finding the Two-Coin Sequence

The chance of getting a fair coin to land on "heads" is 50% -- but how common is it to get 2 heads in a row?  Before we write the Python code to find the simulated probability, let's calculate the true probability.

### Statistical Probability of Two Heads

In the cell below, calculate `P_twoheads` to the the probability of flipping two fair coins and they both land on heads:

In [109]:
P_twoheads = 0.5*0.5
P_twoheads

0.25

### Sequences of Two Heads

To find the simulated probability of two heads, we need to observe **sequences of two coin flips**.  Instead of writing a new simulation, we will calculate each two-coin sequence by evaluating every row and calculating the sequence that includes:

- The result of the current observation's coin flip
- The result of the previous observation's coin flip

For each observation of the simulation, create a new column called `seq2` that contains the two coin sequence as described above (ex: `HH`, `HT`, `TH`, or `TT`).

***Unsure about how to get the previous observation's coin flip?***  Read the DISCOVERY guide "Using Previous Observations when Computation Values in a DataFrame" to find out more on how to use a previous observation's value with the `shift` function in your formula:
- [Guide: "Using Previous Observations when Computation Values in a DataFrame"](https://discovery.cs.illinois.edu/guides/DataFrame-Fundamentals/Using-Previous-Observations-when-Computation-Values/)

In [110]:
df["previous_flip"]  = df["flip"].shift(1)
df["seq2"] = df['previous_flip'] + df['flip']

### Simulated Probability of Two Heads

Create a new DataFrame called `df_twoheads` that contains all the rows that contain a sequence of two heads:

In [111]:
df_twoheads = df[df["seq2"] == "HH"]
df_twoheads

Unnamed: 0,flip,previous_flip,seq2
7,H,H,HH
8,H,H,HH
15,H,H,HH
16,H,H,HH
37,H,H,HH
...,...,...,...
999986,H,H,HH
999987,H,H,HH
999993,H,H,HH
999994,H,H,HH


Finally, calculate the simulated probability of two heads:

In [112]:
P_sim_twoheads = len(df_twoheads) / len(df)
P_sim_twoheads

0.250522

### Simulation Error

To find the error in a simulation result when we know the true probability, the following formula can be used:

$$error = \frac{|true\_probability - simulated\_probability|}{true\_probability}$$

Using the variables you created earlier, `P_twoheads` and `P_sim_twoheads`, calculate the simulation error in the result and store the error in the variable `error`.

- Useful Function: The `abs` function in Python can be used to find the absolute value.  For example, `abs(-5)` will return `5`.

In [113]:
error = abs(P_twoheads - P_sim_twoheads) / P_twoheads
error

0.00208800000000009

In [114]:
print(f"Your simulation error was: {(error * 100):.3f}%")

Your simulation error was: 0.209%


### 🔬 Checkpoint Tests 🔬

In [115]:
### TEST CASE for Part 2: Finding the Two-Coin Sequence
import math

tada = "\N{PARTY POPPER}"

assert((P_twoheads) == 25e-2)

assert("seq2" in df)
assert((df.loc[:, "seq2"].isnull().sum()) == 1)

assert(abs(P_sim_twoheads - P_twoheads) < 0.05)
assert(error >= 0)
assert(math.isclose(error - abs((P_sim_twoheads - P_twoheads)/P_twoheads), 0))

print(f"{tada} All Tests Passed! {tada}")


🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 3: Flipping Ten Heads in a Row!

It would be awesome to flip a coin and get **ten heads in a row**!

### Statistical Probability of Ten Heads

Before we look at our data, what is the statistical probability of flipping a fair coin ten times and getting ten heads?

In [116]:
P_tenheads = 0.5**10
P_tenheads

0.0009765625

### Simulation of Ten Heads in a Row

Using the same technique as above, create a new column `seq10` that contains the sequence of 10 coin flips that starts with your current observation's coin flip and adds the previous nine coin flips:


In [117]:
df["previous2_flip"] = df["flip"].shift(2)
df["seq3"] = df["previous2_flip"] + df["previous_flip"] + df["flip"]

df["previous3_flip"] = df["flip"].shift(3)
df["seq4"] = df["previous3_flip"] + df["previous2_flip"] + df["previous_flip"] + df["flip"]

df["previous4_flip"] = df["flip"].shift(4)
df["seq5"] = df["previous4_flip"] + df["previous3_flip"] + df["previous2_flip"] + df["previous_flip"] + df["flip"]

df["previous5_flip"] = df["flip"].shift(5)
df["seq6"] = df["previous5_flip"] + df["previous4_flip"] + df["previous3_flip"] + df["previous2_flip"] + df["previous_flip"] + df["flip"]

df["previous6_flip"] = df["flip"].shift(6)
df["seq7"] = df["previous6_flip"] + df["previous5_flip"] + df["previous4_flip"] + df["previous3_flip"] + df["previous2_flip"] + df["previous_flip"] + df["flip"]

df["previous7_flip"] = df["flip"].shift(7)
df["seq8"] = df["previous7_flip"] + df["previous6_flip"] + df["previous5_flip"] + df["previous4_flip"] + df["previous3_flip"] + df["previous2_flip"] + df["previous_flip"] + df["flip"]

df["previous8_flip"] = df["flip"].shift(8)
df["seq9"] = df["previous8_flip"] + df["previous7_flip"] + df["previous6_flip"] + df["previous5_flip"] + df["previous4_flip"] + df["previous3_flip"] + df["previous2_flip"] + df["previous_flip"] + df["flip"]

df["previous9_flip"] = df["flip"].shift(9)
df["seq10"] = df["previous9_flip"] + df["previous8_flip"] + df["previous7_flip"] + df["previous6_flip"] + df["previous5_flip"] + df["previous4_flip"] + df["previous3_flip"] + df["previous2_flip"] + df["previous_flip"] + df["flip"]


df

Unnamed: 0,flip,previous_flip,seq2,previous2_flip,seq3,previous3_flip,seq4,previous4_flip,seq5,previous5_flip,seq6,previous6_flip,seq7,previous7_flip,seq8,previous8_flip,seq9,previous9_flip,seq10
0,H,,,,,,,,,,,,,,,,,,
1,T,H,HT,,,,,,,,,,,,,,,,
2,T,T,TT,H,HTT,,,,,,,,,,,,,,
3,T,T,TT,T,TTT,H,HTTT,,,,,,,,,,,,
4,T,T,TT,T,TTT,T,TTTT,H,HTTTT,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,T,H,HT,H,HHT,H,HHHT,T,THHHT,H,HTHHHT,T,THTHHHT,T,TTHTHHHT,H,HTTHTHHHT,H,HHTTHTHHHT
999996,H,T,TH,H,HTH,H,HHTH,H,HHHTH,T,THHHTH,H,HTHHHTH,T,THTHHHTH,T,TTHTHHHTH,H,HTTHTHHHTH
999997,H,H,HH,T,THH,H,HTHH,H,HHTHH,H,HHHTHH,T,THHHTHH,H,HTHHHTHH,T,THTHHHTHH,T,TTHTHHHTHH
999998,T,H,HT,H,HHT,T,THHT,H,HTHHT,H,HHTHHT,H,HHHTHHT,T,THHHTHHT,H,HTHHHTHHT,T,THTHHHTHHT


In [118]:
df.head(10)

Unnamed: 0,flip,previous_flip,seq2,previous2_flip,seq3,previous3_flip,seq4,previous4_flip,seq5,previous5_flip,seq6,previous6_flip,seq7,previous7_flip,seq8,previous8_flip,seq9,previous9_flip,seq10
0,H,,,,,,,,,,,,,,,,,,
1,T,H,HT,,,,,,,,,,,,,,,,
2,T,T,TT,H,HTT,,,,,,,,,,,,,,
3,T,T,TT,T,TTT,H,HTTT,,,,,,,,,,,,
4,T,T,TT,T,TTT,T,TTTT,H,HTTTT,,,,,,,,,,
5,T,T,TT,T,TTT,T,TTTT,T,TTTTT,H,HTTTTT,,,,,,,,
6,H,T,TH,T,TTH,T,TTTH,T,TTTTH,T,TTTTTH,H,HTTTTTH,,,,,,
7,H,H,HH,T,THH,T,TTHH,T,TTTHH,T,TTTTHH,T,TTTTTHH,H,HTTTTTHH,,,,
8,H,H,HH,H,HHH,T,THHH,T,TTHHH,T,TTTHHH,T,TTTTHHH,T,TTTTTHHH,H,HTTTTTHHH,,
9,T,H,HT,H,HHT,H,HHHT,T,THHHT,T,TTHHHT,T,TTTHHHT,T,TTTTHHHT,T,TTTTTHHHT,H,HTTTTTHHHT


### Simulated Probability of Ten Heads

Create a new DataFrame `df_tenheads` that contains all sequences of ten heads in a row:

In [119]:
df_tenheads = df[df["seq10"] == "H"*10]
df_tenheads

Unnamed: 0,flip,previous_flip,seq2,previous2_flip,seq3,previous3_flip,seq4,previous4_flip,seq5,previous5_flip,seq6,previous6_flip,seq7,previous7_flip,seq8,previous8_flip,seq9,previous9_flip,seq10
3346,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH
3347,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH
3348,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH
5760,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH
5761,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999857,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH
999858,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH
999859,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH
999860,H,H,HH,H,HHH,H,HHHH,H,HHHHH,H,HHHHHH,H,HHHHHHH,H,HHHHHHHH,H,HHHHHHHHH,H,HHHHHHHHHH


# Section 5: Splitting Up Your Base Image

To create a mosaic from an image, we must split the base image into small regions to be replaced with the tile images. To accomplish this, we need a function that will **find the subset of pixels found in a region of an image**.

- Thinking about the 3x3 pixel image `sample.png` (from Section 1), we might need a 2x2 square (or 1x3 rectangle) of pixels instead of using all 3x3 pixels.


### Your `findImageSubset` function

Create a function `findImageSubset` that finds the subset of the image starting at (`x`, `y`), spanning `width` pixels wide and `height` pixels tall. Your function should return the **subset of all the pixels in that region of the image**.

- Example: `findImageSubset(image, x=0, y=0, width=3, height=3)` -- returns subset of all the pixels in the square defined by: x=0...2 and y=0...2 (9 total pixels)

- Example: `findImageSubset(image, x=5, y=5, width=5, height=5)` -- returns subset of all the pixels in the square defined by: x=5...9 and y=5...9 (25 total pixels)

- Example: `findImageSubset(image, x=5, y=0, width=5, height=5)` -- returns subset of all the pixels in the square defined by: x=5...9 and y=0...4 (25 total pixels)

### Finding the Simulation Error

First, calculate the simulated probability of getting heads 10 times in a row and store it in `P_sim_tenheads`:

In [120]:
P_sim_tenheads = len(df_tenheads) / len(df)
P_sim_tenheads

0.001048


Using the variables you created earlier, `P_tenheads` and `P_sim_tenheads`, calculate the simulation error in the result and store the error in the variable `error10`.

In [121]:
error10 =  abs(P_tenheads - P_sim_tenheads) / P_tenheads
error10

0.0731520000000001

In [122]:
print(f"Analysis: In a simulation of {len(df)} coin flips, we flipped heads 2 times in a row a total of {len(df_twoheads)} times!")
print(f"Your simulation error for 2 heads in a row was: {(error * 100):.3f}%")
print()
print(f"Analysis: In a simulation of {len(df)} coin flips, we flipped heads 10 times in a row a total of {len(df_tenheads)} times!")
print(f"Your simulation error for 10 heads in a row was: {(error10 * 100):.3f}%")

Analysis: In a simulation of 1000000 coin flips, we flipped heads 2 times in a row a total of 250522 times!
Your simulation error for 2 heads in a row was: 0.209%

Analysis: In a simulation of 1000000 coin flips, we flipped heads 10 times in a row a total of 1048 times!
Your simulation error for 10 heads in a row was: 7.315%


### 🤔 Reflection: Observing the Simulation Error 🤔

Press the **"Run All"** button at the top of your notebook to run this entire notebook again (and again, about 5 times) and see how the analysis values above change.  Note the trends you see in the percentage errors.

Nearly every time you run this noteboook (but not always), the percentage of **simulation error for flipping two coins will be MUCH SMALLER than the percentage error in flipping ten coins**.

Two significant factors that influence lowering simulation error are:

1. Number of Simulations -- The more times you run a simulation, the lower your expected simulation error.
2. Frequency of Event -- The more common the event, the lower your expected simulation error.

Since we ran the simulation a large number of times (1,000,000), the error rate for common events (`"HH"` happens 25% of the time) will usually be very low.  However, for very rare events (there's only a 1:1024 chance of getting `"HHHHHHHHHH"`), the simulation will often be quite high -- even with 1,000,000 simulations.

In [123]:
len(df_tenheads)
len(df[df.loc[:, "seq10"] == "H" * 10])

1048

### 🔬 Checkpoint Tests 🔬

In [124]:
### TEST CASE for Part 3: Flipping Ten Heads in a Row
import math
tada = "\N{PARTY POPPER}"

assert(math.isclose((P_tenheads), 9.765625e-4))

assert("seq10" in df)
assert((df.loc[:, "seq10"].isnull().sum()) == 9)

assert(len(df_tenheads) > 0)
assert(len(df_tenheads) < 10000)
assert(len(df[df.loc[:, "seq10"] == "H" * 10]) == len(df_tenheads))

assert(error10 > 0)
assert(error10 < 0.5)
assert(math.isclose(error10 - abs((P_sim_tenheads - P_tenheads)/P_tenheads), 0))

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/simulation-for-ten-heads-in-a-row/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉