<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Simulation for Ten Heads in a Row</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/simulation-for-ten-heads-in-a-row/">https://discovery.cs.illinois.edu/microproject/simulation-for-ten-heads-in-a-row/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Your Simulation of Flipping a Coin

Simulation is a powerful tool that allows us to run a event with a probabilistic outcome millions of times in under a second.  In this MicroProject, you will use simple simulation to flip a coin a million times and discover how to find trends in the simulated data.  After writing this simulation, you will do analysis that compiles data over multiple observation -- a simple form of "time-series analysis" -- to find if the statistical probability of events measure the simulated probability.

### Create Your Simulation

Create a simulation of flipping a fair coin 1,000,000 times.  Record an `"H"` for heads (50%) and a `"T"` for tails (50%), store the coin flip result in a column called `flip`, and store all 1,000,000 observations in a DataFrame `df`:

In [149]:
# Simulation:
import pandas as pd
import random
data = []
for i in range (1000000):
    flip = random.choice(["H","T"])
    d = {"flip":flip}
    data.append(d)
df = pd.DataFrame(data)

In [150]:
df

Unnamed: 0,flip
0,H
1,T
2,T
3,H
4,T
...,...
999995,H
999996,H
999997,T
999998,H


### 🔬 Checkpoint Tests 🔬

In [151]:
### TEST CASE for Part 1: Initial Simulation
tada = "\N{PARTY POPPER}"

assert("df" in vars())
assert("flip" in df)
assert(len(df[df.flip == "H"]) > 400000)
assert(len(df[df.flip == "T"]) > 400000)
assert(len(df[df.flip == "H"]) + len(df[df.flip == "T"]) == 1e6)

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 2: Finding the Two-Coin Sequence

The chance of getting a fair coin to land on "heads" is 50% -- but how common is it to get 2 heads in a row?  Before we write the Python code to find the simulated probability, let's calculate the true probability.

### Statistical Probability of Two Heads

In the cell below, calculate `P_twoheads` to the the probability of flipping two fair coins and they both land on heads:

In [152]:
P_twoheads = (1/2)*(1/2)
P_twoheads

0.25

### Sequences of Two Heads

To find the simulated probability of two heads, we need to observe **sequences of two coin flips**.  Instead of writing a new simulation, we will calculate each two-coin sequence by evaluating every row and calculating the sequence that includes:

- The result of the current observation's coin flip
- The result of the previous observation's coin flip

For each observation of the simulation, create a new column called `seq2` that contains the two coin sequence as described above (ex: `HH`, `HT`, `TH`, or `TT`).

***Unsure about how to get the previous observation's coin flip?***  Read the DISCOVERY guide "Using Previous Observations when Computation Values in a DataFrame" to find out more on how to use a previous observation's value with the `shift` function in your formula:
- [Guide: "Using Previous Observations when Computation Values in a DataFrame"](https://discovery.cs.illinois.edu/guides/DataFrame-Fundamentals/Using-Previous-Observations-when-Computation-Values/)

In [153]:
df["seq2"] = df.flip+df.flip.shift(1)
df

Unnamed: 0,flip,seq2
0,H,
1,T,TH
2,T,TT
3,H,HT
4,T,TH
...,...,...
999995,H,HH
999996,H,HH
999997,T,TH
999998,H,HT


### Simulated Probability of Two Heads

Create a new DataFrame called `df_twoheads` that contains all the rows that contain a sequence of two heads:

In [154]:
df_twoheads = df[df.seq2 == "HH"]
df_twoheads

Unnamed: 0,flip,seq2
6,H,HH
11,H,HH
12,H,HH
13,H,HH
16,H,HH
...,...,...
999991,H,HH
999992,H,HH
999995,H,HH
999996,H,HH


Finally, we calculate the probability of two heads:

In [155]:
P_sim_twoheads = len(df_twoheads) / len(df)
P_sim_twoheads

0.250831

### Simulation Error

To find the error in a result, we can use the following formula:

$$error = \frac{|actual - expected|}{expected}$$

Using the variables you created earlier, `P_twoheads` and `P_sim_twoheads`, calculate the simulation error in the result and store the error in the variable `error`.

- Useful Function: The `abs` function in Python can be used to find the absolute value.  For example, `abs(-5)` will return `5`.

In [156]:
error = abs(P_sim_twoheads - P_twoheads)/P_twoheads
error

0.0033240000000001046

In [157]:
print(f"Your simulation error was: {(error * 100):.3f}%")

Your simulation error was: 0.332%


### 🔬 Checkpoint Tests 🔬

In [158]:
### TEST CASE for Part 2: Finding the Two-Coin Sequence
import math

tada = "\N{PARTY POPPER}"

assert((P_twoheads) == 25e-2)

assert("seq2" in df)
assert((df.loc[:, "seq2"].isnull().sum()) == 1)

assert(abs(P_sim_twoheads - P_twoheads) < 0.05)
assert(error >= 0)
assert(math.isclose(error - abs((P_sim_twoheads - P_twoheads)/P_twoheads), 0))

print(f"{tada} All Tests Passed! {tada}")


🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 3: Flipping Ten Heads in a Row!

I think it would be awesome to flip a coin and get ten heads in a row!

### Statistical Probability of Ten Heads

Before we look at our data, what is the statistical probability of flipping a fair coin ten times and getting ten heads?

In [159]:
P_tenheads = (1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)
P_tenheads

0.0009765625

### Simulation of Ten Heads in a Row

Using the same technique as above, create a new column `seq10` that contains the sequence of 10 coin flips that starts with your current observation's coin flip and adds the previous nine coin flips:


In [160]:
df["seq10"] = df.flip+df.flip.shift(1)+df.flip.shift(2)+df.flip.shift(3)+df.flip.shift(4)+df.flip.shift(5)+df.flip.shift(6)+df.flip.shift(7)+df.flip.shift(8)+df.flip.shift(9)
df

Unnamed: 0,flip,seq2,seq10
0,H,,
1,T,TH,
2,T,TT,
3,H,HT,
4,T,TH,
...,...,...,...
999995,H,HH,HHTHHHTHTT
999996,H,HH,HHHTHHHTHT
999997,T,TH,THHHTHHHTH
999998,H,HT,HTHHHTHHHT


### Simulated Probability of Ten Heads

Create a new DataFrame `df_tenheads` that contains all sequences of ten heads in a row:

In [161]:
df_tenheads = df[df.seq10=="HHHHHHHHHH"]
df_tenheads

Unnamed: 0,flip,seq2,seq10
175,H,HH,HHHHHHHHHH
2431,H,HH,HHHHHHHHHH
2580,H,HH,HHHHHHHHHH
2581,H,HH,HHHHHHHHHH
4187,H,HH,HHHHHHHHHH
...,...,...,...
993201,H,HH,HHHHHHHHHH
993202,H,HH,HHHHHHHHHH
998534,H,HH,HHHHHHHHHH
998535,H,HH,HHHHHHHHHH


### Finding the Simulation Error

First, calculate the simulated probability of getting heads 10 times in a row and store it in `P_sim_tenheads`:

In [162]:
P_sim_tenheads = len(df_tenheads)/len(df)
P_sim_tenheads

0.000971


Using the variables you created earlier, `P_tenheads` and `P_sim_tenheads`, calculate the simulation error in the result and store the error in the variable `error10`.

In [163]:
error10 = abs(P_sim_tenheads-P_tenheads)/P_tenheads
error10

0.005696000000000034

In [164]:
print(f"Analysis: In a simulation of {len(df)} coin flips, we flipped heads 2 times in a row a total of {len(df_twoheads)} times!")
print(f"Your simulation error for 2 heads in a row was: {(error * 100):.3f}%")
print()
print(f"Analysis: In a simulation of {len(df)} coin flips, we flipped heads 10 times in a row a total of {len(df_tenheads)} times!")
print(f"Your simulation error for 10 heads in a row was: {(error10 * 100):.3f}%")

Analysis: In a simulation of 1000000 coin flips, we flipped heads 2 times in a row a total of 250831 times!
Your simulation error for 2 heads in a row was: 0.332%

Analysis: In a simulation of 1000000 coin flips, we flipped heads 10 times in a row a total of 971 times!
Your simulation error for 10 heads in a row was: 0.570%


### 🤔 Reflection: Observing the Simulation Error 🤔

Press the **"Run All"** button at the top of your notebook to run this entire notebook again (and again, about 5 times) and see how the analysis values above change.  Note the trends you see in the percentage errors.

Nearly every time you run this noteboook (but not always), the percentage of simulation error for flipping two coins will be **much smaller** than the percentage error in flipping ten coins.

The two most significant factors influence lowering simulation error are:

1. Number of Simulations -- The more times you run a simulation, the lower your expected simulation error.
2. Frequency of Success -- The more common the event happens, the lower your expected simulation error.

Since we ran the simulation a large number of times (1,000,000), the error rate for common events (`"HH"` happens 25% of the time) will usually be very low.  However, for very rare events (there's only a 1:1024 chance of getting `"HHHHHHHHHH"`), the simulation will often be quite high -- even with 1,000,000 simulations.

### 🔬 Checkpoint Tests 🔬

In [166]:
### TEST CASE for Part 3: Flipping Ten Heads in a Row
import math
tada = "\N{PARTY POPPER}"

assert(math.isclose((P_tenheads), 9.765625e-4))

assert("seq10" in df)
assert((df.loc[:, "seq10"].isnull().sum()) == 9)

assert(len(df_tenheads) > 0)
assert(len(df_tenheads) < 10000)
assert(len(df[df.loc[:, "seq10"] == "H" * 10]) == len(df_tenheads))

assert(error10 > 0)
assert(error10 < 0.5)
assert(math.isclose(error10 - abs((P_sim_tenheads - P_tenheads)/P_tenheads), 0))

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/simulation-for-ten-heads-in-a-row/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉