### Class #2: Probability - Examples
In this notebook we'll go over how we simulate data. First we import the necessary packages.

In [4]:
import random
import pandas as pd
import numpy as np

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

How do we simulate random processes using Python?

In [2]:
dice_rolls = np.arange(1,7)
coin_flips = np.arange(1,3)

print(dice_rolls)
print(coin_flips)

[1 2 3 4 5 6]
[1 2]


Roll one die

In [6]:
all_rolls=[]
for i in range(100):
    roll = np.random.choice(dice_rolls)
    all_rolls.append(roll)

data = [go.Histogram(x=all_rolls, histnorm='probability')]
iplot(data)

Roll two dice

In [7]:
all_rolls=[]
for i in range(10000):
    roll_1 = np.random.choice(dice_rolls)
    roll_2 = np.random.choice(dice_rolls)
    all_rolls.append(roll_1+roll_2)

data = [go.Histogram(x=all_rolls, histnorm='probability')]
iplot(data)

Pick a random sample from a non-uniform but known discrete distribution. In this case we have a die wrigged to weight heigher number rolls.

In [9]:
all_rolls=[]
for i in range(100):
    roll = np.random.choice(np.arange(1,7),p=[0.05,  0.10,  0.15,  0.20,  0.25,  0.25])
    all_rolls.append(roll)

data = [go.Histogram(x=all_rolls, histnorm='probability')]
iplot(data)

We're going to use this plotting method several times, so let's just declare it as a function

In [10]:
def plot_probs(x_, y_):
    trace = go.Scatter(
        x = x_,
        y = y_
    )

    layout = go.Layout(
        xaxis=dict(
            title='Number of Simulations'
        ),
        yaxis=dict(
            title='Probability'
        )
    )

    fig = go.Figure(data = [trace], layout = layout)
    return fig

Let's deonte the probability of a single event (rolling a 4) as:
### P(A4)
How can we simulate and show that this is 1/6 (or 0.167)?

In [12]:
dice_rolls = np.arange(1,7)
count_hits = 0
x_data = []
y_data = []

for num_simulations in range(1, 10001):
    roll = np.random.choice(dice_rolls)
    if roll == 4:
        count_hits += 1
    if (num_simulations % 10) == 0:
        y_data.append(float(count_hits) / float(num_simulations))
        x_data.append(num_simulations)

# Plot the figure
fig = plot_probs(x_data, y_data)
iplot(fig)

For two events, the probability that at least one occurs is given by:
## P(A or B) = P(A) + P(B) - P(A and B)
For our dice example, we can see that:<br>
P(A4 or A3) = P(A4) + P(A3) - P(A4 and A3)<br>
P(A4 or A3) = 1/6 + 1/6 - 0<br>
P(A4 or A3) = 2/6 = 1/3 = 0.33


In [13]:
dice_rolls = np.arange(1,7)
count_hits = 0
x_data = []
y_data = []

for num_simulations in range(1, 10001):
    roll = np.random.choice(dice_rolls)
    if (roll == 4) | (roll == 3) :
        count_hits += 1
    if (num_simulations % 10) == 0:
        y_data.append(float(count_hits) / float(num_simulations))
        x_data.append(num_simulations)

# Plot the figure
fig = plot_probs(x_data, y_data)
iplot(fig)

The Probability that both will occur is given by:
## P(A and B) = P(A) * P(B)
So, for our dice example, let's say we have two dice. The odds of rolling sixes on both can be stated as:<br>
P(A6 and B6) = P(A6) * P(B6)<br>
P(A6 and B6) = 1/6 * 1/6 = 1/36 = 0.0278

In [14]:
dice_rolls = np.arange(1,7)
count_hits = 0
x_data = []
y_data = []

for num_simulations in range(1, 10001):
    roll_1 = np.random.choice(dice_rolls)
    roll_2 = np.random.choice(dice_rolls)
    if (roll_1 == 6) & (roll_2 == 6) :
        count_hits += 1
    if (num_simulations % 10) == 0:
        y_data.append(float(count_hits) / float(num_simulations))
        x_data.append(num_simulations)

# Plot the figure
fig = plot_probs(x_data, y_data)
iplot(fig)

Next, we can consider the likelihood of one event, given observation of another. We describe this as:
## P(A | B) = P(A and B) / P(B)
Let's look at this scenario for our dice:<br>
P(A6 | B6) = P(A6 and B6) / P(B6)<br>
P(A6 | B6) = (1/36) / (1/6) = 1/6 = 0.167<br>
The odds of the second die being a 6 are independent of the outcome of the first roll. 

In [15]:
dice_rolls = np.arange(1,7)
effective_simulations = 0
count_hits = 0
x_data = []
y_data = []

while effective_simulations < 10000:
    roll_1 = np.random.choice(dice_rolls)
    roll_2 = np.random.choice(dice_rolls)
    if roll_1 == 6: 
        effective_simulations += 1
        if roll_2 == 6:
            count_hits += 1
        if (effective_simulations % 10) == 0:
            y_data.append(float(count_hits) / float(effective_simulations))
            x_data.append(effective_simulations)

# Plot the figure
fig = plot_probs(x_data, y_data)
iplot(fig)

Simulating the cancer problem

In [19]:
pC = [0.50, 0.50]
pTest_C = [0.20, 0.80]
pTest_NC = [0.14, 0.86]

sum_C = 0
sum_NC = 0
sum_C_Pos = 0
sum_C_Neg = 0
sum_NC_Pos = 0
sum_NC_Neg = 0

for patient_simulation in range(1,100001):
    has_cancer = np.random.choice([1, 0], p=pC)
    if has_cancer == 1:
        sum_C += 1
        test_pos = np.random.choice([1, 0], p=pTest_C)
    else:
        sum_NC += 1
        test_pos = np.random.choice([1, 0], p=pTest_NC)
    
    if (has_cancer == 1) & (test_pos == 1):
        sum_C_Pos += 1
    elif (has_cancer == 1) & (test_pos == 0):
        sum_C_Neg += 1
    elif (has_cancer == 0) & (test_pos == 1):
        sum_NC_Pos += 1
    else:
        sum_NC_Neg += 1

cancer_rate = (100 * float(sum_C)) / float(sum_NC)
cancer_given_pos_result = (100 * float(sum_C_Pos)) / (float(sum_C_Pos) + float(sum_NC_Pos))

print("%s Patients with cancer" % sum_C)
print("%s Patients without cancer" % sum_NC)
print("%.2f%% Cancer rate" % cancer_rate)
print("%s True Positives" % sum_C_Pos)
print("%s False Positives" % sum_NC_Pos)
print("%s True Negatives" % sum_NC_Neg)
print("%s False Negatives" % sum_C_Neg)
print("%.2f%% With cancer given positive result" % cancer_given_pos_result)


49757 Patients with cancer
50243 Patients without cancer
99.03% Cancer rate
10006 True Positives
7079 False Positives
43164 True Negatives
39751 False Negatives
58.57% With cancer given positive result


### The M&M Problem:
In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown.

Suppose a friend of mine has two bags of M&M’s, and he tells me that one is from 1994 and one from 1996. He won’t tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow one came from the 1994 bag?