# Essential Math for Data Science

## 2. Probability

### Joint Probabilities

In [4]:
p_heads = 1/2
p_six   = 1/6

p_both  = p_heads * p_six

p_both

0.08333333333333333

In [5]:
for c in ('Heads', 'Tail'):
    for d in range(1, 6+1):
        print(c, d)

Heads 1
Heads 2
Heads 3
Heads 4
Heads 5
Heads 6
Tail 1
Tail 2
Tail 3
Tail 4
Tail 5
Tail 6


### Union Probabilities

In [8]:
p_or = p_heads + p_six - p_both
p_or

0.5833333333333333

In [9]:
p_or = p_heads + p_six - p_heads * p_six
p_or

0.5833333333333333

### Conditional Probability and Bayes’ Theorem

In [21]:
p_coffee_drinker = 0.65
p_cancer = 0.005
p_coffee_drinker_given_cancer = 0.85

p_cancer_given_coffee_drinker = (p_coffee_drinker_given_cancer * p_cancer) / p_coffee_drinker

print(f'Probability of having cancer being a coffee drinker is '
      f'{round(p_cancer_given_coffee_drinker*100, 2)}%.')

Probability of having cancer being a coffee drinker is 0.65%.


### Binomial Distribution

In [41]:
# Calculating the binomial distribution

from scipy.stats import binom

n = 10 # number of trials
p = 0.9 # probability of success for each trial

for k in range(n + 1): # k is the total number of successes over all n trials
    probability = binom.pmf(k, n, p) # probability mass function
    print(f'{k}: {round(probability*100, 2)}')

0: 0.0
1: 0.0
2: 0.0
3: 0.0
4: 0.01
5: 0.15
6: 1.12
7: 5.74
8: 19.37
9: 38.74
10: 34.87


In [47]:
n = 5     # number of trials
p = 0.5   # probability of success for each trial

for k in range(n + 1):   # number of successes
    probability = binom.pmf(k, n, p)
    print(f'{k}: {round(probability*100, 2)}')

0: 3.12
1: 15.62
2: 31.25
3: 31.25
4: 15.62
5: 3.12


### Beta T-Distribution

In [1]:
from scipy.stats import beta

In [54]:
a = 8  # number of successes
b = 2  # number of failures

# Calculate the area up to a given x-value
# Calculate the area up to 90%
p = beta.cdf(0.90, a, b)  # cdf is cumulative density function

print(f'There is a {round(p*100, 2)}% chance the '
      f'underlying probability of success is 90% or less.') # Beta distribution

There is a 77.48% chance the underlying probability of success is 90% or less.


In [60]:
a = 8  # number of successes
b = 2  # number of failures

p = 1.0 - beta.cdf(0.90, a, b)

print(f'There is a {round(p*100, 2)}% chance the '
      f'underlying probability of success is 90% or greater.') # Beta distribution

There is a 22.52% chance the underlying probability of success is 90% or greater.


In [61]:
a = 30  # number of successes
b = 6  # number of failures

p = 1.0 - beta.cdf(0.90, a, b)

print(f'There is a {round(p*100, 2)}% chance the '
      f'underlying probability of success is 90% or greater.') # Beta distribution

There is a 13.16% chance the underlying probability of success is 90% or greater.


In [3]:
a = 8  # number of successes
b = 2  # number of failures

# Calculate the area up to a given x-value
# Calculate the area up to 90%
p = beta.cdf(0.90, a, b) - beta.cdf(0.80, a, b) # cdf is cumulative density function

print(f'There is a {round(p*100, 2)}% chance the '
      f'underlying probability of success is 90% or less.') # Beta distribution

There is a 33.86% chance the underlying probability of success is 90% or less.


## Exercises

<b></b>1. There is a 30% chance of rain today, and a 40% chance your umbrella order will arrive on time. You are eager to walk in the rain today and cannot do so without either!

What is the probability it will rain AND your umbrella will arrive?

In [13]:
p_r = 0.30
p_u = 0.40

p_r_and_u = p_r * p_u

print(f'There is a {int(p_r_and_u * 100)}% probability that it will rain '
      f'and that my umbrella will arrive on time.')

There is a 12% probability that it will rain and that my umbrella will arrive on time.


<b></b>2. There is a 30% chance of rain today, and a 40% chance your umbrella order will arrive on time.

You will be able to run errands only if it does <i>not</i> rain or your umbrella arrives.
    
What is the probability it will <i>not</i> rain OR your umbrella arrives?

In [17]:
p_r = 0.30
p_u = 0.40

p_r_or_u = (1 - p_r) + p_u - (p_r * p_u)

print(f'There is a {int(p_r_or_u * 100)}% probability that it will not rain '
      f'or that my umbrella will arrive on time.')

There is a 98% probability that it will not rain or that my umbrella will arrive on time.


<b></b>3. There is a 30% chance of rain today, and a 40% chance your umbrella order will arrive on time.

However, you found out if it rains there is only a 20% chance your umbrella will arrive on time.

What is the probability it will rain AND your umbrella will arrive on time?

In [19]:
p_r = 0.30
p_u = 0.40
p_u_w_r = 0.20

p_r_and_u = p_r * p_u_w_r

print(f'There is a {int(p_r_and_u * 100)}% probability that it will rain '
      f'and that my umbrella will arrive on time.')

There is a 6% probability that it will rain and that my umbrella will arrive on time.


<b></b>4. You have 137 passengers booked on a flight from Las Vegas to Dallas. However, it is Las Vegas on a Sunday morning and you estimate each passenger is 40% likely to not show up.

You are trying to figure out how many seats to overbook so the plane does not fly empty.

How likely is it <i>at least</i> 50 passengers will not show up?

<i>= Find the probability of 50 or more passengers not showing up.
<br>
= I’m trying to determine the probability of having 50 or more passengers who won’t show up. How can I calculate this likelihood?</i>

In [1]:
from scipy.stats import binom

In [38]:
n_booked = 137
n_no_show = 50
p_no_show = 0.40
probability = 0

for k in range (n_no_show, n_booked + 1):
    probability += binom.pmf(k, n_booked, p_no_show)

print(f'Probability of 50 or more passengers '
      f'not showing up is {round(probability * 100, 2)}%.')

Probability of 50 or more passengers not showing up is 82.21%.


In [32]:
n_flip = 10
n_number = 3
p_win = 0.50
probability = 0

print(f'Probability of getting a number n times or more while flipping coins:\n')

for k in range (0, n_flip + 1):
    probability += binom.pmf(k, n_flip, p_win)
    print(f'{k}: {round(probability * 100, 2)}%')

Probability of getting a number n times or more while flipping coins:

0: 0.1%
1: 1.07%
2: 5.47%
3: 17.19%
4: 37.7%
5: 62.3%
6: 82.81%
7: 94.53%
8: 98.93%
9: 99.9%
10: 100.0%


<b></b>5. You flipped a coin 19 times and got heads 15 times and tails 4 times.

Do you think this coin has any good probability of being fair? Why or why not?

In [41]:
from scipy.stats import beta

In [65]:
n_flips = 19
n_heads = 15
n_tails = 4
p_win = 0.5

p = 1.0 - beta.cdf(p_win, n_heads, n_tails)

print(f'With {round(p*100, 2)}% this coin has a bad probability of being fair.')

With 99.62% this coin has a bad probability of being fair.
