# Bayes Theorem 

*Bayes Theorem* is a famous and useful formula for flipping the direction of a conditional probability. It is important for many applications in machine learning, data science, and statistics. From discovering the effectiveness of a drug to Naive Bayes in machine learning, Bayes Theorem will be an inevitable encounter in your journey. 

We will learn about Bayes Theorem by studying the link (or lack thereof) between video games and homicidal behavior. We will also learn how it can apply to the confusion matrix, a critical validation tool in machine learning. 

## Bayes Theorem Formula

The formula for Bayes Theorem fundamentally switches a conditional probability's direction, such as turning the the $ P(B|A) $ into the $ P(A|B) $.

$ \Large P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} $ 

There are a few other pieces you need to perform this flip though. You need the $ P(A) $ and the $ P(B) $. As you will see in our example, Bayes Theorem is a game of proportions. 


## Understanding Bayes Theorem

There have been several controversial allegations about a link between violence and video games, which have come and gone over the decades. It [cyclically falls into and out of the news cycle](https://youtu.be/0oPVxqCx1Lw?si=l3jaX_flri_9TJIa), seemingly after a tragic event involving violent youths. While first person shooters and "rated M for Mature" titles can indeed be uncomfortable or morally questionable, we have to consider the empirical evidence whether these video games cause homicidal behavior. 

img

Let's say a special interest group cites some data that 85% of homicidal criminals have played violent video games in the United States. 

$ P(\text{gamer}|\text{homicidal}) = .85 $ 

While this might send politicians and media into a tizzy, we should never take any claim like this at face value. We could scrutinize for biased sampling or faulty assumptions in the study. Perhaps we would even isolate age ranges and exposure to violent video games at different ages. But even then, we will show how marginalized this number becomes in the bigger picture. 

Consider this: is the probability of someone being *a gamer given they are homicidal* the same as the probabilty of being *homicidal given they are a gamer*? The answer is a resounding no! These two conditional probabilities are very different, and we need to account for the latter. 

$  P(\text{gamer}|\text{homicidal}) = .85 $ 

$ P(\text{homicidal}|\text{gamer}) = \text{ ? } $ 


Let's gather some further statistics. According to the Federal Bureau of Investigation, there are 17,251 known homicidal offenders in the United States in 2017. There are approximately 324 million people in the United States that year as well. From my rough estimations based on industry research, 19% of the population plays violent video games. 

Let's write down these probabilities based on the data. 

$  P(\text{homicidal}) = \frac{17,251}{324,000,000} = .00005 $ 

$ P(\text{gamer}) = \frac{61,560,000}{324,000,000} = .19 $ 

Before we move on ask yourself this. Does something feel off looking at these two numbers above given the claims about video games causing homicidal behavior? Think for a moment and move on. 

Reason again that 19% of the population plays violent video games, and yet only .005% of the population is homicidal. If video games were making people homicidal, shouldn't we be seeing A LOT more of the population being homicidal... say 19% or so? If these two numbers feel wide to you, you would be correct to think that way!

The way we can formalize this is to use Bayes Theorem which again flips a conditional probability $ P(B|A) $ into the $ P(A|B) $. 

$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} $ 

If $ P(B) $ is $ P(\text{gamer}) $ and $ P(A) $ is $ P(\text{homicidal}) $, then we can substitute 

$ P(\text{gamer}|\text{homicidal}) = .85 $ 

$  P(\text{homicidal}) = .00005 $ 

$  P(\text{gamer}) = .19 $ 

$   P(\text{homicidal}|\text{gamer}) = \frac{P(\text{gamer}|\text{homicidal}) \times P(\text{homicidal})}{P(\text{gamer})} = \frac{.85 \times .00005}{.19} = .0002 $ 

We can also do this in Python using simple arithmetic operators. 

In [None]:
p_gamer_given_homicdal = .85
p_homicidal = .00005
p_gamer = .19

p_homicidal_given_gamer = p_gamer_given_homicdal * p_homicidal / p_gamer

print(p_homicidal_given_gamer) # 0.0002236842105263158

Wow, so let's talk about the elephant in the room. Sure, somebody can claim 85% of homicidal people are gamers. However, only .02% of gamers are homicidal. Let's take a look at this visualized animation below to understand why, and see why we get deceived by percentages. 


<video src="https://github.com/thomasnield/anaconda_probability_fundamentals/raw/main/media/02_VennDiagramBayes.mp4" controls="controls" style="max-width: 730px;">
</video>


Now of course, the media can spin this and say "gamers are 4x more likely to be homicidal" but the reality is we are comparing tiny numbers to tiny numbers. This is like saying homicidal people are 4x more likely to wear hats and glasses. Effectively we are taking common attributes and associating it with uncommon ones, and is [a common source for many fallacies](https://en.wikipedia.org/wiki/Base_rate_fallacy). It can also rear its ugly head in machine learning and study validation, which we will discuss next. 


## Bayes Theorem and Confusion Matrices

Let's apply Bayes Theorem to machine learning. Let's say we have a fancy logistic regression or deep learning algorithm that was trained on some data and predicts whether someone has a disease. We want to evaluate the results using a tool called the **confusion matrix**, which tracks true positives, false positives, true negatives, and false negatives. 

Here is some simplified scikit-learn code below demonstrating how to use a confusion matrix. The `y_pred` would typically be the test output of a classifcation model (e.g. a logistic regression, decision tree, or neural network) and the `y_true` would be our _ground truth_ or actual values. 

In [None]:
import numpy as np 
from sklearn.metrics import confusion_matrix

y_pred = np.array([1,0,1,1,1,1,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,1,0,0,0,1,0])
y_true = np.array([1,0,1,1,1,1,0,0,0,1,1,1,1,0,1,0,0,0,1,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,1,0])

'''
[[truepositives falsenegatives]
 [falsepositives truenegatives]]
 '''
matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, normalize=None)
print(matrix)
'''
[[18  3]
 [ 1 14]]
'''

Here is the result formatted. 

|          | Disease | No Disease |
|----------|---------|------------|
| Positive | 18      | 3          |
| Negative | 1       | 14         |

We can calculate easily from this that if someone has the disease, they are $ 94.7 $ % likely to test positive, which sounds promising. 

$ P(\text{positive}|\text{disease}) = \frac{18}{18+3} = .947 $

But what if this sample is highly biased to deal with class imbalance, which is an open problem in machine learning? **Class imbalance** is not having enough data equitably between the *true* and *false* cases. For example if only 1% of the population has a disease, that would cause the machine learning algorithm to simply predict *nobody* has the disease, as that would result in 99% accuracy. This is why we cannot trust accuracy metrics at face value, and use confusion matrices and Bayes Theorem accordingly. 

Let's say we do find that statistic, that only 1% of the population has the disease. 

$ P(\text{disease}|\text{positive}) = \frac{P(\text{positive}|\text{disease}) \times P(\text{disease})}{P(\text{positive})} $ 

$ P(\text{disease}|\text{positive}) = \frac{.947 \times .01}{.583} = \textbf{.016} $

Shoot! So you mean to tell me that if we account for 1% of the population actually having the disease, then the probability of having the disease given a positive test is only 1.6%? This just shows how quickly performance can nosedive and the confusion matrix does not account for the whole population, especially if the sample was deliberately biased! 

> 3Blue1Brown has a great video on the [medical testing paradox and Bayes Theorem.](https://www.youtube.com/watch?v=lG4VkPoG3ko)

Whether we are talking violent video games, machine learning validation, or medical testing, Bayes Theorem shows us how easily we can get mislead by percentages and we should always check ourselves accordingly. 

## Normalizing Constant

When working out conditional probability and Bayes Theorem problems, you may occasionally need to work with probabilities when an event DOES NOT happen. Thankfully, there is a useful conditional probability formula called the **normalizing constant** that will help us identify those negated event probabilities. 

$ \Large P(A) = P(A|B) \times P(B) + P(A|'B) \times P('B) $ 

Let's say there is a 30% chance of rain $ A $ today, and a 40% chance your umbrella order will arrive on time $ B $. But if it rains, there is only a 20% chance of your order arriving on time. To run errands, there needs to be no rain or your umbrella arrives. 

So we are provided the following immediately from the word problem above. 

$  P(\text{rain}) = .3 $

$  P(\text{on time}) = .4 $

$  P(\text{on time}|\text{rain}) = .2 $

But we need to find the probability of $ P(\text{on time}|\text{no rain}) $. Thankfully we can use the normalizing constant formula to solve this.

$  P(A) = P(A|B) \times P(B) + P(A|'B) \times P('B) $ 

$  P(\text{on time}) = P(\text{on time}|\text{rain}) \times P(\text{rain}) + P(\text{on time}|\text{no rain}) \times P(\text{no rain}) $ 


Substitute the values we do have and then solve algebraically for $  P(\text{on time}|\text{no rain}) $ from there. 

$  .4 = .2 \times .3 + P(\text{on time}|\text{no rain}) \times (1 - .3) $ 

$  P(\text{on time}|\text{no rain}) = \mathbf{.486} $

We now have everything we need to solve for our union probability. Remember, we wanted to find the probability of no rain or the umbrella arrives on time. 

$  P(\text{no rain} \cup \text{on time}) = P(\text{no rain}) + P(\text{on time}) - P(\text{no rain}) \times P(\text{on time}|\text{no rain}) $ 

$  P(\text{no rain} \cup \text{on time})  = 7 + .4 - .7 \times .486 = \mathbf{.7598} $ 

Therefore, there is a 75.98% probability we will be able to run our errands (it does not rain *OR* the umbrella arrives on time). 

## Exercise

An electric car company announced they have a new "AI system" that detects worn tires with 95% accuracy and notifies the driver to replace them. 

At any given time though, only 5% of the tires on the road need to be replaced, and the system is flagging positives 7% of the time. 

What is the probability of a tire actually being worn if it tested positive? Complete the Python code below (replacing the question marks "?") to calculate your answer. 

In [None]:
p_positive_if_worn = .95
p_worn = .05
p_positive = .07

p_worn_if_positive = ? 

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

The answer is a 67.85% probability that a positive test actually yields a worn tire. Use Bayes Theorem to flip that conditional probability as shown below. 

In [None]:
p_positive_if_worn = .95
p_worn = .05
p_positive = .07

p_worn_if_positive = p_positive_if_worn * p_worn / p_positive
print(p_worn_if_positive) # 0.6785714285714285