# Conditional Probability

## Dataset

The study titled "Adolescents Understanding of Social Class", is a study examining teens' beliefs about social class. 
- Sample: 48 working class and 50 upper middle class 16 year olds
- "objective" assignment to social class based on self reported measures of parents' occupation, education and household income
- "subjective" association based on survey questions

**Contingency Table:**

|                |                    | Objective<br>Working class | Objective<br>Upper middle class | Total |
| :------------: | :----------------- | -------------------------: | ------------------------------: | ----: | 
|                | poor               |                          0 |                               0 |     0 |
|                | working class      |                          8 |                               0 |     8 |
| **subjective** | middle class       |                         32 |                              13 |    45 |
|                | upper middle class |                          8 |                              37 |    45 |
|                | upper class        |                          0 |                               0 |     0 |
|                | Total              |                         48 |                              50 |    98 |

For simplicity I transform words in acronyms:  
- For Subjective:
    - Poor: SP
    - Working class: SWC
    - Middle class: SMC
    - Upper middle class: SUMC
    - Upper class: SUP
- For Objective:
    - Working class: OWC
    - Upper middle class: OUMC

### Marginal Probabilities

*What is the probability that a student's __objective__ social class position is upper middle class (OUMC)?*  

$P(\text{OUMC}) = \frac{50}{98} = 0.5102$  

Note that the term **marginal probability** comes from the fact that the counts we use to calculate this probability came from the **margins of the contingency table**.

### Joint Probability

*What is the probability that a student's **objective position and subjective identity** are both upper middle class?*  

$P(\text{OUMC & SUMC}) = \frac{37}{98} = 0.3776$

The term **joint probability** comes from the fact that we're considering the students who are at the **intersection of the two events** of interest.

### Conditional Probability

*What is the probability that a student who is objectively in the working class associates with upper middle class?*  

$P(\text{SUMC | OWC}) = \frac{8}{48} = 0.1667$

We call this a conditional probability because we first conditioned on the working class and then calculated the probability based on counts only in this column. 

### Bayes' Theorem

We calculate conditional probabilities using Baye's Theorem, which states that probability of `A` given `B` is probability of `A` and `B` divided by probability of `B`. So that's the joint probability on the numerator divided by what you're conditioning on in the denominator.

$$P(A\ |\ B) = \frac{P(A\ \text{&}\ B)}{P(B)}$$

Using the previous question, the probability of subjective upper middle class (SUMC) given objective working class (OWC) is going to be equal to the joint probability of subjective upper middle class and objective working class `P(SUMC & OWC)`, divided by probability of objective working class `P(OWC)`, what we're conditioning on.

$P(\text{SUMC | OWC}) = \frac{P(\text{SUMC & OWC})}{P(OWC)}$<br>
$P(\text{SUMC | OWC}) = \frac{8/98}{48/98} = 0.16667$<br>

**Practice:**

*The American Community Survey is an ongoing survey that provides data every year to give communities the current information they need to plan investments and services. The 2010 American community survey estimates that 14.6% of Americans live below the poverty line. 20.7% speak a language other than English at home, and 4.2% fall into both categories. Based on this information, what percent of Americans live below the poverty line given that they speak a language other than English at home?*

$P(\text{below poverty | speak other language}) = \frac{P(\text{below poverty & speak other language})}{P(\text{speak other language})}$<br>
$P(\text{below poverty | speak other language}) = \frac{0.042}{0.207} = 0.2028$<br>

One use of this information would be to compare to the general public. Remember, we also know that 14.6% of all Americans live below the poverty line. So it seems like living below the poverty line is more prevalent for people who speak a language other than English at home. We're the comparing the 14.6% for the general public to the 20% that we arrived at, for the part of the public that speaks a language other than English at home. This finding suggests that language spoken at home, and poverty level may be dependent.  

**General Rule**

Since Bayes' Theorem does not have an independent condition, we can actually simply rearrange it and calculate the joint probability of `A` and `B` as a product of the conditional probability of `A` given `B` `P(A | B)`, multiplied by the marginal probability of `B` `P(B)`. So all we've done is taken the Bayes' Theorem, shuffled things around, and come up with a new rule for calculating joint probabilities.

$$P(A\ |\ B) = \frac{P(A\ \text{&}\ B)}{P(B)}\ \ \ \rightarrow\ \ \ P(A\ \text{&}\ B) = P(A\ |\ B) \times P(B)$$

Generically, if `P(A | B) = P(A)`, then the events `A` and `B` are said to be independent. We can explain this in two ways.  
- Conceptually: `B` tells us nothing about `A`, then `A` and `B` are independent, meaning that, whether we have the probability with `B` given, or not, the probabilities are exactly the same.
- Mathematically: if events `A` and `B` are independent, then `P(A and B) = P(A) x P(B)`. Then,

$$P(A\ |\ B) = \frac{P(A\ \text{&}\ B)}{P(B)} = \frac{P(A) \times P(B)}{P(B)} = P(A)$$


### Example

Consider the following hypothetical distribution of gender and major of students in that introductory class. We have 100 students in this class. 60 of them are social science majors, and 40 of them are not.

|            |        | Major<br>Social Science | Major<br>Non-Social Science | Total |
| :--------- | :----- | ----------------------: | --------------------------: | ----: |
|            | female |                      30 |                          20 |    50 |
| **gender** | male   |                      30 |                          20 |    50 |
|            | Total  |                      60 |                          40 |   100 |

If I wanted to find the overall probability of social science majors in this class, that would be 60 out of 100, so the probability that a randomly-selected student is a social science major is 0.6.

$P(SS) = \frac{60}{100} = 0.6$<br>

Now let's condition on the gender. What is the probability that a randomly-selected female in this student is a social science major?

$P(SS\ |\ F) = \frac{30}{50} = 0.6$<br>

What about the males? 50 males in the class, 30 of which are social science majors. So once again, probability of social science given male is 30 out of 50, 60% as well.

$P(SS\ |\ M) = \frac{30}{50} = 0.6$<br>

So what we're seeing here is that all of these probabilities are exactly the same. So this goes back to `P(A | B)`. If that equals `P(A)`, then we know that the events are independent. In this case, `P(SS) = P(SS | F)` or `P(SS) = P(SS | M)`. So we would determine that the two variables, gender and major are independent of each other, given this hypothetical distribution. 

### Questions

Consider for the next questions the following table:

<img src="images/exerc_2.1.png" align="center" width="700"/>

1) What is the probability that a student's subjective social class identity is upper middle class?  
&#9744; $\frac{8}{48} \approx 0.17$  
&#9745; $\frac{45}{98} \approx 0.46$  
&#9744; $\frac{8}{45} \approx 0.18$  
&#9744; $\frac{37}{50} \approx 0.74$  
&#9744; $\frac{37}{45} \approx 0.82$

*Solving:*

$P(\text{SUMC}) = \frac{45}{98} = 0.4591$

2) What is the probability that a student's objective and subjective class is working class?  
&#9744; $\frac{48}{98} \approx 0.49$  
&#9744; $\frac{8}{48} \approx 0.17$  
&#9745; $\frac{8}{98} \approx 0.08$  
&#9744; $\frac{8}{8} \approx 1$  

*Solving:*

$P(\text{SWC & OWC}) = \frac{8}{98} = 0.8163$

3) If a student's objective class position is upper middle class, what is the probability that they associate with middle class?  
&#9744; $\frac{32}{48} \approx 0.67$  
&#9744; $\frac{13}{45} \approx 0.29$  
&#9744; $\frac{13}{98} \approx 0.13$  
&#9745; $\frac{13}{50} \approx 0.26$  

*Solving:*

$P(\text{SMC | OUMC}) = \frac{13}{50} = 0.26$

4) Same data: "The American Community Survey is an ongoing survey that provides data every year to give communities the current information they need to plan investments and services. The 2010 American Community Survey estimates that 14.6% of Americans live below the poverty line, 20.7% speak a language other than English at home, and 4.2% fall into both categories.". Based on this information, what percent of Americans who live below the poverty line also speak a language other than English at home?  
&#9744; $\frac{0.207}{0.146} \approx 1.42$  
&#9744; $\frac{0.042}{0.207} \approx 0.2$  
&#9744; $\frac{(0.146 * 0.207)}{0.146} \approx 0.207$  
&#9744; $\frac{0.146}{0.207} \approx 0.71$  
&#9745; $\frac{0.042}{0.146} \approx 0.29$  

*Solving:*

$P(\text{speak other languages | below poverty}) = \frac{P(\text{speak other languages & below poverty})}{P(\text{below poverty})}$<br>
$P(\text{speak other languages | below poverty}) = \frac{0.042}{0.146} = 0.2877$<br>

---
## Probability Trees

Consider the folowing problem:

*You have 100 emails in your inbox, 60 of them are spam, and 40 are not. Of the 60 spam emails, 35 contain the word "free". Of the rest, only 3 contain the word "free". If an email contains the word "free", what is the probability that it is spam?*

Given that information, we can build the probability tree by dividing our population, our inbox in this case is our population, into two, based on whether the email is spam or not spam. So we have 60 emails that are spam, and 40 emails that are not spam. Now that we've done this branching, we can actually further branch out from these and list how many of the spam emails have the word free in them and how many of them do not, and likewise for the no spam, non-spam emails. Of the 60 spam emails, 35 have the word free in it, and of, and the remainder 25 do not. And of the not spam emails, only three of them have the word free in it, and 37 do not. The generated tree results as:

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs@e2c6dbc6/Coursera/Duke%20University/Probability-intro/Week%203/images/probability_tree.svg" align="center" width="500"/>

Calculating the probability that the email containing the word "free" is a spam:

$P(\text{spam}) = \frac{60}{100} = 0.6$<br>
$P(\text{free}) = \frac{(35+3)}{100} = 0.38$<br>
$P(\text{spam & free}) = \frac{35}{100} = 0.35$<br>
$P(\text{spam | free}) = \frac{P(\text{spam & free})}{P(\text{free})}$<br>
$P(\text{spam | free}) = \frac{0.35}{0.38} = 0.9211$<br>


Consider the following problem:  
*Swaziland, has the highest HIV problems in the world, since 25.9% of this country's population is infected with HIV. The ELISA test is one of the first and most accurate tests for HIV. For those who carry HIV, the ELISA test is 99.7% accurate. For those who do not carry HIV, the test is 92.6% accurate. If an individual from Swaziland has tested positive, what is the probability that he carries HIV?*

Let's write down the given probabilities: 

$P(\text{HIV}) = 0.259$<br>
$P(\text{positive | HIV}) = 0.997$<br>
$P(\text{negative | not HIV}) = 0.92.6$<br>

Then the question is what is $P(\text{HIV | positive})$? Thus, let's build the tree diagram: 

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs@e2c6dbc6/Coursera/Duke%20University/Probability-intro/Week%203/images/probability_tree_hiv.svg" align="center" width="700"/>

Using the Bayes' theorem, we know that:

$$P(\text{HIV | positive}) = \frac{P(\text{HIV & positive})}{P(\text{positive})}$$

To get the join probabilities, like the one in the numerator, using the probability true, all we need to do is multiply across the branches. This is why a probability tree is useful. Because it organizes the information for you in a way where you'd no longer have to think, what should I multiply with what. And you, all you need to do is carry along the branches and pick up the building blocks along the way. Thus: 

$P(\text{HIV & positive}) = P(\text{HIV}) \times P(\text{positive | HIV})$<br>
$P(\text{HIV & positive}) = 0.259 \times 0.997$<br>
$P(\text{HIV & positive}) = 0.2582$<br>

We also must calculate the probability of *positive* and *not HIV* in order to find the marginal probability `P(positive)`, which is the sum of the conditional probabilities where it occurs.

$P(\text{not HIV & positive}) = P(\text{not HIV}) \times P(\text{positive | not HIV})$<br>
$P(\text{not HIV & positive}) = 0.741 \times 0.074$<br>
$P(\text{not HIV & positive}) = 0.0548$<br>

$P(\text{positive}) = P(\text{HIV & positive}) + P(\text{not HIV & positive})$<br>
$P(\text{positive}) = 0.2582 + 0.0548 = 0.3130$<br>

Finally, we can calculate the probability asked in the question $P(\text{HIV | positive})$ as:

$P(\text{HIV | positive}) = \frac{P(\text{HIV & positive})}{P(\text{positive})}$<br>
$P(\text{HIV | positive}) = \frac{0.2582}{0.3130} = 0.8249 \approx 82\%$<br>

### Questions

1) Is the accuracy of the test dependent on or independent of whether the patient has the disease?  
&#9745; Dependent  
&#9744; Independent  

2) Given the probability tree below, what is the probability that a randomly chosen person from Swaziland tests negative?  

<img src="images/exerc_2.3.png" align="center" width="500"/>

&#9744; $0.003 + 0.926 = 0.929$  
&#9745; $0.0008 + 0.6862 = 0.687$  
&#9744; $\frac{0.0008}{(0.0008 + 0.6862)} \approx 0.0012$  
&#9744; $\frac{0.6862}{(0.0008 + 0.6862)} \approx 0.9988$  
&#9744; 40.2582 + 0.0548 = 0.313$  

*Solving:*

$P(\text{negative}) = P(\text{HIV and negative}) + P(\text{not HIV and negative})$<br>
$P(\text{negative}) = 0.0008 + 0.6862 = 0.687$

---
## Bayesian Inference

In this part, we will virtually play a game to introduce a Bayesian approach to inference.

**Setup**: I have a die in each hand, one of them is a six-sided die and the other one is a 12-sided die. The ultimate goal of the game is to guess which hand is holding which die, but this is more than just a guessing game. Before you make a final decision, you will be able to collect data by asking me to roll the die in one hand, and I'll tell you whether the outcome of the roll is greater than or equal to 4. 

*What is the probability of rolling a value greater than or equal to 4 with a six-sided die?*  
With a six-sided die, the sample space is made up of numbers between 1 and 6. We're interested in an outcome greater than or equal to 4, the probability of getting such an outcome is then 3 out of 6, or 1 out of 2 or, 50%.

$S_{6side} = \{1, 2, 3, 4, 5, 6\}\ \ \ \rightarrow\ \ \ P(\ge4) = 3/6 = 1/2$<br>

*What is that probability with a 12-sided die?*  
With a 12-sided die, the sample space is bigger, number is between 1 and 12. And once again, we're interested in outcomes 4 or greater. The probability of getting such an outcome is 9 out of 12, or $\frac{3}{4} = 75\%$. 

$S_{12side} = \{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\ \ \ \rightarrow\ \ \ P(\ge4) = 9/12 = 3/4$<br>  

We chose the right hand, and we won (rolled a number $\gt4$). Having observed this data point, we have the probability tree:

<img src="images/bayesian_inference_tree.svg" align="center" width="500"/>

If we want to find this probability, it's a conditional probability, we can make use of the Bayes' theorem. Which basically says, if you're looking for A given B, find the joint probability of A and B `P(A and B)`, divided by the marginal probability of B `P(B)`. So we have:

$P(\text{Good die on the right | we rolled }\gt4\text{ on the right hand}) = \frac{\text{Good Right & }\gt4\text{ Right}}{\gt4\text{ Right}}$<br>
$P(\text{Good die on the right | we rolled }\gt4\text{ on the right hand}) = \frac{0.375}{0.375+0.25}$<br>
$P(\text{Good die on the right | we rolled }\gt4\text{ on the right hand}) = 0.6$<br>

### Posterior

The probability we just calculated is also called the posterior probability:

$$P(\text{Good die on the right | we rolled }\gt4\text{ on the right hand})$$ 

Posterior probability is generally defined as probability of the hypothesis given the data. 

$$P(\text{hypothesis | data})$$

In other words, it's the probability of a hypothesis we set forth, given the data we just observed. It depends on both the prior probability we set and the observed data. This is the opposite of the probability of observed data, given the null hypothesis being true `P(data | hypothesis)`. In other words, the probability of data given the hypothesis, which we had called a `p-value`.

### Updating the Prior

In the Bayesian approach, we evaluate claims iteratively as we collect more data. In the next iteration, the next roll, if we were to play this game one more time, and you had asked me to roll a die on either the right or the left hand again, and we had done the calculation of the posterior one more time. We get to take advantage of what we learned from the data. In other words, we **update** our prior with our posterior probability from the previous iteration. So, in the next iteration, our updated prior for the first hypothesis being true is going to be the 60%, the posterior from the previous iteration. And the compliment of that, 40%, is going to be the probability of the competing hypothesis. 

<img src="images/updated_prior.svg" align="center" width="300"/>

###Questions:

1) What is the probability of rolling ≥4 with a 6-sided die? What about with a 12-sided die?  
&#9744; 6-sided: 3/4; 12-sided: 1/2  
&#9744; 6-sided: 1/3; 12-sided: 2/3  
&#9744; 6-sided: 1/3; 12-sided: 3/4  
&#9745; 6-sided: 1/2; 12-sided: 3/4  
&#9744; 6-sided: 2/3; 12-sided: 1/3  

*Solving*:  
$S_{6side} = \{1, 2, 3, 4, 5, 6\}\ \ \ \rightarrow\ \ \ P(\ge4) = 3/6 = 1/2$<br>
$S_{12side} = \{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\ \ \ \rightarrow\ \ \ P(\ge4) = 9/12 = 3/4$<br>  

2) Say you're playing a game where the goal is to roll ≥ 4. If you could get your pick, which die would you prefer to play this game with?  
&#9744; 6-sided  
&#9745; 12-sided  

3) Before we collect any data, you have no idea if I am holding the good die (12-sided) on the right hand or the left hand. Then, what are the probabilities associated with the following hypotheses?  
- $\text{H}_1$: good die on the Right (bad die on the Left)  
- $\text{H}_2$: good die on the Left (bad die on the Right)  

&#9744; $P(\text{H}_1) = 0.33 \text{ and } P(\text{H}_2) = 0.67$  
&#9745; $P(\text{H}_1) = 0.5 \text{ and } P(\text{H}_2) = 0.5$  
&#9744; $P(\text{H}_1) = 0 \text{ and } P(\text{H}_2) = 1$  
&#9744; $P(\text{H}_1) = 0.25 \text{ and } P(\text{H}_2) = 0.75$

4) You chose the right hand, and you won (rolled a number ≥4). Having observed this data point how, if at all, do the probabilities you assign to the same set of hypotheses change?  
- $\text{H}_1$: good die on the Right (bad die on the Left)  
- $\text{H}_2$: good die on the Left (bad die on the Right)  

&#9744; $P(\text{H}_1) = 0.5 \text{ and } P(\text{H}_2) = 0.5$  
&#9745; $P(\text{H}_1) \gt 0.5 \text{ and } P(\text{H}_2) \lt 0.5$  
&#9744; $P(\text{H}_1) \lt 0.5 \text{ and } P(\text{H}_2) \gt 0.5$  


## Examples of Bayesian Inference

*American Cancer Society estimates that about 1.7% of women have breast cancer.*  

*Susan G Komen for the Cure Foundation states that mammography correctly identifies about 78% of women who truly have breast cancer.*  

*An article published in 2003 suggests that up to 10% of all mammograms are false positive.*  

These probabilities are of course estimates, as they're very difficult to calculate precisely. But we're going to take these as givens for this example. As usual, let's first parse through the percentages we're given. 

$P(\text{cancer}) = 0.017$<br>
$P(\text{positive | cancer}) = 0.78$<br>
$P(\text{positive | not cancer}) = 0.10$<br>

Prior to any testing, and any information exchange between the patient and the doctor, what probability should a doctor assign to a female patient having breast cancer?

$P(\text{cancer}) = 0.017 \ \ \ \ \rightarrow \ \ \ \ \text{Prior Probability}$<br>

When a patient goes through breast cancer screening, there are 2 competing claims. Patient has cancer or, and patient doesn't have cancer. If a mammogram yields a positive result what is the probability that patient has cancer?

In terms of probability the question is asking `P(cancer | positive)`? So, let's first build the probability tree as:

<img src="images/bayesian_example.svg" align="center" width="600"/>

We want to know the probability of having breast cancer given the mammogram yields a positive results. Thus, using the Bayesian probability, we have:

$$P(\text{cancer | positive}) = \frac{P(\text{cancer and positive})}{P(\text{positive})}$$

Looking in our probability tree, we calculate:

$P(\text{cancer | positive}) = \frac{0.01326}{0.01326+0.0983}$<br>
$P(\text{cancer | positive}) = 0.1189 \approx 0.12 \ \ \ \ \rightarrow \ \ \ \ \text{Posterior Probability}$<br>

Since a positive mammogram doesn't necessarily mean that the patient actually has breast cancer, the doctor might decide to retest the patient. What is the probability of having breast cancer if the second mammogram also yields a positive result? 

Now that we have information about the previous iteration, we plug in the prior, the posterior from the previous test. Therefore, the probability of not having breast cancer is updated to be the complement of this, 88%. Now, solving the posterior probability, we have:

$P(\text{cancer | positive}) = \frac{0.0936}{0.0936+0.088}$<br>
$P(\text{cancer | positive}) = 0.5154 \approx 0.52$<br>

### Questions

1) What should the new prior probability that this woman has cancer, given that she already tested positive once, i.e. what is the new prior probability?  
&#9744; 0.017  
&#9745; 0.12  
&#9744; 0.0133  
&#9744; 0.88  

2) What is the probability of having breast cancer if this second mammogram also yields a positive result, i.e. what is the new posterior probability? Choose the closest answer.  

<img src="images/exerc_2.8.png" align="center" width="600"/>

&#9744; 0.0936  
&#9744; 0.088  
&#9744; 0.48  
&#9745; 0.52  

*Solving*:

$P(\text{cancer | positive}) = \frac{0.0936}{0.0936+0.088}$<br>
$P(\text{cancer | positive}) = 0.5154 \approx 0.52$<br>