In [1]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf #needed for models in this script
import pylab as pl
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

In [2]:
pd.set_option('html', True) #see the dataframe in a more user friendly manner
%matplotlib inline

## Naive Bayes Overview

Bayesian Probabilities

The Bayesian notion of probability can be expressed as follows:

posterior probability = (conditional probability x prior probability) / evidence

The posterior probability is basically the probability that an observation belongs to some class based on its observed feature values. To give an example: What is the probability that a person has diabetes given a pre-breakfast blood glucose measurement of x and a post-breakfast blood glucose measurement of y? In the Naive Bayes model, classifications are made based on the most probable hypothesis.

Bayesian inference introduces an additional prior probability (or just prior) that can be interpreted as prior belief or a priori knowledge. Priors describe the general probability of encountering a particular class. For example, we might say that the prior probability that a person is male versus female is 0.5. We can obtain these class priors by consulting domain experts, or estimate them from data. (If you're estimating class priors from training data, it's important that your sample is drawn i.i.d. (independent and identically distributed) from a representative sample of the population.)

Good video to watch: https://www.youtube.com/watch?v=BLcgeLALLnc

## Working Through a Story Problem

Problem:

100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammogram. 950 out of 9,900 women without breast cancer will also get a positive mammogram. If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammograms will actually have breast cancer?

Correct Answer:

7.8%, obtained as follows: Out of 10,000 women, 100 have breast cancer; 80 of those 100 have positive mammograms. From the same 10,000 women, 9,900 will not have breast cancer and of those 9,900 women, 950 will also have positive mammograms. This makes the total number of women with positive mammograms 950 + 80 or 1,030. Of those 1,030 women with positive mammograms, 80 will have cancer. Expressed as a proportion, this is 80/1,030 or 0.07767 or 7.8%.

### Step by Step

To put it another way, before the mammogram, the 10,000 women can be divided into <b>two groups:</b>
    
<b>Group 1:</b> 100 women with breast cancer   
<b>Group 2:</b>  9,900 women without breast cancer

Summing these two groups gives a total of 10,000 patients, confirming that none have been lost in the math.

After the mammogram, the women can be divided into <b>four groups:</b>

<b>Group A:</b> 80 women with breast cancer, and a positive mammogram.    
<b>Group B:</b> 20 women with breast cancer, and a negative mammogram.   
<b>Group C:</b> 950 women without breast cancer, and a positive mammogram.   
<b>Group D:</b> 8,950 women without breast cancer, and a negative mammogram.

As you can check, the sum of all four groups is still 10,000. The sum of groups A and B, the groups with breast cancer, corresponds to group 1; and the sum of groups C and D, the groups without breast cancer, corresponds to group 2; so administering a mammogram does not actually change the number of women with breast cancer. The proportion of the cancer patients (A + B) within the complete set of patients (A + B + C + D) is the same as the 1% prior chance that a woman has cancer: (80 + 20) / (80 + 20 + 950 + 8950) = 100 / 10000 = 1%.

The proportion of cancer patients with positive results, within the group of all patients with positive results, <b>is the proportion of (A) within (A + C):</b>

i)   80 / (80 + 950)     
ii)  80 / 1030    
iii) 0.0776699 or 7.8%   

If you administer a mammogram to 10,000 patients, then out of the 1030 with positive mammograms, 80 of those positive-mammogram patients will have cancer. This is the correct answer, the answer a doctor should give a positive-mammogram patient if she asks about the chance she has breast cancer; if thirteen patients ask this question, roughly 1 out of those 13 will have cancer.

The <b>most common mistake</b> is to ignore the original fraction of women with breast cancer and the fraction of women without breast cancer who receive false positives, and focus only on the fraction of women with breast cancer who get positive results. It's a mistake to assume that if around 80% of women with breast cancer have positive mammograms, then the probability of a woman with a positive mammogram having breast cancer must be around 80%.

Figuring out the final answer always requires all three pieces of information:

i)   the percentage of women with breast cancer   
ii)  the percentage of women without breast cancer who receive false positives   
iii) the percentage of women with breast cancer who receive (correct) positives.

To see that the final answer always depends on the original fraction of women with breast cancer, consider an alternate universe in which only one woman out of a million has breast cancer. Even if mammogram in this world detects breast cancer in 8 out of 10 cases, while returning a false positive on a woman without breast cancer in only 1 out of 10 cases, there will still be a hundred thousand false positives for every real case of cancer detected. The original probability that a woman has cancer is so extremely low that, although a positive result on the mammogram does increase the estimated probability, the probability isn't increased to certainty or even "a noticeable chance"; the probability goes from 1:1,000,000 to 1:100,000.

Similarly, in an alternate universe where only one out of a million women does not have breast cancer, a positive result on the patient's mammogram obviously doesn't mean that she has an 80% chance of having breast cancer! If this were the case her estimated probability of having cancer would have been revised drastically downward after she got a positive result on her mammogram---an 80% chance of having cancer is a lot less than 99.9999%! If you administer mammograms to ten million women in this world, around eight million women with breast cancer will get correct positive results, while one woman without breast cancer will get false positive results. Thus, if you got a positive mammogram in this alternate universe, your chance of having cancer would go from 99.9999% up to 99.999987%. That is, your chance of being healthy would go from 1:1,000,000 down to 1:8,000,000.

These two extreme examples help demonstrate that the mammogram result doesn't replace your old information about the patient's chance of having cancer; the mammogram slides the estimated probability in the direction of the result. A positive result slides the original probability upward; a negative result slides the probability downward. For example, in the original problem where 1% of the women have cancer, 80% of women with cancer get positive mammograms, and 9.6% of women without cancer get positive mammograms, a positive result on the mammogram slides the 1% chance upward to 7.8%.

The original proportion of patients with breast cancer is known as the <b>prior probability.</b> The chance that a patient with breast cancer gets a positive mammogram, and the chance that a patient without breast cancer gets a positive mammogram, are known as the two <b>conditional probabilities.</b>   
Collectively, this initial information is known as the priors. The final answer---the estimated probability that a patient has breast cancer, given that we know she has a positive result on her mammogram---is known as the revised probability or the <b>posterior probability.</b>