In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import random
import matplotlib.pyplot as plt

# Bayes's Theorm

The question we would like to answer today is how likely a conclusion is based on data we have. The method we will learn provides a way for updating probabilities based on new data. It is also an example that bridges the question of hypothesis testing and confidence intervals to the question of decision making. 

Today we will ask these questions of Chronic Kidney Disease and then you will apply them to the penguins data set for Firday.

## Definitions Probability and Conditional Probability

To begin though we need some definitions:

**P(A)** is going to denote the probability that event A happens. P(A) is between 0 and 1, with 0 representing an event that is impossible, and 1 an event that is certain. 

Consider the event *two six-sided dice are rolled and the result adds up to 7*.  Note first that with six sided dice there are a total of 36 possible rolls (thinking of the order as mattering). We then count six ways for the result to add up to 7.

$$ P(\mbox{two dice rolled add up to 7}) =\frac{6}{36} = \frac{1}{6} $$ 

Consider the event *one of the two dice rolled is a six*.  In this case we count 6 ways for the first die to be a six, 6 ways for the second die to be a six, and then note that in one case both dice gave a six and that this was an overcount. So adding this all up and subtracting the overcount we have 11 ways that one die could be a six.

$$ P(\mbox{one of the two dice rolled is a six}) = \frac{11}{36} $$

**Intersection**

Consider the event that both the sum of the dice is 7 and one of them is a 6. There are only two ways for this to happen.

$$ P( A \quad\mbox{and}\quad B) = \frac{2}{36} = \frac{1}{18} $$


**Conditional Probability**

Now suppose though that we are told that one of the dice rolled was a six. What is the probability that the result we have adds up to a 7?

We can brute force this:  We know that there are 11 rolls that give a six on one die (and in one of those cases a six on each). Out of these 11 there are two that add up to 7 and so the conditional probaiblity is:

$$ P( A | B) = \mbox{the probability of event A given the observation of event B} = \frac{2}{11} $$

Note that from here we can compute the intersection probability:

$$ P(A \quad\mbox{and}\quad B) = P(A|B) P(B) = \frac{2}{11} \cdot \frac{11}{36} = \frac{2}{36} = \frac{1}{18} $$

Note that we can reason out the following result:

$$ P(B | A) P(A) + P(B | \mbox{not}\quad A) P(\mbox{not} \quad A) = P(B) $$

Which is just a statement of total probability (i.e. A happened or not A happened).



**Bayes Theorem**

First note that 
$$ P(A \mbox{and} B) = P(A|B) P(B) = P(B|A) P(A) $$ 
by the symmetry of the intersection result above.

We then observe that 

$$ P(A| B) = \frac{P(B|A) P(A)}{P(B) } = \frac{P(B|A) P(A) }{ P(B|A) P(A) + P(B| \mbox{not} A) P(\mbox{not} A) } $$

Which is Bayes Theorem. What this is used for is to compute (or estimate) an unknown conditional probability from known ones. 

## Chronic Kidney Disease Example

The US CDC estimates that the prevalence of Chronic kidney disease is 15% of all adults. 

*side note:* this seemes like a really high prevalnce of a disease. The CDC notes that most cases of chronic kidney disease go undiagnosed because the patient is not showing symptoms. It is possible to have chronic kidney disease for years without having symptoms. 

So a question we might have is:  If we know a patient is over a certain age, can we refine our estimate of how likely it is that they have chronic kidney disease. 

In [5]:
# Let's continue exploring the penguins data set (our first Case Study that we will work on together)

kidney_url = 'https://drive.google.com/uc?export=download&id=1R8H9Hno1_fu7ON6yQlk9jNQsTXG6loup'
data_names = ['age', 'blood pressure', 'specific gravity', 'albumin', 'sugar',
            'red blood cells', 'pus cell', 'pus cell clumps', 'bacteria',
            'blood glucose random', 'blood urea', 'serum creatinine',
            'sodium', 'potassium', 'hemoglobin', 'packed cell volume',
            'white blood cell count', 'red blood cell count', 'hypertension',
            'diabetes mellitus', 'coronary artery disease', 'appetite',
            'pedal edema', 'anemia', 'class']

# use the na_values flag to parts the '?' as an NaN
kidney_data = pd.read_csv(kidney_url, names=data_names, na_values = ['?'])

# fix the white space in 'class'
kidney_data.loc[:, 'class'] = kidney_data.loc[:, 'class'].str.strip()

# This is a large dataset, so we check by just displaying the first few rows using .head()
kidney_data.head()

Unnamed: 0,age,blood pressure,specific gravity,albumin,sugar,red blood cells,pus cell,pus cell clumps,bacteria,blood glucose random,...,packed cell volume,white blood cell count,red blood cell count,hypertension,diabetes mellitus,coronary artery disease,appetite,pedal edema,anemia,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,...,38,6000,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35,7300,4.6,no,no,no,good,no,no,ckd


Note the problem with this dataset. This is not a truly random sample. This is a sample drawn from patients who were at a hospital with symptoms that could be explained by chronic kidney disease. However it is the data we have, so lets see if we can use it to update our estimate from the CDC of 15% of all adults having CDC, based on additional information about the patients.

In [7]:
age_data = kidney_data.loc[:, ['age', 'class']].dropna()
age_data

Unnamed: 0,age,class
0,48.0,ckd
1,7.0,ckd
2,62.0,ckd
3,48.0,ckd
4,51.0,ckd
...,...,...
395,55.0,notckd
396,42.0,notckd
397,12.0,notckd
398,17.0,notckd


In [8]:
age_data.groupby('class').mean()

Unnamed: 0_level_0,age
class,Unnamed: 1_level_1
ckd,54.541322
notckd,46.516779


In [9]:
age_data.groupby('class').count()

Unnamed: 0_level_0,age
class,Unnamed: 1_level_1
ckd,242
notckd,149


In [10]:
242/(242+149)

0.618925831202046

Again we note that the counts identifies the problem with our dataset. While only 15% of all adults have chronic kidney disease, in our dataset 61% of the patients had chronic kidney disease. 

Let's consider two events:  

A = an adult has chronic kidney disease.

B = an adult 54 years old or older.

I choose 54 because it is the mean of the patients with chronic kidney disease in our sample, but really this is something we could like to come back later and tune.

What we would like to do is estimate: P(A | B).  I.e. understand from our dataset how likely we expect the prevalnce of ckd to be in patients with symptoms who are 54 years old or older.

### Computations

To find this we need estimates on:

- P(A) = 0.15
- P(\mbox{not} A) = 1- 0.15 = 0.85

- P(B | A) 
- P(B | \mbox{not} \quad A)

In [13]:
# P(B|A) = probability a patient is 54 years old or older given they have ckd
# P(B| not A) = probability a patient is 54 years old or older given they do not have ckd

age_cutoff = 54

cutoff_count = age_data.loc[age_data.loc[:, 'age']>= age_cutoff, :].groupby('class').count()
cutoff_count

Unnamed: 0_level_0,age
class,Unnamed: 1_level_1
ckd,153
notckd,53


In [14]:
PA = 0.15
PnotA = 0.85
PB_givenA = cutoff_count.loc['ckd', 'age'] / cutoff_count.sum()
PB_givennotA = cutoff_count.loc['notckd', 'age'] / cutoff_count.sum()

In [17]:
PB_givennotA

age    0.257282
dtype: float64

In [18]:
PA_givenB = PB_givenA * PA / ( PB_givenA * PA + PB_givennotA * PnotA)
PA_givenB

age    0.3375
dtype: float64