# BAYESIAN STATISTICS

### Motivation

[Motivation](https://theconversation.com/bayes-theorem-the-maths-tool-we-probably-use-every-day-but-what-is-it-76140)

![225px-Thomas_Bayes.gif](attachment:225px-Thomas_Bayes.gif)

## Bayes' Formula

## $$P(A|B) = \dfrac{P(B|A)P(A)}{P(B)}$$
- $P(A)$ is called the **prior probability** of A, which refers to the probability of A prior to the knowledge of the occurrence of B.
- $P(A|B)$ is called the **posterior probability** of A, which refers to the probabilithy of A after observing B.
- Baye's Theorem can be viewed as a way of updating the probability of A in light of the knowledge about B.

![bayes-1.png](attachment:bayes-1.png)

## Using Law of Total Probability

## $P(A_j|B)=\dfrac {P(B\cap A_j)}{P(B)}=\dfrac{P(A_j)P(B|A_j)}{\sum_{i=1}^{n} P(A_i)P(B|A_i)}$

### Example

- It is know that bowl $A_1$ contains 3 red and 7 blue chips and bowl $A_2$ contains 8 red and 2 blue chips. Chips are identical in size and shape. 
- A die is cast and bowl $A_1$ is selected if a five or six shows up on the side that is up; otherwise bowl $A_2$ is selected.
- From this we conclude $P(A_1)=\dfrac{2}{6}$ and $P(A_2)=\dfrac{4}{6}$.
- The selected bowl is handed to another person and one chip is taken at random.
- Let's say that the chip is red and we denote the event of this selection as $B$
- It is reasonable that to assign the conditional probabilities $P(B|A_1)=\dfrac{3}{10}$ and $P(B|A_2)=\dfrac{8}{10}$.
- What is the conditional probability the chip was selected from bowl $A_1$, given that a red chip is drawn is 


$P(A_1|B)=\dfrac{P(A_1)P(B|A_1)}{P(A_1)P(B|A_1)+P(A_2)P(B|A_2)}=\dfrac{(\dfrac{2}{6})(\dfrac{3}{10})}{(\dfrac{2}{6})(\dfrac{3}{10})+(\dfrac{4}{6})(\dfrac{8}{10})}=\dfrac{3}{19}$

### Classwork
- Calculate $P(A_2|B)$

## Note
- Since $A_2$ has a larger proportion of red chips than $A_1$, it makes sense that $P(A_2|B)$ should be larger than $P(A_2)$ and that $P(A_1|B)$ should be smaller than $P(A_1)$.

- Intuitively the chance of having bowl $A_2$ are better once a red chip is observed than before a chip is taken.

## Monty Hall Problem

### Background

- https://youtu.be/_X5erR9LKUs

### Problem Statement
A TV game show called Let's Make A Deal, was popular in the 60's and 70's. A contestant in the show was given a choice of three doors. Behind one door was a valuable prize such as a car; behind the other two doors were less valuable prizes. After a contestant chose a door, say door 1, the host opened one of the other doors, say door 3, showing a less valuable prize. He then gave the contestant the opportunity to switch from door 1 to door 2. **Would switching from Door 1 to Door 2 increase the contestant's chances of winning the car?**

### Problem Setup
Define the events $D_i=\{Door\ i\ conceals\ a\ car\}$ and $O_j=\{Host\ opens\ Door\ j\ after\ a\ contestant\ chooses\ Door\ 1 \}$. 


When a contestant makes their initial choice, the prior probabilities are $P(D_1)=P(D_2)=P(D_3)=\dfrac{1}{3}$.  


After the host shows that the car is not behind Door 3, the chances of winning the car are given by the posterior probabilities $P(D_1|O_3)$ and  $P(D_2|O_3)$. 


We can find these probabilities using Baye's Theorem.  

### Solution Setup

First evaluate $P(O_3|D_1)$, $P(O_3|D_2)$, $P(O_3|D_3)$ in light of the strategy of the show.


$P(O_3|D_1)=\dfrac{1}{2}$ (Door 3 is one of the two doors with the lesser prize that can be opened, given that Door 1 conceals the car).


$P(O_3|D_2)=1$ (The host can open only Door 3 if Door 2 conceals the car).


$P(O_3|D_3)=0$ (The host will not open Door 3 if Door 3 conceals the car).


### Solution

$P(D_1|O_3)=\dfrac{P(D_1)P(O_3|D_1)}{P(D_1)P(O_3|D_1)+P(D_2)P(O_3|D_2)+P(D_3)P(O_3|D_3)}=\dfrac{(\dfrac{1}{3})(\dfrac{1}{2})}{(\dfrac{1}{3})(\dfrac{1}{2})+(\dfrac{1}{3})(1)+(\dfrac{1}{3})(0)}=\dfrac{1}{3}$


$P(D_2|O_3)=\dfrac{P(D_2)P(O_3|D_2)}{P(D_1)P(O_3|D_1)+P(D_2)P(O_3|D_2)+P(D_3)P(O_3|D_3)}=\dfrac{(\dfrac{1}{3})(1)}{(\dfrac{1}{3})(\dfrac{1}{2})+(\dfrac{1}{3})(1)+(\dfrac{1}{3})(0)}=\dfrac{2}{3}$

### Observation

- Given the additional knowledge that the car is not behind Door 3, the chances of winning the car are doubled by switching from Door 1 to Door 2.

# Bayesian Thinking

![jon.jones_.20141.jpg](attachment:jon.jones_.20141.jpg)

- USADA is the national anti-doping organization in the United States for Olympic, Paralympic, Pan American, and Parapan American sport. 
- Pre USADA never failed a drug test in 22 fights.
- Since USADA testing has tested positive in 3 out of 5 fights.
- Is it likely that he was using PEDs his entire career Pre-USADA and never got caught?

## Naive Bayes Classifier
- The basic assumption is that the features are statistically independent.

- Naive assumption of independence among the features. 

- Naive Bayes classifiers are built on Bayesian methods.

- In Naive Bayes classification, we're interested in finding the probability of a label given some observed features, which we can write as $P(L~|~{\rm features})$.
Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:

$$
P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})}
$$

- Taken Jake VanderPlas Python Data Science Handbook. Code reused on MIT license.
- https://github.com/jakevdp/PythonDataScienceHandbook

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

# Real Example

## Naive Bayes Classifying Tumor

### Description
- Use cell nuclei categories to predict whether a breast cancer tumor is benign or malignant
- https://www.mldata.io/dataset-details/breast_cancer/#customize_download

Name	Type	Description  
clump_thickness	integer	Value range: 1-10  
uniformity_of_cell_size	integer	Value range: 1-10  
uniformity_of_cell_shape	integer	Value range: 1-10  
marginal_adhesion	integer	Value range: 1-10  
single_epithelial_cell_size	integer	Value range: 1-10  
bare_nuclei	integer	Value range: 1-10  
bland_chromatin	integer	Value range: 1-10  
normal_nucleoli	integer	Value range: 1-10  
mitosis	integer	Value range: 1-10  
class	integer	Predictor Value: 2 for benign, 4 for malignant  

In [None]:
bc=pd.read_csv('breast_cancer_scikit_onehot_dataset.csv')

In [None]:
bc.head()

In [None]:
target=bc['class']
target = bc['class'].map(lambda x: 1 if x == 4 else 0).values 
target = pd.Series(target)
target.head()

In [None]:
predictor=bc.drop(columns=['class'])

In [None]:
target.value_counts(normalize=True)

In [None]:
predictor.head()

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(predictor, alpha=0.2, figsize=(20, 18), diagonal='kde')
plt.show()

In [None]:
# Import train_test_split function
from sklearn import preprocessing
from sklearn.model_selection import train_test_split# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(predictor, target, test_size=0.3,random_state=9) 

In [None]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)

In [None]:
# Calculate the Probability of an outcome belonging to each of the classes
yprob = gnb.predict_proba(X_test)
yprob
yprob[:5]

In [None]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, y_pred)

## When to Use Naive Bayes

Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more complicated model.
That said, they have several advantages:

- They are extremely fast for both training and prediction
- They provide straightforward probabilistic prediction
- They are often very easily interpretable
- They have very few (if any) tunable parameters

These advantages mean a naive Bayesian classifier is often a good choice as an initial baseline classification.
If it performs suitably, then congratulations: you have a very fast, very interpretable classifier for your problem.
If it does not perform well, then you can begin exploring more sophisticated models, with some baseline knowledge of how well they should perform.

Naive Bayes classifiers tend to perform especially well in one of the following situations:

- When the naive assumptions actually match the data (very rare in practice)
- For very well-separated categories, when model complexity is less important
- For very high-dimensional data, when model complexity is less important

The last two points seem distinct, but they actually are related: as the dimension of a dataset grows, it is much less likely for any two points to be found close together (after all, they must be close in *every single dimension* to be close overall).
This means that clusters in high dimensions tend to be more separated, on average, than clusters in low dimensions, assuming the new dimensions actually add information.
For this reason, simplistic classifiers like naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.