### Bayes' Rule
From Wikipedia, "describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes' theorem allows the risk to an individual of a known age to be assessed more accurately by conditioning it relative to their age, rather than assuming that the individual is typical of the population as a whole."

$$P(A|B) = \frac{P(A) P(B|A)}{P(B)}$$

- The $P(A|B)$ is the posterior
- The $P(A)$ is the prior
- The $P(B|A)$ is the likelihood
- The $P(B)$ is the evidence

### Naive Asumption
The naive part comes from the assumption that all features are independent which are not always true in real world problems but this assumption works quite well in practice. From an unknown joint distribution $(X_1,X_2,...,X_n)$, the posterior of Y takes class c:
$$P(Y=c|X_1, X_2,..., X_n) = \frac{P(X_1,X_2,...,X_n|Y=c) P(Y=c)}{P(X_1,X_2,...,X_n)}$$

Because of the naive assumption, the formula will be:
$$P(Y=c|X_1, X_2,..., X_n) = \Pi \frac{P(X_i|Y=c)P(Y=c)}{P(X_i)}$$

For $c \in C$, the prediction will be the one that has highest posteriori probability:
$$\hat{y} = \underset{c}{\mathrm{argmax}} \; P(X|y=c)P(Y=c)$$

### Hidden Markov Model
Another assumption that we can use to solve the posterior is to make an assumption that each variable X_1 depends on X_2, X_2 depends on X_3, ... Our problem will turns into Hidden Markov Chain where the state is the target class and the outputs of the chain are X1, X2, X3,... X_n

__Todo:__ Check the feasibility and implement

### Likelihood of Features
#### a) Likelihood as Gaussian
The likelihood of features is assumed to be __Gaussian__, the formula will be:
$$P(X|y=c) = P(X|\mu_{y=c}, \sigma_{y=c}) P(\mu = \mu_{y=c}, \sigma = \sigma_{y=c})% = \frac{1}{\sqrt{2 \pi \sigma_{y=c}}} exp(-\frac{(X - \mu_{y=c})^2}{2 \sigma^2_{y=c}})$$

However, we do not have any prior knowledge about the parameter $\theta(\mu, \sigma)$, we assume that the distributions of $\mu$ and $\sigma$ are uniform, and therefore the probability stay unchange, it will be:
$$P(X|y=c) = P(X|\mu_{y=c}, \sigma_{y=c})% = \frac{1}{\sqrt{2 \pi \sigma_{y=c}}} exp(-\frac{(X - \mu_{y=c})^2}{2 \sigma^2_{y=c}})$$


#### b) Likelihood as Categorical Distribution
Each feature is assumed to have its own __categorical__ distribution, or given any class $y=c$, the probability that X takes the value x:
$$P(X=x|y=c) = \frac{|\{ i \in D | X=x, y=c \} + \alpha|}{ |\{ i \in D | y = c \}| + \alpha C}$$
__Note:__ $\alpha$ is the smoothing parameter to prevent zero probability 

### Implementation

In [94]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [89]:
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Using Likelihood as Gaussian
- Use MLE to estimate the $\theta$ and $\mu$
- $\mu = \frac{1}{n} \sum_i^n x_i$
- $\sigma = \sqrt{\frac{\sum_i^n (x_i - \mu)^2}{n}}$

In [70]:
def compute_theta(X, y, cls):
    T = X[y == cls]
    m = T.mean().to_frame("u")
    s = T.std().to_frame("s")
    return m.join(s).to_dict()

(3.2039711007793663, 17.462830188679245)

In [101]:
def predict_log_proba(X, thetas):
    X = X.copy()
    cols = X.columns
    def _udf(x, cols, params):
        # return stats.norm.pdf(x, loc=m, scale=s)
        log_proba = 0
        for col in cols:
            u, s = params["u"][col], params["s"][col]
            p = stats.norm.pdf(x[col], loc=u, scale=s)
            if p == 0:
                continue
            log_proba += np.log(p)
        return log_proba
    
    for cls in thetas:
        params = thetas[cls]
        X[f"class_{cls}"] = X.apply(lambda x: _udf(x, cols, params), axis=1)
    return X


#### Our GaussianNB

In [102]:
thetas = {
    0: compute_theta(X_train, y_train, cls=0),
    1: compute_theta(X_train, y_train, cls=1)
}

thetas[0]["s"]["mean radius"], thetas[0]["u"]["mean radius"]
X_out = predict_log_proba(X_test, thetas)

In [103]:
y_pred_imp = np.argmax(X_out[["class_0", "class_1"]], axis=1)
print("Accuracy:", np.sum(y_pred_imp == y_test) / len(y_test))

Accuracy: 0.9385964912280702


#### Scikit-Learn GaussianNB

In [105]:
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred_sk = clf.predict(X_test)
print("Accuracy:", np.sum(y_pred_sk == y_test) / len(y_test))

Accuracy: 0.9298245614035088


#### Compare results

In [107]:
print("Same predictions", np.sum(y_pred_imp == y_pred_sk) / len(y_test))

Same predictions 0.9912280701754386
