### **Bayes' Theorem:**

Bayes' Theorem provides a way to update our beliefs about the probability of a hypothesis \( H \) (such as a class label) given new evidence \( E \) (such as a feature vector). Mathematically, it is expressed as:

$$
P(H | E) = \frac{P(E | H) \cdot P(H)}{P(E)}
$$

Where:
- $P(H | E)$ is the **posterior probability**: the probability of the hypothesis $H$ given the evidence $E$.
- $P(E | H)$ is the **likelihood**: the probability of the evidence $E$ given that the hypothesis $H$ is true.
- $P(H)$ is the **prior probability**: the initial probability of the hypothesis $H$ before seeing the evidence.
- $P(E)$ is the **marginal likelihood** or **evidence**: the total probability of the evidence under all possible hypotheses.

### **Bayes Classifier:**

In the context of classification, the Bayes classifier uses Bayes' Theorem to determine the probability that an observation $\mathbf{x}$ belongs to a particular class $C_k$. The goal is to assign the observation to the class with the highest posterior probability.

### **Mathematical Formulation:**

Given a set of classes $C_1, C_2, \dots, C_K$, and an observation $\mathbf{x}$, the Bayes classifier assigns $\mathbf{x}$ to the class $C_k$ that maximizes the posterior probability:

$$
\hat{C}(\mathbf{x}) = \underset{k \in \{1, 2, \dots, K\}}{\arg\max} \, P(C_k | \mathbf{x})
$$

Using Bayes' Theorem, this can be rewritten as:

$$
\hat{C}(\mathbf{x}) = \underset{k \in \{1, 2, \dots, K\}}{\arg\max} \, \frac{P(\mathbf{x} | C_k) \cdot P(C_k)}{P(\mathbf{x})}
$$

Since $P(\mathbf{x})$ is the same for all classes, it does not affect the maximization, so we can simplify the decision rule to:

$$
\hat{C}(\mathbf{x}) = \underset{k \in \{1, 2, \dots, K\}}{\arg\max} \, P(\mathbf{x} | C_k) \cdot P(C_k)
$$

Where:
- $P(\mathbf{x} | C_k)$ is the likelihood: the probability of observing $\mathbf{x}$ given that the true class is $C_k$.
- $P(C_k)$ is the prior probability: the probability that a randomly chosen observation belongs to class $C_k$.

### **Interpretation:**

- **Prior $P(C_k)$:** Represents our initial belief about the frequency or probability of each class. For example, if we know that 70% of emails are not spam and 30% are spam, then $P(\text{Not Spam}) = 0.7$ and $P(\text{Spam}) = 0.3$.
- **Likelihood $P(\mathbf{x} | C_k)$:** Reflects how likely it is to observe the features $\mathbf{x}$ given that the observation belongs to class $C_k$. For example, the likelihood of certain words appearing in a spam email might be higher than in a non-spam email.

### **Gaussian Naive Bayes Example:**

In practice, a common implementation is the Gaussian Naive Bayes classifier, where we assume that the likelihood of the features $\mathbf{x}$ given the class $C_k$ follows a Gaussian (normal) distribution:

$$
P(x_j | C_k) = \frac{1}{\sqrt{2 \pi \sigma_k^2}} \exp \left( -\frac{(x_j - \mu_{k})^2}{2 \sigma_k^2} \right)
$$

Where $\mu_k$ and $\sigma_k^2$ are the mean and variance of the feature $x_j$ for class $C_k$.

### **Final Decision Rule:**

The final decision rule combines the prior and the likelihood:

$$
\hat{C}(\mathbf{x}) = \underset{k \in \{1, 2, \dots, K\}}{\arg\max} \, \log(P(C_k)) + \sum_{j=1}^{p} \log \left( P(x_j | C_k) \right)
$$

Here, we take the logarithm to avoid underflow issues and to simplify the multiplication of probabilities into a sum.

### **Summary:**

The Bayes classifier is a powerful probabilistic approach that leverages Bayes' Theorem to classify observations. By combining prior knowledge of class probabilities with the likelihood of observing certain features given those classes, it assigns the observation to the most probable class. Despite its simplicity, it can be very effective, especially when the assumptions (such as feature independence in Naive Bayes) hold true.


### Example: Disease Diagnosis

In this example, we'll use a Bayes classifier to determine if a patient has a certain disease based on the presence of symptoms.

**Scenario:**

We have a dataset with the following features:
- **x1:** Presence of Symptom A (1 if present, 0 if not)
- **x2:** Presence of Symptom B (1 if present, 0 if not)

We want to classify the patient into:
- C1: No Disease
- C2: Disease

**Step 1: Calculate Prior Probabilities**

From the training data, we determine:
- $P(\text{No Disease}) = 0.8$
- $P(\text{Disease}) = 0.2$

**Step 2: Calculate Likelihoods**

We estimate the likelihoods from the data:
- $P(x1 = 1 \mid \text{No Disease}) = 0.2$
- $P(x1 = 1 \mid \text{Disease}) = 0.7$
- $P(x2 = 1 \mid \text{No Disease}) = 0.1$
- $P(x2 = 1 \mid \text{Disease}) = 0.6$

**Step 3: Compute Posterior Probabilities**

Consider a new patient with the following symptoms:
- $x1 = 1$ (has Symptom A)
- $x2 = 1$ (has Symptom B)

We need to compute the posterior probabilities for both classes:

**For No Disease:**

Compute the joint probability:
$$
P(x \mid \text{No Disease}) = P(x1 = 1 \mid \text{No Disease}) \cdot P(x2 = 1 \mid \text{No Disease})
$$

Calculate the posterior probability:
$$
P(\text{No Disease} \mid x) \propto P(x \mid \text{No Disease}) \cdot P(\text{No Disease})
$$

**For Disease:**

Compute the joint probability:
$$
P(x \mid \text{Disease}) = P(x1 = 1 \mid \text{Disease}) \cdot P(x2 = 1 \mid \text{Disease})
$$

Calculate the posterior probability:
$$
P(\text{Disease} \mid x) \propto P(x \mid \text{Disease}) \cdot P(\text{Disease})
$$

**Example Calculation:**

Let’s do a quick calculation using hypothetical numbers.

For a patient with:
- $x1 = 1$
- $x2 = 1$

**For No Disease:**

- $P(x1 = 1 \mid \text{No Disease}) = 0.2$
- $P(x2 = 1 \mid \text{No Disease}) = 0.1$

$$
P(x \mid \text{No Disease}) = 0.2 \cdot 0.1 = 0.02
$$

$$
P(\text{No Disease} \mid x) \propto 0.02 \cdot 0.8 = 0.016
$$

**For Disease:**

- $P(x1 = 1 \mid \text{Disease}) = 0.7$
- $P(x2 = 1 \mid \text{Disease}) = 0.6$

$$
P(x \mid \text{Disease}) = 0.7 \cdot 0.6 = 0.42
$$

$$
P(\text{Disease} \mid x) \propto 0.42 \cdot 0.2 = 0.084
$$

Since $0.084$ (Disease) $>$ $0.016$ (No Disease), we classify the patient as having the Disease.
