# Naive Bayes

Naive Bayes is a family of probabilistic algorithms based on Bayes' Theorem, with the "naive" assumption of independence between every pair of features. 

It is particularly suited for classification tasks and is known for its simplicity, efficiency, and effectiveness in a variety of applications.

## Bayes' Theorem

Bayes' Theorem provides a way to calculate the probability of a hypothesis given some evidence. It is expressed as:

$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Where:
- $P(A|B)$ is the posterior probability of hypothesis $ A $ given evidence $ B $.
- $ P(B|A) $ is the likelihood of evidence $ B $ given hypothesis $ A $.
- $ P(A) $ is the prior probability of hypothesis $ A $.
- $ P(B) $ is the prior probability of evidence $ B $.

## Types of Naive Bayes Classifiers

1. **Gaussian Naive Bayes:**
   - Assumes that the features follow a normal (Gaussian) distribution.
   - Suitable for continuous data.
   - The likelihood of the features is calculated using the Gaussian distribution:

     $ P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left( -\frac{(x_i - \mu_y)^2}{2\sigma_y^2} \right) $

     Where $ \mu_y $ and $ \sigma_y $ are the mean and standard deviation of the feature $ x_i $ for class $ y $.

2. **Multinomial Naive Bayes:**
   - Suitable for discrete data, especially for text classification tasks (e.g., spam detection).
   - Assumes that the features follow a multinomial distribution.
   - The likelihood of the features is calculated based on the frequency of terms:

     $ P(x_i|y) = \frac{\text{count}(x_i, y) + \alpha}{\sum \text{count}(x_j, y) + \alpha \cdot n} $

     Where $\alpha$ is the smoothing parameter (Laplace smoothing) and $ n $ is the number of possible feature values.

3. **Bernoulli Naive Bayes:**
   - Suitable for binary/Boolean features (e.g., presence or absence of a word).
   - Assumes that the features follow a Bernoulli distribution.
   - The likelihood of the features is calculated as:

     $ P(x_i|y) = p_{i,y}^{x_i} \cdot (1 - p_{i,y})^{(1 - x_i)} $

     Where $ p_{i,y} $ is the probability of feature $ x_i $ being 1 given class $ y $.

## Assumptions

- **Feature Independence:** Naive Bayes assumes that all features are independent given the class label. This is often not true in real-world data but the algorithm performs surprisingly well despite this assumption.

## Advantages

- **Simple and Fast:** Easy to implement and computationally efficient.
- **Handles Missing Data:** Can handle missing data during prediction.
- **Effective with Small Data:** Performs well with relatively small datasets.
- **Scalable:** Efficient in terms of both space and time complexity.

## Disadvantages

- **Feature Independence Assumption:** Assumes that features are independent, which is rarely true in real-world data.
- **Zero Probability Problem:** If a feature value was not present in the training set, it will assign zero probability to that feature. This can be mitigated with techniques like Laplace smoothing.

## Applications

- **Text Classification:** Spam detection, sentiment analysis, document categorization.
- **Medical Diagnosis:** Predicting diseases based on symptoms.
- **Recommendation Systems:** Collaborative filtering.