# Introduction
Naive Bayes is family of classification algorithms based on Bayes' theorem. It is a popular choice for various classification tasks due to its simplicity, efficiency, and interpretability.

### Core principle
Naive Bayes classifiers work under the assumption fo conditional independence between features (predictors) given the class label (target variable). In simpler terms, it assumes that knowing the value of one feature does not influence the probability of another feature's value, as long as the class label is already known. While this assumption is not always hold true in reality, it often works well in practive for many classification problems.

### Classification process
- Training: The model learns from the labeled dataset where each data point has features and a corresponding class label.
- Prediction: For a new unseen data point, the model calculates the probability of it belonging to each class. It achieves this by,
    1. Using Bayes' theorem to compute the posterior probability (probability of a class given the features).
    2. Assuming conditional independence between features, which simplifies the calculations.
    3. Multiplying the probabilities of each feature value given the class and multiplying by the prior probability of the class itself (learned from the training data).
- Assigning class label: The class with the highest posterior probability is assigned as the predicted class for the new data point.

### Example
Imagine emails are being classified as spam or not spam based on features like наличиe слова "деньги" (presence of the word "money") and наличие восклицательных знаков (presence of exclamation marks). Naive Bayes would assume that the presence of "money" doesn't influence the presence of exclamation marks (and vice versa) given the email class (spam or not spam).

### Advantages of Naive Bayes
- Simplicity and efficiency: Naive Bayes is easy to understand and implement, making it a good choice for beginners. It's also computationally efficient for training and prediction.
- Interpretability: The model allows to understand how each feature contributes to the classification by examining the feature probabilities for each class.
- Performance: Naive Bayes can perform well for various classification tasks, especially when dealing with high-dimensional data (many features).

### Disadvantage of Naive Bayes
- Conditional independence assumption: The assumption of conditional independence between features might not always be valid, which can lead to suboptimal performance in some cases.
- Sensitivity to features: Naive Bayes can be sensitive to irrelevant features or features with many unique values. Feature selection or preprocessing techniques might be necessary for better performance.

# Bayes Theorem
$P(A|B) = \frac{P(B|A)*P(A)}{P(B)}$

Where,
- $P(A|B)$ = Posterior. The probability of A being true, given B is true.
- $P(B|A)$ = Likelihood. The probability of B being true, given A is true.
- $P(A)$ = Prior. The probability of A being true. This is knowledge.
- $P(B)$ = Marginalization. The probability of B being true.

# Naive Bayes Algorithm
### 1. Data preprocessing
- Text cleaning:
    - Remove punctuation, stop words (common words with little meaning), and potentially numbers depending on the task.
    - Convert text to lowercase for consistency.
    - Consider stemming or lemmatization (reducing words to base form) for improved accuracy (optional).
- Feature representation: Represent each document (sentence) as a feature vector.
- Common approaches:
    - Bag-of-Words (BoW): Each word occurrence is a feature (0 or 1 indicating absence or presence).(Bernoulli NB).
    - Term Frequency-Inverse Document Frequency (TF-IDF): Weights words based on their importance within the document and rarity across the corpus. (Multinomial NB)

### 2. Model training
- Calculate class priors: Estimate the probability ($P(y = c)$) for each class ($c$) based on frequency in the training data.
- Calculate conditional probabilities:
    - Estimate the probability of each feature (word) appearing given a specific class ($P(w_i|y = c)$)
    - Use techniques like Laplace Smoothing to avoid zero probabilities for unseen words (especially for multinomial NB).

### 3. Classification of new sentence
- Calculate posterior probability:
    - Use the equation: $P(y = c | sentence) ≈ \Pi(P(w_i | y = c)) * P(y = c)$.
    - Multiply the probabilities of each word ($w_i$) appearing in the sentence given its class ($c$).
    - Multiply by the prior probability of class ($c$).
- Class prediction: Assign the sentence to the class with the highest posterior probability ($P(y = c | sentence)$).

### Key points
- Naive Bayes assumes independence between words in a document given the class label (simplification).
- This assumption might not always hold true but offers computational efficiency and can be surprisingly effective.
- Multinomial NB can capture word frequency information but requires managing the feature space size.
- Laplace Smoothing helps address zero probabilities and improves model robustness.

# Spam classifier with Naive Bayes, and Bag-of-Words
### Objective
Create a binary text classifier to distinguish spam email from legitimate emails (ham).

### Challenges
- Text data cannot be fed into ML models.
- The text information has to be converted into features suitable for the model.

### Solution
- Feature extraction using Bag-of-Words: This technique represents documents as a collection of words, ignoring grammar and word order.
    - All the unique words from the entire email dataset are extracted.
    - Each email is the represented by a feature vector where each element indicated the frequency (count) of a particular word in that email.

### Classification using Naive Bayes
- Naive Bayes is well suited for text classification tasks.
- It assumes the independence of features (words) in a document, which might not be strictly true but often works well in practice for text data.
- The model calculates the probability of an email being spam or ham based on the presence and frequency of words associated with each category.

### Example
- Emails:
    - "Can you please look at the task ...?" (ham)
    - "Hi! I am a Nigerian Prince" (spam)
- Extracted Bag-of-Words: "Can", "you", "please", "look", "at", "the", "task", "...", "Hi", "!", "am", "a", “Nigerian", "Prince".
- Feature vectors:
    - Ham: (1 count of "Can", 1 count of "you", ..., 0 for "Nigerian", 0 for “Prince").
    - Spam: (0 for "Can", 0 for "you", ..., 1 count of “Nigerian", count of "Prince").

Naive Bayes uses these feature vectors and their corresponding labels (spam/ ham) to learn the probability distribution of words for each category. During prediction for a new email, the model calculates the probabilities of the email being spam and ham based on the word frequencies and classifies it accordingly.

### Summary
- Bag-of-Words transforms text data into numerical features for analysis.
- Naive Bayes leverages these features to classify emails as spam or ham based on word probabilities learned from the training data. This approach provides a simple and effective way to build a spam classifier.

### Note
Real world spam classification can be more complex and might involve additional techniques like stemming or lemmatization (reducing words to their root form), n-grams (considering sequences of words), feature weighting based on importance.

# Bag-of-Words (BoW)
