# Introduction
Naive Bayes is family of classification algorithms based on Bayes' theorem. It is a popular choice for various classification tasks due to its simplicity, efficiency, and interpretability.

### Core principle
Naive Bayes classifiers work under the assumption fo conditional independence between features (predictors) given the class label (target variable). In simpler terms, it assumes that knowing the value of one feature does not influence the probability of another feature's value, as long as the class label is already known. While this assumption is not always hold true in reality, it often works well in practive for many classification problems.

### Classification process
- Training: The model learns from the labeled dataset where each data point has features and a corresponding class label.
- Prediction: For a new unseen data point, the model calculates the probability of it belonging to each class. It achieves this by,
    1. Using Bayes' theorem to compute the posterior probability (probability of a class given the features).
    2. Assuming conditional independence between features, which simplifies the calculations.
    3. Multiplying the probabilities of each feature value given the class and multiplying by the prior probability of the class itself (learned from the training data).
- Assigning class label: The class with the highest posterior probability is assigned as the predicted class for the new data point.

### Example
Imagine emails are being classified as spam or not spam based on features like наличиe слова "деньги" (presence of the word "money") and наличие восклицательных знаков (presence of exclamation marks). Naive Bayes would assume that the presence of "money" doesn't influence the presence of exclamation marks (and vice versa) given the email class (spam or not spam).

### Advantages of Naive Bayes
- Simplicity and efficiency: Naive Bayes is easy to understand and implement, making it a good choice for beginners. It's also computationally efficient for training and prediction.
- Interpretability: The model allows to understand how each feature contributes to the classification by examining the feature probabilities for each class.
- Performance: Naive Bayes can perform well for various classification tasks, especially when dealing with high-dimensional data (many features).

### Disadvantage of Naive Bayes
- Conditional independence assumption: The assumption of conditional independence between features might not always be valid, which can lead to suboptimal performance in some cases.
- Sensitivity to features: Naive Bayes can be sensitive to irrelevant features or features with many unique values. Feature selection or preprocessing techniques might be necessary for better performance.

# Bayes Theorem
$P(A|B) = \frac{P(B|A)*P(A)}{P(B)}$

Where,
- $P(A|B)$ = Posterior. The probability of A being true, given B is true.
- $P(B|A)$ = Likelihood. The probability of B being true, given A is true.
- $P(A)$ = Prior. The probability of A being true. This is knowledge.
- $P(B)$ = Marginalization. The probability of B being true.

# Naive Bayes Algorithm
### 1. Data preprocessing
- Text cleaning:
    - Remove punctuation, stop words (common words with little meaning), and potentially numbers depending on the task.
    - Convert text to lowercase for consistency.
    - Consider stemming or lemmatization (reducing words to base form) for improved accuracy (optional).
- Feature representation: Represent each document (sentence) as a feature vector.
- Common approaches:
    - Bag-of-Words (BoW): Each word occurrence is a feature (0 or 1 indicating absence or presence).(Bernoulli NB).
    - Term Frequency-Inverse Document Frequency (TF-IDF): Weights words based on their importance within the document and rarity across the corpus. (Multinomial NB)

### 2. Model training
- Calculate class priors: Estimate the probability ($P(y = c)$) for each class ($c$) based on frequency in the training data.
- Calculate conditional probabilities:
    - Estimate the probability of each feature (word) appearing given a specific class ($P(w_i|y = c)$)
    - Use techniques like Laplace Smoothing to avoid zero probabilities for unseen words (especially for multinomial NB).

### 3. Classification of new sentence
- Calculate posterior probability:
    - Use the equation: $P(y = c | \text{sentence}) ≈ \Pi(P(w_i | y = c)) * P(y = c)$.
    - Multiply the probabilities of each word ($w_i$) appearing in the sentence given its class ($c$).
    - Multiply by the prior probability of class ($c$).
- Class prediction: Assign the sentence to the class with the highest posterior probability ($P(y = c | sentence)$).

### Key points
- Naive Bayes assumes independence between words in a document given the class label (simplification).
- This assumption might not always hold true but offers computational efficiency and can be surprisingly effective.
- Multinomial NB can capture word frequency information but requires managing the feature space size.
- Laplace Smoothing helps address zero probabilities and improves model robustness.

# Spam Classifier With Naive Bayes, And Bag-of-Words
### Objective
Create a binary text classifier to distinguish spam email from legitimate emails (ham).

### Challenges
- Text data cannot be fed into ML models.
- The text information has to be converted into features suitable for the model.

### Solution
- Feature extraction using Bag-of-Words: This technique represents documents as a collection of words, ignoring grammar and word order.
    - All the unique words from the entire email dataset are extracted.
    - Each email is the represented by a feature vector where each element indicated the frequency (count) of a particular word in that email.

### Classification using Naive Bayes
- Naive Bayes is well suited for text classification tasks.
- It assumes the independence of features (words) in a document, which might not be strictly true but often works well in practice for text data.
- The model calculates the probability of an email being spam or ham based on the presence and frequency of words associated with each category.

### Example
- Emails:
    - "Can you please look at the task ...?" (ham)
    - "Hi! I am a Nigerian Prince" (spam)
- Extracted Bag-of-Words: "Can", "you", "please", "look", "at", "the", "task", "...", "Hi", "!", "am", "a", “Nigerian", "Prince".
- Feature vectors:
    - Ham: (1 count of "Can", 1 count of "you", ..., 0 for "Nigerian", 0 for “Prince").
    - Spam: (0 for "Can", 0 for "you", ..., 1 count of “Nigerian", count of "Prince").

Naive Bayes uses these feature vectors and their corresponding labels (spam/ ham) to learn the probability distribution of words for each category. During prediction for a new email, the model calculates the probabilities of the email being spam and ham based on the word frequencies and classifies it accordingly.

### Summary
- Bag-of-Words transforms text data into numerical features for analysis.
- Naive Bayes leverages these features to classify emails as spam or ham based on word probabilities learned from the training data. This approach provides a simple and effective way to build a spam classifier.

### Note
Real world spam classification can be more complex and might involve additional techniques like stemming or lemmatization (reducing words to their root form), n-grams (considering sequences of words), feature weighting based on importance.

# Bag-of-Words (BoW)
Bag-of-Words (BoW) is a fundamental technique used in Natural Language Processing (NLP) for representing text data. It focuses on the occurrences of words within a document, ignoring grammar or word order.

### Core idea
- Imagine a bag filled with words, where each word appears as many times as it occurs in the document.
- The order of context in which the words appear is not considered.

### Creating a Bag-of-Words representation
1. Preprocessing: Text cleaning steps like removing punctuation, stop words (that is, common words like "a", "an", "the"), and converting text to lowercase are often performed.
2. Tokenization: The text is split into individual words (tokens).
3. Vocabulary creation: A list of unique words encountered across all documents in the corpus (collection of documents) is created. This is called the vocabulary.
4. Feature vector representation: Each document is represented by a feature vector. This vector has the same length as the vocabulary.

- Each element in the vector corresponds to a word in the vocabulary.
- The value at each element represents the number of times that particular word appears in the document (its frequency)/

### Example
Consider 2 documents,
- Document 1: "The quick brown fox jumps over the laxy dog.".
- Document 2: "The dog is lazy. The fox is quick.".

After preprocessing and tokenization, the result would be,
- Vocabulary: ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog", "is"].

The feature vectors for these documents could be,
- Document 1: [2, 1, 1, 1, 1, 1, 1, 1, 0]
- Document 2: [2, 1, 0, 1, 0, 0, 1, 1, 1]

### Applications
Bag-of-Words is a simple and effective way to represent text data for various NLP tasks, including,
- Document classification (e.g., spam detection, sentiment analysis).
- Information retrieval (e.g., document search).
- Topic modeling (identifying groups of related words).

### Limitations
- BoW ignores word order and context, which can be crucial for understanding the meaning of a sentence.
- Words with similar meanings (synonyms) are treated differently.
- The effectiveness of BoW depends heavily on the quality of the preprocessing steps.

### Alternatives
- TF-IDF (Term Frequency-Inverse Document Frequency) is a popular extension of BoW that incorporates the importance of words within a document and across the corpus.
- Word embeddings, like word2vec and GloVe, capture semantic relationships between words and provide a more nuanced representation of text data.
- Overall, Bag-of-Words is a foundational technique in NLP, offering a simple and efficient way to represent text data for various tasks. However, it's important to be aware of its limitations and consider alternative approaches depending on the specific application.

### Further reading
https://www.scaler.com/topics/nlp/text-representation-in-nlp/

# Text Vectorization And Feature Reduction In Spam Classification
### Text to vectors
- For machine learning models to work, text data needs to be converted into numerical features for processing.
- BoW is a common technique for text vectorization.
- BoW represents documents as vectors where each element corresponds to a unique word in the vocabulary.
- The value in each element represents the frequency (count) of that word in the document.

### Example
Consider the sentence: "Can you please look at the task...?".
- Vocabulary: ["Can", "you", "please", "look", "at", "the", "task", “..."].
- BoW vector: [1, 1, 1, 1, 1, 1, 1, 1] (Assuming each word appears once).

### Challenges with high dimensionality
- With large corpus, the vocabulary size (the number of unique words) can become very large. 
- This leads to high dimensional feature vectors (potentially tens of thousands of features).
- High dimensionality can pose problems for machine learning models,
    - Increased computational cost for training and prediction.
    - Potential for overfitting, where the model memorizes training data instead of learning the general patterns.

### Feature reduction techniques
- Text cleaning: Preprocessing steps like removing the stop words (common words like "the", "a", "an") and punctuation can significantly reduce the vocabulary size.
- Dimensionality reduction techniques:
    - Term Frequency-Inverse Document Frequency (TF-IDF): This method weights words based on their importance within a document and across the corpus. Words that are frequent in a document but rare overall (like "Nigeria" for spam) receive higher weights, leading to more informative features.
    - Principal Component Analysis (PCA): This technique projects data points onto a lower-dimensional space while capturing most of the variance in the data. This reduces the number of features while preserving the most important information.

### Summary
- BoW provides a basic way to convert text into numerical features.
- High dimensionality due to large vocabulary size can hinder model performance.
- Text cleaning and dimensionality reduction techniques like TF-IDF and PCA help manage feature space and improve model effectiveness.

# Text Cleaning For Text Classification
1. Convert sentences in words. This technically is called tokenization.
2. Convert all the text to lowercase. How will this help? This will remove duplicates like, the, THE, The, etc.
3. Remove non-alphabetical features. What does this mean? e.g., comma (,), full-stop (.), etc. These along with numbers can be removed. Removing numbers is not a hard rule, they can be left as it is in the text.
4. Remove stopwords. Meaning, words such as, the, how, where, etc., can be removed. Stopwords are words that do not add a lot of value to the classification.

NOTE: All of this text processing is optional.

# Mathematical Intuition For Naive Bayes
### Objective
Classify a new text message (sentence) as spam (class 1) or ham (class 0) based on the words it contains.

### Mathematical formulation
The posterior probability has to be calculated,

$P(y = c | w_1, w_2, ..., w_n)$ = Probability of the sentence belonging to class $c$ (spam or ham) given the set of words ($w_1$ to $w_n$).

Using Bayes' theorem,

$P(A | B) = \frac{P(B | A) * P(A)}{P(B)}$

Where,
- A = Class ($c$ = 0 for ham, and $c$ = 1 for spam).
- B = Set of words ($w_1$, $w_2$, ..., $w_n$).

### Challenges
1. Calculating likelihood ($P(B | A)$):
    - The probability of all the words appearing together given the class is needed (e.g., $P(w_1, w_2, ..., w_n | y = 1)$ for spam).
    - Directly calculating this joint probability is difficult due to the "curse of dimensionality" - the probability becomes extremely small as the number of words increases.
2. Naive assumption: Naive Bayes addresses this by assuming independence between words in a document given the class label ($c$). This means,
    - $P(w_1, w_2, ..., w_n | y = c) ≈ P(w_1 | y = c) * P(w_2 | y = c, w_1) * ... * P(w_n | y = c, w_1, w_2, ..., w_{n - 1})$.
    - We estimate the probability of each word individually given the class ($c$).

### Impact of the assumption
- This simplification makes the calculation of likelihood tractable.
- However, the independence assumption might not always hold true in natural language, where word order and context can influence meaning.

### Summary
Naive Bayes offers a computationally efficient approach to text classification by,
- Formulating the problem using Bayes' theorem and conditional probabilities.
- Making the simplifying assumption of word independence given the class label.
- Despite the assumption, Naive Bayes can be surprisingly effective for many text classification tasks.

# Naive Assumption in Naive Bayes
### The core assumption
Naive Bayes in text classification assumes independence between words in a document given the class label (spam or ham). This means,
- $P(w_1, w_2, ..., w_n | y = c) ≈ P(w_1 | y = c) * P(w_2 | y = c) * ... * P(w_n | y = c)$.
- We estimate the probability of each word individually given the class, ignoring the influence of other words in the sentence.

### Impact of the assumption
- Simplification: This assumption makes calculating the likelihood (probability of words given the class) tractable, avoiding the "curse of dimensionality" issue.
- Limitation: The assumption might not always hold true. Words can be related, and their presence can influence the probability of others (e.g., "happy" and "new" appearing together more frequently).

### Example
Consider $P(w_2 | y = 1, w_1)$. Naively, it becomes $P(w_2 | y = 1)$, ignoring the presence of $w_1$. In reality, the probability of "new" might depend on "happy" being present.

### Benefits of the assumption
- Computational efficiency: Easier to calculate individual word probabilities than complex joint probabilities.
- Surprisingly effective: Despite the simplification, Naive Bayes can achieve good performance in many text classification tasks.

### Justification for the assumption
- While word dependencies exist, their overall impact might average out across a large corpus.
- The simplicity of the model can sometimes compensate for the imperfect assumption.

### Summary
Naive Bayes takes a pragmatic approach. It acknowledges that word independence isn't entirely true but leverages the assumption for computational efficiency and achieves reasonable performance in many real-world scenarios. This trade-off between simplicity and accuracy is what makes Naive Bayes a popular choice for text classification tasks.

# Summary of Naive Bayes For Text Classification
### Objective
Classify a sentence as spam (class 1) or ham (class 0) based on the words it contains.

### Key equation
The posterior probability has to be found: $P(y = c | \text{sentence})$, which is the probability of the sentence belonging to class $c$ (spam or ham) given the words in the sentence.

### Naive Bayes approach
1. Leverages Bayes' theorem: $P(y = c | sentence) = \frac{P(sentence | y = c) * P(y = c)}{P(\text{sentence})}$.
2. Naive assumption: Assumes independence between words in the sentence given the class label ($c$). This simplifies the calculation of $P(\text{sentence} | y = c)$.
3. Simplified equation (for spam class, $c$ = 1): $P(y = 1 | \text{sentence}) ≈ \Pi(P(w_i | y = 1)) * P(y = 1)$. Where,
    - $\Pi$ (product symbol) = Multiplies the probabilities of each word ($w_i$) appearing in the sentence given its spam ($y$ = 1).
    - $P(y = 1)$ = Prior probability of a message being spam.

### Classification
- Calculate $P(y = 1 | \text{sentence})$ and $P(y = 0 | \text{sentence})$ using the same approach for both spam and ham classes.
- The sentence is classified into the class with the higher posterior probability.

### Why doesn't the denominator matter?
- The denominator, $P(\text{sentence})$ cancels out when comparing $P(y = 1 | \text{sentence})$ and $P(y = 0 | \text{sentence})$ because it is the same for both calculations.
- Only the class with the higher probability is considered, so the constant denominator does not affect the final decision.

### Naive assumption trade-off
- The assumption simplifies calculations but might not always hold true (words can be related).
- Despite the simplification, Naive Bayes can be surprisingly effective for many text classification tasks.

### Summary
Naive Bayes offers a simple yet powerful approach for text classification. By leveraging Bayes' theorem and making a simplifying assumption, it efficiently estimates the probability of a sentence belonging to a class based on word probabilities. While the independence assumption is not perfect, it often provides good results in practice.


# Limitations Of Naive Bayes
While Naive Bayes offers a powerful approach, it has some limitations.

### Limited text understanding
- It analyzes words independently, ignoring their meaning or relationships within the sentence.
- New words encountered during prediction (not in the training vocabulary) can lead to issues.

### Assumption of order independence
- The model does not consider the word order, which can affect meaning.
- Sentences like "good movie" and "movie bad" might be treated similarly.

### Frequency insensitivity
The model treats a word appearing once or multiple times the same in the bag-of-words representation. Information about word frequency is lost.

### Zero probability problem
- If a word from a new sentence is absent from the vocabulary, its probability becomes 0.
- This can lead to the entire equation for that class becoming 0, making classification impossible.

### Handling out-of-vocabulary (OOV) words
- Simple approach: Assume the word is not present at all (probability = 0).
- $P(\text{unknown word} | y = 1)$: Assign a uniform probability (often 1) to unseen words.
- Laplace Smoothing: A more sophisticated technique that adds a small value (e.g., 1) to the count of each word estimating probabilities. This avoids zero probabilities and provides smoother estimates.

### Summary
Naive Bayes offers a trade-off between simplicity and accuracy. While it has limitations in understanding complex text relationships, it can be effective for many classification tasks. Techniques like Laplace help address the zero probability problem and improve robustness.

# Laplace Smoothing For Naive Bayes
### Problem
Naive Bayes calculates the probability of words ($w_j$) appearing in a class (e.g., spam). If a word is absent from the training data for a specific class, its probability becomes 0. This can lead to,
1. Zero probability problem: The entire equation for that class becomes 0, making classification impossible.
2. Mathematical issues: Multiplication by 0 can cause problems in calculations.

### Solution: Laplace Smoothing
This technique adds a small value ($\alpha$) to the count of each word when estimating probabilities. The formula for Laplace Smoothing with Naive Bayes is, $P(w_j | y = 1) = (\frac{\text{Count}(w_j, y = 1) + \alpha}{Total number of words in class 1 + \alpha * c})$. Where,
- $\alpha$ = Hyperparameter controlling smoothing (typically a small value like 1).
- $c$ = Number of possible values for $w_j$ (in this case, 0 or 1).

### Advantages
- Non-zero probabilities: Ensures all words have a non-zero probability, even if unseen in training data.
- Robustness: Prevents the model from breaking down due to zero probabilities.
- Smoother estimates: Reduces the impact of sparse data, leading to more stable and reliable probability estimates.

### Example
- $\alpha$ = 1.
- Word "important" not present in spam emails ($\text{Count}(\text{important}, y = 1) = 0$).

### Without Smoothing
$\frac{P(\text{important} | y = 1)}{\text{Total spam emails} = 0}$ (classification impossible).

### With Smoothing
$P(\text{important} | y = 1) = \frac{0 + 1}{\text{Total spam emails} + 1 * 2} = \frac{1}{\text{Total spam emails} + 2}$ (provides a valid probability).

Laplace Smoothing is a simple yet effective technique that improves the robustness and reliability of Naive Bayes for text classification by avoiding zero probabilities and providing smoother probability estimates.

# Bernoulli V. Multinomial Naive Bayes
### Feature representation
The key difference between Bernoulli and Multinomial Naive Bayes lies in how they handle features (word occurrences) in text classification.

### Bernoulli Naive Bayes (Bernoulli NB)
- Suitable for features with only 2 possible values (typically 0 or 1).
- Example: "good" can be either present (1) or absent (0) in a document.

### Multinomial Naive Bayes (Multinomial NB)
- Applicable for features with multiple distince values (k, where k > 2).
- Example: "good" can appear 0 times, 1 time, 2 times, and so on (represented by different values based on frequency).

### Impact on features
- Bernoulli NB:
    - Simpler model with fewer features (0 or 1 for each word).
    - Might miss information about word frequency.
- Multinomial NB:
    - More complex model with increased features for each word (representing frequency).
    - Captures word frequency information but leads to a larger feature space.

### Feature engineering considerations
While Multinomial NB can capture frequency, the increase in features can be problematic. Techniques like,
- Minimum frequency threshold: Ignore words appearing less than a certain number of times.
- Maximum frequency threshold: Cap the maximum value for frequent words (e.g., "the").

These techniques help manage the feature space size in Multinomial NB.

### Summary
