# Introduction
Naive Bayes is family of classification algorithms based on Bayes' theorem. It is a popular choice for various classification tasks due to its simplicity, efficiency, and interpretability.

### Core principle
Naive Bayes classifiers work under the assumption fo conditional independence between features (predictors) given the class label (target variable). In simpler terms, it assumes that knowing the value of one feature does not influence the probability of another feature's value, as long as the class label is already known. While this assumption is not always hold true in reality, it often works well in practive for many classification problems.

### Classification process
- Training: The model learns from the labeled dataset where each data point has features and a corresponding class label.
- Prediction: For a new unseen data point, the model calculates the probability of it belonging to each class. It achieves this by,
    1. Using Bayes' theorem to compute the posterior probability (probability of a class given the features).
    2. Assuming conditional independence between features, which simplifies the calculations.
    3. Multiplying the probabilities of each feature value given the class and multiplying by the prior probability of the class itself (learned from the training data).
- Assigning class label: The class with the highest posterior probability is assigned as the predicted class for the new data point.

### Example
Imagine emails are being classified as spam or not spam based on features like наличиe слова "деньги" (presence of the word "money") and наличие восклицательных знаков (presence of exclamation marks). Naive Bayes would assume that the presence of "money" doesn't influence the presence of exclamation marks (and vice versa) given the email class (spam or not spam).

### Advantages of Naive Bayes
- Simplicity and efficiency: Naive Bayes is easy to understand and implement, making it a good choice for beginners. It's also computationally efficient for training and prediction.
- Interpretability: The model allows to understand how each feature contributes to the classification by examining the feature probabilities for each class.
- Performance: Naive Bayes can perform well for various classification tasks, especially when dealing with high-dimensional data (many features).

### Disadvantage of Naive Bayes
- Conditional independence assumption: The assumption of conditional independence between features might not always be valid, which can lead to suboptimal performance in some cases.
- Sensitivity to features: Naive Bayes can be sensitive to irrelevant features or features with many unique values. Feature selection or preprocessing techniques might be necessary for better performance.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [2]:
pd.set_option("display.max_columns", None)
sns.set_theme(style = "whitegrid")
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize'] = (20, 10)

In [3]:
df = pd.read_csv("spam_clean.csv", encoding = "latin-1")
df.head()

Unnamed: 0,type,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Bayes Theorem
$P(A|B) = \frac{P(B|A)*P(A)}{P(B)}$

Where,
- $P(A|B)$ = Posterior. The probability of A being true, given B is true.
- $P(B|A)$ = Likelihood. The probability of B being true, given A is true.
- $P(A)$ = Prior. The probability of A being true. This is knowledge.
- $P(B)$ = Marginalization. The probability of B being true.

# Naive Bayes Algorithm
### 1. Data preprocessing
- Text cleaning:
    - Remove punctuation, stop words (common words with little meaning), and potentially numbers depending on the task.
    - Convert text to lowercase for consistency.
    - Consider stemming or lemmatization (reducing words to base form) for improved accuracy (optional).
- Feature representation: Represent each document (sentence) as a feature vector.
- Common approaches:
    - Bag-of-Words (BoW): Each word occurrence is a feature (0 or 1 indicating absence or presence).(Bernoulli NB).
    - Term Frequency-Inverse Document Frequency (TF-IDF): Weights words based on their importance within the document and rarity across the corpus. (Multinomial NB)

### 2. Model training
- Calculate class priors: Estimate the probability ($P(y = c)$) for each class ($c$) based on frequency in the training data.
- Calculate conditional probabilities:
    - Estimate the probability of each feature (word) appearing given a specific class ($P(w_i|y = c)$)
    - Use techniques like Laplace Smoothing to avoid zero probabilities for unseen words (especially for multinomial NB).

### 3. Classification of new sentence
- Calculate posterior probability:
    - Use the equation: $P(y = c | \text{sentence}) ≈ \Pi(P(w_i | y = c)) * P(y = c)$.
    - Multiply the probabilities of each word ($w_i$) appearing in the sentence given its class ($c$).
    - Multiply by the prior probability of class ($c$).
- Class prediction: Assign the sentence to the class with the highest posterior probability ($P(y = c | sentence)$).

### Key points
- Naive Bayes assumes independence between words in a document given the class label (simplification).
- This assumption might not always hold true but offers computational efficiency and can be surprisingly effective.
- Multinomial NB can capture word frequency information but requires managing the feature space size.
- Laplace Smoothing helps address zero probabilities and improves model robustness.

# Spam Classifier With Naive Bayes, And Bag-of-Words
### Objective
Create a binary text classifier to distinguish spam email from legitimate emails (ham).

### Challenges
- Text data cannot be fed into ML models.
- The text information has to be converted into features suitable for the model.

### Solution
- Feature extraction using Bag-of-Words: This technique represents documents as a collection of words, ignoring grammar and word order.
    - All the unique words from the entire email dataset are extracted.
    - Each email is the represented by a feature vector where each element indicated the frequency (count) of a particular word in that email.

### Classification using Naive Bayes
- Naive Bayes is well suited for text classification tasks.
- It assumes the independence of features (words) in a document, which might not be strictly true but often works well in practice for text data.
- The model calculates the probability of an email being spam or ham based on the presence and frequency of words associated with each category.

### Example
- Emails:
    - "Can you please look at the task ...?" (ham)
    - "Hi! I am a Nigerian Prince" (spam)
- Extracted Bag-of-Words: "Can", "you", "please", "look", "at", "the", "task", "...", "Hi", "!", "am", "a", “Nigerian", "Prince".
- Feature vectors:
    - Ham: (1 count of "Can", 1 count of "you", ..., 0 for "Nigerian", 0 for “Prince").
    - Spam: (0 for "Can", 0 for "you", ..., 1 count of “Nigerian", count of "Prince").

Naive Bayes uses these feature vectors and their corresponding labels (spam/ ham) to learn the probability distribution of words for each category. During prediction for a new email, the model calculates the probabilities of the email being spam and ham based on the word frequencies and classifies it accordingly.

### Summary
- Bag-of-Words transforms text data into numerical features for analysis.
- Naive Bayes leverages these features to classify emails as spam or ham based on word probabilities learned from the training data. This approach provides a simple and effective way to build a spam classifier.

### Note
Real world spam classification can be more complex and might involve additional techniques like stemming or lemmatization (reducing words to their root form), n-grams (considering sequences of words), feature weighting based on importance.

# Bag-of-Words (BoW)
Bag-of-Words (BoW) is a fundamental technique used in Natural Language Processing (NLP) for representing text data. It focuses on the occurrences of words within a document, ignoring grammar or word order.

### Core idea
- Imagine a bag filled with words, where each word appears as many times as it occurs in the document.
- The order of context in which the words appear is not considered.

### Creating a Bag-of-Words representation
1. Preprocessing: Text cleaning steps like removing punctuation, stop words (that is, common words like "a", "an", "the"), and converting text to lowercase are often performed.
2. Tokenization: The text is split into individual words (tokens).
3. Vocabulary creation: A list of unique words encountered across all documents in the corpus (collection of documents) is created. This is called the vocabulary.
4. Feature vector representation: Each document is represented by a feature vector. This vector has the same length as the vocabulary.

- Each element in the vector corresponds to a word in the vocabulary.
- The value at each element represents the number of times that particular word appears in the document (its frequency)/

### Example
Consider 2 documents,
- Document 1: "The quick brown fox jumps over the laxy dog.".
- Document 2: "The dog is lazy. The fox is quick.".

After preprocessing and tokenization, the result would be,
- Vocabulary: ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog", "is"].

The feature vectors for these documents could be,
- Document 1: [2, 1, 1, 1, 1, 1, 1, 1, 0]
- Document 2: [2, 1, 0, 1, 0, 0, 1, 1, 1]

### Applications
Bag-of-Words is a simple and effective way to represent text data for various NLP tasks, including,
- Document classification (e.g., spam detection, sentiment analysis).
- Information retrieval (e.g., document search).
- Topic modeling (identifying groups of related words).

### Limitations
- BoW ignores word order and context, which can be crucial for understanding the meaning of a sentence.
- Words with similar meanings (synonyms) are treated differently.
- The effectiveness of BoW depends heavily on the quality of the preprocessing steps.

### Alternatives
- TF-IDF (Term Frequency-Inverse Document Frequency) is a popular extension of BoW that incorporates the importance of words within a document and across the corpus.
- Word embeddings, like word2vec and GloVe, capture semantic relationships between words and provide a more nuanced representation of text data.
- Overall, Bag-of-Words is a foundational technique in NLP, offering a simple and efficient way to represent text data for various tasks. However, it's important to be aware of its limitations and consider alternative approaches depending on the specific application.

### Further reading
https://www.scaler.com/topics/nlp/text-representation-in-nlp/

# Text Vectorization And Feature Reduction In Spam Classification
### Text to vectors
- For machine learning models to work, text data needs to be converted into numerical features for processing.
- BoW is a common technique for text vectorization.
- BoW represents documents as vectors where each element corresponds to a unique word in the vocabulary.
- The value in each element represents the frequency (count) of that word in the document.

### Example
Consider the sentence: "Can you please look at the task...?".
- Vocabulary: ["Can", "you", "please", "look", "at", "the", "task", “..."].
- BoW vector: [1, 1, 1, 1, 1, 1, 1, 1] (Assuming each word appears once).

### Challenges with high dimensionality
- With large corpus, the vocabulary size (the number of unique words) can become very large. 
- This leads to high dimensional feature vectors (potentially tens of thousands of features).
- High dimensionality can pose problems for machine learning models,
    - Increased computational cost for training and prediction.
    - Potential for overfitting, where the model memorizes training data instead of learning the general patterns.

### Feature reduction techniques
- Text cleaning: Preprocessing steps like removing the stop words (common words like "the", "a", "an") and punctuation can significantly reduce the vocabulary size.
- Dimensionality reduction techniques:
    - Term Frequency-Inverse Document Frequency (TF-IDF): This method weights words based on their importance within a document and across the corpus. Words that are frequent in a document but rare overall (like "Nigeria" for spam) receive higher weights, leading to more informative features.
    - Principal Component Analysis (PCA): This technique projects data points onto a lower-dimensional space while capturing most of the variance in the data. This reduces the number of features while preserving the most important information.

### Summary
- BoW provides a basic way to convert text into numerical features.
- High dimensionality due to large vocabulary size can hinder model performance.
- Text cleaning and dimensionality reduction techniques like TF-IDF and PCA help manage feature space and improve model effectiveness.

# Text Cleaning For Text Classification
1. Convert sentences in words. This technically is called tokenization.
2. Convert all the text to lowercase. How will this help? This will remove duplicates like, the, THE, The, etc.
3. Remove non-alphabetical features. What does this mean? e.g., comma (,), full-stop (.), etc. These along with numbers can be removed. Removing numbers is not a hard rule, they can be left as it is in the text.
4. Remove stopwords. Meaning, words such as, the, how, where, etc., can be removed. Stopwords are words that do not add a lot of value to the classification.

NOTE: All of this text processing is optional.

# Mathematical Intuition For Naive Bayes
### Objective
Classify a new text message (sentence) as spam (class 1) or ham (class 0) based on the words it contains.

### Mathematical formulation
The posterior probability has to be calculated,

$P(y = c | w_1, w_2, ..., w_n)$ = Probability of the sentence belonging to class $c$ (spam or ham) given the set of words ($w_1$ to $w_n$).

Using Bayes' theorem,

$P(A | B) = \frac{P(B | A) * P(A)}{P(B)}$

Where,
- A = Class ($c$ = 0 for ham, and $c$ = 1 for spam).
- B = Set of words ($w_1$, $w_2$, ..., $w_n$).

### Challenges
1. Calculating likelihood ($P(B | A)$):
    - The probability of all the words appearing together given the class is needed (e.g., $P(w_1, w_2, ..., w_n | y = 1)$ for spam).
    - Directly calculating this joint probability is difficult due to the "curse of dimensionality" - the probability becomes extremely small as the number of words increases.
2. Naive assumption: Naive Bayes addresses this by assuming independence between words in a document given the class label ($c$). This means,
    - $P(w_1, w_2, ..., w_n | y = c) ≈ P(w_1 | y = c) * P(w_2 | y = c, w_1) * ... * P(w_n | y = c, w_1, w_2, ..., w_{n - 1})$.
    - We estimate the probability of each word individually given the class ($c$).

### Impact of the assumption
- This simplification makes the calculation of likelihood tractable.
- However, the independence assumption might not always hold true in natural language, where word order and context can influence meaning.

### Summary
Naive Bayes offers a computationally efficient approach to text classification by,
- Formulating the problem using Bayes' theorem and conditional probabilities.
- Making the simplifying assumption of word independence given the class label.
- Despite the assumption, Naive Bayes can be surprisingly effective for many text classification tasks.

# Naive Assumption in Naive Bayes
### The core assumption
Naive Bayes in text classification assumes independence between words in a document given the class label (spam or ham). This means,
- $P(w_1, w_2, ..., w_n | y = c) ≈ P(w_1 | y = c) * P(w_2 | y = c) * ... * P(w_n | y = c)$.
- We estimate the probability of each word individually given the class, ignoring the influence of other words in the sentence.

### Impact of the assumption
- Simplification: This assumption makes calculating the likelihood (probability of words given the class) tractable, avoiding the "curse of dimensionality" issue.
- Limitation: The assumption might not always hold true. Words can be related, and their presence can influence the probability of others (e.g., "happy" and "new" appearing together more frequently).

### Example
Consider $P(w_2 | y = 1, w_1)$. Naively, it becomes $P(w_2 | y = 1)$, ignoring the presence of $w_1$. In reality, the probability of "new" might depend on "happy" being present.

### Benefits of the assumption
- Computational efficiency: Easier to calculate individual word probabilities than complex joint probabilities.
- Surprisingly effective: Despite the simplification, Naive Bayes can achieve good performance in many text classification tasks.

### Justification for the assumption
- While word dependencies exist, their overall impact might average out across a large corpus.
- The simplicity of the model can sometimes compensate for the imperfect assumption.

### Summary
Naive Bayes takes a pragmatic approach. It acknowledges that word independence isn't entirely true but leverages the assumption for computational efficiency and achieves reasonable performance in many real-world scenarios. This trade-off between simplicity and accuracy is what makes Naive Bayes a popular choice for text classification tasks.

# Summary Of Naive Bayes For Text Classification
### Objective
Classify a sentence as spam (class 1) or ham (class 0) based on the words it contains.

### Key equation
The posterior probability has to be found: $P(y = c | \text{sentence})$, which is the probability of the sentence belonging to class $c$ (spam or ham) given the words in the sentence.

### Naive Bayes approach
1. Leverages Bayes' theorem: $P(y = c | sentence) = \frac{P(sentence | y = c) * P(y = c)}{P(\text{sentence})}$.
2. Naive assumption: Assumes independence between words in the sentence given the class label ($c$). This simplifies the calculation of $P(\text{sentence} | y = c)$.
3. Simplified equation (for spam class, $c$ = 1): $P(y = 1 | \text{sentence}) ≈ \Pi(P(w_i | y = 1)) * P(y = 1)$. Where,
    - $\Pi$ (product symbol) = Multiplies the probabilities of each word ($w_i$) appearing in the sentence given its spam ($y$ = 1).
    - $P(y = 1)$ = Prior probability of a message being spam.

### Classification
- Calculate $P(y = 1 | \text{sentence})$ and $P(y = 0 | \text{sentence})$ using the same approach for both spam and ham classes.
- The sentence is classified into the class with the higher posterior probability.

### Why doesn't the denominator matter?
- The denominator, $P(\text{sentence})$ cancels out when comparing $P(y = 1 | \text{sentence})$ and $P(y = 0 | \text{sentence})$ because it is the same for both calculations.
- Only the class with the higher probability is considered, so the constant denominator does not affect the final decision.

### Naive assumption trade-off
- The assumption simplifies calculations but might not always hold true (words can be related).
- Despite the simplification, Naive Bayes can be surprisingly effective for many text classification tasks.

### Summary
Naive Bayes offers a simple yet powerful approach for text classification. By leveraging Bayes' theorem and making a simplifying assumption, it efficiently estimates the probability of a sentence belonging to a class based on word probabilities. While the independence assumption is not perfect, it often provides good results in practice.


# Limitations Of Naive Bayes
While Naive Bayes offers a powerful approach, it has some limitations.

### Limited text understanding
- It analyzes words independently, ignoring their meaning or relationships within the sentence.
- New words encountered during prediction (not in the training vocabulary) can lead to issues.

### Assumption of order independence
- The model does not consider the word order, which can affect meaning.
- Sentences like "good movie" and "movie bad" might be treated similarly.

### Frequency insensitivity
The model treats a word appearing once or multiple times the same in the bag-of-words representation. Information about word frequency is lost.

### Zero probability problem
- If a word from a new sentence is absent from the vocabulary, its probability becomes 0.
- This can lead to the entire equation for that class becoming 0, making classification impossible.

### Handling out-of-vocabulary (OOV) words
- Simple approach: Assume the word is not present at all (probability = 0).
- $P(\text{unknown word} | y = 1)$: Assign a uniform probability (often 1) to unseen words.
- Laplace Smoothing: A more sophisticated technique that adds a small value (e.g., 1) to the count of each word estimating probabilities. This avoids zero probabilities and provides smoother estimates.

### Summary
Naive Bayes offers a trade-off between simplicity and accuracy. While it has limitations in understanding complex text relationships, it can be effective for many classification tasks. Techniques like Laplace help address the zero probability problem and improve robustness.

# Laplace Smoothing For Naive Bayes
### Problem
Naive Bayes calculates the probability of words ($w_j$) appearing in a class (e.g., spam). If a word is absent from the training data for a specific class, its probability becomes 0. This can lead to,
1. Zero probability problem: The entire equation for that class becomes 0, making classification impossible.
2. Mathematical issues: Multiplication by 0 can cause problems in calculations.

### Solution: Laplace Smoothing
This technique adds a small value ($\alpha$) to the count of each word when estimating probabilities. The formula for Laplace Smoothing with Naive Bayes is, $P(w_j | y = 1) = (\frac{\text{Count}(w_j, y = 1) + \alpha}{Total number of words in class 1 + \alpha * c})$. Where,
- $\alpha$ = Hyperparameter controlling smoothing (typically a small value like 1).
- $c$ = Number of possible values for $w_j$ (in this case, 0 or 1).

### Advantages
- Non-zero probabilities: Ensures all words have a non-zero probability, even if unseen in training data.
- Robustness: Prevents the model from breaking down due to zero probabilities.
- Smoother estimates: Reduces the impact of sparse data, leading to more stable and reliable probability estimates.

### Example
- $\alpha$ = 1.
- Word "important" not present in spam emails ($\text{Count}(\text{important}, y = 1) = 0$).

### Without Smoothing
$\frac{P(\text{important} | y = 1)}{\text{Total spam emails} = 0}$ (classification impossible).

### With Smoothing
$P(\text{important} | y = 1) = \frac{0 + 1}{\text{Total spam emails} + 1 * 2} = \frac{1}{\text{Total spam emails} + 2}$ (provides a valid probability).

Laplace Smoothing is a simple yet effective technique that improves the robustness and reliability of Naive Bayes for text classification by avoiding zero probabilities and providing smoother probability estimates.

# Bernoulli V. Multinomial Naive Bayes
### Feature representation
The key difference between Bernoulli and Multinomial Naive Bayes lies in how they handle features (word occurrences) in text classification.

### Bernoulli Naive Bayes (Bernoulli NB)
- Suitable for features with only 2 possible values (typically 0 or 1).
- Example: "good" can be either present (1) or absent (0) in a document.

### Multinomial Naive Bayes (Multinomial NB)
- Applicable for features with multiple distince values (k, where k > 2).
- Example: "good" can appear 0 times, 1 time, 2 times, and so on (represented by different values based on frequency).

### Impact on features
- Bernoulli NB:
    - Simpler model with fewer features (0 or 1 for each word).
    - Might miss information about word frequency.
- Multinomial NB:
    - More complex model with increased features for each word (representing frequency).
    - Captures word frequency information but leads to a larger feature space.

### Feature engineering considerations
While Multinomial NB can capture frequency, the increase in features can be problematic. Techniques like,
- Minimum frequency threshold: Ignore words appearing less than a certain number of times.
- Maximum frequency threshold: Cap the maximum value for frequent words (e.g., "the").

These techniques help manage the feature space size in Multinomial NB.

### Summary
- Choose Bernoulli NB for binary features (presence or absence).
- Choose Multinomial NB for features with multiple features (frequency).
- Be mindful of feature explosion in Multinomial NB and consider feature engineering techniques.

# Multi-Class Classification
Naive Bayes can effectively used for multi-class classification problems (more than 2 classes).

### Key idea
- The core idea concept from binary classification is extended.
- The posterior probability ($P(y = c | \text{sentence})$) is calculated for each possible class ($c$).
- The class with the highest posterior probability is chosen for the sentence.

### Mathematical formulation
Similar to the binary case, the same basic equation can be used but for multiple cases,

$P(y = c | \text{sentence}) ≈ \Pi{(P(w_i | y = c)) * P(y = c)}$. Where,
- c iterates over all possible classes (0, 1, 2, etc).
- $P(w_i | y = c)$ is the probability of word $w_i$ appearing, given class $c$.
- $P(y = c)$ is the prior probability of class $c$ (overall frequency of class $c$).

### Classification
1. Calculate the posterior probability for each class using the equation above.
2. Assign the sentence to the class with the highest posterior probability.

### Example (Multinomial NB)
Consider a scenario with 3 classes, spam ($c$ = 0), important ($c$ = 1), and advertisment ($c$ = 2). $P(y = 0 | \text{sentence})$, $P(y = 1 | \text{sentence})$, and $P(y = 2 | \text{sentence})$ are calculated for a new sentence. The class with the highest probability wins.

### Summary
Naive Bayes handles multi-class classification efficiently by calculating posterior probabilities for each class and assigning the sentence to the class with the highest probability. This approach extends the core concepts from binary classification to handle more complex scenarios.

# Code Implementation Of Naive Bayes

In [4]:
df["message"][1]

'Ok lar... Joking wif u oni...'

In [5]:
df["type"].value_counts()

type
ham     4825
spam     747
Name: count, dtype: int64

In [6]:
df["type"].value_counts(normalize = True)

type
ham     0.865937
spam    0.134063
Name: proportion, dtype: float64

In [7]:
import ssl
import nltk

ssl._create_default_https_context = ssl._create_unverified_context
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/vidishsirdesai/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/vidishsirdesai/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vidishsirdesai/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
print(nltk.corpus.stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [9]:
# processing text
text_sample = "Hello, world!"

# tokenization
nltk.word_tokenize(text_sample)

['Hello', ',', 'world', '!']

In [10]:
text_sample.split()

['Hello,', 'world!']

In [11]:
# step 1: lowercase
lowercase_text_sample = [i.lower() for i in nltk.word_tokenize(text_sample)]
lowercase_text_sample

['hello', ',', 'world', '!']

In [12]:
# step 2: remove non-alpha tokens
import re

text_sample = ["123@gmail.com"]

for word in text_sample:
    # replace everything that is not (^) alphanumeric or whitespace with ""
    c_word = re.sub(r"[^\w\s]", "", word)
    if c_word not in nltk.corpus.stopwords.words("english"):
        print(word, "-->", c_word)

123@gmail.com --> 123gmailcom


In [13]:
# packages for text processing
# import re, nltk, ssl
# ssl._create_default_https_context = ssl._create_unverified_context
# nltk.download("punkt")
# nltk.download("punkt_tab")
# nltk.download("stopwords")
# from nltk import word_tokenize, sent_tokenize
# from nltk.corpus import stopwords

# a simple text processing function
def clean_tokenized_sentence(sentence):
    """
    Performs basic cleaning of tokenized sentence.
    """
    # an empty string to store the processes sentence
    cleaned_sentence = ""
    words = nltk.word_tokenize(sentence)
    for word in words:
        # convert to lowercase
        cleaned_word = word.lower()
        # remove punctuations by substitution
        cleaned_word = re.sub(r"[^\w\s]", "", cleaned_word)

        # remove stopwords
        if cleaned_word != "" and cleaned_word not in nltk.corpus.stopwords.words("english"):
            # append the processed words to new list
            cleaned_sentence = cleaned_sentence + " " + cleaned_word
        
    return (cleaned_sentence.strip())

sentence = "Ok lar... The Joking wif u oni..."
clean_tokenized_sentence(sentence)

'ok lar joking wif u oni'

In [14]:
# apply the text cleaning function to the dataset
df["cleaned_message"] = df["message"].apply(clean_tokenized_sentence)
df.sample(10)

Unnamed: 0,type,message,cleaned_message
1917,ham,We not leaving yet. Ok lor then we go elsewher...,leaving yet ok lor go elsewhere n eat u thk
2862,ham,"Ok that would b lovely, if u r sure. Think abo...",ok would b lovely u r sure think wot u want dr...
3032,ham,"Aight, lemme know what's up",aight lem know
4395,ham,Dear :-/ why you mood off. I cant drive so i b...,dear mood cant drive brother drive
1395,ham,Thats cool! I am a gentleman and will treat yo...,thats cool gentleman treat dignity respect
3840,ham,Howz pain.it will come down today.do as i said...,howz painit come todaydo said ystrdayice medicine
154,ham,"You are everywhere dirt, on the floor, the win...",everywhere dirt floor windows even shirt somet...
3921,ham,"Oh really? perform, write a paper, go to a mov...",oh really perform write paper go movie home mi...
4800,ham,The guy at the car shop who was flirting with ...,guy car shop flirting got phone number paperwo...
1871,ham,Dont know supports ass and srt i thnk. I think...,dont know supports ass srt thnk think ps3 play...


In [15]:
df["cleaned_message"] = df["cleaned_message"].astype(str)

In [16]:
# dropping the "message" column
df.drop(columns = ["message"], inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   type             5572 non-null   object
 1   cleaned_message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [17]:
# encoding the target column "type"
def encode_type(i):
    encode = {
        "spam": 1,
        "ham": 0
    }
    return encode[i]

df["type"] = df["type"].apply(encode_type)
# df["type"] = df["type"].apply({"spam": 1, "ham": 0})
df.sample(10)

Unnamed: 0,type,cleaned_message
696,0,aight close still around alex place
4449,0,awesome minute
2547,1,text82228 get ringtones logos games wwwtxt8222...
2861,1,adult 18 content video shortly
963,0,yo chad gymnastics class wan na take site says...
4216,0,office around 4 pm going hospital
2678,0,playng 9 doors game gt racing phone lol
2148,0,get home
1261,0,thank much skyped wit kz sura didnt get pleasu...
185,0,hello handsome finding job lazy working toward...


In [18]:
# performing train-test split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df["cleaned_message"], df["type"], test_size = 0.25, random_state = 42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((4179,), (1393,), (4179,), (1393,))

In [19]:
# converting to bag of words and then to features
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
x_train = vectorizer.fit_transform(x_train)
# converts to BoW and then to features
x_test = vectorizer.transform(x_test)
# size of the bag
x_train.shape, x_test.shape

((4179, 7612), (1393, 7612))

In [20]:
vectorizer.vocabulary_

{'call': 1571,
 'tell': 6544,
 'headache': 3293,
 'want': 7144,
 'use': 6986,
 'hour': 3441,
 'sick': 5952,
 'time': 6682,
 'never': 4618,
 'try': 6836,
 'alone': 912,
 'take': 6480,
 'weight': 7208,
 'tear': 6530,
 'comes': 1879,
 'ur': 6971,
 'heart': 3303,
 'falls': 2712,
 'eyes': 2688,
 'always': 925,
 'remember': 5511,
 'stupid': 6341,
 'friend': 2954,
 'share': 5876,
 'bslvyl': 1498,
 'raji': 5384,
 'pls': 5098,
 'favour': 2744,
 'convey': 1963,
 'birthday': 1334,
 'wishes': 7293,
 'nimya': 4645,
 'today': 6717,
 'iï½ï½ï½m': 3697,
 'prob': 5260,
 'hi': 3351,
 'kate': 3819,
 'lovely': 4122,
 'see': 5798,
 'tonight': 6748,
 'ill': 3539,
 'phone': 5029,
 'tomorrow': 6737,
 'got': 3141,
 'sing': 5979,
 'guy': 3218,
 'gave': 3029,
 'card': 1624,
 'xxx': 7459,
 'usual': 6998,
 'iam': 3498,
 'fine': 2804,
 'happy': 3268,
 'amp': 938,
 'well': 7217,
 'nope': 4683,
 'watching': 7165,
 'tv': 6859,
 'home': 3396,
 'going': 3112,
 'bored': 1407,
 'ps': 5309,
 'grown': 3195,
 'right': 5606,
 

In [21]:
from sklearn.naive_bayes import BernoulliNB
bernoulli_nb = BernoulliNB()
bernoulli_nb.fit(x_train, y_train)

In [22]:
y_pred = bernoulli_nb.predict(x_test)

In [23]:
from sklearn.metrics import accuracy_score, f1_score

print(accuracy_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

0.9712849964106246
0.8837209302325582


In [24]:
from sklearn.naive_bayes import MultinomialNB

multinomial_nb = MultinomialNB()
multinomial_nb.fit(x_train, y_train)

In [25]:
y_pred = multinomial_nb.predict(x_test)

In [26]:
print(accuracy_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

0.9791816223977028
0.9214092140921409


# Hyperparameters In Naive Bayes
While Naive Bayes is often praised for its simplicity, it does have a few hyperparameters that can be tuned to improve performance.

### Smoothing parameter ($\alpha$)
- Used in Laplace Smoothing to address the zero probability problem.
- A small value (e.g., 1) is typically used to add a small amount of weight to each feature count, preventing zeros. Tuning this parameter can help balance model robustness (avoiding zero probabilities) and overfitting.

### Feature selection
- Naive Bayes assumes independence between features, which might not always be true.
- Selecting a relevant subset of features that contribute most to classification can improve both bias and variance.
- Techniques like chi-square tests, information gain, or feature importance analysis can help identify these features.

### Model selection
There are several variants of Naive Bayes, each suited for different data types,
- Bernoulli NB: Suitable for binary features (presence or absence).
- Multinomial NB: Suitable for features with multiple distinct values (frequency).
- Gaussian NB: Suitable for continuous features (assumes a Gaussian distrinution).

Choosing the appropriate variant based on the data's characteristics can significantly impact the performance.

### Class priors
- Naive Bayes uses class priors ($P(y = c)$) to represent the probability of each class appearing in the data.
- By default, these are estimated based on the class frequency in the training data (uniform for balanced datasets).
- In some cases, domain knowledge about the prior probabilities of differnt classes might be available.
- Setting informative priors based on this knowledge can potentially improve classifcation accuracy.

### Tuning techniques
- Techniques like `GridSearchCV` or `RandomSearchCV` can be used to explore differnt hyperparameter combinations and identify the best performing configuration for the specific dataset.
- Cross-validation is crucial to evaluate the model's performance on unseen data and avoid overfitting.

# Metrics To Evaluate Naive Bayes
### Classification accuracy
- The most basic metric, representing the proportion of correctly classified instances.
- $\text{Accuracy} = \frac{Total number of correct predictions}{\text{Total number of predictions}} = \frac{TP + TN}{TP + FP + TN + FN}$.

### Precision
- Measures the proportion of positive predictions that were actually correct.
- $\text{Precision} = \frac{TP}{TP + FP}$.
- Useful for understanding how many of the model's positive predictions were relevant.

### Recall
- Measures the proportions of actual positive cases that were correctly classified.
- $\text{Recall} = \frac{TP}{TP + FN}$.
- Useful for understanding how many relevant cases the model captured.

### F1-score
- Harmonic mean of precision and recall, combining both metrics into a single score.
- $\text{F1-score} = \frac{2 * \text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall}}$.
- Provides a balanced view of both precision and recall.

### Confusion matrix
- A visual representation of the model's performance on a classification task.
- Rows represent actual classes, and columns represent predicted classes.
- Values represent the number of instances in each category (True Positives, False Positives, True Negatives, False Negatives).

### ROC curve (Receiving Operating Characteristics curve)
- Plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different classification thresholds.
- A good model will have an ROC curve that stays close to the top-left corner, indicating high TPR and low FPR.

### Choosing the right metric
The choice of metric depends on the specific problem context.
- Accuracy: A good starting point but can be misleading in imbalanced datasets.
- Precision: Important when the cost of false positives is high (e.g., spam filtering).
- Recall: Important when missing relevant cases is critical (e.g., medical diagnosis).
- F1-score: A balanced view, useful when both precision and recall matter.

### Additional considerations
- Cross-validation: Evaluate the model's performance on unseen data using techniques like k-fold cross-validation.
- Error analysis: Analyze misclassified instances to understand potential weaknesses of the model.

# Bias-Variance Trade-Off In Naive Bayes
Naive Bayes inherently introduces some bias due to its core assumption of independence between features. This means it assumes that the presence of one feature does not affect the presence or absence of any other feature given the class label. While this assumption can be a good starting point, it may not always hold true for real world data, leading to underfitting for complex relationships.

### Factors affecting the trade-off in Naive Bayes
- Number of features: 
    - Too few features: Might lead to underfitting due to limited information for classification.
    - Too many features: Can increase variance, especially if some features are irrelevant or redundant. Feature selection can help address this.
- Smoothing techniques: Laplace Smoothing or other techniques help prevent zero probabilities for unseen words, reducing variance and improving model robustness.
- Choice of model variant: Different variations of Naive Bayes (e.g., Gaussian Naive Bayes v. Multinomial Naive Bayes) are suited for different data types and can impact bias-variance. Choosing the appropriate variant can help mitigate some bias.

### Strategies to improve the trade-off
- Feature selection: Select the most relevant features that contribute significantly to classification. Techniques like information gain, chi-square tests, or feature importance analysis can help identify these features.
- Smoothing techniques: As mentioned earlier, smoothing techniques like Laplace Smoothing can prevent zero probabilities and reduce variance.
- Model selection and regularization: Explore variations of Naive Bayes that might handle specific data characteristics better (e.g., Gaussian Naive Bayes for continuous features). Regularization techniques can be used to penalize overly complex models and reduce variance.
- Hyperparameter tuning: Tuning hyperparameters (e.g., smoothing parameter in Laplace smoothing) can help find the sweet spot between bias and variance. Techniques like `GridSearchCV` can be used for this.

### Finding the optimal balance
The goal is to find the sweet spot between bias and variance. The strategies mentioned above can help achieve this. Evaluating the model's performance on unseen data (e.g., using cross-validation) is crucial to ensure it generalizes well and avoids overfitting.

# Underfitting And Overfitting In Naive Bayes
### Underfitting
- Symptoms:
    - Poor performance on both, training and testing data.
    - The model fails to capture the underlying patterns in the data.
- Causes in Naive Bayes:
    - Limited features: The model might not have enough features to adequately represent the data complexity.
    - Overly strong independence assumption: The assumption of independence between features might be too strict, leading the model to miss important relationships.
- Solutions:
    - Feature engineering: Extract more features that capture relevant information from the data.
    - Relaxing independence assumption: Explore alternative Naive Bayes variants (e.g., Kernel Naive Bayes) that allows for some feature dependencies.

### Overfitting
- Symptoms:
    - High performance on the training data but poor performance on unseen data.
    - The model memorized specific details of the training data that don't generalize well.
- Causes in Naive Bayes:
    - Too many features: Using irrelevant or noisy features can increase the model's complexity and lead to overfitting.
    - Laplace Smoothing with high $\alpha$: A high smoothing parameter might smooth out the data too much, leading to overfitting.
- Solutions:
    - Feature selection: Select only the most relevant features that contribute significantly to classification.
    - Regularization: Techniques like L1 or L2 regularization can penalize overly complex models and reduce overfitting.
    - Hyperparameter tuning: Optimize the smoothing parameter ($\alpha$) in Laplace smoothing to avoid smoothing.
    - Cross-validation: Evaluate the model on unseen data using techniques like K-Fold Cross Validation to ensure it generalizes well.

### Strategies to prevent underfitting and overfitting
- Feature engineering: Explore creating new features or transforming existing features to better represent the data.
- Model selection: Consider different Naive Bayes variants or other classification algorithms based on the data characteristics.
- Hyperparameter tuning: Tune hyperparameters like smoothing parameters and regularization coefficients to find the optimal balance.
- Data augmentation (for text data): Techniques like synonym replacement or back translation can increase the training data size and reduce overfitting.
- Early stopping: Stop training the model if validation performance starts to decline, preventing it from overfitting on the training data.

# Effect Of Outliers On Naive Bayes
Naive Bayes is susceptible to outliers in the data, which can negatively impact its performance. The following are ways in which outliers can impact the model,
1. Zero probability problem:
    - Outliers can be words or features rarely seen in the training data.
    - Naive Bayes estimates the probability of each feature (word) appearing given a specific class.
    - If an unseen word appears in a new sentence (outlier), its probability becomes 0.
    - This can lead to the entire equation for that class becoming 0, making classifcation impossible.
2. Skewed class distributions:
    - Outliers can skew the distribution of features within a class.
    - The model learns the class probabilities based on these skewed distributions.
    - When encountering new data with outliers, the model might misclassify them due to the learned (inaccurate) class distributions.
3. Impact on feature importance:
    - Outliers might be assigned high importance based on their unique characteristics.
    - This can mislead the model into giving undue weightage to these features during classification.
    - The model might prioritize the outlier feature over more relevant but common features, leading to misclassification.

### How to mitigate the effects of outliers?
1. Data preprocessing:
    - Identify and remove outliers using techniques like outlier detection algorithms (statistical methods, z-scores).
    - Consider capping extreme values to a certain threshold.
2. Feature engineering: Apply techniques like feature scaling or normalization to reduce the influence of outliers on the feature space.
3. Robust estimators: Explore alternative Naive Bayes variants, like Lidstone Smoothing, that are less sensitive to outliers when estimating probabilities.

# Effects Of Data Imbalance On Naive Bayes
### Problem
In an imbalanced dataset, one class (the majority class) has significantly more samples than the other classes. Naive Bayes relies on class probabilities for prediction, and with imbalance data, it can,
- Overestimate the majority class probability: Due to the skewed distribution, the model might learn a higher prior probability for the majority class. This can lead to the model favoring the majority class during prediction, even for borderline cases.
- Underestimate the minority class probability: With fewer examples, the model might struggle to learn accurate probabilities for the minority class. This can result in misclassifications, where minority class instances are incorrectly classified as the majority class.

### Impact on performance
- Reduced accuracy: The model might achieve high overall accuracy due to correctly classifying most majority class instances. However, its performance on the minority class will be likely poor, leading to a misleading overall picture.
- Difficulties in evaluation: Standard metrics like accuracy becomes less informative in imbalanced datasets. Metrics like precision, recall, and F1-score for the minority class become crucial to understand its performance.

### Mitigation strategies
1. Data preprocessing techniques:
    - Oversampling: Duplicate minority class instances to create a more balanced dataset. Techniques like SMOTE can be used for synthetic data generation.
    - Undersampling: Randomly remove instances from the majority class to match the size of the minority class. However, this discards potentially valuable data.
    - Cost-sensitive learning: Assign higher weights to misclassifications of the minority class during training, forcing the model to pay more attention to these instances.
2. Model selection: Consider alternative classification algorithms that are less sensitive to data imbalance, such as Random Forest or Support Vector Machines (SVMs) with appropriate cost functions.
3. Ensemble methods: Combine predictions from multiple Naive Bayes models trained on different resampled versions of the data (e.g., using oversampling and undersampling). This can improve overall performance and reduce bias towards the majority class.

### Choosing the right approach
The best approach to address data imbalance depends on the specific data and the cost of misclassification for each class. Carefully evaluate the impact of different techniques on the model's performance before deploying it in a real-world setting.

### Additional considerations
- Understanding domain knowledge: The domain knowledge about the relative importance of different classes can be incorporated into the model through the cost-sensitive learning or by prioritizing the minority class during evaluation.
- Active learning: Query the user for labels on unlabeled data points, focusing on the majority class to improve the model's knowledge in that area.

# Time And Space Complexity Of Naive Bayes
### Time complexity
- Training time complexity: $O(n * d)$. Where,
    - $n$ = Number of training samples.
    - $d$ = Number of features.
- Explanation of training time complexity: The algorithm iterates over each training sample and updates the probability counts for each feature-class combination. This process is linear in the number of samples and features, leading to a time complexity of $O(n * d)$.
- Testing time complexity: $O(d * c)$. Where,
    - $d$ = Number of features.
    - $c$ = Number of classes.
- Explanation of testing time complexity: For each test instance, the algorithm calculates the probability of each class given the features. This involves iterating over each feature and looking up its probability for each class. This results in a time complexity of $O(d * c)$.

### Space complexity
- Space complexity: $O(d * c)$. To store the probability of each feature given a class.
- Explanation of space complexity: The algorithm stores the probability of each feature given a class. This requires space proportional to the number of features and classes, leading to a space complexity of $O(d * c)$.

### Key points
- Naive Bayes is known for its simplicity and efficiency, especially in text classification tasks.
- The low time and space complexity make it suitable for large datasets.
- The assumption of feature independence can be a limitation, but it often works well in practice.