<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/03.nb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/03.nb.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# Naive Bayes, Text Classification, and Sentiment

📝 SALP chapter 4

## 🤔 **What is Text Classification?**
  - Text classification involves assigning a text to a predefined category from a finite set of classes.
  - Examples include:
    - **Language identification** (detecting the language of the text).
    - **Authorship attribution** (determining the author of a document).
    - **Sentiment analysis** (positive/negative)
      - Used to determine a writer's emotional stance toward a product, topic, or service.
      - e.g., movie review text being classified as positive or negative
    - **Spam detection** (spam/not spam)
    - **Topic/subject categorization** (sports, tech, etc.)
  - **Linear classifiers** are commonly used for text classification tasks.


### **Linear Classifiers Overview**
- A **linear classifier** makes a classification decision based on a `linear combination of input features`.
- The classifier creates a `decision boundary` in the `feature space` (text features like word frequencies).
- Examples of linear classifiers include:
  - **Naive Bayes**
  - **Logistic Regression**

## 🤔 **What is Naive Bayes?**
- **Naive Bayes** is a family of probabilistic classifiers based on **Bayes’ Theorem**.
- It assumes **conditional independence** between features (words) given the class.
- Often used in **text classification** tasks like **spam detection** or **sentiment analysis**.
- **Bayes' Theorem**: $\displaystyle P(C|X) = \frac{P(X|C) P(C)}{P(X)}$
  - **P(C|X)**: Probability of class $C$ given input data $X$.
  - **P(X|C)**: Likelihood of data $X$ given class $C$.
  - **P(C)**: Prior probability of class $C$.
  - **P(X)**: Prior probability of data $X$.



### **What is the Multinomial Naive Bayes?**
- The **Multinomial Naive Bayes** model is particularly well-suited for text classification.
- It is used when features represent **word counts** or **term frequencies**
  - i.e., the number of occurrences of each word
- Assumes that word occurrences follow a `multinomial distribution`.

### **What is a Multinomial Distribution?**
- The **Multinomial Distribution** is an extension of the binomial distribution.
- It models the probabilities of outcomes for multiple categories from a finite set, where each outcome has a certain probability.
- In a multinomial experiment, there are:
  1. **n trials** (fixed number of observations).
  2. **k possible outcomes** (categories).
  3. **p probabilities** associated with each outcome.
- The sum of probabilities for all categories equals 1.
  - 🍎 **Example 1**: Rolling a die multiple times where each face has an equal probability of showing up.

- **Multinomial Distribution Formula**:
  - $\displaystyle P(X_1 = x_1, X_2 = x_2, ..., X_k = x_k) = \frac{n!}{x_1! x_2! ... x_k!} p_1^{x_1} p_2^{x_2} ... p_k^{x_k}$
  - Where:
    - $n$: Total number of trials.
    - $x_1, x_2, ..., x_k$: Number of occurrences for each category.
    - $p_1, p_2, ..., p_k$: Probabilities of each category.

- 🍎 **Example 2**:
  - Suppose we roll a die 10 times, and the outcomes are as follows: 
    - Face 1: 2 times.
    - Face 2: 3 times.
    - Face 3: 1 time.
    - Face 4: 1 time.
    - Face 5: 2 times.
    - Face 6: 1 time.
  - Use the multinomial distribution to calculate the probability of this exact outcome:
    - $P(2, 3, 1, 1, 2, 1) = (10!) / (2! × 3! × 1! × 1! × 2! × 1!) × (1/6)^2 × (1/6)^3 × (1/6)^1 × (1/6)^1 × (1/6)^2 × (1/6)^1 = 0.002500571559213533$

In [3]:
import numpy as np
from scipy.stats import multinomial

# Example 1: Rolling a die multiple times
def die_roll_experiment(num_rolls):
    # Probabilities for each face of the die
    p = [1/6] * 6
    
    # Simulate die rolls
    rolls = np.random.choice(6, size=num_rolls, p=p) + 1
    
    # Count occurrences of each face
    unique, counts = np.unique(rolls, return_counts=True)
    
    return dict(zip(unique, counts))

# Example 2: Calculate probability of specific outcome
def calculate_specific_outcome_probability():
    n = 10  # Total number of rolls
    x = [2, 3, 1, 1, 2, 1]  # Occurrences of each face
    p = [1/6] * 6  # Probability of each face
    
    probability = multinomial.pmf(x, n, p)
    return probability

# Run Example 1
num_rolls = 1000
results = die_roll_experiment(num_rolls)
print("Die Roll Experiment Results:")
print(results)

# Run Example 2
probability = calculate_specific_outcome_probability()
print("\nProbability of specific outcome:")
print(f"{probability}")

Die Roll Experiment Results:
{1: 176, 2: 176, 3: 187, 4: 146, 5: 146, 6: 169}

Probability of specific outcome:
0.00250057155921354


In [4]:
import math

def multinomial_probability(trials, outcomes, probs):
  total_outcomes = sum(outcomes)
  if trials != total_outcomes:
    raise ValueError("The sum of outcomes must equal the number of trials.")

  probability = math.factorial(trials)
  for i,outcome in enumerate(outcomes):
    probability /= math.factorial(outcome)
    probability *= math.pow(probs[i], outcome)


  return probability

# Example usage:
trials = 10
outcomes = [2, 3, 1, 1, 2, 1]
probs = [1/6]*6
probability = multinomial_probability(trials, outcomes, probs)
print(probability)  # Output: approximately 0.002500571559213533

0.002500571559213533


### **Assumptions of Multinomial Naive Bayes**
- **Conditional Independence**: 
  - Each word is conditionally independent of every other word, given the class.
- **Bag-of-Words (BoWs) Assumption**: 
  - The position of words does not matter, only their frequency counts.
- **Multinomial Distribution**: 
  - Words are drawn from a fixed vocabulary and follow a multinomial distribution.
- 🍎 Create BaWs from a document

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def bag_of_words(document, stop_words=None):
    # Tokenize the document into words
    words = word_tokenize(document.lower())

    # Remove stop words if provided
    if stop_words is not None:
        # stop_words = set(stopwords.words('english'))
        words = [word for word in words if word not in stop_words]

    # Create a bag of words representation
    bag = {}
    for word in words:
        bag[word] = bag.get(word, 0) + 1

    return bag

# Example usage
document = "This is a sample document. It contains some words, and some other words."
bag = bag_of_words(document)
print(bag)
print(stopwords.words('english'))

{'this': 1, 'is': 1, 'a': 1, 'sample': 1, 'document': 1, '.': 2, 'it': 1, 'contains': 1, 'some': 2, 'words': 2, ',': 1, 'and': 1, 'other': 1}
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', '

### **Naive Bayes Classifiers**
- Naive Bayes is a **probabilistic model** for classification.
- It selects the class $\hat{c}$ that maximizes the **posterior probability** given a document $d$:
  - $\displaystyle\hat{c} = \arg\max_{c \in C} P(c|d)$
  - This formula uses **Bayesian inference** to predict the `most likely class` for the input.
- 🍎 An email $d$ being classified into two categories: 
  - $\displaystyle\hat{c} = \arg\max_{c \in \{S,H\}} P(c|d)$
  - where S = spam, H = ham, not spam


### **Bayes’ Theorem**
- $P(c|d) = \dfrac{P(d|c)P(c)}{P(d)}$
  - $P(c|d)$: Probability of class $c$ given the document $d$.
  - $P(d|c)$: Likelihood of observing the document $d$ if the class is $c$.
  - $P(c)$: Prior probability of the class $c$.
  - $P(d)$: Evidence or probability of the document $d$ under all classes.


### **Simplifying Naive Bayes Formula**
- The denominator $P(d)$ is constant for all classes and can be ignored for classification.
  - $\displaystyle\hat{c} = \arg\max_{c \in C} P(d|c)P(c)$
- Naive Bayes is called a **generative model**:
  - First, a class $c$ is sampled from the prior $P(c)$.
  - Then the document is generated from the likelihood $P(d|c)$.


### **Naive Bayes Assumptions**
- With assumptions:
  1. **Bag-of-Words**: 
     - The order of words doesn’t matter.
  2. **Conditional Independence**: 
     - Each word’s occurrence is independent given the class $c$.
     - also called **naive Bayes assumption**.
  
- Then a document $d$ can be represented as a set of features $f_1, f_2, ..., f_n$:
  - $P(f_1, f_2, ..., f_n | c) = P(f_1 | c) \cdot P(f_2 | c) \cdot ... \cdot P(f_n | c)$
- The final equation for the class chosen by a naive Bayes classifier is thus:
  - $\displaystyle\hat{c} = C_{NB} = \arg\max_{c \in C} P(c)∏_{f\in F}P(f|c)$  


### **Applying Naive Bayes to Text**
- **Document as Features**:
  - For text classification, each word in the document is treated as a feature:
    - $c_{NB} = \arg\max_{c \in C} P(c) \prod_{i \in \text{positions}} P(w_i | c)$
    - positions ← all word positions in test document
  - The classifier computes the likelihood of the document given each class and picks the highest.
- Use of Naive Bayes in **spam detection** where words like “free” or “win” increase likelihood for spam.



### **Practical Challenges of Naive Bayes**
- **Computational Issues**:
  - Calculating the product of probabilities for long documents can lead to **underflow**.
  
- **Solution**:
  - Use `logarithms` to convert the product into a sum:
    - $c_{NB} = \arg\max_{c \in C} \log P(c) + \sum_{i \in \text{positions}} \log P(w_i | c)$
    - Naive Bayes is a **linear classifier** when expressed in log space
- This transformation makes the computation efficient and avoids underflow.


### **Training the Naive Bayes Classifier**
- **Key Idea**: Learn probabilities from data.
- **Class Prior $P(c)$**:
  - $\displaystyle \hat{P}(c) = \dfrac{N_c}{N_{\text{doc}}}$
  - $N_c$: Number of documents in class $c$.
  - $N_{\text{doc}}$: Total number of documents.
- **Word Likelihood $P(w_i|c)$**:
  - $\displaystyle \hat{P}(w_i|c) = \dfrac{\text{count}(w_i, c)}{\sum_{w \in V} \text{count}(w, c)}$
  - $V$: Vocabulary (all words in all classes).



### **The Problem with Zero Probabilities**
- **Zero Likelihood Problem**:
  - When a word does not appear in the training data for a class, its probability becomes zero.
    - $\displaystyle \hat{P}(\text{“word-not-exist-in-c”}|c) = 0$
  - Multiplied probabilities lead to zero for the entire class likelihood.
- **Impact**: 
  - If any word's likelihood is zero, the whole class probability becomes zero.
- **Solution**: 
  - Use **smoothing algorithms such as add-one (Laplace) smoothing**:
  - $\displaystyle \hat{P}(w_i|c) = \frac{\text{count}(w_i, c) + 1}{\sum_{w \in V} (\text{count}(w, c) + 1)}$


### **Handling Unknown Words and Stop Words**
- **Unknown Words**:
  - Words in the test set but not in the training vocabulary are ignored.

- **Stop Words**:
  - High-frequency words like “the” or “a” can be removed.
  - Stop word removal does not always improve performance in text classification.



### **Naive Bayes Algorithm - Training**
- **Input:** the set of all training documents $D$, the set of all classes $C$
- For each class $c∈ C$:
  1. **Calculate P(c) terms**
     - $N_{\text{doc}}$ = number of documents in $D$
     - $N_c$ = number of documents from $D$ in class $c$
     - $\text{logprior}[c] \leftarrow \log \left( \frac{N_c}{N_{\text{doc}}} \right)$
     - $V \leftarrow$ vocabulary of $D$
     - $\text{bigdoc}[c] \leftarrow \text{append}(d)$ for $d \in D$ with class $c$

  2. For each word $w \in V$:
     - **Calculate P(w|c) terms**
       - $\text{count}(w, c) \leftarrow$ number of occurrences of $w$ in $\text{bigdoc}[c]$
       - $\displaystyle \text{loglikelihood}[w, c] \leftarrow \log \left( \frac{\text{count}(w, c) + 1}{\sum_{w' \in V} (\text{count}(w', c) + 1)} \right)$
- **Return**: Vocabulary $V$, log priors $P(c)$, and log likelihoods $P(w|c)$.


### **Naive Bayes Algorithm - Testing**
- **Input:** testdoc, logprior, loglikelihood, C, V
- For each class $c \in C$:
  1. $\text{sum}[c] \leftarrow \text{logprior}[c]$
  2. For each position $i$ in $\text{testdoc}$:
     - $\text{word} \leftarrow \text{testdoc}[i]$
     - If $\text{word} \in V$:
       - $\text{sum}[c] \leftarrow \text{sum}[c] + \text{loglikelihood}[\text{word}, c]$
- **Return:** $\arg\max_{c∈C} \, \text{sum}[c]$

### 🍎 Example 3: Classifying emails using naive Bayes classifier
#### 1. **Training Data**
Let's take a small training dataset with the following emails:
- Spam:
  - Email 1: "Buy cheap products now"
  - Email 2: "Cheap deals available today"
  - Email 3: "Get cheap tickets now"
- Ham:
  - Email 1: "Meeting is scheduled at 9"
  - Email 2: "Your appointment is confirmed"
  - Email 3: "Please confirm your attendance"

#### 2. **Vocabulary (V)**

We first extract the vocabulary, i.e., all unique words across all emails.

**Vocabulary (V):**
```
V = {"buy", "cheap", "products", "now", "deals", "available", "today", 
     "get", "tickets", "meeting", "is", "scheduled", "at", "9", 
     "your", "appointment", "confirmed", "please", "confirm", "attendance"}
```
**|V| (Size of vocabulary):** 20 words.

#### 3. **Prior Probabilities: $P(c)$**

To calculate the prior probability $P(c)$ of each class:

- $N_{\text{spam}} = 3$ (Spam has 3 emails)
- $N_{\text{ham}} = 3$ (Ham has 3 emails)
- $N_{\text{doc}} = 6$ (Total number of emails)

We compute $P(\text{spam})$ and $P(\text{ham})$:

$\displaystyle P(\text{spam}) = \frac{N_{\text{spam}}}{N_{\text{doc}}} = \frac{3}{6} = 0.5$

$\displaystyle P(\text{ham}) = \frac{N_{\text{ham}}}{N_{\text{doc}}} = \frac{3}{6} = 0.5$

So, the priors are:
$P(\text{spam}) = 0.5, \quad P(\text{ham}) = 0.5$

#### 4. **Word Counts in Each Class**

We now count the occurrences of each word in the **spam** and **ham** classes by concatenating all emails in each class.

- **Spam concatenated:**
  ```
  "buy cheap products now cheap deals available today get cheap tickets now"
  ```
- **Ham concatenated:**
  ```
  "meeting is scheduled at 9 your appointment is confirmed please confirm your attendance"
  ```

The counts of words for each class are:

| Word        | Spam (Count) | Ham (Count) |
|-------------|--------------|-------------|
| buy         | 1            | 0           |
| cheap       | 3            | 0           |
| products    | 1            | 0           |
| now         | 2            | 0           |
| deals       | 1            | 0           |
| available   | 1            | 0           |
| today       | 1            | 0           |
| get         | 1            | 0           |
| tickets     | 1            | 0           |
| meeting     | 0            | 1           |
| is          | 0            | 2           |
| scheduled   | 0            | 1           |
| at          | 0            | 1           |
| 9           | 0            | 1           |
| your        | 0            | 2           |
| appointment | 0            | 1           |
| confirmed   | 0            | 1           |
| please      | 0            | 1           |
| confirm     | 0            | 1           |
| attendance  | 0            | 1           |

#### 5. **Likelihood Calculation $P(w|c)$**

Now, let's calculate the likelihood $P(w|c)$ using **Laplace (add-one) smoothing**.

For a word $w$ and class $c$:
$\displaystyle P(w|c) = \frac{\text{count}(w, c) + 1}{\sum_{w'} (\text{count}(w', c) + 1)}$

We need the total number of words in each class (plus the vocabulary size for smoothing):

- **Total number of words in spam:** 12
- **Total number of words in ham:** 13
- **Vocabulary size:** $|V| = 20$

Now, let's compute $P(w| \text{spam})$ and $P(w| \text{ham})$ for some example words:

- **For word "cheap":**
  $\displaystyle P(\text{"cheap"} | \text{spam}) = \frac{3 + 1}{12 + 20} = \frac{4}{32} = 0.125$
  
  $\displaystyle P(\text{"cheap"} | \text{ham}) = \frac{0 + 1}{13 + 20} = \frac{1}{33} = 0.0303$

- **For word "meeting":**
  $\displaystyle P(\text{"meeting"} | \text{spam}) = \frac{0 + 1}{12 + 20} = \frac{1}{32} = 0.03125$
  
  $\displaystyle P(\text{"meeting"} | \text{ham}) = \frac{1 + 1}{13 + 20} = \frac{2}{33} = 0.0606$

#### 6. **Testing (Classifying a New Email)**

Let's classify the new email: `"Get cheap products"`

- **Tokenized email:** `["get", "cheap", "products"]`

We will compute the log-probabilities for both classes using the formula:

$\log P(c | \text{email}) = \log P(c) + \sum_{w \in \text{email}} \log P(w | c)$

##### For Spam:

$\log P(\text{spam}) = \log(0.5) = -0.6931$

$\displaystyle \log P(\text{"get"} | \text{spam}) = \log\left(\frac{2}{32}\right) = -2.7726$

$\log P(\text{"cheap"} | \text{spam}) = \log(0.125) = -2.0794$

$\displaystyle \log P(\text{"products"} | \text{spam}) = \log\left(\frac{2}{32}\right) = -2.7726$

Total log-probability for spam:
$\log P(\text{spam} | \text{email}) = -0.6931 + (-2.7726) + (-2.0794) + (-2.7726) = -8.3178$

##### For Ham:

$\log P(\text{ham}) = \log(0.5) = -0.6931$

$\displaystyle \log P(\text{"get"} | \text{ham}) = \log\left(\frac{1}{33}\right) = -3.4965$

$\log P(\text{"cheap"} | \text{ham}) = \log(0.0303) = -3.4965$

$\displaystyle \log P(\text{"products"} | \text{ham}) = \log\left(\frac{1}{33}\right) = -3.4965$

Total log-probability for ham:
$\log P(\text{ham} | \text{email}) = -0.6931 + (-3.4965) + (-3.4965) + (-3.4965) = -11.1827$

#### 7. **Final Prediction**

Since $\log P(\text{spam} | \text{email}) = -8.3178$ is greater than $\log P(\text{ham} | \text{email}) = -11.1827$, the classifier predicts **spam** for the email `"Get cheap products"`.

#### 8. **Python implementation**

In [9]:
import math
from collections import defaultdict

# Tokenize documents into words
def tokenize(doc):
    return doc.lower().split()

# Training the Naive Bayes classifier
def train_naive_bayes(data):
    # Initialize variables
    logprior = {}
    loglikelihood = defaultdict(lambda: defaultdict(float))
    class_word_count = defaultdict(lambda: defaultdict(int))
    class_doc_count = defaultdict(int)
    vocabulary = set()
    Ndoc = len(data)
    
    # Concatenate documents by class
    for label, doc in data:
        class_doc_count[label] += 1
        words = tokenize(doc)
        vocabulary.update(words)
        for word in words:
            class_word_count[label][word] += 1
    
    # Calculate log P(c)
    for label in class_doc_count:
        logprior[label] = math.log(class_doc_count[label] / Ndoc)
    
    # Calculate log P(w|c) with Laplace smoothing
    for label in class_doc_count:
        total_word_count = sum(class_word_count[label].values())
        for word in vocabulary:
            word_count = class_word_count[label][word] + 1  # Add-one smoothing
            loglikelihood[word][label] = math.log(word_count / (total_word_count + len(vocabulary)))
    
    return logprior, loglikelihood, vocabulary

# Classify a new document
def classify_naive_bayes(doc, logprior, loglikelihood, classes, vocabulary):
    words = tokenize(doc)
    scores = {label: logprior[label] for label in classes}
    
    for word in words:
        if word in vocabulary:
            for label in classes:
                scores[label] += loglikelihood[word][label]
    
    return max(scores, key=scores.get)

# Sample training data
training_data = [
    ("spam", "Buy cheap products now"),
    ("ham", "Meeting is scheduled at 9"),
    ("ham", "Your appointment is confirmed"),
    ("spam", "Cheap deals available today"),
    ("ham", "Please confirm your attendance"),
    ("spam", "Get cheap tickets now")
]

# Train the Naive Bayes classifier
logprior, loglikelihood, vocabulary = train_naive_bayes(training_data)

# Sample classes
classes = ["spam", "ham"]

# Test the classifier with a new email
test_email_1 = "Get cheap products"
test_email_2 = "Your meeting is confirmed"

predicted_class_1 = classify_naive_bayes(test_email_1, logprior, loglikelihood, classes, vocabulary)
predicted_class_2 = classify_naive_bayes(test_email_2, logprior, loglikelihood, classes, vocabulary)

# Output the predictions
print(f"The email '{test_email_1}' is classified as: {predicted_class_1}")
print(f"The email '{test_email_2}' is classified as: {predicted_class_2}")


The email 'Get cheap products' is classified as: spam
The email 'Your meeting is confirmed' is classified as: ham


### **Considerations in Sentiment Analysis**
- Standard Naive Bayes text classification works well for sentiment analysis.
- The performance can be greatly improved with some small optimizations such as
  - Clip word counts at 1 per document (ignore frequency) because whether a word occurs matters more than its frequency
    - This variant is called **Binary Multinomial Naive Bayes**
  - Prepend **NOT** to every word after a negation word since **negation** can flip sentiment
  - Add features to Naive Bayes for words in **sentiment lexicons**
    - Sentiment Lexicons are pre-annotated lists of words with positive or negative sentiment


### **Language Identification using Naive Bayes**
- Words are not the best features in language identification.
- **Character n-grams** (2, 3, or 4 characters) are more effective.
- **Byte n-grams** treat text as raw bytes (ignoring Unicode)
  - It can model statistics about the beginning or ending of words since spaces count as bytes
- A widely used Naive Bayes system, [langid.py](https://github.com/saffsd/langid.py), is trained on multilingual data (Wikipedia, Twitter, religious texts).
  - The system begins with all possible n-grams (1-4 length) and selects the 7000 most informative features.
- 🍎 Example
  ```bash
  # install langid first
  pip install langid
  ```

In [None]:
import langid

# Define some example texts in different languages
texts = [
    "This is an English sentence.",  # English
    "Esta es una oración en español.",  # Spanish
    "C'est une phrase en français.",  # French
    "Das ist ein deutscher Satz.",  # German
    "这是一个中文句子。",  # Chinese
    "これは日本語の文です。"  # Japanese
]

# Iterate over each text and identify its language
for text in texts:
    lang, confidence = langid.classify(text)
    print(f"Text: {text}")
    print(f"Predicted Language: {lang}, Confidence: {confidence:.2f}")
    print('-' * 40)

- The output languages above are represented by their two-letter ISO 639-1 codes:
  - `en` → English
  - `es` → Spanish
  - `fr` → French
  - `de` → German
  - `zh` → Chinese
  - `ja` → Japanese

## **Naive Bayes as a Language Model**

- Naive Bayes classifiers can use various features: 
  - dictionaries, URLs, emails, phrases, etc.
- When **only individual words** are used as features, Naive Bayes acts similarly to a **language model**.
- Used as **Unigram Language Models**:
  - Each class (e.g., positive, negative) has its own unigram language model.
  - Probability of a sentence P(s|c) is the product of probabilities for each word $P(w_i|c)$ in that sentence:
  - $P(s|c) = \prod_{i \in \text{positions}} P(w_i|c)$
- This makes Naive Bayes classifiers a powerful tool for tasks involving text, such as sentiment analysis or document classification.



### **Naive Bayes Example for Sentence Classification**

- Example: Sentiment classification with positive (+) and negative (-) classes.
- Model Parameters:
  
| Word  | P(w\|+) | P(w\|-) |
|-------|--------------|--------------|
| I     | 0.1          | 0.2          |
| love  | 0.1          | 0.001        |
| this  | 0.01         | 0.01         |
| fun   | 0.05         | 0.005        |
| film  | 0.1          | 0.1          |

- Sentence: **"I love this fun film"**
  - $P(s|+) = 0.1 \times 0.1 \times 0.01 \times 0.05 \times 0.1 = 5 \times 10^{-7}$
  - $P(s|-) = 0.2 \times 0.001 \times 0.01 \times 0.005 \times 0.1 = 1.0 \times 10^{-9}$  
- Conclusion: The positive model assigns a **higher probability** to the sentence.

## **Evaluating Naive Bayes Classifiers**
- **Metrics**:
  - **Precision**: The proportion of true positives among all predicted positives.
  - **Recall**: The proportion of true positives among all actual positives.
  - **F1-Score**: The `harmonic mean` of precision and recall.
- **Evaluation Process**:
  - Use **training, validation, and test sets**.
  - Apply **cross-validation** to ensure the model generalizes well to unseen data.

### **Precision**:
- `Precision` measures the proportion of `correctly predicted positive instances` among `all instances predicted as positive`.
- $\text{Precision} = \dfrac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}$
- **Example**: If 100 emails are classified as spam (predicted positives) and 80 of them are actually spam (true positives), then:
  - $\text{Precision} = \dfrac{80}{100} = 0.8$

### **Recall**:
- `Recall` measures `the proportion of actual positives that were correctly predicted`.
  - $\text{Recall} = \dfrac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}$
- **Example**: If there are 90 actual spam emails and 80 were correctly identified, then:
  - $\text{Recall} = \dfrac{80}{90} = 0.89$

### **F1-Score**:
- The F1-Score is the `harmonic mean of precision and recall`, providing a balance between the two.
  - $\text{F1-Score} = 2 \times \dfrac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ 
- **Example**: If precision is 0.8 and recall is 0.89, then:
   - $\text{F1-Score} = 2 \times \dfrac{0.8 \times 0.89}{0.8 + 0.89} = 0.84$
- **High F1-Score**: Indicates a good balance between Precision and Recall.
- **Low F1-Score**: Indicates that the model struggles with either false positives or false negatives.


#### **Harmonic Mean**:
- $\displaystyle\text{HarmonicMean}(a_1, a_2, a_3, \dots, a_n) = \frac{n}{\frac{1}{a_1} + \frac{1}{a_2} + \dots + \frac{1}{a_n}}$
- The `harmonic mean` is used because it is closer to the **minimum** of the values compared to the `arithmetic mean`.
- It weighs lower values more heavily, providing a **conservative** estimate
  - which is useful when combining precision and recall.
- **Example**: The harmonic value for two values:
  - $\displaystyle\text{HarmonicMean}(P, R) = \frac{2}{\frac{1}{P} + \frac{1}{R}}$


### **F-measure as Harmonic Mean**
- With parameter $\alpha$, F-measure can be written as:
  - $\displaystyle F = \frac{1}{\alpha \cdot \frac{1}{P} + (1 - \alpha) \cdot \frac{1}{R}}$
  - **α**: Controls the balance between **Precision (P)** and **Recall (R)**.
  - $\displaystyle\beta^2 = \frac{1 - \alpha}{\alpha} $, making the F-measure dependent on the chosen tradeoff between P and R.
- **Simplified F-measure ($F_\beta$)**:
  - $\displaystyle F_\beta = (1 + \beta^2) \cdot \frac{P \cdot R}{\beta^2 \cdot P + R}$
    - **P** is Precision.
    - **R** is Recall.
    - **β** is the weight factor:
      - **β > 1**: More weight to **Recall**, useful for minimizing false negatives.
      - **β < 1**: More weight to **Precision**, useful for minimizing false positives.
      - **β = 1**: Equal weight (this simplifies to the **F1-Score**).
  - It gives **more weight to the lower value** (either Precision or Recall), leading to a more conservative and balanced evaluation.


## **Evaluation Process for Naive Bayes Classifiers**

### **Training, Validation, and Test Sets**:
- **Training Set**: Used to train the model by learning from labeled data.
- **Validation Set**: Used for tuning hyperparameters and selecting the best model during the training process.
- **Test Set**: Used to evaluate the final performance of the model on unseen data.

### **Cross-Validation**:
- **Definition**: Cross-validation involves dividing the data into k-folds, training the model on $k-1$ folds, and testing on the remaining fold. This is repeated $k$ times, with each fold serving as the test set once.
- **Example**: In **5-fold cross-validation**, the data is split into 5 parts:
  - Train on 4 folds, test on 1 fold, and repeat 5 times.
  - Average the results from each iteration to get a robust estimate of the model’s performance.
- **Advantages**:
  - Helps to ensure that the model generalizes well to unseen data.
  - Reduces the risk of overfitting by validating on multiple subsets of the data.

## **Multiple-Class Classifier**
- A classifier that predicts **more than two classes**.
  - Examples: Positive, Neutral, Negative.
- **Use Cases**:
  - Sentiment Analysis (Positive, Neutral, Negative)
  - Language Detection (English, French, Spanish)


### **Classification Methods for Multi-Class Problems**
- **1. One-vs-Rest (OvR)**:
- Treats each class as a separate binary classification problem.
  - Train **n** binary classifiers (one for each class).
  - For a class $c_i$, treat all other classes as "negative".
- **2. One-vs-One (OvO)**:
- Create a classifier for **every pair** of classes.
  - For $n$ classes, build $\dfrac{n(n-1)}{2}$ binary classifiers.
  - Classify by voting among the pairwise classifiers.
- **3. Softmax Regression (Multinomial Logistic Regression)**:
- Models the probability of each class directly.
  - Outputs probabilities for each class and selects the class with the highest probability.


### **Confusion Table for Multi-Class Classification**
- A confusion matrix for multi-class classification has **rows** representing the **true classes** and **columns** representing the **predicted classes**.

|              | Predicted: Class 1(A) | Predicted: Class 2(B) | Predicted: Class 3(C) | **Total Actual** |
|--------------|-------------------|-------------------|-------------------|------------------|
| **Actual: Class 1(A)** | 50                | 10                | 5                 | **65**           |
| **Actual: Class 2(B)** | 8                 | 45                | 12                | **65**           |
| **Actual: Class 3(C)** | 5                 | 15                | 50                | **70**           |
| **Total Predicted** | **63**            | **70**            | **67**            | **200**          |

- **Metrics Derived from the Confusion Table**:
- **True Positives (TP)**: Correct predictions for each class.
  - Class 1: 50
  - Class 2: 45
  - Class 3: 50

- **False Positives (FP)**: Incorrect predictions for each class (predicted as that class but actually from another class).
  - Class 1: 8 + 5 = 13
  - Class 2: 10 + 15 = 25
  - Class 3: 5 + 12 = 17

- **False Negatives (FN)**: Incorrectly predicted as another class (actually from that class but predicted as another class).
  - Class 1: 10 + 5 = 15
  - Class 2: 8 + 12 = 20
  - Class 3: 5 + 15 = 20
- **Multi-Class Accuracy**:
  - $\text{Accuracy} = \dfrac{\text{Sum of True Positives}}{\text{Total Number of Instances}}$



### **Evaluating Multi-Class Classifiers**
- For each class $c_i$, treat it as the "positive" class and all other classes as "negative".
- **Precision for $c_i$**:
  - $\displaystyle\text{Precision for } c_i = \frac{\text{True Positives for } c_i}{\text{True Positives for } c_i + \text{False Positives for } c_i}$
  
  - $\displaystyle\text{Precision}_A = \frac{TP_A}{TP_A + FP_A} = \frac{50}{50 + 13} = 0.7937$
  
  - $\displaystyle\text{Precision}_B = \frac{TP_B}{TP_B + FP_B} = \frac{45}{45 + 25} = 0.6429$
  
  - $\displaystyle\text{Precision}_C = \frac{TP_C}{TP_C + FP_C} = \frac{50}{50 + 17} = 0.7463$
- **Recall for $c_i$**:
  - $\displaystyle\text{Recall for } c_i = \frac{\text{True Positives for } c_i}{\text{True Positives for } c_i + \text{False Negatives for } c_i}$
  
  - $\displaystyle\text{Recall}_A = \frac{TP_A}{TP_A + FN_A} = \frac{50}{50 + 15} = 0.7692$
  
  - $\displaystyle\text{Recall}_B = \frac{TP_B}{TP_B + FN_B} = \frac{45}{45 + 20} = 0.6923$
  
  - $\displaystyle\text{Recall}_C = \frac{TP_C}{TP_C + FN_C} = \frac{50}{50 + 20} = 0.7143$
- **F1-Score for $c_i$**:
  - $\displaystyle\text{F1-Score}_A = \frac{2 \times \text{Precision}_A \times \text{Recall}_A}{\text{Precision}_A + \text{Recall}_A} = \frac{2 \times 0.7937 \times 0.7692}{0.7937 + 0.7692} = 0.7813$
  
  - $\displaystyle\text{F1-Score}_B = \frac{2 \times \text{Precision}_B \times \text{Recall}_B}{\text{Precision}_B + \text{Recall}_B} = 0.6667$
  
  - $\displaystyle\text{F1-Score}_C = \frac{2 \times \text{Precision}_C \times \text{Recall}_C}{\text{Precision}_C + \text{Recall}_C} = 0.7299$

#### **Macro-Averaged Metrics**:
- **Macro-averaged Precision**:
  - $\text{Macro Precision} = \dfrac{\sum_{i=1}^{n} \text{Precision}_i}{n}=0.7276$

- **Macro-averaged Recall**:
  - $\text{Macro Recall} = \dfrac{\sum_{i=1}^{n} \text{Recall}_i}{n}=0.7143$
- **Macro-average F1-Score**:
  - $\text{Macro F1-Score} = \dfrac{\sum_{i=1}^{n} \text{F1-Score}_i}{n}=0.7299$

#### **Micro-Averaging**:
- Aggregates **True Positives**, **False Positives**, and **False Negatives** across all classes for an overall precision and recall.
  - Considers each instance equally across all classes, regardless of the class distribution.
- **Micro-Averaged Precision** = $\dfrac{\sum_{i=1}^{n} \text{True Positives}_i}{\sum_{i=1}^{n} (\text{True Positives}_i + \text{False Positives}_i)}$

- **Micro-Averaged Recall** = $\dfrac{\sum_{i=1}^{n} \text{True Positives}_i}{\sum_{i=1}^{n} (\text{True Positives}_i + \text{False Negatives}_i)}$

- **Micro-Averaged F1-Score** = $\dfrac{2 \times \text{Micro Precision} \times \text{Micro Recall}}{\text{Micro Precision} + \text{Micro Recall}}$

- Since **Micro Precision** and **Micro Recall** are equal when aggregating across all classes, the **Micro F1-Score** is calculated similarly to the F1-Score for binary classification:
  - $\text{Micro F1-Score} = \text{Micro Precision} = \text{Micro Recall}$ 
  
  - $\displaystyle =\frac{TP_{\text{total}}}{TP_{\text{total}} + FP_{\text{total}}} = \frac{145}{145 + 55} = 0.7250$

## Comparing Classifier Performance
- **Problem**: How do we know if classifier A is better than classifier B?
  - Example: Comparing the F1 score of a logistic regression classifier vs. a naive Bayes classifier.
  - We observe the **effect size** δ(x):
    $\delta(x) = M(A, x) - M(B, x)$
    - Where $M(A, x)$ and $M(B, x)$ are the performance `metrics` (e.g., accuracy, F1) for classifiers A and B on test set $x$.
    - Larger $δ(x)$ indicates A might be better than B.


### **Statistical Hypothesis Testing**
- **Objective**: Is the observed performance difference statistically significant?
  - **Null hypothesis (H₀)**: A is not better than B $(δ(x) ≤ 0)$.
  - **Alternative hypothesis (H₁)**: A is better than B $(δ(x) > 0)$.
- **Question**: How likely is the observed $δ(x)$ to occur by chance?
  - The probability is formalized as the **p-value**:
    $P(\delta(X) \geq \delta(x) | H₀ \text{ is true})$
  - If the p-value is small enough (e.g., < 0.05), reject H₀ and conclude A is better than B.


### **Effect Size and p-Value**
- **Effect Size**: The difference in performance between A and B
  - e.g., $δ(x) = 0.2$ for accuracy.
- **p-Value**: Measures how likely it is to observe a $δ(x)$ as large as the one seen, assuming H₀ is true.
  - **Small p-value (< 0.05)**: Unlikely to observe such a large effect under H₀ → reject H₀.
  - **Large p-value**: The observed difference could occur by chance → fail to reject H₀.


### **Non-Parametric Tests in NLP**
- **Why Non-Parametric Tests?**
  - Parametric tests assume distributions like normality, which may not hold in NLP.
  - Non-parametric tests work by sampling data directly.
- **Popular Tests** described in [Computer-Intensive Methods for Testing Hypotheses: An Introduction](https://www.wiley.com/en-us/Computer-Intensive+Methods+for+Testing+Hypotheses%3A+An+Introduction-p-9780471611363):
  - **Approximate Randomization**
  - **Bootstrap Test** (paired version, common in NLP)


### **The Paired Bootstrap Test**
- **Goal**: Create many virtual test sets ($b$ of them) from an observed test set to simulate different testing conditions.
- [**Bootstrap Method**](./codes/03/btt.py):
  1. Start with a test set $x$ of size $n$.
  2. Randomly sample with replacement from $x$ to create virtual test sets $x(i)$.
  3. Compute the difference $δ(x(i))$ between A and B on each test set.
  4. Count how often $δ(x(i))$ exceeds `the observed 0` by $δ(x)$ or more.
     - $\displaystyle p\text{-value}(x) = \frac{1}{b} \sum_{i=1}^{b} 𝕋(\delta(x(i)) - \delta(x) \geq 0)$
     - Where $b$ is the number of bootstrap samples, and $𝕋(x)$ is an indicator function that is 1 if $x$ is true and 0 otherwise.


### **Paired Bootstrap Test**
- Counts how often $δ(x(i))$ exceeds `the expected value` of $δ(x)$ by $δ(x)$ or more:
  - $\displaystyle p\text{-value}(x) = \frac{1}{b} \sum_{i=1}^{b} 𝕋(\delta(x(i)) - \delta(x) \geq \delta(x)) = \frac{1}{b} \sum_{i=1}^{b} 𝕋(\delta(x(i)) \geq 2\delta(x))$
- **Interpretation**: If very few bootstrap test sets have a difference as large as the observed $δ(x)$, the p-value will be small, indicating A is likely better than B.


### 🍎 **Bootstrap Example**
- Assume test set with 10 documents.
  - Logistic regression A accuracy: 0.70
  - Naive Bayes B accuracy: 0.50
  - ∴ Observed $δ(x) = 0.20$
- **Bootstrap Sampling**:
  - Generate 10,000 virtual test sets by sampling with replacement.
  - Compute $δ(x(i))$ for each set.
  - Count the number of times $δ(x(i))$ exceeds $2δ(x) = 0.40$.
- **Result**: If only 47 out of 10,000 test sets exceed $2δ(x)$, p-value = 0.0047 → reject H₀.

## Avoiding Harms in Classification  
- Classification algorithms, including naive Bayes, can perpetuate societal harms such as
- `Representational harms`:  caused by a system that demeans a social group, for example by perpetuating negative stereotypes about them.
- Other harms: like silencing marginalized groups.


### **Representational Harms**
- Harms caused by systems that demean a social group, perpetuating negative stereotypes.
- **Example Study**: 
  - [Kiritchenko & Mohammad (2018)](https://aclanthology.org/S18-2005/): Sentiment analysis systems often assign more negative emotions to sentences with `African American names` than `European American names`.
  - **Implications**: These biases reflect and perpetuate stereotypes.
  

### **Bias in Toxicity Detection**
- **Goal**: Detect hate speech, harassment, or toxic language.
- **Problem**: 
  - Toxicity classifiers may mistakenly flag non-toxic sentences that mention groups like women, blind people, or gay people.
  - These errors can lead to the silencing of discourse.
- **Examples Study**
  - [Park et al. (2018)](https://doi.org/10.3115/v1/W14-2105): Classifiers flagging terms like “women” as toxic.
  - [Hutchinson et al. (2020)](https://doi.org/10.18653/v1/2020.acl-main.487): Misclassification of references to disabled people.
  - [Sap et al. (2019)](https://doi.org/10.18653/v1/P19-1163): Bias against African-American Vernacular English.


### **Causes of Model Bias**
- **Training Data**: Models replicate and amplify biases present in data.
- **Labels**: Biases in human labeling can affect model outputs.
- **Resources**: Lexicons, pretrained embeddings, or model components can introduce bias.
- **Model Architecture**: What the model is trained to optimize matters.


## Mitigating Bias in Classification Models
- **Current Research**: Ongoing work in addressing bias through data curation and evaluation.
- **No General Solutions**: Each model needs to be critically evaluated.
  
### **Documenting Models: The Model Card**
- **What is a Model Card?**: A tool to document model development and evaluation.
  - It helps promote transparency and accountability
- **Contents of a Model Card**:
  - Training algorithms and parameters.
  - Training data sources and preprocessing.
  - Evaluation data sources and motivation.
  - Intended use and user groups.
  - Model performance across demographic groups.
- **Example Model Card**
  - **Training Data**: Describe where the data comes from and any biases.
  - **Evaluation**: Provide details on model performance across different demographic groups.
  - **Intended Use**: Specify what the model should and should not be used for.