# 10.01. Naive Bayes Classification with Feature Selection

## Table of Contents
1. [Introduction](#introduction)
2. [Theory: Naive Bayes](#theory)
3. [Feature Selection (Chi-Square)](#selection)
4. [Training the Classifier](#training)
5. [Classification](#classification)
6. [Evaluation](#evaluation)
7. [Summary](#summary)

---

## 1. Introduction <a name="introduction"></a>

**Naive Bayes** is a probabilistic classifier useful for categorizing documents.

### Use Cases:
- **Spam filtering**: Classify emails as spam/not spam
- **Sentiment analysis**: Positive/negative reviews
- **Topic classification**: Categorize news articles
- **Language detection**: Identify document language

### For Nepali IR:
- Classify documents by topic (politics, sports, culture, etc.)
- Filter content by category
- Organize document collections

---

## 2. Theory: Naive Bayes <a name="theory"></a>

### Bayes' Theorem:
$$
P(c|d) = \frac{P(d|c) \cdot P(c)}{P(d)}
$$

Where:
- $P(c|d)$ = Probability of class $c$ given document $d$
- $P(d|c)$ = Probability of document given class (likelihood)
- $P(c)$ = Prior probability of class
- $P(d)$ = Probability of document (constant for all classes)

### Classification Rule:
$$
c_{MAP} = \arg\max_c P(c) \prod_{i=1}^{n} P(w_i|c)
$$

---


In [1]:
from pathlib import Path
from collections import Counter, defaultdict
import math

# Load data
DATA_DIR = Path('../data')

def load_documents(data_dir):
    documents = {}
    for file_path in sorted(data_dir.glob('doc*.txt')):
        with open(file_path, 'r', encoding='utf-8') as f:
            documents[file_path.stem] = f.read()
    return documents

def load_stopwords(file_path):
    stopwords = set()
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            stopwords.add(line.strip())
    return stopwords

def load_stemming_dict(file_path):
    stem_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            parts = line.strip().split(',')
            if len(parts) == 2:
                stem_dict[parts[0]] = parts[1]
    return stem_dict

def tokenize(text):
    tokens = text.split()
    cleaned = []
    for token in tokens:
        token = token.strip('।,.!?;:"\'-()[]{}/')
        if token and any('\u0900' <= c <= '\u097F' for c in token):
            cleaned.append(token)
    return cleaned

def preprocess_text(text, stopwords, stem_dict):
    tokens = tokenize(text)
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [stem_dict.get(t, t) for t in tokens]
    return tokens

documents = load_documents(DATA_DIR)
stopwords = load_stopwords(DATA_DIR / 'nepali_stopwords.csv')
stem_dict = load_stemming_dict(DATA_DIR / 'nepali_stemming.csv')

# Use generated topics if available (from 00_data_expansion or manual tags)
labels = {}
for doc_id, text in documents.items():
    # Simple heuristic for demo if labels missing
    if 'politics' in doc_id: labels[doc_id] = 'politics'
    elif 'sports' in doc_id: labels[doc_id] = 'sports'
    elif 'tech' in doc_id: labels[doc_id] = 'tech'
    elif 'doc001' <= doc_id <= 'doc003': labels[doc_id] = 'politics'
    elif 'doc004' <= doc_id <= 'doc006': labels[doc_id] = 'sports'
    else: labels[doc_id] = 'culture'

preprocessed_docs = {}
for doc_id, text in documents.items():
    preprocessed_docs[doc_id] = preprocess_text(text, stopwords, stem_dict)

print(f"✓ Loaded {len(preprocessed_docs)} documents")
print(f"  Classes: {set(labels.values())}")

✓ Loaded 60 documents
  Classes: {'politics', 'culture', 'tech', 'sports'}


## 3. Feature Selection: Chi-Square ($\chi^2$) <a name="selection"></a>

Not all words are useful for classification. Some adds noise. Feature selection picks the top-K terms that are most dependent on the class.

We calculate the $\chi^2$ statistic for each term $t$ and class $c$:

$$ \chi^2(t, c) = \frac{N(N_{11}N_{00} - N_{10}N_{01})^2}{(N_{11} + N_{10})(N_{01} + N_{00})(N_{11} + N_{01})(N_{10} + N_{00})} $$

In [2]:
def chi_square_selection(docs, labels, top_k=50):
    # 1. Compute Contingency Tables
    # N11: doc has term, is in class
    # N10: doc has term, not in class
    # N01: doc no term, is in class
    # N00: doc no term, not in class
    
    N = len(docs)
    all_terms = set(w for d in docs.values() for w in d)
    classes = set(labels.values())
    
    chi_scores = defaultdict(float)
    
    print(f"Calculating Chi-Square for {len(all_terms)} terms...")
    
    for term in all_terms:
        term_score = 0
        
        for c in classes:
            N11, N10, N01, N00 = 0, 0, 0, 0
            
            for doc_id, doc_terms in docs.items():
                has_term = term in doc_terms
                is_class = labels[doc_id] == c
                
                if has_term and is_class: N11 += 1
                elif has_term and not is_class: N10 += 1
                elif not has_term and is_class: N01 += 1
                else: N00 += 1
            
            # Chi-square formula
            numerator = N * (N11*N00 - N10*N01)**2
            denominator = (N11 + N10) * (N01 + N00) * (N11 + N01) * (N10 + N00)
            
            if denominator > 0:
                score = numerator / denominator
                # Max score across classes (or avg)
                term_score = max(term_score, score)
        
        chi_scores[term] = term_score
        
    # Select Top K
    selected_features = sorted(chi_scores, key=chi_scores.get, reverse=True)[:top_k]
    return selected_features, chi_scores

# Select Top Features
selected_features, scores = chi_square_selection(preprocessed_docs, labels, top_k=100)
print(f"✓ Selected {len(selected_features)} features")
print(f"  Top 5: {selected_features[:5]}")

Calculating Chi-Square for 470 terms...
✓ Selected 100 features
  Top 5: ['मोबाइल', 'सफ्टवेयर', 'संविधान', 'खेल', 'प्रतियोगिता']


## 4. Training with Selected Features <a name="training"></a>
We filter the vocabulary to only include selected features.

In [3]:
class NaiveBayesClassifier:
    def __init__(self, vocabulary):
        self.class_counts = Counter()
        self.word_counts = defaultdict(Counter)
        self.vocabulary = set(vocabulary)
        self.total_docs = 0
    
    def train(self, documents, labels):
        self.total_docs = len(documents)
        
        for doc_id, terms in documents.items():
            cls = labels[doc_id]
            self.class_counts[cls] += 1
            
            for term in terms:
                if term in self.vocabulary: # ONLY use selected features
                    self.word_counts[cls][term] += 1
        
        print(f"✓ Training complete (Vocab Size: {len(self.vocabulary)})")
    
    def predict(self, terms):
        scores = {}
        vocab_size = len(self.vocabulary)
        
        for cls in self.class_counts:
            prior = math.log(self.class_counts[cls] / self.total_docs)
            total_words = sum(self.word_counts[cls].values())
            likelihood = 0
            
            for term in terms:
                if term in self.vocabulary:
                    word_count = self.word_counts[cls][term]
                    prob = (word_count + 1) / (total_words + vocab_size)
                    likelihood += math.log(prob)
            
            scores[cls] = prior + likelihood
        
        return max(scores, key=scores.get)

# Train
nb = NaiveBayesClassifier(selected_features)
nb.train(preprocessed_docs, labels)

✓ Training complete (Vocab Size: 100)


## 5. Summary <a name="summary"></a>

Feature selection:
- ⬇️ Reduces model size
- ⬇️ Removes noise
- ⬆️ Can improve accuracy
- ⬆️ Faster training