![alternative text](../../data/nb1_chatgpt.png)
![alternative text](../../data/nb2_chatgpt.png)
# Multinomial Naive Bayes

![alternative text](../../data/nb3_chatgpt.png)

Feature Likelihoods (Conditional Probabilities): You estimate the likelihood of observing each feature (word or term) given each class. In the case of MNB, this involves calculating the conditional probabilities. These probabilities indicate how likely each feature is to appear in documents of each class. For MNB, you estimate the likelihood of observing each feature (word or term) given each class. This involves counting how often each feature appears in documents of each class and normalizing by the total count of features in that class. Laplace smoothing can be applied here. 

Once you've estimated the class priors and feature likelihoods for each class, the training process is complete, and you have a trained Naive Bayes classifier.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib.pylab import plt
from matplotlib.colors import LogNorm

# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))


In [182]:
# Split the dataset into features (text) and labels (newsgroup categories)
X = newsgroups.data
y = newsgroups.target
names = newsgroups.target_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"number of training examples {len(X_train)} and number of labels {len(set(y_train))}")

number of training examples 15076 and number of labels 20


In [53]:
import re
import numpy as np
from collections import Counter

class CountVectorizer:
    def __init__(self, max_features=None):
        self.max_features = max_features
        self.vocabulary_ = {}
    
    def fit(self, documents):
        # Tokenize and build the vocabulary
        token_pattern = r"(?u)\b\w\w+\b"
        words = re.findall(token_pattern, " ".join(documents).lower())
        counts = Counter(words)
        unique_words = set(words)
        if self.max_features is not None:
            unique_words = [x[0] for x in counts.most_common(self.max_features)]
        self.vocabulary_ = {word: index for index, word in enumerate(unique_words)}
        return self
    
    def transform(self, documents):
        # Transform documents into count vectors
        if not self.vocabulary_:
            raise ValueError("CountVectorizer has not been fitted.")
        
        feature_names = list(self.vocabulary_.keys())
        X = np.zeros((len(documents), len(feature_names)), dtype=int)
        
        for i, doc in enumerate(documents):
            words = re.findall(r"(?u)\b\w\w+\b", doc.lower())
            for word in words:
                if word in self.vocabulary_:
                    feature_index = self.vocabulary_[word]
                    X[i, feature_index] += 1
        
        return X


In [75]:
# Create and fit the CountVectorizer
num_vocab = 10000
vectorizer = CountVectorizer(max_features=num_vocab)
vectorizer.fit(X_train)
# Transform the documents into count vectors
X_train_tokens = vectorizer.transform(X_train)
X_test_tokens = vectorizer.transform(X_test)
print(X_train_tokens.shape)

(15076, 10000)


The parameter "alpha" represents Laplace smoothing (additive smoothing), which is used to avoid zero probabilities when a word has not been observed in a specific class. It prevents the algorithm from assigning zero likelihood to unseen words.

Calculating the likelihoods for the features present in the document:
![alternative text](../../data/ll_nb.png)
![alternative text](../../data/ll_nb2.png)
![alternative text](../../data/ll_nb3.png)



In [172]:
import numpy as np

class MultinomialNB:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Laplace smoothing parameter
        self.class_prior_ = None
        self.feature_log_prob_ = None
        self.classes_ = None

    def fit(self, X, y):
        # Calculate class priors
        # P(class) represents the prior probability of a class.
        # P(class) = (Number of samples in class) / (Total number of samples).
        unique_classes, class_counts = np.unique(y, return_counts=True)
        total_samples = len(y)
        self.class_prior_ = class_counts / total_samples
        self.classes_ = unique_classes

        # Calculate conditional probabilities (log probabilities)
        num_classes = len(unique_classes)
        num_features = X.shape[1]
        
        # for every class
        # P(word|class) represents the conditional probability of observing a word given a class.
        # P(word|class) = (Count of word occurrences in documents of the class + alpha) / (Total count of words in documents of the class + alpha * Vocabulary size).
        self.feature_log_prob_ = np.zeros((num_classes, num_features))
        for i, cls in enumerate(unique_classes):
            X_cls = X[y == cls] 
            total_word_count = X_cls.sum() + num_features * self.alpha
            # For each word, how likely does it belong to a class?   
            self.feature_log_prob_[i, :] = np.log((X_cls.sum(axis=0) + self.alpha) / total_word_count)

    def predict(self, X):
        # Predict class labels for input samples
        return self.classes_[np.argmax(self.predict_log_proba(X), axis=1)]

    def predict_log_proba(self, X):
        # Calculate log probabilities of each class for input samples
        # P(class|document) = (P(class) * P(document|class)) / P(document)
        # log(P(class)) +  log(P(document|class)) = prior + likelihood
        # P(document) is constant
        # likelihood : P(document|class) = P(feature_1|class) * P(feature_2|class) * ... * P(feature_n|class)
        # X * self.feature_log_prob_ : likelihood of observing a specific set of features for a given class 
        prior = np.log(self.class_prior_)
        likelihood = np.dot(X, self.feature_log_prob_.T)
        posterior_probs = prior + likelihood
        
        return posterior_probs


In [175]:
MNB = MultinomialNB(0.1)
MNB.fit(X_train_tokens,y_train)

print(f" model size {MNB.feature_log_prob_.shape} \n and class prior {MNB.class_prior_}")

 model size (20, 10000) 
 and class prior [0.04298222 0.05114089 0.05240117 0.05299814 0.05027859 0.05127355
 0.05187052 0.05266649 0.05492173 0.05193685 0.0531308  0.05240117
 0.05187052 0.05279915 0.05293181 0.05273282 0.04789069 0.05027859
 0.04085964 0.03263465]


In [167]:
test_accuracy = np.sum(MNB.predict(X_test_tokens) == y_test)/len(y_test)
test_accuracy

(20, 10000)

# Let's walk through an example

![alternative text](../../data/ll_nb4.png)
![alternative text](../../data/ll_nb5.png)
![alternative text](../../data/ll_nb6.png)


In [188]:
index = 20
print(X_test[index],'\n\n label : ',names[y_test[index]])

I need to port several OS/2 PM applications to X (OpenWindows or Motif),
and desperately need any information on how to go about doing this (short
of a complete rewrite.
 
Are there any tool to make porting easer?
Any References?
Any talent out there to hire to do this?
I will even take an OS/2 Presentation Mgr emulator for sun!
 
Any, and all replies (except flames) welcome!
 
 
Brian Colaric 

 label :  comp.windows.x


In [237]:
test_tokens = X_test_tokens[index]
model = MNB.feature_log_prob_ # log likelihood of each word give each class
print("test_tokens", np.shape(test_tokens), "model size ",model.shape)


test_tokens (10000,) model size  (20, 10000)


In [240]:
# likehood of test sample given each class * prior = likehood of the class given sample 
posterior = model @ test_tokens + MNB.class_prior_ 
names[np.argmax(posterior)] # prediction 

'comp.windows.x'