<div style="height:200px;">
    <img src="https://storage.googleapis.com/kaggle-media/competitions/nlp1-cover.jpg"/>
</div>

<h3 class="list-group-item list-group-item-action active" data-toggle="list" style="color:#02288b; background:#c50244; border:1px dashed;" role="tab" aria-controls="home"><center>Intro</center></h3>

# Goal

The goal of this notebook is to learn what is the Naive Bayes algorithm by implementing from the scratch the sklearn MultinomialNB class.

# Definition

Naive Bayes is a classification model based on Bayes Theorem.
Bayes Theorem universal example is to classify email messages between spam and ham.

The equations of Naive Bayes for this 'Real or Not?' is:

P(real | text) = (P(text | real) * P(real)) / P(text)<br>
P(fake | text) = (P(text | fake) * P(fake)) / P(text)

<h3 class="list-group-item list-group-item-action active" data-toggle="list" style="color:#02288b; background:#c50244; border:1px dashed;" role="tab" aria-controls="home"><center>Data</center></h3>

# Load the data

In [None]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils.validation import check_X_y, check_array

In [None]:
path = "/kaggle/input/nlp-getting-started/"
train_df = pd.read_csv(path + "train.csv")
submit_df = pd.read_csv(path + "test.csv")

y_train = train_df['target']
y_train.unique()

train_df.head()

## Create the vocabulary, the word count and prepare the train data

In [None]:
vocabulary = []
_ = [vocabulary.extend(x.split()) for i,x in enumerate(train_df['text'])]

In [None]:
vocabulary = np.array(vocabulary)
vocab = np.unique(vocabulary)

print("Vocab:", len(vocab))

In [None]:
vectorizer = CountVectorizer(vocabulary=vocab)
word_counts = vectorizer.fit_transform(train_df.text.to_numpy()).toarray()

X_train = pd.DataFrame(word_counts, columns=vocab).to_numpy()

<h3 class="list-group-item list-group-item-action active" data-toggle="list" style="color:#02288b; background:#c50244; border:1px dashed;" role="tab" aria-controls="home"><center>Implementation</center></h3>

|  Variable      | Math     |  Description                                                      |
|----------------|----------|-------------------------------------------------------------------|
| prior          | P(y)     |  Probability of any random selected message belonging to a class  |
| ik_word        | P(Xi&#124;y)  |  Likelihood of each word, conditional on message class            |
| lk_message     | P(x&#124;y)   |  Likelihood of an entire message belonging to a particular class  |
| normalize_term | P(x)     |  Likelihood of an entire message across all possible classes      |

In [None]:
class NaiveBayes():
    def __init__(self, alpha=1.0):
        self.prior = None
        self.word_counts = None
        self.lk_word = None
        self.alpha = alpha
        
    def fit(self, x, y):
        '''
        Fit the features and the labels
        Calculate prior, word_counts and lk_word
        '''
        x, y = check_X_y(x, y)
        n = x.shape[0]
        
        # calculate the prior - number of text belonging to a particular class (real or fake)
        x_per_class = np.array([x[y == c] for c in np.unique(y)])
        self.prior = np.array([len(x_class) / n for x_class in x_per_class])
        
        # calculate the likelihood for each word 'lk_word'
        self.word_counts = np.array([sub_arr.sum(axis=0) for sub_arr in x_per_class]) + self.alpha
        self.lk_word = self.word_counts / self.word_counts.sum(axis=1).reshape(-1, 1)
        
        return self
    
    def _get_class_numerators(self, x):
        '''
        Calculate for each class, the likelihood that an entire message conditional
        on the message belonging to a particular class (real or fake)
        '''
        n, m = x.shape[0], self.prior.shape[0]
        
        class_numerators = np.zeros(shape=(n, m))
        for i, word in enumerate(x):
            word_exists = word.astype(bool)
            lk_words_present = self.lk_word[:, word_exists] ** word[word_exists]
            lk_message = (lk_words_present).prod(axis=1)
            class_numerators[i] = lk_message * self.prior
        
        return class_numerators
    
    def _normalized_conditional_probs(self, class_numerators):
        '''
        Conditional probabilities = class_numerators / normalize_term
        '''
        # normalize term is the likelihood of an entire message (addition of all words in a row)
        normalize_term = class_numerators.sum(axis=1).reshape(-1,1)
        conditional_probs = class_numerators / normalize_term
        assert(conditional_probs.sum(axis=1) - 1 < 0.001).all(), 'rows should sum to 1'
        
        return conditional_probs
    
    def predict_proba(self, x):
        '''
        Return the probabilities for each class (fake or real)
        '''
        class_numerators = self._get_class_numerators(x)
        conditional_probs = self._normalized_conditional_probs(class_numerators)
        
        return conditional_probs
    

    def predict(self, x):
        '''
        Return the answer with the highest probability (argmax)
        '''
        return self.predict_proba(x).argmax(axis=1)

In [None]:
NaiveBayes().fit(X_train, y_train).predict(X_train)

<h3 class="list-group-item list-group-item-action active" data-toggle="list" style="color:#02288b; background:#c50244; border:1px dashed;" role="tab" aria-controls="home"><center>Submission</center></h3>

In [None]:
submit_df.head()

In [None]:
# word counts for submit using vectorizer from sklearn
word_counts_submit = vectorizer.fit_transform(submit_df.text.to_numpy()).toarray()

submit_x = pd.DataFrame(word_counts_submit, columns=vocab).to_numpy()

# use naive bayes to predict the submission
result = NaiveBayes().fit(X_train, y_train).predict(submit_x)

final_result = list(map(list, zip(submit_df.id.to_numpy(), result)))

final_df = pd.DataFrame(final_result, columns=['id', 'target'])
final_df.to_csv("submission.csv",index=False)

Thank you and don't forget to up-vote in order to support the community ;)