# Naive Bayes

- **Real time Prediction**: Naive Bayes is an eager learning classifier and it is sure **fast**. Thus, it could be used for making predictions in **real time**.
- **Multi class Prediction**: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of **multiple classes of target variable**.
- **Text classification/ Spam Filtering/ Sentiment Analysis**: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
- **Recommendation System**: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

In [19]:
import re
import math
from collections import Counter, defaultdict

def tokenize(message):
    message = message.lower()
    all_words = re.findall("[a-z0-9']+", message)
    return set(all_words)


print(tokenize("This is a test"))

{'a', 'test', 'is', 'this'}


In [7]:
def count_words(training_set):
    """training set consists of pairs (message, is_spam)"""
    counts = defaultdict(lambda: [0, 0])
    for message, is_spam in training_set:
        for word in tokenize(message):
            counts[word][0 if is_spam else 1] += 1
    return counts

def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
    """turn the word_counts into a list of triplets
    w, p(w | spam) and p(w | ~spam)"""
    return [(w,
             (spam + k) / (total_spams + 2 * k),
             (non_spam + k) / (total_non_spams + 2 * k))
             for w, (spam, non_spam) in counts.items()]


def spam_probability(word_probs, message):
    message_words = tokenize(message)
    log_prob_if_spam = log_prob_if_not_spam = 0.0

    for word, prob_if_spam, prob_if_not_spam in word_probs:

        # for each word in the message,
        # add the log probability of seeing it
        if word in message_words:
            log_prob_if_spam += math.log(prob_if_spam)
            log_prob_if_not_spam += math.log(prob_if_not_spam)

        # for each word that's not in the message
        # add the log probability of _not_ seeing it
        else:
            log_prob_if_spam += math.log(1.0 - prob_if_spam)
            log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)

    prob_if_spam = math.exp(log_prob_if_spam)
    prob_if_not_spam = math.exp(log_prob_if_not_spam)
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)


class NaiveBayesClassifier:

    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []

    def train(self, training_set):

        # count spam and non-spam messages
        num_spams = len([is_spam
                         for message, is_spam in training_set
                         if is_spam])
        num_non_spams = len(training_set) - num_spams

        # run training data through our "pipeline"
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts,
                                             num_spams,
                                             num_non_spams,
                                             self.k)

    def classify(self, message):
        return spam_probability(self.word_probs, message)


## Testing our models

In [17]:
import glob
import random

path = r'../data/spam/**/*'

data = []

for fn in glob.glob(path):
    is_spam = 'ham' not in fn
    
    with open(fn, 'r') as file:
        try:
            for line in file:
                if line.startswith("Subject:"):
                    # remove the leading subject and keep what's left
                    subjct = re.sub(r'Subject:', '', line).strip()
                    data.append((subjct, is_spam))
        except UnicodeDecodeError:
            pass

def split_data(data, prob):
    """split data into fractions [prob, 1 - prob]"""
    results = [], []
    for row in data:
        results[0 if random.random() < prob else 1].append(row)
    return results


random.seed(0)

train_data, test_data = split_data(data, 0.75)

classifier = NaiveBayesClassifier()
classifier.train(train_data)

## Test the models

In [20]:
classified = [(subject, is_spam, classifier.classify(subject)) for subject, is_spam in test_data]

counts = Counter((is_spam, spam_probability > 0.5) for _, is_spam, spam_probability in classified)

In [21]:
print("Counts ", counts)

Counts  Counter({(False, False): 672, (True, True): 71, (True, False): 33, (False, True): 19})


Look at the most misclassified:

In [22]:
classified.sort(key=lambda row: row[2])

In [29]:
spammiest_hams = list(filter(lambda row: not row[1], classified))[-5:]
hammiest_spams = list(filter(lambda row: row[1], classified))[:5]

In [33]:
def p_spam_given_word(word_prob):
    """uses bay's theorem to compute p(spam | message contains word)"""
    
    word, prob_if_spam, prob_if_not_spam = word_prob
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)

words = sorted(classifier.word_probs, key=p_spam_given_word)

In [34]:
spammiest_words = words[-5:]
hammiest_words = words[:5]

In [35]:
spammiest_words

[('500', 0.02664576802507837, 0.00024142926122646064),
 ('systemworks', 0.029780564263322883, 0.00024142926122646064),
 ('money', 0.029780564263322883, 0.00024142926122646064),
 ('adv', 0.032915360501567396, 0.00024142926122646064),
 ('rates', 0.04231974921630094, 0.00024142926122646064)]

In [36]:
hammiest_words

[('spambayes', 0.001567398119122257, 0.04949299855142443),
 ('2', 0.001567398119122257, 0.04080154514727185),
 ('users', 0.001567398119122257, 0.0374215354901014),
 ('razor', 0.001567398119122257, 0.031144374698213424),
 ('zzzzteana', 0.001567398119122257, 0.02631578947368421)]

How could we get better performance? One obvious way would be to get more data to train on. There are a number of ways to improve the model as well. Here are some possibilities that you might try:
- Look at the message content, not just the subject line. You’ll have to be careful how you deal with the message headers.
- Our classifier takes into account every word that appears in the training set, even words that appear only once. Modify the classifier to accept an optional min_count threshhold and ignore tokens that don’t appear at least that many times.
- The tokenizer has no notion of similar words (e.g., “cheap” and “cheapest”). Modify the classifier to take an optional stemmer function that converts words to equivalence classes of words. For example, a really simple stemmer function might
be:

In [37]:
def drop_final_s(word):
    return re.sub("s$", "", word)

Creating a good stemmer function is hard. People frequently use the Porter Stemmer.