## 4.1
Assume the following likelihoods for each word being part of a positive or negative movie review, and equal prior probabilities for each class.

|  | pos | neg |
| --- | --- | --- |
| I | 0.09 | 0.16 |
| always | 0.07 | 0.06 |
| like | 0.29 | 0.06 |
| foreign | 0.04 | 0.15 |
| films | 0.08 | 0.11 |

What class will Naive bayes assign to the sentence “I always like foreign films.”?

In [1]:
"""
P(c|d) = P(d|c)P(c) = P(c)Pi(wi|c)
"""
import pandas as pd

df = pd.DataFrame(
    [[0.09, 0.16],
     [0.07, 0.06],
     [0.29, 0.06],
     [0.04, 0.15],
     [0.08, 0.11]],
    columns = ('pos', 'neg'),
    index=('I', 'always', 'like', 'foreign', 'films')
)

prob_pos = (df.pos.sum() / df.sum().sum()) * df.pos.prod()
prob_neg = (df.neg.sum() / df.sum().sum()) * df.neg.prod()

print('positive' if prob_pos > prob_neg else 'negative')

negative


## 4.2
Given the following short movie reviews, each labeled with a genre, either comedy or action:
1. fun, couple, love, love **comedy**
2. fast, furious, shoot **action**
3. couple, fly, fast, fun, fun **comedy**
4. furious, shoot, shoot, fun **action**
5. fly, fast, shoot, love **action**

and a new document D:  
&emsp;*fast, couple, shoot, fly*  
compute the most likely class for D. Assume a naive Bayes classifier and use
add-1 smoothing for the likelihoods.

In [2]:
import math
from collections import defaultdict
from nltk.tokenize import word_tokenize


class NaiveBayes:
    def __init__(self, binary=False):
        self.binary = binary
        self.trained = False

    def __call__(self, doc):
        if not self.trained:
            return ''

        words = self._tokenize(doc)
        sums = []
        for c in self.classes:
            sums.append(self.log_prior[c] + sum([self.log_likelihood[c][word] for word in words]))

        return self.classes[sums.index(max(sums))]

    def train(self, data, labels):
        self.classes = list(set(labels))
        num_doc = 0
        num_c = {c: 0 for c in self.classes}
        self.vocab = set()
        big_doc = {c: defaultdict(int) for c in self.classes}

        for doc, label in zip(data, labels):
            num_doc += 1
            num_c[label] += 1

            words = set(self._tokenize(doc)) if self.binary else self._tokenize(doc)
            for word in words:
                self.vocab.add(word)
                big_doc[label][word] += 1

        self.log_prior = {c: math.log(num / num_doc) for c, num in num_c.items()}

        self.log_likelihood = {c: {} for c in self.classes}
        for c in self.classes:
            sum_den = sum(big_doc[c].values())
            for word in self.vocab:
                self.log_likelihood[c][word] = math.log((big_doc[c][word] + 1) / (sum_den + len(self.vocab)))

        self.trained = True

    def _tokenize(self, doc):
        puncts = set(',.!?\'"')
        if self.trained:
            words = [word for word in word_tokenize(doc) if word in self.vocab]
        else:
            words = [word for word in word_tokenize(doc) if word not in puncts]
        return words

In [3]:
docs = ['fun, couple, love, love',
        'fast, furious, shoot',
        'couple, fly, fast, fun, fun',
        'furious, shoot, shoot, fun',
        'fly, fast, shoot, love']
labels = ['comedy', 'action', 'comedy', 'action', 'action']

nb = NaiveBayes()
nb.train(docs, labels)

nb('fast, couple, shoot, fly')

'action'

## 4.3
Train two models, multinomial naive Bayes and binarized naive Bayes, both with add-1 smoothing, on the following document counts for key sentiment words, with positive or negative class assigned as noted.

| doc | “good” | “poor” | “great” | (class) |
| --- | --- | --- | --- | --- |
| d1. | 3 | 0 | 3 | pos |
| d2. | 0 | 1 | 2 | pos |
| d3. | 1 | 3 | 0 | neg |
| d4. | 1 | 5 | 2 | neg |
| d5. | 0 | 2 | 0 | neg |

Use both naive Bayes models to assign a class (pos or neg) to this sentence:  
&emsp;*A good, good plot and great characters, but poor acting.*  
Recall from page 6 that with naive Bayes text classification, we simply ignore (throw out) any word that never occurred in the training document. (We don’t throw out words that appear in some classes but not others; that’s what addone smoothing is for.) Do the two models agree or disagree?

In [4]:
import pandas as pd

docs = [('good ' * 3 + 'poor ' * 0 + 'great ' * 3).strip(),
        ('good ' * 0 + 'poor ' * 1 + 'great ' * 2).strip(),
        ('good ' * 1 + 'poor ' * 3 + 'great ' * 0).strip(),
        ('good ' * 1 + 'poor ' * 5 + 'great ' * 2).strip(),
        ('good ' * 0 + 'poor ' * 2 + 'great ' * 0).strip()]
labels = ['pos', 'pos', 'neg', 'neg', 'neg']

sentence = 'A good, good plot and great characters, but poor acting.'

nb1 = NaiveBayes()
nb1.train(docs, labels)
result1 = nb1(sentence)

nb2 = NaiveBayes(binary=True)
nb2.train(docs, labels)
result2 = nb2(sentence)

pd.DataFrame(
    [['multinomial naive Bayes', result1], ['binarized naive Bayes', result2]],
    columns=('model', 'result')
)

Unnamed: 0,model,result
0,multinomial naive Bayes,pos
1,binarized naive Bayes,neg
