# Week 5, part III

# ―Sentiment, affect, and connotation

<img src="images/_1.jpeg" width="50%">

# Bayesian inference

The final equation for the class chosen by a NBC is thus:

\begin{equation}
c_{NB} = argmaxP(c) \Pi P(f|c)
\end{equation}

with $c \in C$ and $f \in F$

As Jurafsky and Martin (2019) note, Naive Bayes calculations, like calculations for 
language modeling, are done in log space, to avoid underflow and increase speed. Hence,
Eq. [6] can dispensed as follows [7]:

\begin{equation}
c_{NB} = argmax log P(c) + \sum log P(w_{i}|c)
\end{equation}

with $c \in C$ and $i \in V$, where $V$ is the dictionary of the corpus of text

# Training the NBC ― prior probabilities

For the document prior $P(c)$ we ask what percentage of the documents in our training 
set are in each class c.

Let Nc be the number of documents in our training data with class $c$ and $N_{doc}$ be 
the total number of documents:

$\hat P(c) = \frac{N_{c}}{N_{doc}}$

# Training the NBC ― conditional distributions of features

To learn the probability $P(f_{i}|c)$, we'll assume a feature is just the existence of a 
word in the document's bag of words (Jurafsky & Martin, 2019).

$P(w_{i}|c)$ is the fraction of times the word $w_{i}$ appears among all wods in all 
documents of class $c$:
+ we first concatenate all documents with class $c$
+ then, we use the frequency of $w_{i}$ in this concatenated document to give a maximum
  likelihood estimate of the probability:
  
  $\hat P(w_{i}|c) = \frac{|w_{i}, c|}{\sum_{w \in V} |w, c|}$

# Training the NBC ― caveats

**Problem:** since NBC multiplies all the features likelihood $(w_{i}|c)$, 
zero-probabilites in any  word included in a test set document will cause the 
probability of the class to be zero!

Non mutually exclusive **solutions**:

+ add-one (Laplace) smoothing
+ torough NLP pipeline dealing with rare or oov words

# Example

+ Setup:
  * sentiment analysis domain with the two classes:
    - positive (+) and 
    - negative (-)
+ Data:
  * training and test documents simplified from actual movie reviews
  
| Set      | Class | Documents                             |
|----------|-------|---------------------------------------|
| Training | -     | just plain boring                     |
|          | -     | entirely predictable and lacks energy | 
|          | -     | no surprise and very few laughs       |
|          | +     | very powerful                         |
|          | +     | the most fun film of the summer       |
| Test     | ?     | predictable with no fun               |

Source is Jurafsky and Martin (2019, page: 62)

# Example: computing priors

| Set      | Class | Documents                             |
|----------|-------|---------------------------------------|
| Training | -     | just plain boring                     |
|          | -     | entirely predictable and lacks energy | 
|          | -     | no surprise and very few laughs       |
|          | +     | very powerful                         |
|          | +     | the most fun film of the summer       |
| Test     | ?     | predictable with no fun               |

$P(-) = \frac{3}{5}$

$P(+) = \frac{2}{5}$

# Example: computing word likelihood

Design:

+ oov words are filtered out
  * we don't consider 'with'
+ we use 'add-one' smoothing
  * $\hat P(w_{i}|c) = \frac{|w_{i}, c|}{\sum_{w \in V} |w, c| + 1}$

| Word        | $c = -$              | $c = +$            |
| ----------- | -------------------- | -------------------|
| predictable |  (1 + 1) / (14 + 20) | (0 + 1) / (9 + 20) |
| no          |  (1 + 1) / (14 + 20) | (0 + 1) / (9 + 20) |
| fun         |  (0 + 1) / (14 + 20) | (1 + 1) / (9 + 20) |


# Example: estimating doc-to-class affiliations 

$P(-)P(s|-) = \frac{3}{5} \times \frac{2 \times 2 \times 1}{34^{3}} = 
6.1 \times 10^{-5}$

$P(+)P(s|+) = \frac{2}{5} \times \frac{1 \times 1 \times 2}{29^{3}} = 
3.2 \times 10^{-5}$
