# Week 5

# ―Sentiment, affect, and connotation

<img src="images/_1.jpeg" width="50%">

# Why is sentiment analysis (SA) so popular?

+ Extracting audiences' sentiment is a general-class task:
  * a review of a movie, book, or product on the web expresses the author’s 
    sentiment toward the product
  * an editorial or political text expresses sentiment toward a candidate or 
    political action
+ Online behavior offers germane conditions to SA: 
  * social media exhibit sharp community structures (e.g., 
    [Schmidt et al. 2017][1]) 
  * hence, opposing views are fostered (e.g., [Ball et al. 2018][2])
  * bots increase exposure to negative and inflammatory content 
    ([Stella, Ferrara, & Domenico 2018][3])
  * the marginal costs of engaging in social sanctioning $\rightarrow$ 0:
    - online firestorms emerge frequently ([Rost, Stahel & Frey, 2016][4])
  * ... and the costs of being sanctioned $\rightarrow$ 0

  [1]: https://www.pnas.org/content/114/12/3035.short
  [2]: pnas.org/content/115/37/9216?mod=article_inline
  [3]: https://www.pnas.org/content/115/49/12435.short
  [4]: https://journals.plos.org/plosone/article%3Fid%3D10.1371/journal.pone.0155923

# NLP and classification tasks

The simplest version of sentiment analysis is a binary classification task ―
the words of the review provide excellent cues (Martin & Jurafsky, 2019).

Consider, for example, the following phrases extracted from positive and negative 
reviews of movies and restaurants:

```{bash}
...awesome caramel sauce and sweet toasty almonds. I love this place! 
...awful pizza and ridiculously overpriced...
```

Counting an offering's promoters and/or detractors implies classifying reviews
into dsicrete classes, such as 'positive' and 'negative'.

# Classifiers in a nutshell

The task of supervised classification is to take an input $x$ and a fixed
set of output classes $Y = y_{1}, y_{2}, \ldots, y_{M}$ and return a predicted 
class $y \in Y$.

There are two broad familiies of classifiers:
+ **Generative classifiers**, like naive Bayes, build a model of how a class   
  could generate some input data:
  * given an observation, they return the class most likely to have generated
    the observation
+ **Discriminative classifiers**, like logistic regression, learn what features 
  from the input are most useful to discriminate between the different possible 
  classes

Naive Bayes has been widely applied to SA ― let's have a closer look...

# Naive Bayes Classifier (NBC) 

+ So called because it is a Bayesian classifier that makes a simplifying (naive) assumption about how the features interact.
+ The intuition of the classifier is shown in the below-displayed figure:
  * text documents are represented as bag-of-words:
    - the position of words doesn't matter
    - what matters is the frequency of words in the document

<img src="images/_2.png" width="100%">

Source is Jurafsky & Martin, 2019

# Intuition of the Naive Bayes Classifier

For a document $d$, out of all classes $c \in C$, the classifier returns the 
class $\hat c$ which has the maximum posterior probability given the document:

\begin{equation}\label{eq:}
\hat c = argmaxP*(c|d)
\end{equation}

# Bayesian inference

This idea of Bayesian inference has been known since the work of Bayes* (1763),
and was first applied to text classification by Mosteller and Wallace (1964). 

The intuition of Bayesian classification is to use Bayes’ rule to transform Eq. 
1 into other probabilities that have some useful properties. Bayes’ rule is 
presented in three other probabilities:

\begin{equation}
P(x|y) = \frac{P(y|x)P(x)}{P(y)}
\end{equation}

\* He's interred in Bunhill Fields Cemetery

<img src="images/_3.jpeg" width="40%">


# Bayesian inference (cont'd)

Eq. [2] can be substituted into Eq. [1] to get Eq. [3]:

\begin{equation}
\hat c = argmax \frac{P(d|c)P(c)}{P(d)}
\end{equation}

with $c \in C$

As $P(d)$ doesn't change for each class, Eq. [3] can be dispensed as follows:

\begin{equation}
\hat c = argmax P(d|c)P(c)
\end{equation}

with $c \in C$


# Bayesian inference (cont'd)

Eq. [4] contains two probabilities:

+ $P(c)$ is the prior probability of the class $c$
+ $P(d|c)$ is the likelihood of the document, which can also be expressed as:
  
  \begin{equation}
  \hat c = argmaxP(f_{1}, f_{2}, ..., f_{n}|c)P(c)
  \end{equation}
  
In practice, Eq. [5] is just too costly/impossible to estimate:
+ estimating the probability of every possible combination of features would require:
  * huge numbers of parameters
  * impossibly large training sets
+ NBCs therefore make two simplifying assumptions:
  * 'bag of words' assumption ― the order of words doesn't matter
    - the vector of features $F$ encodes word identities not positions
  * 'naive Bayes assumption': the probabilities $P(f_{i}|c)$ are independent
    given the class $c$
    
    $P(f_{1}, f_{2}, ..., f_{n}) = P(f_{1}|c) \cdot P(f_{2}|c) \cdot ... 
      \cdot P(f_{n}|c)$

# Bayesian inference (cont'd)

The final equation for the class chosen by a NBC is thus:

\begin{equation}
c_{NB} = argmaxP(c) \Pi P(f|c)
\end{equation}

with $c \in C$ and $f \in F$

As Jurafsky and Martin (2019) note, Naive Bayes calculations, like calculations for 
language modeling, are done in log space, to avoid underflow and increase speed. Hence,
Eq. [6] can dispensed as follows [7]:

\begin{equation}
c_{NB} = argmax log P(c) + \sum log P(w_{i}|c)
\end{equation}

with $c \in C$ and $i \in V$, where $V$ is the dictionary of the corpus of text

# Training the NBC ― prior probabilities

For the document prior $P(c)$ we ask what percentage of the documents in our training 
set are in each class c.

Let Nc be the number of documents in our training data with class $c$ and $N_{doc}$ be 
the total number of documents:

$\hat P(c) = \frac{N_{c}}{N_{doc}}$

# Training the NBC ― conditional distributions of features

To learn the probability $P(f_{i}|c)$, we'll assume a feature is just the existence of a 
word in the document's bag of words (Jurafsky & Martin, 2019).

$P(w_{i}|c)$ is the fraction of times the word $w_{i}$ appears among all wods in all 
documents of class $c$:
+ we first concatenate all documents with class $c$
+ then, we use the frequency of $w_{i}$ in this concatenated document to give a maximum
  likelihood estimate of the probability:
  
  $\hat P(w_{i}|c) = \frac{|w_{i}, c|}{\sum_{w \in V} |w, c|}$

# Training the NBC ― caveats

**Problem:** since NBC multiplies all the features likelihood $(w_{i}|c)$, 
zero-probabilites in any  word included in a test set document will cause the 
probability of the class to be zero!

Non mutually exclusive **solutions**:

+ add-one (Laplace) smoothing
+ torough NLP pipeline dealing with rare or oov words

# Example

+ Setup:
  * sentiment analysis domain with the two classes:
    - positive (+) and 
    - negative (-)
+ Data:
  * training and test documents simplified from actual movie reviews
  
| Set      | Class | Documents                             |
|----------|-------|---------------------------------------|
| Training | -     | just plain boring                     |
|          | -     | entirely predictable and lacks energy | 
|          | -     | no surprise and very few laughs       |
|          | +     | very powerful                         |
|          | +     | the most fun film of the summer       |
| Test     | ?     | predictable with no fun               |

Source is Jurafsky and Martin (2019, page: 62)

# Example: computing priors

| Set      | Class | Documents                             |
|----------|-------|---------------------------------------|
| Training | -     | just plain boring                     |
|          | -     | entirely predictable and lacks energy | 
|          | -     | no surprise and very few laughs       |
|          | +     | very powerful                         |
|          | +     | the most fun film of the summer       |
| Test     | ?     | predictable with no fun               |

$P(-) = \frac{3}{5}$

$P(+) = \frac{2}{5}$

# Example: computing word likelihood

Design:

+ oov words are filtered out
  * we don't consider 'w'
+ we use 'add-one' smoothing
  * $\hat P(w_{i}|c) = \frac{|w_{i}, c|}{\sum_{w \in V} |w, c| + 1}$

| Word        | $c = -$              | $c = +$            |
| ----------- | -------------------- | -------------------|
| predictable |  (1 + 1) / (14 + 20) | (0 + 1) / (9 + 20) |
| no          |  (1 + 1) / (14 + 20) | (0 + 1) / (9 + 20) |
| fun         |  (0 + 1) / (14 + 20) | (1 + 1) / (9 + 20) |


# Example: estimating doc-to-class affiliations 

$P(-)P(s|-) = \frac{3}{5} \times \frac{2 \times 2 \times 1}{34^{3}} = 
6.1 \times 10^{-5}$

$P(+)P(s|+) = \frac{2}{5} \times \frac{1 \times 1 \times 2}{29^{3}} = 
3.2 \times 10^{-5}$


# Over and beyond SA...

SA focuses on a very specific portion of a braoder gamut containing affective states.

Prior literature (Scherer, 2000) suggests the existence of some families of affectives states that can be investigated with NLP theories and tools:

+ **emotion:** relatively brief episode of response to the evaluation of an 
  external or internal event as being of major significance (angry, sad, 
  joyful, fearful, ashamed, proud, elated, desperate)
+ **mood:** diffuse affect state, most pronounced as change in subjective feeling, 
  of low intensity but relatively long duration, often without apparent cause
  (cheerful, gloomy, supportive, contemptous, friendly)
 

# Over and beyond SA...(cont'd)

+ **interpersonal stance:** affective stance taken toward another person in a 
  specific interaction, coloring the interpersonal exchange in a situation
  (distant, cold, warm, supportive, contemptous, friendly)
+ **attitudes:** relatively enduring, affectively colored beliefs, preferences, and
  predispositions towards objects or persons (liking, loving, hating, valuing
  desiring)
+ **personality traits:** emotionally laden, stable personality dispositions and
  behavior tendencies, typical for a person (nervous, anxious, reckless,   
  hostile, jealous)
 

# Affect lexicons

+ the study of affect states via NLP draws upon affect lexicons
+ the key premise of affect lexicons is that words have affect 
  meanings ― i.e., they have connotations
+ affect lexicons can be distinguished along several dimensions:
  * polar lexicons Vs. cluster-based lexicons
  * human-annotated/supervised lexicons Vs. semi-supervised lexicons

# Example of polar, human-annotated affect lexicon

<img src="images/_5.png" width="70%">

# Example of cluster-based, human-annotated affect lexicon

<img src="images/_6.png" width="60%">

# Semi-supervised induction of affect lexicons

There are two popular alternatives to create affect lexicons using a 
semi-supervised approach:
    
+ semantic axis methods
+ label propagation

# Semantic Axis Method (SAM)

The semantic axis method builds on four steps:

1. pole-by-pole (or aspect-by-aspect), fixing seed words (hand-curated task)
2. getting the embedding for each word
3. getting the embedding for each pole (or aspect)
4. getting the semantic axis
5. appreciating the position of words in the semantic axis

# SAM: Step 1

Based on target affect states or poles (e.g., 'good' and 'bad'), a list
of seed words is compiled.

**Caveat**: word meanings are relational in nature:

+ meanings change based on the linguistic context for a lexical item
+ meanings change across space and time
+ thus, make sure your seed words 'make sense' given the dataset at hand

<img src="images/_7.png" width="50%">

# SAM: Step 2 ― getting word embeddings

Each target word has to be associated with a vector. Mainly, there are two
alternatives:

+ using off-the-shelf embeddings 
+ building your own embedding (desirable for idiosyncratic text corpora)

# SAM: Step 3 ― getting pole (aspect) embeddings

Once word vectors are available, we need to create embeddings for the poles (aspects).

This can be achieved by taking the centroid of the word vectors associated with
each pole (or state).

Let's consider two sets of word vectors:

$S^{+} = \{E{w_{1}^{+}}, E{w_{2}^{+}}, ..., E{w_{n}^{+}}\}$

and

$S^{-} = \{E{w_{1}^{-}}, E{w_{2}^{-}}, ..., E{w_{n}^{-}}\}$

where $S^{+}$ isf the embedding of, say, positive seed words,
and $S^{-}$ is the embedding of negative words.

The embeddings of the positive and the negative poles are, respectively:

$V^{+} = \frac{1}{n} \sum_{i = 1}^{n} E(w_{i}^{+})$

$V^{-} = \frac{1}{m} \sum_{i = 1}^{m} E(w_{i}^{-})$

# SAM: Step 4 ― creating a semantic axis

The semantic axis can be computed by subctracting the vectors of the poles 
(aspects) as follows:

$V_{axis} = V^{+} - V^{-}$

# SAM: Step 5 ― locating word positions along the semantic axis

The position of a target word $w$ along the semantic axis can be computed
(for example) as the cosine similarity of $w$ and $V_{axis}$:

$score(w) = \frac {E(w) \cdot V_{axis}}{||E(w)|| ||V_{axis}||}$

# Label propagation

<img src="images/_8.png" width="80%">