In [13]:
import re
from collections import Counter

# For the example later
example_text = open("review_polarity/txt_sentoken/pos/cv750_10180.txt").read()
bag_of_words=Counter({'and': 37, 'is': 26, 'he': 11, 'great': 10, 'carlito': 9, 'film': 8, 'but': 8, 'some': 7, 'pacino': 7, "carlito's": 7, 'palma': 5, 'well': 5, 'like': 5,  'woman': 4, 'amazing': 4, 'bias':1}) 

<center>
<h2>Text classification<br> with the perceptron</h2>
<p style="text-align:center">
Natural Language Processing<br>
(COM4513/6513)<br>
<br>
<a href="http://andreasvlachos.github.io">Andreas Vlachos</a><br>
a.vlachos@sheffield.ac.uk<br>
<small>Department of Computer Science<br>
University of Sheffield
</small>
</p>
</center>

<h3>Text classification</h3>

A very common problem in NLP:
<center>
<p style="border:3px; width: 500px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 1em;">
<i>Given a piece of text, assign a label from a predefined set</i>
</p>
</center>

<b>What could the labels be?</b>

<ul>
<li>positive vs negative (e.g. sentiment in reviews)</li>
<li>about world politics or not</li>
<li>author name (author identification)</li>
<li>pass or fail in essay grading</li>
</ul>

### In this lecture

We will see how to:
- representing documents as vectors
- learn a classifier using the perceptron rule

Ready for Lab in Week 4!

### Sentiment analysis on film reviews

<img src="images/imdb.jpg" style="width:100%;">

### Representing text

In [7]:
print(re.sub("[^\w']"," ",example_text).split()[:100])

["what's", 'shocking', 'about', "carlito's", 'way', 'is', 'how', 'good', 'it', 'is', 'having', 'gotten', 'a', 'bit', 'of', 'a', 'bad', 'rap', 'for', 'not', 'being', 'a', 'big', 'box', 'office', 'hit', 'like', "pacino's", 'previous', 'film', 'scent', 'of', 'a', 'woman', 'and', 'not', 'having', 'as', 'strong', 'a', 'performance', 'as', 'he', 'did', 'in', 'that', 'one', 'he', 'had', 'just', 'won', 'an', 'oscar', "carlito's", 'way', 'was', 'destined', 'for', 'underrated', 'heaven', "that's", 'what', 'it', 'is', 'an', 'underrated', 'gem', 'of', 'a', 'movie', 'and', 'what', 'a', 'shame', 'because', 'pacino', 'and', 'de', 'palma', 'both', 'do', 'amazing', 'jobs', 'with', 'it', 'and', 'turn', 'it', 'into', 'a', 'great', 'piece', 'of', 'a', 'pulpy', 'character', 'study', "carlito's", 'way', 'deals']


Ideas?

Let's represent text with vectors. Why?

That's what machine learning algorithms take as input

### Counting words

In [9]:
dictionary = Counter(re.sub("[^\w']"," ",example_text).split()[:100])
print(dictionary)

Counter({'a': 9, 'it': 4, 'and': 4, 'of': 4, 'is': 3, 'way': 3, "carlito's": 3, 'for': 2, 'he': 2, 'an': 2, 'what': 2, 'having': 2, 'underrated': 2, 'as': 2, 'not': 2, 'scent': 1, 'office': 1, 'previous': 1, 'movie': 1, 'amazing': 1, 'de': 1, 'one': 1, 'do': 1, 'that': 1, 'turn': 1, 'strong': 1, 'pacino': 1, "what's": 1, 'performance': 1, 'just': 1, 'being': 1, 'piece': 1, 'had': 1, 'into': 1, 'destined': 1, 'character': 1, 'pulpy': 1, 'hit': 1, 'how': 1, 'great': 1, 'won': 1, 'film': 1, 'bit': 1, 'palma': 1, 'woman': 1, 'shame': 1, 'like': 1, 'good': 1, "that's": 1, 'gem': 1, 'big': 1, 'rap': 1, 'gotten': 1, 'bad': 1, 'oscar': 1, 'did': 1, 'both': 1, 'box': 1, 'because': 1, 'study': 1, 'was': 1, 'with': 1, "pacino's": 1, 'heaven': 1, 'deals': 1, 'in': 1, 'shocking': 1, 'jobs': 1, 'about': 1})


### Bag of words representation

- The higher the counts for a word, the more important it is
- No document has every word; most have 0 counts (implicitly)
- For a given vobaculary, every document is represented by a vector of the same length

Anything missing?

- which words to keep?
- how to value their presence/absence?
- word order is ignored, could we add bigrams?

Choice of representation (features) matters a lot!

### Our first classifier

We represent a document as counts over words/features, $\mathbf{x} \in \Re^N$.

How to predict if it has positive $(y=1)$ or negative $(y=-1)$ sentiment?

If each word $n$ has counts $x_n$ in the review and is associated with a weight ($w_n$), then:

$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$

$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$

In [14]:
print(bag_of_words)

Counter({'and': 37, 'is': 26, 'he': 11, 'great': 10, 'carlito': 9, 'but': 8, 'film': 8, "carlito's": 7, 'pacino': 7, 'some': 7, 'palma': 5, 'like': 5, 'well': 5, 'woman': 4, 'amazing': 4, 'bias': 1})


In [15]:
weights = dict({'and': 0.0, 'is': 0.0, 'he': 0.0, 'great': 0.0,\
                'carlito': 0.0, 'but': 0.0, 'film': 0.0, 'some': 0.0,\
                'carlito\'s': 0.0, 'pacino': 0.0, 'like': 0.0,\
                'palma': 0.0, 'well': 0.0, 'amazing': 0.0, 'woman': 0.0, 'bias': 0.0})

In [16]:
score = 0.0
for word, counts in bag_of_words.items():
    score += counts * weights[word]
print(score)
print("positive") if score >= 0.0 else print("negative")

0.0
positive


<h3>Another view</h3>
<a href="https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/"><img src="images/perceptron.png" style="width:600px; background:none; border:none; box-shadow:none;" /></a>
<p class="fragment">
How to learn the weights $\mathbf{w}$?
</p>

<h3>The perceptron</h3>

<p><img style="float: left; width:40%" src="images/colorfulperceptron.jpg"/><img src="images/Rosenblatt-CAL1958.jpg" style="width:35%; float: right;"/>
</p>

<p>Proposed by Rosenblatt in 1958 and still in use by researchers</p>

### Supervised learning

Given training documents with the correct labels

$$D_{train} = \{\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}$$

Find the weights $\mathbf{w}$ for the linear classifier

$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$

so that we can predict the labels of **unseen** documents


### Supervised learning


<img src="images/supervisedMLbyRaschka.jpg" style="width:100%;">

<h3>Learning with the perceptron</h3>
<p style="font-size: 100%; border:3px; width: 90%; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em;">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}\\
& set\; \mathbf{w} = \mathbf{0} \\
& \mathbf{for} \; (\mathbf{x},y) \in D_{train} \; \mathbf{do}\\
& \quad predict  \; \hat y = sign(\mathbf{w}\cdot \phi(\mathbf{x}))\\
& \quad \mathbf{if} \; \hat y \neq y \; \mathbf{then}\\
& \quad \quad \mathbf{if} \; \hat y\; \mathbf{is}\; 1 \; \mathbf{then}\\
& \quad \quad \quad update \; \mathbf{w} = \mathbf{w} + \phi(\mathbf{x})\\
& \quad \quad \mathbf{else}\\
& \quad \quad \quad update \; \mathbf{w} = \mathbf{w} - \phi(\mathbf{x})\\
& \mathbf{return} \; \mathbf{w}
\end{align}
</p>

<ul class="fragment">
<li>error-driven, online learning</li>
<li>$x$ is the document $\phi(x)$ is the bag of words, bigrams, etc.</li>
</ul>

### A little test

Given the following tweets labeled with sentiment:

| Label        | Tweet | 
| -------------|--------|
| negative     | Very sad about Iran. |
| negative     | No Sat off...Need to work 6 days a week. |
| negative     | I’m a sad panda today.|
| positive     | such a beautiful satisfying day of bargain shopping. loves it. |
| positive     | who else is in a happy mood?? |
| positive     | actually quite happy today. |

What features would the perceptron find indicative of positive/negative class?

Would they generalize to unseen test data?

### Sparsity and the bias

In NLP, no matter how large our training dataset, we will never see (enough of) all the words/features.
- features unseen in training are ignored in testing
- there are ways to ameliorate this issue (e.g. word clusters), but it never goes away
- there will be texts containing only unseen words

Bias: that appears in each instance
- its value is hardcoded to 1 
- that 1 in the diagram
- effectively learns to predict the majority class

<h3>3 tricks for better perceptrons</h3>
<span style="font-size: 100%; color:blue">averaging</span>, <span style="font-size: 100%; color:green">multiple passes</span>, <span style="font-size: 100%; color:red">shuffling</span>
<p style="border:3px; width:900px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 100%;">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(x^1,y^1)...(x^M,y^M)\}\\
& set \; \mathbf{w}_{\color{blue}{0}} = \mathbf{0}; \color{blue}{c = 1} \\
& \color{green}{\mathbf{for} \; i=1 \; \mathbf{to} \; maxIter \; \mathbf{do}}\\
& \quad \color{red}{shuffle(D_{train})}\\					
& \quad \mathbf{for} \; (x,y) \in D_{train} \; \mathbf{do}\\
& \quad \quad predict  \; \hat y = sign(\mathbf{w}_{\color{blue}{c-1}}\cdot \phi(x))\\
& \quad \quad \mathbf{if} \; \hat y \neq y \; \mathbf{then}\\
& \quad \quad \quad update \; \mathbf{w}_{\color{blue}{c}} = \mathbf{w}_{\color{blue}{c-1}} + y\phi(x) \\
& \quad \quad \mathbf{else}\\
& \quad \quad \quad \mathbf{w}_{\color{blue}{c}} = \mathbf{w}_{\color{blue}{c-1}} \\
& \quad \quad \color{blue}{c = c + 1} \\
& \mathbf{return} \; \color{blue}{\frac{1}{c}\sum_{i=1}^c \mathbf{w}_i}
\end{align}
</p>

### Binary to multiclass

A vector of weights per label $y \in \cal Y$:

$$\hat y = \mathop{\arg \max}\limits_{y \in \cal Y} (\mathbf{w}^y \cdot \phi(x))$$

Update rule:

\begin{align}
&\mathbf{if} \; \hat y \neq y \; \mathbf{then}\\
&\quad \quad update \; \mathbf{w^y} = \mathbf{w^y} + \phi(\mathbf{x})\\
&\quad \quad update \; \mathbf{w^{\hat y}} = \mathbf{w^{\hat y}} - \phi(\mathbf{x})\\
\end{align}

Equivalently, make label-specific representations:

$$\hat y = \mathop{\arg \max}\limits_{y \in \cal Y} (\mathbf{w} \cdot \phi(x,y))$$

### Evaluation

The standard way to evaluate our classifier is:

$$ Accuracy = \frac{correctLabels}{allInstances}$$

What could go wrong?

When one class is much more common than the other, predicting it always
gives high accuracy.

### Evaluation

| Predicted/Correct	| MinorityClass | MajorityClass  |
| ------------- 		|:-------------:| -----:|
| **MinorityClass**     | TruePositive | FalsePositive |
| **MajorityClass**     | FalseNegative  | TrueNegative |

$$ Precision = \frac{TruePositive}{TruePositive+FalsePositive}$$

$$ Recall = \frac{TruePositive}{TruePositive+FalseNegative}$$

### Bibliography
- Manning, Raghavan and Schutze's [vector space chapter](http://nlp.stanford.edu/IR-book/pdf/06vect.pdf) from the Introduction to Information Retrieval.
- Hal Daumé III's [chapter](http://ciml.info/dl/v0_9/ciml-v0_9-ch03.pdf) on the perceptron from his book on machine learning
- For more background reading on classification, Kevin Murphy's [introduction](https://www.cs.ubc.ca/~murphyk/MLbook/pml-intro-22may12.pdf) touches upon most important concepts in ML

### Coming up next

How to use the perceptron for PoS tagging