# POS tagging on Twitter

*Notebook for COMP90042, Web search and Text Analysis*

*Copyright The University of Melbourne, 2018*

In this notebook we will check the performance of the POS tagger from the last workshop on a different domain: Twitter. First, let's build the HMM tagger again.

In [None]:
import numpy as np
from nltk.corpus import treebank
corpus = treebank.tagged_sents()

word_numbers = {}
tag_numbers = {}

num_corpus = []
for sent in corpus:
    num_sent = []
    for word, tag in sent:
        wi = word_numbers.setdefault(word.lower(), len(word_numbers))
        ti = tag_numbers.setdefault(tag, len(tag_numbers))
        num_sent.append((wi, ti))
    num_corpus.append(num_sent)
    
word_names = [None] * len(word_numbers)
for word, index in word_numbers.items():
    word_names[index] = word
tag_names = [None] * len(tag_numbers)
for tag, index in tag_numbers.items():
    tag_names[index] = tag
    
S = len(tag_numbers)
V = len(word_numbers)

# initalise
eps = 0.1
pi = eps * np.ones(S)
A = eps * np.ones((S, S))
O = eps * np.ones((S, V))

# count
for sent in num_corpus:
    last_tag = None
    for word, tag in sent:
        O[tag, word] += 1
        if last_tag == None:
            pi[tag] += 1
        else:
            A[last_tag, tag] += 1
        last_tag = tag
        
# normalise
pi /= np.sum(pi)
for s in range(S):
    O[s,:] /= np.sum(O[s,:])
    A[s,:] /= np.sum(A[s,:])
    
    
def viterbi(params, observations):
    pi, A, O = params
    M = len(observations)
    S = pi.shape[0]
    
    alpha = np.zeros((M, S))
    alpha[:,:] = float('-inf')
    backpointers = np.zeros((M, S), 'int')
    
    # base case
    alpha[0, :] = pi * O[:,observations[0]]
    
    # recursive case
    for t in range(1, M):
        for s2 in range(S):
            for s1 in range(S):
                score = alpha[t-1, s1] * A[s1, s2] * O[s2, observations[t]]
                if score > alpha[t, s2]:
                    alpha[t, s2] = score
                    backpointers[t, s2] = s1
    
    # now follow backpointers to resolve the state sequence
    ss = []
    ss.append(np.argmax(alpha[M-1,:]))
    for i in range(M-1, 0, -1):
        ss.append(backpointers[i, ss[-1]])
        
    return list(reversed(ss)), np.max(alpha[M-1,:])



## Reading a corpus of POS tagged tweets

Remember from the lecture: we always need some annotated data in order to evaluate our methods, even when they are unsupervised. In order to do this, we will use a dataset of tweets annotated with POS tags, which we will download automatically via Python.

The next step is to read the file. We will use it as a *test set* only: you are not allowed to use any of this data for training.

In [None]:
import urllib
urllib.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")

test_inputs = []
test_outputs = []
with open('pos.txt') as f:
    words = []
    pos_tags = []
    for line in f:
        if line.strip() == '':
            test_inputs.append(words)
            test_outputs.append(pos_tags)
            words = []
            pos_tags = []
        else:
            word, pos = line.strip().split()
            words.append(word)
            pos_tags.append(pos)
    
print test_inputs[0]
print test_outputs[0]

## Tagging the corpus and evaluating it

Now that we read our test set, let's try to tag it using our HMM tagger trained before.

In [None]:
predictions = []
for sent in test_inputs:
    encoded_sent = [word_numbers[w] for w in sent]
    pred = viterbi((pi, A, O), encoded_sent)
    predictions.append([tag_names[i] for i in pred])

This will raise an error due to an OOV word. A simple way to deal with OOV's is to smooth the counts in the emission matrix. Let's do that.

In [None]:
# Add an OOV token to our dictionary. Let's call it '<unk>'
unk_index = len(word_numbers)
word_numbers.setdefault('<unk>', unk_index)
word_names.append('<unk>')

V = len(word_numbers)

# initalise
eps = 0.1
O = eps * np.ones((S, V))

# add one smoothing
O += 1.0

# count
for sent in num_corpus:
    for word, tag in sent:
        O[tag, word] += 1
 
# normalise
for s in range(S):
    O[s,:] /= np.sum(O[s,:])

Now to tag the sentence, we first replace any OOV words with our '<unk>' token.

In [None]:
predictions = []
for sent in test_inputs:
    encoded_sent = []
    for word in sent:
        if word in word_numbers:
            encoded_sent.append(word_numbers[word])
        else:
            encoded_sent.append(word_numbers['<unk>'])
    pred, _ = viterbi((pi, A, O), encoded_sent)
    #predictions.append([tag_names[i] for i in predicted]
    predictions.append(pred)
    

print predictions[0]
print('%20s\t%5s\t%5s' % ('TOKEN', 'TRUE', 'PRED'))
for wi, ti, predi in zip(test_inputs[0], test_outputs[0], predictions[0]):
    print('%20s\t%5s\t%5s' % (wi, ti, tag_names[predi]))

There are quite a few errors here, much more than in the previous workshop example. Let's try to quantify this in terms of accuracy, so we can compare with PTB numbers.

In [None]:
from sklearn.metrics import accuracy_score as acc

# flat our data into single lists
all_test_tags = [tag for tags in test_outputs for tag in tags]
# for predictions, we need to obtain the original tag from the index
all_pred_tags = [tag_names[tag] for tags in predictions for tag in tags]

print acc(all_test_tags, all_pred_tags)

51.9% accuracy is quite low. Compare this to the performance on Penn Treebank, which can reach 96.7% accuracy. One reason for such low numbers is the fact we are training only on a subset of Penn Treebank (since it is freely available on NLTK). But even state-of-the-art POS taggers reach only 80% accuracy on this dataset (Ritter et al., EMNLP 2011).

Notice that the twitter test set has some tags which are not defined in PTB, such as "USR" for user mentions (@paulwalk, in the example above). This means that these additional tags will never get predicted using the current tagger. Can you come up with a solution for that?