# POS tagging on Twitter

*Notebook for COMP90042, Web search and Text Analysis*

*Copyright The University of Melbourne, 2018*

In this notebook we will check the performance of the POS tagger from the last workshop on a different domain: Twitter. First, let's build the HMM tagger again.

In [None]:
import numpy as np
from nltk.corpus import treebank
corpus = treebank.tagged_sents()

word_numbers = {}
tag_numbers = {}

num_corpus = []
for sent in corpus:
    num_sent = []
    for word, tag in sent:
        wi = word_numbers.setdefault(word.lower(), len(word_numbers))
        ti = tag_numbers.setdefault(tag, len(tag_numbers))
        num_sent.append((wi, ti))
    num_corpus.append(num_sent)
    
word_names = [None] * len(word_numbers)
for word, index in word_numbers.items():
    word_names[index] = word
tag_names = [None] * len(tag_numbers)
for tag, index in tag_numbers.items():
    tag_names[index] = tag
    
S = len(tag_numbers)
V = len(word_numbers)

# initalise
eps = 0.1
pi = eps * np.ones(S)
A = eps * np.ones((S, S))
O = eps * np.ones((S, V))

# count
for sent in num_corpus:
    last_tag = None
    for word, tag in sent:
        O[tag, word] += 1
        if last_tag == None:
            pi[tag] += 1
        else:
            A[last_tag, tag] += 1
        last_tag = tag
        
# normalise
pi /= np.sum(pi)
for s in range(S):
    O[s,:] /= np.sum(O[s,:])
    A[s,:] /= np.sum(A[s,:])
    
    
def viterbi(params, observations):
    pi, A, O = params
    M = len(observations)
    S = pi.shape[0]
    
    alpha = np.zeros((M, S))
    alpha[:,:] = float('-inf')
    backpointers = np.zeros((M, S), 'int')
    
    # base case
    alpha[0, :] = pi * O[:,observations[0]]
    
    # recursive case
    for t in range(1, M):
        for s2 in range(S):
            for s1 in range(S):
                score = alpha[t-1, s1] * A[s1, s2] * O[s2, observations[t]]
                if score > alpha[t, s2]:
                    alpha[t, s2] = score
                    backpointers[t, s2] = s1
    
    # now follow backpointers to resolve the state sequence
    ss = []
    ss.append(np.argmax(alpha[M-1,:]))
    for i in range(M-1, 0, -1):
        ss.append(backpointers[i, ss[-1]])
        
    return list(reversed(ss)), np.max(alpha[M-1,:])



## Reading a corpus of POS tagged tweets

Remember from the lecture: we always need some annotated data in order to evaluate our methods, even when they are unsupervised. In order to do this, we will use a dataset of tweets annotated with POS tags, which we will download automatically via Python.

The next step is to read the file. We will use it as a *test set* only: you are not allowed to use any of this data for training.

In [10]:
import urllib
urllib.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")

test_inputs = []
test_outputs = []
with open('pos.txt') as f:
    words = []
    pos_tags = []
    for line in f:
        
        if line.strip() == '':
#             print line
            test_inputs.append(words)
            test_outputs.append(pos_tags)
            words = []
            pos_tags = []
        else:
            print line
#             print line.strip()
            word, pos = line.strip().split()
#             print word,pos
            words.append(word)
            pos_tags.append(pos)
print len(test_inputs)
print test_inputs[0]
print test_outputs[0]

@paulwalk USR

It PRP

's VBZ

the DT

view NN

from IN

where WRB

I PRP

'm VBP

living VBG

for IN

two CD

weeks NNS

. .

Empire NNP

State NNP

Building NNP

= SYM

ESB NNP

. .

Pretty RB

bad JJ

storm NN

here RB

last JJ

evening NN

. .

Small JJ

Biz NNP

Tech NNP

Tour NNP

2010 CD

Launches VBZ

Five CD

City NN

Tour NN

MONTCLAIR NNP

N.J. NNP

...: :

The DT

all DT

day NN

event NN

features VBZ

America's... NNP

http://tinyurl.com/28hd9fu URL

#fb HT

@MiSS_SOTO USR

I PRP

think VBP

that DT

's VBZ

when WRB

I PRP

'm VBP

gonna VBG

be VB

there RB

On IN

Thanksgiving NNP

after IN

you PRP

done VBN

eating VBG

its PRP

#TimeToGetOut HT

unless IN

you PRP

wanna VBP

help VB

with IN

the DT

dishes NNS

RT RT

@robmoysey USR

Eyeopener NNP

vs CC

. .

Ryerson NNP

Quidditch NN

team NN

this DT

Sunday NNP

at IN

4 CD

p.m. NN

Anyone NN

know VB

where WRB

to TO

get VB

cheap JJ

brooms NNS

? .

#Ryerson HT

@RUQuidditch USR

#Rams HT

RT RT

@Zodiac


did VBD

it PRP

+ SYM

@Strigy USR

got VBD

mine NN

in IN

bbt NNP

aintree NNP

today NN

. .

Played VBD

table NN

tennis NN

on IN

it PRP

in IN

store NN

! .

V RB

impressed VBD

. .

Did VB

you PRP

get VB

analogue JJ

controller NN

2 RB

? .

USM NNP

get VB

ready RB

yours PRP$

truly RB

tonight NN

at IN

frat NN

house NN

Hattiesburg NNP

on IN

the DT

1s CD

n CC

2s CD

!! .

Today NN

's POS

cheer NN

went VBD

from IN

awful JJ

to TO

awesome JJ

from IN

the DT

moment NN

I PRP

realized VBD

I PRP

did VBD

not RB

have VB

class NN

today NN

. .

@jonronson USR

But CC

a DT

mess NN

on IN

one CD

day NN

may MD

not RB

a DT

mess NN

eternally RB

be VB

. .

Driving VBG

, ,

driving VBG

, ,

driving VBG

away RB

to TO

Phil NNP

. .

Tasty JJ

dinner NN

tonight NN

with IN

the DT

Society NNP

of IN

Mining NNP

and CC

Metallurgy NNP

Engineers NNPS

. .

Have VBP

a DT

practice NN

session NN

@Cromwell USR

Field NN

today NN

at IN

6pm


! .

RT RT

@BuffSabresGrl63 USR

: :

RT RT

@LindyRuffsTie USR

: :

RT RT

@FakeDarcy USR

: :

No UH

Derek NNP

, ,

when WRB

guys NNS

get VBP

hurt VBN

, ,

" ''

kissing VBG

the DT

boo-boo NN

" ''

will MD

not RB

make VB

it PRP

better JJR

. .

#sab HT

... :

RT RT

@TRAEABN USR

Headed VBN

to TO

Shreveport NNP

then IN

Lake NNP

Charles NNP

-might MD

as IN

well RB

shoot VB

through IN

dat DT

#BMT HT

after IN

the DT

slab NN

show NN

in IN

Lake NNP

Charles NNP

.. .

RT RT

@WeSpazForJB USR

: :

http://twitpic.com/2nn4ee URL

&lt; SYM

---- SYM

still RB

the DT

funniest JJS

photo NN

. .

of IN

all DT

time NN

. .

ALL DT

TIME NN

. .

I PRP

have VBP

eaten VBN

a DT

large JJ

quantity NN

of IN

oranges NNS

this DT

week NN

... :

RT RT

@NickSilly USR

: :

Fun JJ

! .

RT RT

@JackFMDFW USR

: :

Put VB

on IN

your PRP$

Boogie NN

Shoes NNS

and CC

Get VB

Down RP

Tonight NN

with IN

KC NNP

and CC

The NNP

Sunshine NNP

Band NNP

. 

shorts NNS

to TO

the DT

party NN

tonight NN

, ,

so CC

I PRP

have VBP

to TO

wear VB

sweat NN

pants NNS

over IN

them PRP

and CC

change VB

when WRB

I PRP

get VBP

there RB

. .

SmackMyHead UH

. .

Such JJ

a DT

shiny JJ

morning NN

!!! .

Love VBP

It PRP

. .

Anyways UH

, ,

today NN

will MD

be VB

very RB

busy JJ

! .

Oh UH

lord NNP

here RB

we PRP

go VBP

again RB

Paranormal NNP

Activity NNP

2 CD

http://j.mp/ascgZr URL

I PRP

'm VBP

not RB

requesting VBG

it PRP

in IN

my PRP$

area NN

this DT

time NN

. .

Are VBP

you PRP

? .

RT RT

@HippoArmy USR

: :

Thousands NNS

of IN

angry JJ

masturbators NNS

marched VBD

against IN

Christine NNP

O'Donnell NNP

today NN

http://bit.ly/9nTWQw URL

Seriously UH

. .

Our PRP

Favorite JJ

YouTube NNP

Videos NNS

This DT

Week NN

: :

The DT

Drama NNP

Edition NNP

: :

Had VBD

any DT

epic JJ

meltdowns NNS

this DT

week NN

? .

You PRP

know VBP

, ,

of IN

the DT

cryin NN

... :

http://


Is VBZ

all DT

done VBN

getting VBG

things NNS

set VB

up RP

for IN

the DT

Ultimate NNP

Magic NNP

playtest NN

that WDT

starts VBZ

on IN

Monday NNP

. .

Ca MD

n't RB

wait VB

to TO

see VB

what WP

people NNS

think VBP

. .

I PRP

've VBP

never RB

found VBN

a DT

happy JJ

way NN

to TO

preserve VB

it PRP

but CC

it PRP

's VBZ

usually RB

affordable JJ

most JJS

of IN

the DT

year NN

. .

We PRP

eat VBP

a DT

lot NN

2 RB

, ,

grilled JJ

saute JJ

bake JJ

or CC

fried JJ

RT RT

@LilTwist USR

: :

RT RT

this DT

if IN

you PRP

want VBP

me PRP

to TO

go VB

back RB

live RB

on IN

Ustream NNP

later RB

tonight NN

Have VB

DJ NN

Got VBZ

Us PRP

Fallin VBG

in IN

Love NN

in IN

my PRP$

head NN

. .

" ''

Cause IN

baby NN

tonight NN

, ,

The DT

dj NN

got VBZ

us PRP

fallin VBG

in IN

love NN

again RB

" ''

@Littlesapling USR

You PRP

said VBD

that DT

last JJ

week NN

Ash NNP

. .

I PRP

think VBP

you PRP

need VBP

a DT

12 CD

! .

VISIONS NNP

LOUNGE NNP

( (

247 CD

1ST NNP

AVE NN

HKY NNP

) )

Today NN

I NNP

justt RB

felt VBD

special JJ

forr IN

no DT

reason NN

&lt; UH

3 UH

. .

It PRP

wass VBD

cool JJ

thou( RB

: :

RT RT

@PLLGirls USR

: :

Episodes NNS

1-10 CD

of IN

Pretty NNP

Little NNP

Liars NNP

will MD

air VB

in IN

the DT

UK NNP

on IN

October NNP

18 CD

! .

@Loserface_Laura USR

when WRB

mike NNP

lets VBZ

me PRP

know VBP

, ,

I NNP

will MD

let VB

you PRP

know VBP

. .

I PRP

mean VBP

everyone NN

might MD

just RB

switch VB

out RP

a DT

lot NN

. .

@CULTOFMIKEY USR

It PRP

's VBZ

too RB

long JJ

, ,

and CC

it PRP

just RB

... :

I PRP

do VBP

n't RB

know VB

, ,

it PRP

sounds VBZ

cliche NN

. .

And CC

dumb JJ

. .

My PRP$

sister NN

wants VBZ

to TO

cry VB

she PRP

hates VBZ

it PRP

so RB

much JJ

. .

Goin VBG

to TO

my PRP$

BEST JJS

friends NN

birthday NN

party NN

2day NN

. .

As IN

u PRP

can MD

see VB

her PRP

name NN

is 


in IN

my PRP$

city NN

oh UH

my PRP$

dream NN

and CC

sing VB

one CD

time NN

The DT

Basic JJ

Step NN

Before RB

You PRP

Even RB

Start VBP

Thinking VBG

Of IN

Making VBG

Your PRP$

...: :

Keyword NN

research NN

is VBZ

a DT

well RB

known JJ

subject NN

, ,

yet CC

so RB

... :

http://bit.ly/9XQgSr URL

@SnoopDogg USR

hey UH

snoop NNP

my PRP$

wife NN

Cath NNP

is VBZ

30 CD

today NN

, ,

any DT

chance NN

of IN

a DT

shout NN

out IN

to TO

her PRP

, ,

Today NN

I PRP

got VBD

a DT

promotion NN

at IN

work NN

, ,

and CC

tomorrow NN

I PRP

'm VBP

going VBG

home NN

to TO

Wisconsin NNP

for IN

a DT

few JJ

days NNS

. .

So UH

content JJ

with IN

life NN

right RB

now RB

. .

:) UH

It PRP

might MD

just RB

be VB

me PRP

but CC

when WRB

males NN

text VBP

" ''

K NN

" ''

or CC

cross VBP

there PRP$

legs NNS

while IN

sitting VBG

ai VBZ

n't RB

cool JJ

wit IN

me PRP

. .

@Foxy_Shoe_Thief USR

*purrs VBP

long RB

on IN

th

TX NNP

- :

DECEMBER NNP

30 CD

: :

Running NN

back NN

Cier NNP

... :

http://bit.ly/f3YPEB URL

@YasminBaildon USR

do VBP

you PRP

think VBP

its PRP

possible JJ

one CD

day NN

ill PRP

meet VB

you PRP

Cody NNP

and CC

Alli NNP

:D UH

&lt; UH

3dreaming VBG

big JJ

&lt; UH

3 UH

!!! .

So UH

I PRP

'm VBP

currently RB

looking VBG

at IN

about JJ

5 CD

years NNS

and CC

3 CD

months NNS

if IN

I PRP

can MD

get VB

out IN

of IN

there RB

within IN

the DT

next JJ

month NN

. .

RT RT

@TeenDreaming USR

: :

Retweet VB

if IN

you PRP

're VBP

being VBG

distracted VBN

by IN

twitter NNP

when WRB

you PRP

're VBP

supposed VBN

to TO

be VB

studying/doing VBG

homework NN

! .

:p UH

Mortgage NN

Advice NN

&gt; SYM

First JJ

time NN

home NN

buyer NN

with IN

several JJ

questions NNS

in IN

...: :

My PRP$

wife NN

and CC

I PRP

are VBP

beginning VBG

to TO

look VB

at IN

re NN

... :

http://bit.ly/hgMKjS URL

@justinbieber USR

I PRP

hav


: UH

RT RT

@JupiterStorm USR

: :

1 CD

. .

6 CD

. .

2011 CD

. .

The DT

day NN

@owlcity USR

went VBD

on IN

an DT

Emporer NNP

's NNP

New NNP

Groove NNP

rampage NN

through IN

twitter NN

. .

xD UH

good UH

morning UH

tweeps NNS

. .

have VB

'll MD

a DT

fabulous JJ

day NN

.... :

I PRP

know VBP

Deena NNP

watched VBD

this DT

show NN

, ,

now RB

shes PRP

trying VBG

to TO

get VB

with IN

Vinny NNP

after IN

Snooki NNP

been VBN

with IN

him PRP

? .

RT RT

@KyotoIS USR

: :

BEST JJS

SATURDAY NNP

PARTY NN

IN IN

KYOTO NNP

! .

BUTTERFLY NNP

SATURDAYS NNPS

@BUTTERFLYKYOTO USR

@DJLEADJP USR

@djshiotsu USR

@DJYMX USR

http://t.co/ifuwGCI URL

I PRP

typed VBD

an DT

essay NN

and CC

before IN

I PRP

saved VBD

it PRP

my PRP$

lap NN

top NN

froze VBD

and CC

died VBD

. .

-___- UH

Excellent JJ

dinner NN

@nostranapdx USR

, ,

now RB

home NN

and CC

drinking VBG

pinot NN

and CC

watching VBG

an DT

Ahhhhnold NNP

movie NN

. .



Art NN

of IN

Modern JJ

Mythmaking NN

." .

Still RB

smh UH

@ IN

last JJ

night NN

. .

Dointhemost URL

. URL

com URL

bruise NN

on IN

my PRP$

facee NN

... :

wild JJ

night NN

! .

I PRP

need VBP

to TO

go VB

get VB

some DT

music NN

mixed VBN

down RP

today NN

PREACH VB

!! .

RT RT

@DJRyan1der USR

: :

These DT

ppl NNS

do VBP

n't RB

understand VB

its PRP$

Monday NNP

and CC

I PRP

'm VBP

not RB

in IN

da DT

mood NN

to TO

deal VB

wit IN

their PRP$

bullshit NN

!! .

Fukk VB

outta IN

here RB

!!! .

@SMFG800 USR

is VBZ

not RB

morning NN

anymore RB

=p UH

A DT

wet JJ

and CC

cold JJ

Monday NNP

....: :

High JJ

temperatures NNS

will MD

be VB

reached VBN

this DT

morning NN

, ,

in IN

the DT

low JJ

to TO

mid JJ

40s CD

, ,

before IN

cooling VBG

... :

http://bit.ly/fkdrr6 URL

I PRP

'm VBP

at IN

Roll NNP

And NNP

Go NNP

Quality NNP

Kitchen NNP

( (

8th NNP

Ave NNP

, ,

W NNP

38th NNP

St NNP

, ,

New NNP

York NNP



tonight NN

#meteopau HT

give VB

me PRP

your PRP$

your PRP$

your PRP$

young JJ

minddddddd NN

RT RT

@CHINOXL USR

: :

Today NN

is VBZ

my PRP$

main JJ

goonie NN

@bobbyboutit USR

birthday NN

#teamchino HT

ca MD

n't RB

move VB

without IN

him PRP

, ,

please VB

send VB

him PRP

some DT

shouts NNS

!! .

While IN

he PRP

's VBZ

... :

HBCU NNP

's PRP

should MD

have VB

a DT

free JJ

JORDAN NNP

shoe NN

lottery NN

every DT

Tuesday NNP

but CC

only RB

at IN

7pm CD

. .

Ninjas NNS

would MD

skip VB

class NN

for IN

this DT

shit NN

EventsandHolidays NNS

with IN

" ''

cute JJ

easter NNP

egg NN

color NN

shower NN

Easter NNP

cards NNPS

", NNP

Zazzle NNP

's POS

Best JJS

Today NN

. .

http://bit.ly/ebZzGT URL

still RB

fly JJ

from IN

yesterday NN

.... :

same JJ

clothes NNS

from IN

yesterday NN

... :

lmao UH

RT RT

@iloveheartstoo USR

: :

Please VB

follow VB

me PRP

and CC

RT RT

for IN

a DT

chance NN

to TO

win VB

. .

End

so RB

you PRP

know VBP

what WP

that DT

means VBZ

, ,

right UH

? .

Mum NN

and CC

I PRP

... :

http://bit.ly/9uTYHJ URL

when WRB

i PRP

compliment VBP

her PRP

she PRP

wo MD

n't RB

believe VB

mee PRP

&lt; UH

3 UH

Jopulse NNP

: :

Get VB

This DT

Little JJ

Secret NN

When WRB

to TO

Avail VB

For IN

Mortgage NN

http://bit.ly/9yTa3x URL

@sweet_clockwork USR

lmao UH

! .

okay UH

then IN

:o UH

) )

im PRP

baking VBG

it PRP

tonight NN

, ,

so IN

'll MD

hit VB

you PRP

up RP

tomorrow NN

and CC

see VB

about IN

getting VBG

you PRP

some DT

yummies NNS

!! .

&lt; UH

3 UH

buy VB

it PRP

when WRB

it PRP

comes VBZ

out IN

. .

o_O UH

RT RT

@ReginaPearl USR

yo UH

is VBZ

there RB

another DT

way NN

to TO

listen VB

to TO

nicki NNP

album NNP

beside IN

fb NNP

? .

Gen NNP

pen NNP

n CC

inc NNP

orr NNP

ball NN

at IN

own JJ

46 CD

orr NNP

comp VBP

long JJ

pass NN

to TO

end VB

half NN

9-6 RB

Watching VBG

a DT

detective NN



Rift NNP

for IN

Manchester NNP

Salver NNP

: :

After IN

failing VBG

to TO

recapture VB

the DT

East NNP

Africa NNP

Chal NNP

... :

http://bit.ly/9HIJAp URL

#news HT

@UrbanDecay411 USR

so RB

stoked JJ

got VBD

some DT

lip NN

junkie NN

in IN

heavy JJ

__AND__ CC

am VBP

picking VBG

my PRP$

book NN

of IN

shadows NNS

II RB

tomorrow NN

! .

So RB

stoked VBN

! .

:D UH

@InuBoA_Kwon USR

i PRP

told VBD

her PRP

you PRP

r VBP

very RB

lucky JJ

to TO

know VB

me PRP

in IN

past JJ

year NN

ago RB

. .

i PRP

had VBD

1 CD

mention NN

from IN

my PRP$

fans NNS

she PRP

said VBD

i PRP

same JJ

characters NNS

as IN

@BoA_1105 USR

RT RT

@cstatus USR

: :

pet NN

peeve NN

when WRB

niggas NNS

do VBP

n't RB

tag VB

albums NNS

right RB

... :

lowercase JJ

track NN

names NNS

and CC

shit NN

@WriteRCastle USR

So UH

, ,

when WRB

do VBP

you PRP

start VBP

the DT

next JJ

book NN

? .

Surely RB

you PRP

want VBP

another DT

bestseller NN

time NN

these DT

days NNS

man NN

It PRP

was VBD

just RB

the DT

best JJS

day NN

of IN

my PRP$

life NN

.. :

thanks NNS

to TO

some DT

people NNS

and CC

the DT

unknown JJ

old JJ

lady NN

for IN

sharing VBG

her PRP$

experience NN

__AND__ CC

advice NN

... :

@pegasus_za USR

You PRP

're VBP

welcome JJ

. .

Yeah UH

, ,

it PRP

's VBZ

not RB

a DT

bad JJ

day NN

, ,

just RB

feeling VBG

a DT

little RB

tired JJ

and CC

ill PRP

. .

How WRB

's VBZ

it PRP

going VBG

with IN

you PRP

? .

Think VBP

I'ma PRP

make VB

@GrownWomanStuff USR

have VBP

a DT

shoegasm NN

today NN

hahahaha UH

RT RT

@politicalwind USR

: :

Congress NNP

punts NNS

tough JJ

choice NN

until IN

after IN

election NN

http://bit.ly/aEi59e URL

#oreillyfactor HT

Shower NN

time NN

the DT

flyin NNP

pilot NNP

I PRP

was VBD

at IN

last JJ

night NN

did VBD

n't RB

have VB

no DT

hot JJ

water NN

uggghh UH

@gabbysueurban USR

WELL UH

seems VBZ

like IN

you PRP




by-week NN

? .

How WRB

did VBD

little JJ

G NNP

do VBP

? .

" ''

when WRB

ur PRP

feelin VBG

good JJ

in IN

somebodys NN

spot NN

gettin VBG

hot JJ

dont VB

stop VB

.. :

just RB

dont VB

think VB

im PRP

not RB

.. :

cuz IN

im PRP

out IN

gettin VBG

mineS NNS

"! .

realest JJS

throwback NN

of IN

all DT

time NN

@IAMtheCOMMODORE USR

Today NN

is VBZ

my PRP$

20th JJ

birthday NN

and CC

it PRP

would MD

mean VB

the DT

world NN

to TO

me PRP

if IN

I PRP

got VBD

an DT

@reply NN

from IN

you PRP

today NN

! .

&lt; UH

3 UH

Happy JJ

Monday NNP

to TO

Everyone NN

! .

#education HT

can MD

never RB

be VB

taken VBN

away RB

... :

go VB

follow VB

your PRP$

#dreams HT

@taylorrhicks USR

Thunder NN

, ,

lightening NN

, ,

first JJ

rain NN

we PRP

've VBP

had VBN

in IN

months NNS

... :

love VBP

it PRP

! .

Chukchansi NNP

had VBD

a DT

pow NN

wow NN

this DT

weekend NN

. .

Think VBP

someone NN

did VBD

a DT

... :

Got VBD



## Tagging the corpus and evaluating it

Now that we read our test set, let's try to tag it using our HMM tagger trained before.

In [None]:
predictions = []
for sent in test_inputs:
    encoded_sent = [word_numbers[w] for w in sent]
    pred = viterbi((pi, A, O), encoded_sent)
    predictions.append([tag_names[i] for i in pred])

This will raise an error due to an OOV word. A simple way to deal with OOV's is to smooth the counts in the emission matrix. Let's do that.

In [None]:
# Add an OOV token to our dictionary. Let's call it '<unk>'
unk_index = len(word_numbers)
word_numbers.setdefault('<unk>', unk_index)
word_names.append('<unk>')

V = len(word_numbers)

# initalise
eps = 0.1
O = eps * np.ones((S, V))

# add one smoothing
O += 1.0

# count
for sent in num_corpus:
    for word, tag in sent:
        O[tag, word] += 1
 
# normalise
for s in range(S):
    O[s,:] /= np.sum(O[s,:])

Now to tag the sentence, we first replace any OOV words with our '<unk>' token.

In [None]:
predictions = []
for sent in test_inputs:
    encoded_sent = []
    for word in sent:
        if word in word_numbers:
            encoded_sent.append(word_numbers[word])
        else:
            encoded_sent.append(word_numbers['<unk>'])
    pred, _ = viterbi((pi, A, O), encoded_sent)
    #predictions.append([tag_names[i] for i in predicted]
    predictions.append(pred)
    

print predictions[0]
print('%20s\t%5s\t%5s' % ('TOKEN', 'TRUE', 'PRED'))
for wi, ti, predi in zip(test_inputs[0], test_outputs[0], predictions[0]):
    print('%20s\t%5s\t%5s' % (wi, ti, tag_names[predi]))

There are quite a few errors here, much more than in the previous workshop example. Let's try to quantify this in terms of accuracy, so we can compare with PTB numbers.

In [None]:
from sklearn.metrics import accuracy_score as acc

# flat our data into single lists
all_test_tags = [tag for tags in test_outputs for tag in tags]
# for predictions, we need to obtain the original tag from the index
all_pred_tags = [tag_names[tag] for tags in predictions for tag in tags]

print acc(all_test_tags, all_pred_tags)

51.9% accuracy is quite low. Compare this to the performance on Penn Treebank, which can reach 96.7% accuracy. One reason for such low numbers is the fact we are training only on a subset of Penn Treebank (since it is freely available on NLTK). But even state-of-the-art POS taggers reach only 80% accuracy on this dataset (Ritter et al., EMNLP 2011).

Notice that the twitter test set has some tags which are not defined in PTB, such as "USR" for user mentions (@paulwalk, in the example above). This means that these additional tags will never get predicted using the current tagger. Can you come up with a solution for that?