## POS Tagging - Lexicon and Rule Based Taggers

POS tagging is where we map every word of the document to a part of speech.

Let's look at the two most basic POS tagging techniques - lexicon based (or unigram) and rule-based. 

In the lexicon based approach, we create a model, such that based on the test set, every word maps to the part of speech it maps to most commonly in the training set.

In rule based, we (ourselves) create rules based on an analysis done by us on the training set.

In the end, we will build a model using both methods combined

This exercise is divided into the following sections:
1. Reading and understanding the tagged dataset
2. Exploratory analysis
3. Lexicon and Rule-Based Models for POS Tagging (individually and combined)

### 1. Reading and understanding the tagged dataset

In [1]:
# Importing libraries
import nltk
import numpy as np
import pandas as pd
import pprint, time
import random
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
import math

In [2]:
# reading the Treebank tagged sentences
wsj = list(nltk.corpus.treebank.tagged_sents())

In [3]:
# samples: Each sentence is a list of (word, pos) tuples
wsj[:3]

[[('Pierre', 'NNP'),
  ('Vinken', 'NNP'),
  (',', ','),
  ('61', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  (',', ','),
  ('will', 'MD'),
  ('join', 'VB'),
  ('the', 'DT'),
  ('board', 'NN'),
  ('as', 'IN'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('Nov.', 'NNP'),
  ('29', 'CD'),
  ('.', '.')],
 [('Mr.', 'NNP'),
  ('Vinken', 'NNP'),
  ('is', 'VBZ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Elsevier', 'NNP'),
  ('N.V.', 'NNP'),
  (',', ','),
  ('the', 'DT'),
  ('Dutch', 'NNP'),
  ('publishing', 'VBG'),
  ('group', 'NN'),
  ('.', '.')],
 [('Rudolph', 'NNP'),
  ('Agnew', 'NNP'),
  (',', ','),
  ('55', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  ('and', 'CC'),
  ('former', 'JJ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Consolidated', 'NNP'),
  ('Gold', 'NNP'),
  ('Fields', 'NNP'),
  ('PLC', 'NNP'),
  (',', ','),
  ('was', 'VBD'),
  ('named', 'VBN'),
  ('*-1', '-NONE-'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('of', 'IN'),
  ('this'

In the list mentioned above, each element of the list is a sentence. Also, note that each sentence ends with a full stop '.' whose POS tag is also a '.'. Thus, the POS tag '.' demarcates the end of a sentence.

Also, we do not need the corpus to be segmented into sentences, but can rather use a list of (word, tag) tuples. Let's convert the list into a (word, tag) tuple.

In [4]:
# converting the list of sents to a list of (word, pos tag) tuples
tagged_words = [tup for sent in wsj for tup in sent]
print(len(tagged_words))
tagged_words[:10]

100676


[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT')]

We now have a list of about 100676 (word, tag) tuples. Let's now do some exploratory analyses.

### 2. Exploratory Analysis

Let's now conduct some basic exploratory analysis to understand the tagged corpus. To start with, let's ask some simple questions:
1. How many unique tags are there in the corpus? 
2. Which is the most frequent tag in the corpus?
3. Which tag is most commonly assigned to the following words:
    - "bank"
    - "executive"


In [5]:
# question 1: Find the number of unique POS tags in the corpus
# you can use the set() function on the list of tags to get a unique set of tags, 
# and compute its length
tags =  [tag for word,tag in tagged_words]
# print(tags)
unique_tags = set(tags)
len(unique_tags)

46

In [6]:
# question 2: Which is the most frequent tag in the corpus
# to count the frequency of elements in a list, the Counter() class from collections
# module is very useful, as shown below

from collections import Counter
tag_counts = Counter(tags)
tag_counts.most_common()

[('NN', 13166),
 ('IN', 9857),
 ('NNP', 9410),
 ('DT', 8165),
 ('-NONE-', 6592),
 ('NNS', 6047),
 ('JJ', 5834),
 (',', 4886),
 ('.', 3874),
 ('CD', 3546),
 ('VBD', 3043),
 ('RB', 2822),
 ('VB', 2554),
 ('CC', 2265),
 ('TO', 2179),
 ('VBN', 2134),
 ('VBZ', 2125),
 ('PRP', 1716),
 ('VBG', 1460),
 ('VBP', 1321),
 ('MD', 927),
 ('POS', 824),
 ('PRP$', 766),
 ('$', 724),
 ('``', 712),
 ("''", 694),
 (':', 563),
 ('WDT', 445),
 ('JJR', 381),
 ('NNPS', 244),
 ('WP', 241),
 ('RP', 216),
 ('JJS', 182),
 ('WRB', 178),
 ('RBR', 136),
 ('-RRB-', 126),
 ('-LRB-', 120),
 ('EX', 88),
 ('RBS', 35),
 ('PDT', 27),
 ('#', 16),
 ('WP$', 14),
 ('LS', 13),
 ('FW', 4),
 ('UH', 3),
 ('SYM', 1)]

Thus, NN is the most common tag followed by IN, NNP, DT, -NONE- etc. You can read the exhaustive list of tags using the NLTK documentation as shown below.

In [7]:
# list of POS tags in NLTK
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [28]:
# question 3: Which tag is most commonly assigned to the word w. Get the tags list that appear for word w and then use the Counter()
#Try w ='bank' 
#tagged_words
def most_common_tags(w):
    word_tags = [tag for word,tag in tagged_words if word==w]
    bank = Counter(word_tags)
    print(bank.most_common())
most_common_tags('bank')

[('NN', 38)]


In [9]:
# question 3: Which tag is most commonly assigned to the word w. Try 'executive' 
most_common_tags('executive')

executive = Counter([tag for word,tag in tagged_words if word == 'executive'])
executive

[('NN', 40), ('JJ', 28)]


Counter({'NN': 40, 'JJ': 28})

### 2. Exploratory Analysis Contd.

Until now, we were looking at the frequency of tags assigned to particular words, which is the basic idea used by lexicon or unigram taggers. Let's now try observing some rules which can potentially be used for POS tagging. 

To start with, let's see if the following questions reveal something useful:

4. What fraction of words with the tag 'VBD' (verb, past tense) end with the letters 'ed'
5. What fraction of words with the tag 'VBG' (verb, present participle/gerund) end with the letters 'ing'

In [10]:
# 4. how many words with the tag 'VBD' (verb, past tense) end with 'ed'
# first get the all the words tagged as VBD
past_tense_verbs = [word for word,tag in tagged_words if tag =='VBD']
# print(past_tense_verbs)
# subset the past tense verbs with words ending with 'ed'. (Try w.endswith('ed'))
ed_verbs = [word for word in past_tense_verbs  if word[-2:]=='ed']

print(len(ed_verbs) / len(past_tense_verbs))
ed_verbs[:20]

0.3881038448899113


['reported',
 'stopped',
 'studied',
 'led',
 'worked',
 'explained',
 'imposed',
 'dumped',
 'poured',
 'mixed',
 'described',
 'ventilated',
 'contracted',
 'continued',
 'eased',
 'ended',
 'lengthened',
 'reached',
 'resigned',
 'approved']

In [11]:
# 5. how many words with the tag 'VBG' end with 'ing'
participle_verbs = [word for word,tag in tagged_words if tag =='VBG']
ing_verbs = [word for word in participle_verbs if word[-3:]=='ing']
print(len(ing_verbs) / len(participle_verbs))
ing_verbs[:20]

0.9972602739726028


['publishing',
 'causing',
 'using',
 'talking',
 'having',
 'making',
 'surviving',
 'including',
 'including',
 'according',
 'remaining',
 'according',
 'declining',
 'rising',
 'yielding',
 'waiving',
 'holding',
 'holding',
 'cutting',
 'manufacturing']

## 2. Exploratory Analysis Continued

Let's now try observing some tag patterns using the fact the some tags are more likely to apper after certain other tags. For e.g. most nouns NN are usually followed by determiners DT ("The/DT constitution/NN"), adjectives JJ usually precede a noun NN (" A large/JJ building/NN"), etc. 

Try answering the following questions:
1. What fraction of adjectives JJ are followed by a noun NN? 
2. What fraction of determiners DT are followed by a noun NN?
3. What fraction of modals MD are followed by a verb VB?

In [29]:
# question: what fraction of adjectives JJ are followed/ by a noun NN

# create a list of all tags (without the words)
tags = [tag for word,tag in tagged_words]

# create a list of JJ tags
jj_tags = [tag for tag in tags if tag=='JJ']

# create a list of (JJ, NN) tags
jj_nn_tags = [(tag,tags[index+1]) for index,tag in enumerate(tags) if tag=='JJ' and tags[index+1]=='NN']

print(len(jj_tags))

print(len(jj_nn_tags))
print(len(jj_nn_tags) / len(jj_tags))

5834
2611
0.4475488515598217


In [32]:
# question: what fraction of determiners DT are followed by a noun NN
dt_tags = [tag for tag in tags if tag=='DT']
dt_nn_tags = [(tag,tags[index+1]) for index,tag in enumerate(tags) if tag=='DT' and tags[index+1]=='NN']

print(len(dt_tags))
print(len(dt_nn_tags))
print(len(dt_nn_tags) / len(dt_tags))

8165
3844
0.470789957134109


In [34]:
# question: what fraction of modals MD are followed by a verb VB?
md_tags = [tag for tag in tags if tag=='MD']
md_vb_tags = [(tag,tags[index+1]) for index,tag in enumerate(tags) if tag=='MD' and tags[index+1]=='VB']

print(len(md_tags))
print(len(md_vb_tags))
print(len(md_vb_tags) / len(md_tags))

927
756
0.8155339805825242


# Lexicon based model

In [35]:
#split into test and train
random.seed(1234)
train_set, test_set = train_test_split(wsj, test_size=0.3)
print(len(train_set))
print(len(test_set))

2739
1175


In [36]:
#lexicon (unigram) tagger
unigram_tagger = nltk.UnigramTagger(train_set)
unigram_tagger.evaluate(test_set)

0.8716698420899668

seems pretty good

# Rule based model

we will use regular expressions to deterine which word is mapped to what

In [37]:
patterns = [
    (r'.*ing$', 'VBG'),              # gerund
    (r'.*ed$', 'VBD'),               # past tense
    (r'.*es$', 'VBZ'),               # 3rd singular present
    (r'.*ould$', 'MD'),              # modals
    (r'.*\'s$', 'NN$'),              # possessive nouns
    (r'.*s$', 'NNS'),                # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
    (r'.*', 'NN')                    # nouns
]


In [40]:
regexp_tagger = nltk.RegexpTagger(patterns)

In [42]:
regexp_tagger.evaluate(train_set)


0.22010070435668277

In [43]:
regexp_tagger.evaluate(test_set)

0.2182000193754642

it's eh i guess

# Combining

we will combine them by doing the following

whenever the unigram model can't identify a word (it doesnt occur in the training set) we will use the rule based tagger

In [45]:
regexp_tagger = nltk.RegexpTagger(patterns)

#backoff is by default NN, but now we set it such that it uses the rule based model
combined_tagger = nltk.UnigramTagger(train_set, backoff = regexp_tagger)

combined_tagger.evaluate(test_set)

0.9047373009978364

Thats much better!!!!!