## Rule based tagging

When it comes to POS tagging, there are several methods that can be used to assign the appropriate tags to words in a text. One such method is the lexicon-based approach, which uses a statistical algorithm to assign the most frequently assigned POS tag to each token.

For instance, the tag “verb” may be assigned to the word “run” if it is used as a verb more often than any other tag.

Another approach is the rule-based method, which combines the lexicon-based approach with predefined rules. These rules are designed to handle specific cases that the lexicon-based approach may not be able to handle on its own.


### Reading the tagset

Let’s try to understand the tagged dataset by reading it from nltk.

In [None]:
# Importing libraries
import nltk
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('treebank')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [None]:
# reading the Treebank tagged sentences
wsj = list(nltk.corpus.treebank.tagged_sents())
# samples: Each sentence is a list of (word, pos) tuples
wsj[:3]

In [None]:
# converting the list of sents to a list of (word, pos tag) tuples
tagged_words = [tup for sent in wsj for tup in sent]
print(len(tagged_words))
tagged_words[:10]

100676


[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT')]

## EDA

- Find the number of unique POS tags in the corpus
- Which is the most frequent tag in the corpus
- Which tag is most commonly assigned to the word "bank".
- Which tag is most commonly assigned to the word "executive".

In [None]:
tags = [pair[1] for pair in tagged_words]
unique_tags = set(tags)
len(unique_tags)

46

In [None]:
from collections import Counter
tag_counts = Counter(tags)
tag_counts

In [None]:
tag_counts.most_common(5)

[('NN', 13166), ('IN', 9857), ('NNP', 9410), ('DT', 8165), ('-NONE-', 6592)]

In [None]:
bank = [pair for pair in tagged_words if pair[0].lower() == 'bank']
bank

In [None]:
executive = [pair for pair in tagged_words if pair[0].lower() == 'executive']
executive

## Lexicon and Rule-Based Models for POS Tagging

In [None]:
# Splitting into Train and Test Sets

train_set, test_set = train_test_split(wsj, test_size=0.3)
print(len(train_set))
print(len(test_set))
print(train_set[:2])


2739
1175
[[('She', 'PRP'), ('was', 'VBD'), ('untrained', 'JJ'), ('and', 'CC'), (',', ','), ('in', 'IN'), ('one', 'CD'), ('botched', 'JJ'), ('job', 'NN'), ('killed', 'VBD'), ('a', 'DT'), ('client', 'NN'), ('.', '.')], [('Similarly', 'RB'), (',', ','), ('Campbell', 'NNP'), ("'s", 'POS'), ('Italian', 'JJ'), ('biscuit', 'NN'), ('operation', 'NN'), (',', ','), ('D.', 'NNP'), ('Lazzaroni', 'NNP'), ('&', 'CC'), ('Co.', 'NNP'), (',', ','), ('has', 'VBZ'), ('been', 'VBN'), ('hurt', 'VBN'), ('*-40', '-NONE-'), ('by', 'IN'), ('overproduction', 'NN'), ('and', 'CC'), ('distribution', 'NN'), ('problems', 'NNS'), ('.', '.')]]


### Unigram lexicon tagger

In NLTK, the UnigramTagger() can be used to train such a model.

In [None]:
# Lexicon (or unigram tagger)
unigram_tagger = nltk.UnigramTagger(train_set)
unigram_tagger.accuracy(test_set)

0.8691994391400147

### Rule-Based (Regular Expression) Tagger

For example, we can specify regexes for various grammatical forms such as gerunds and past tense verbs, 3rd singular present verbs (e.g., creates, moves, makes), modal verbs (e.g., should, would, could), possessive nouns (e.g., partner’s, bank’s), plural nouns (e.g., banks, institutions), cardinal numbers (CD), and so on. In case none of these rules are applicable to a word, we can assign the most frequent tag NN to it.

In [None]:
# specify patterns for tagging
# example from the NLTK book
patterns = [
    (r'.*ing$', 'VBG'),              # gerund
    (r'.*ed$', 'VBD'),               # past tense
    (r'.*es$', 'VBZ'),               # 3rd singular present
    (r'.*ould$', 'MD'),              # modals
    (r'.*\'s$', 'NN$'),              # possessive nouns
    (r'.*s$', 'NNS'),                # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
    (r'.*', 'NN')                    # nouns
]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.evaluate(test_set)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  regexp_tagger.evaluate(test_set)


0.21816785738131803

### Combining Rule based with Lexicon PoS Tagger

NLTK provides a convenient method to combine taggers using the ‘backup’ argument. In the following code, we create a regex tagger to act as a backup to the lexicon tagger. In other words, if the lexicon tagger is unable to tag a word (e.g., a new word not in the vocabulary), it will use the rule-based tagger to assign a tag. Additionally, note that the rule-based tagger itself is backed up by the ‘NN’ tag.

In [None]:
# rule based tagger
rule_based_tagger = nltk.RegexpTagger(patterns)
# lexicon backed up by the rule-based tagger
lexicon_tagger = nltk.UnigramTagger(train_set, backoff=rule_based_tagger)
lexicon_tagger.evaluate(test_set)


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  lexicon_tagger.evaluate(test_set)


0.9048541096347733