<a href="https://colab.research.google.com/github/satuelisa/NLPF/blob/main/NLPF_04_P.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install gutenbergpy # do not redo this if on your own computer, we already used it in Session 1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import gutenbergpy.textget # let's reuse what we did in Session 1 to get some example text
target = 2701 # this one is Herman Melville's Moby Dick
raw  = gutenbergpy.textget.get_text_by_id(target) # content
text = gutenbergpy.textget.strip_headers(raw) # remove header
s = text.decode("utf-8") # get a string from the byte sequence
content = s[s.rindex('CHAPTER 1. Loomings'):] # using the LAST time the string appears since it is rindex
import re 
nodigits = re.sub(r'[0-9]+', '', content)
clean = re.compile(r'\s+') 
ok = clean.sub(' ', nodigits)
potential = ok.split('CHAPTER ')
stripped = [ candidate.strip().lstrip() for candidate in potential ] # leading and trailing space 
chapter = [ s for s in stripped if len(s) > 0 ] # keep only the ones with content
print(len(chapter), ' chapters')
print(chapter[0][:100])

149  chapters
. Loomings. Call me Ishmael. Some years ago—never mind how long precisely—having little or no money 


OK, time to tag the parts of speech.

In [3]:
import nltk
nltk.download('punkt') # we need punctuation specs in order to tokenize
nltk.download('averaged_perceptron_tagger') # and the tagger, too (just do these once per computer)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [22]:
tokens = nltk.word_tokenize(chapter[0])
tags = nltk.pos_tag(tokens)
print(tags[:12])
for special in ['hand', 'stream', 'deliberate', 'bathed', 'peep']:
  print([(word, tag) for (word, tag) in tags if word == special])

[('.', '.'), ('Loomings', 'NNS'), ('.', '.'), ('Call', 'VB'), ('me', 'PRP'), ('Ishmael', 'NNP'), ('.', '.'), ('Some', 'DT'), ('years', 'NNS'), ('ago—never', 'RB'), ('mind', 'VB'), ('how', 'WRB')]
[('hand', 'NN'), ('hand', 'NN'), ('hand', 'NN')]
[('stream', 'NN'), ('stream', 'NN')]
[('deliberate', 'VB')]
[('bathed', 'VBN')]
[('peep', 'NN')]


A partial list of what those tag acronyms mean is available in [Table 5-1 of the Python textbook](https://learning.oreilly.com/library/view/natural-language-processing/9780596803346/ch05s02.html#tab-simplified-tagset).

Note that we know have info on which words are names, such as *Ishmael* which got tagged to NNP.

So, how could the tagger know that a word like *call* is a verb here instead of a noun? From *context*: what other words surround it. The concept of "other surrounding words" brings us to *bigrams* (two-word sequences) and *n* grams in general. Chapter 5 of the Python textbook discusses how to train those.

The state-of-the art technology to deal with content in tagging are *transformers*, one example of which is the *Brill tagger*. Let's try that.


In [5]:
import nltk.tag
from nltk.tag import brill # documentation at https://www.nltk.org/api/nltk.tag.brill.html
import nltk.corpus, itertools
nltk.download('brown') # an example corpus, just download once on your machine

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

We will now combine elements of Chapter 5 of the Python textbook with those from [an online tutorial](https://www.geeksforgeeks.org/nlp-brill-tagger/) to build an example. Also [Jacob's blog](https://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/) may be an informative additional source of information, although not all of the code fragments in the tutorial or the blog are self-sufficient and/or compatible with the latest versions of the libraries involved.

In [16]:
from nltk.tag import brill, brill_trainer
from nltk.tag import DefaultTagger
from nltk.corpus import brown

# specify a model to train
init = DefaultTagger('NN') # a default starting point
templates = nltk.tag.brill.nltkdemo18() # use the demo templates
trainer = brill_trainer.BrillTaggerTrainer(init, templates)
data = brown.tagged_sents(categories = 'news') # some training data
n = len(data)
sample = int(n * 0.1) # the smaller the faster (but of course also worse in general)
print(n, '->', sample)
tiny = 5 # just a few rules so that this will not take too long
tagger = trainer.train(data[:sample], max_rules = tiny) # train the model with the data

4623 -> 462


Let's try it out (tweak the proportion in `sample` and the number of rules in `tiny` for improved results albeit with increased training time.

In [21]:
ttags = tagger.tag(tokens)
part = 15
print('NEW', ttags[:part])
print('OLD', tags[:part])

NEW [('.', 'NN'), ('Loomings', 'NN'), ('.', 'NN'), ('Call', 'NN'), ('me', 'NN'), ('Ishmael', 'NN'), ('.', 'NN'), ('Some', 'NN'), ('years', 'NN'), ('ago—never', 'NN'), ('mind', 'NN'), ('how', 'NN'), ('long', 'NN'), ('precisely—having', 'NN'), ('little', 'NN')]
OLD [('.', '.'), ('Loomings', 'NNS'), ('.', '.'), ('Call', 'VB'), ('me', 'PRP'), ('Ishmael', 'NNP'), ('.', '.'), ('Some', 'DT'), ('years', 'NNS'), ('ago—never', 'RB'), ('mind', 'VB'), ('how', 'WRB'), ('long', 'JJ'), ('precisely—having', 'JJ'), ('little', 'JJ')]


Note that the results are clearly different. The newer tagger knows that *Call* is in fact a verb in this context instead of a noun. The same applies to *mind*.