# TOC

  __Chapter 3 - Part of speech tagging__

1. [Import](#Import)
    1. [Part of speech tagging](#Part-of-speech-tagging)
    1. [Stanford tagger](#Stanford-tagger)
    1. [N-gram-tagger](#N-gram-tagger)
    1. [Regex tagger](#Regex-tagger)
1. [Named entity recognition (NER)](#Named-entity-recognition-(NER))

# Import

<a id = 'Import'></a>

In [1]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings

warnings.simplefilter("ignore")
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))

# Data extensions and settings
import numpy as np

np.set_printoptions(threshold=np.inf, suppress=True)
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.options.display.float_format = "{:,.6f}".format

# Modeling extensions
import nltk

# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
sns.set_style("whitegrid")

# Part of speech tagging

Part of speech (POS) tagging is process of identifying words as nouns, adjectives, verbs, etc. Current algorithms can predict POS with 97 percent accuracy.

<a id = 'Part-of-speech-tagging'></a>

In [15]:
# identify POS in a simple sentence
from nltk import word_tokenize

s = "I was watching TV"
print(nltk.pos_tag(word_tokenize(s)))

[('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]


In [17]:
# return all nouns
tagged = nltk.pos_tag(word_tokenize(s))
allnoun = [word for word, pos in tagged if pos in ["NN", "NNP"]]
allnoun

['TV']

In [18]:
#
from nltk.corpus import brown

tags = [tag for (word, tag) in brown.tagged_words(categories="news")]
print(nltk.FreqDist(tags))

<FreqDist with 218 samples and 100554 outcomes>


In [20]:
# baseline prediction vs. brown corpus
brown_tagged_sents = brown.tagged_sents(categories="news")
default_tagger = nltk.DefaultTagger("NN")
print(default_tagger.evaluate(brown_tagged_sents))

0.13089484257215028


## Stanford tagger

To use the above code, you need to download the Stanford tagger from http://nlp.stanford.edu/software/stanford-postagger-full-2014-08-27.zip. Extract both the jar and model into a folder, and give an absolute path in argument for the POSTagger.

<a id = 'Stanford-tagger'></a>

In [None]:
#
from nltk.tag.stanford import POSTagger

stanford = POSTagger(
    "models/english-bidirectional-distdim.tagger", "standford-postagger.jar"
)
tokens = nltk.word_tokenize(s)
print(stanford.tag(tokens))

## N-gram tagger

UnigramTagger considers the conditional frequency of tags among each token and simply predicts the most frequent tag for each token.

BigramTagger considers the tag of the focus word and the word previous to that word. Trigram taggers consideres the tags of the two previous words.

Generally speaking, the trigram tagger will have higher accuracy but lower coverage. The opposite is true for unigram tagger.

<a id = 'N-gram-tagger'></a>

In [26]:
# utilize unigram, bigram and trigram tagger
from nltk.tag import UnigramTagger, DefaultTagger, BigramTagger, TrigramTagger

train_data = brown_tagged_sents[: int(len(brown_tagged_sents) * 0.9)]
test_data = brown_tagged_sents[int(len(brown_tagged_sents) * 0.9) :]

# combine taggers > each tagger uses the previous tagger as its backoff, which is invokted when a tag cannot be predicted
# this helps balance the tradeoff etween precision and recall
unigram_tagger = UnigramTagger(train_data, backoff=default_tagger)
print(unigram_tagger.evaluate(test_data))

bigram_tagger = BigramTagger(train_data, backoff=unigram_tagger)
print(bigram_tagger.evaluate(test_data))

trigram_tagger = TrigramTagger(train_data, backoff=bigram_tagger)
print(trigram_tagger.evaluate(test_data))

0.8361407355726104
0.8452108043456593
0.843317053722715


## Regex tagger

The regex tagger can be useful when there are many domain-specific words that are not understood by the stock references used by NLTK's taggers.


<a id = 'Regex-tagger'></a>

In [28]:
#
from nltk.tag.sequential import RegexpTagger

regex_tagger = RegexpTagger([(r".*able", "JJ"), (r".*ness", "NN")])  # adjectives
print(regex_tagger.evaluate(test_data))

0.0026911193062892453


# Named entity recognition (NER)

Named entities are typically people, locations and organizations, among others.


<a id = 'Named-entity-recognition-(NER)'></a>

In [31]:
# Basic named entity tagging
from nltk import ne_chunk

sent = "Mark is studying at Stanford University in California"
print(ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary=False))

(S
  (PERSON Mark/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  in/IN
  (GPE California/NNP))


In [33]:
# NLTK / Stanford NER - needs to be downloaded to work.
from nltk.tag.stanford import StanfordNERTagger

st = StanfordNERTagger()