<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/19_Parts_of_Speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Parts of Speech**

This notebook discusses a very important lexical property: parts of speech. We have thus far dealt with searching and manipulating text based on the orthographic features of words (i.e., their written forms). That means we have remained focused on the forms of the words (i.e., types and tokens). This has given us the ability to measure basic yet important aspects of texts: length, word frequency, and also some considerations about lexical diversity.

You probably have some familiarity with how words are classified into different lexical and syntactic categories, such as nouns, verbs, adjectives, pronouns, etc. These categories are called *parts of speech* (POS), and can be used as another source of information which can be exploited during linguistic analysis of texts.

For NLP and Computational Linguistics, it is common to see reference made to POS **Tags**, which are essentially labels or annotations associated with a word to represent more information about that word. These tags can be counted and compared, and also provide critical information for building and understanding grammars of languages.

Let's explore how to tag texts and then use these tags to query information about texts.

## Using `nltk.pos_tag()` to automatically tag texts.

Fortunately for us, NLTK includes a function which will automatically assign part of speech tags to a text. To use this function, we need to import NLTK and download some additional resources.

In [None]:
# import NLTK and download the necessary resources
import nltk
# import resources for tokenizing and tagging
nltk.download(['punkt', 'averaged_perceptron_tagger', 'tagsets', 'treebank', 'universal_tagset', 'book'])


The NLTK function expects a list of tokens and is used like this:

> `nltk.pos_tag(tokens)`

The results will be a list of `(word,tag)` pairs (which are in the form of a tuple.)

The next three cells demonstrate an example of how to POS tag a text using nltk:

In [None]:
# step 1: have some text
rant = "You know, we're living in a society! We're supposed to act in a civilized way."

In [None]:
# step 2: tokenize
rant_tokens = nltk.word_tokenize(rant)

In [None]:
# step 3: tag
rant_pos = nltk.pos_tag(rant_tokens)

# look at the resulting (word, tag) pairs
[tagged for tagged in rant_pos]

POS tagging has created a list of tuples of our words, with `(word, tag)` format. The tags are informative and indeed go beyond broader word categories such as NOUN and VERB. For example,

```
VBP = verb, present tense, not 3rd person singular
```

while

```
VBG = verb, present participle or gerund.
```

These tags are from the Penn tagset which is a very commonly used set of POS tags. You can run the following cell to see the full list or by [going to this link](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

Take a moment to scroll throught these tags and explore their explanations and examples.

In [None]:
 # full list of tags, with definitions and examples
 nltk.help.upenn_tagset()

You can also look up one specific tag by supplying the tag as a string to the `help` function:

In [None]:
# what is the NNP tag?
nltk.help.upenn_tagset('NNP')

### **Discussion**

Part of speech tags help make sense of words in the context of other words. Consider this example from the NLTK book - what is the difference in use for the instances of *refuse* and *permit*?

Think back to what we know about bigrams. Are there any clues  provided by the words which come *before* refuse/permit that might facilitate tagging of the proper part of speech?  

In [None]:
nltk.pos_tag(nltk.word_tokenize("They refuse to permit us to obtain the refuse permit"))

We can supply our own examples as well — let us compare two uses of the same word "comb":

In [None]:
# what pos tag does the word "comb" have in this example?
nltk.pos_tag(nltk.word_tokenize('Quick, comb the desert for droids!'))

In [None]:
# and what pos tag does the word "comb" have in this example?
nltk.pos_tag(nltk.word_tokenize('Where is my comb?'))

So, adding POS tag information provides more information about a text, which becomes useful for more advanced NLP applications such as information extraction, text prediction, and so on. Because the tags are stores as strings, you can use knowledge of Python to search or filter through the list in order to find specific words associated with specific tags.



### **Your Turn**

- Explore using `nltk.pos_tag()` on some texts.
- See if you can understand the different POS tags and what they mean about the words.
- Can you "break" the tagger or have it produce innaccurate results?
- The tagger has a rule that if it does not know the tag for a word, it will automatically assign a default POS tag. Can you figure out what this default tag is?

# Bigrams and Parts of Speech

NLTK reviews major parts of speech, with examples. Regardless of what you think you might know (or not know) about grammar/language, you should carefully read these sections. Look at the patterns associated with different parts of speech – these patterns are crucial for training taggers. This is evidenced in the example showing how bigrams of POS tags show typically English word order. We can try the same with our own example.

Let's use `nltk.bigrams()` on a set of tagged tokens — this will create a set of ((word, tag) , (word, tag)) pairs.

In [None]:
# create bigrams of our pos tagged example
rant_bigrams = [bigram for bigram in nltk.bigrams(rant_pos)]

# inspect the bigrams
rant_bigrams

Now that we have the part of speech information included with our words, we can shift our search patterns away from the orthographic forms of words to instead the part of speech of words. This allows us to find more abstract patterns in language associated with word *categories* rather than with the forms of words themselves.

For instance, let's look for all words in our example which come before nouns. This requires a bit of slicing, because we are looping through pairs set within a single tuple

```
((word, tag), (word, tag))
```

So to access the word in the first pair, we would first index the larger tuple using `[0]` to get the first `(word, tag)` pair, then then index that pair using `[0]` to get the first part of `(word, tag)`, which would be the word. This is demonstrated in the next code cell

In [None]:
for i in rant_bigrams:
  print(i[0][0]) #index the first nested tuple, then index that tuples first value

Another strategy would be to follow the NLTK book's guide and set the tuple pair as the iterator, allowing you to index the tuple in a more transparent way.

```
[a for (a,b) in bigrams]
```

In [None]:
# you can select the first pair of each pair
[a for (a, b) in rant_bigrams]

In [None]:
# or the second
[b for (a,b) in rant_bigrams]

Let's steal the example from NLTK and find all the words which precede nouns in this example:

In [None]:
noun_preceders = [a for (a, b) in rant_bigrams if b[1] == 'NN']
noun_preceders

### **Your Turn**

What do you notice about the words which come before nouns?

- Apply the strategy above to some longer texts of your choice (e.g., you could load in Brown?)
- Do you find that same words appearing in front of nouns? What patterns are you noticing?
- Do the findings make sense in terms of what you know about nouns?

# Frequency distributions and POS tags

The NLTK book demonstrates that we can use frequency distributions to find the most common words associated with a particular part of speech.

They do so by using a [tagged corpus](https://catalog.ldc.upenn.edu/LDC99T42) comprised of articles from the *Wall Street Journal*, using the treebank tag format. This tag format is different from the Penn tags we've been looking at thus far, and is known as the [universal tag set](https://universaldependencies.org/u/pos/).

We can access this corpus through NLTK.

In [None]:
# load in Penn Treebank corpus using universal pos tags (they are simpler)
wsj = nltk.corpus.treebank.tagged_words(tagset = 'universal')
wsj

The corpus is tagged, so if we use `nltk.FreqDist()`, we will get a frequency distribution of `(word, tag)` pairs.

In [None]:
# create a frequency distribution of the pairs
word_tag_fd = nltk.FreqDist(wsj)

# not surprisingly, the most common pairs are punctuation and function words.
word_tag_fd.most_common(10)

We can then run a conditional test on the freqdist to find most common words of a certain category, such as finding the most common verbs.

To do so, we run a list comprehension with a conditional test over the results of the frequency distribution. Note that in the loop we specify the `((word, tag),freq)` nature of each item being iterated over.

In [None]:
# ask for just verbs (leaving most_common empty means it prints all of the words)
# I included a slice to 25 just so it doesn't spam the screen
[word for ((word, tag), freq) in word_tag_fd.most_common() if tag == 'VERB'][:25]

## Conditional Freq Dist and POS Tags

Finding the most common verbs is interesting, but we can also use a conditional frequency distribution to find the frequency of specific words among different POS Tags. This allows us to find the frequency to which certain words might appear under different parts of speech.

The FreqDist will be constructed in a way where each word is a dictionary key, and the values for that key will be each Part of Speech the word occurs under, followed the the frequency:

```
- Word 1
  - POS Tag 1: Frequency
  - POS Tag 2: Frequency
  - etc..
Word 2
  - POS Tag 1: Frequency

Etc..
```

In [None]:
# Create a conditional frequency distribution
wsj_cfd = nltk.ConditionalFreqDist(wsj)

Because words are the keys of the dictionary, we can query the conditional frequency distribution using the words as keys.

This part is really cool – you can see that words are not always used with just one part of speech tag. When the CFD has the words as the conditions (i.e., the first part of the pair), we can see how often different POS tags occur, as the examples below show.

In [None]:
wsj_cfd['yield']

In [None]:
# you can use .most_common() to access the values directly
wsj_cfd['yield'].most_common()

In [None]:
wsj_cfd['cut'].most_common()

Some words are more restricted to specific parts of speech, these are the so-called function words. What do you think happened with these tag for `the` that are not `det`?



In [None]:
wsj_cfd['the'].most_common()

You can obtain the initial treebank pos tagset by loading in the corpus and not specifying that you need the "universal" tagset. This tagset is more detailed.

### **Your Turn**

Spend some time querying the `wsj_cfd` for different words.

- Which words are more likely to appear in different POS?
- Which ones are not?

In [None]:
# Look at different POS tags here.


In [None]:
nltk.FreqDist(nltk.pos_tag(nltk.word_tokenize('Where is my comb? Please comb the desert for droids!')))

# Tagging less clean data

Let's briefly look at what it might be like to tag some of the data from The Current. We will load the data in and clean out the punctuation, but otherwise leave the comments as a single string.

In [None]:
# supermarkets should only sell sustainably caught fish
# load the TP002 data to the notebook environment
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp002.txt'

In [None]:
# read in the entire file
tp002 = open('tp002.txt').read().rstrip()

# remove any punctuation
import re
punctuation = '[#.,!\'"-]'
tp002 = re.sub(pattern = punctuation, repl = '', string = tp002)

# extract the comments
tp002_comments = ' '.join([comment.split('\t')[1] for comment in tp002.split('\n')])

In [None]:
# create tokens from entire set of comments
tp002_tokens = nltk.word_tokenize(tp002_comments)

In [None]:
# tag the tokens
tp002_pos = nltk.pos_tag(tp002_tokens)

In [None]:
# look at first ten token/tag pairs
tp002_pos[:10]

Let's create a FreqDist of the token/tag pairs to locate some frequent and infrequent combinations:


In [None]:
tp002_fdist = nltk.FreqDist(tp002_pos)

Look at the top ten most frequent word/tag pairs. What do you notice here - are there any commonalities among these words/parts of speech?

In [None]:
# top ten most frequent word/tag pairs.
tp002_fdist.most_common(10)

The `.hapaxes()` function returns all items which occur only one time. Look at the first ten hapaxes. What does this tell us about the data, and/or any additional preprocessing we might need to do?

In [None]:
# first ten hapaxes in the data
tp002_fdist.hapaxes()[:10]

# **Adding your own tags**

You can manually tag your own text using a built-in NLTK function, `nltk.tag.str2tuple`. You supply a word and tag in the form of `word/tag', and the result is a (word, tag) pair. For example, below I add the POS tag "NN" to the word "fly":


In [None]:
tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

You can then quickly add a number of tags to a text in one go by writing raw strings with word/tag pairs:

In [None]:
# add tags to a longer text, then tokenize
raw_text = 'We/PRP live/VB in/AT a/DET society/NN'

tokenized_text = nltk.word_tokenize(raw_text)
tokenized_text

In [None]:
# now split the tags
[nltk.tag.str2tuple(w) for w in tokenized_text]

## Why would you want to add your own tags?

You might wonder what the point of adding your own tags is, especially if you find you are not super confident about which tags should be used for a particular word! One of the reasons NLTK shows this to you is to provide a glimpse into how some corpora come tagged, and how NLTK reads those tags and provides them to you.

Another potential use is that you could in theory supply any tags you wanted to your text. So, instead of tagging each word with part of speech, you could devise your own coding scheme for other properties of words. For example, if you had an example of speech which mixed two languages, you could tag the language of each word:


In [None]:
# tag words in English/Mandarin
code_switch = 'Soda/EN is/EN a/EN 很/ZH 乖/ZH 的/ZH 狗/ZH'

In [None]:
# now my text can be searched for "EN" or "ZH"
[nltk.tag.str2tuple(w) for w in nltk.word_tokenize(code_switch)]

## **Your Turn**

- It might be useful to think about some other categories you might be interested in applying to words
- At the least, play around with `str2tuple` and make sure you have a handle on it