<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/13_bigrams_and_parts_of_speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Collocations

One of the most memorable quotes one learns when studying corpus linguistics is by [Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth): “you shall know a word by the company it keeps". This quote embodies the concept of *collocational meaning* — the idea that the true meaning of words is only realised when used in the context of *other* words. 

Here is the definition of collocation from Google Dictionary:

![](https://i.imgur.com/hz3L88z.png)

Note how the linguistic definition is essentially the same thing as the non-linguistic definition, but includes an additional qualification about habitual co-occurence. Collocations are not just words which occurs next to another word, it's a word which occurs next to another word *non-randomly*. 

How do we know which words occur non-randomly or not? Think of the corpora that you've already seen thus far — large corpora such as Brown (and even larger than that) allow linguists to mine not just frequently occuring single words, but also frequently occuring word pairs, triplets, and so on. Statistical measures of word co-occurence have tremendous potential to provide insight into natural language and are also responsible for the ever-increasing accuracy of automatic NLP algorithms used today. 

Statistical knowledge of collocations is not just present in corpora, but is also a function of becoming proficienct in a language. This [Wikipedia article](https://en.wikipedia.org/wiki/English_collocations) contains a decent explanation of some collocations in English. Consider this table from the article, which shows how some word pairs seem natural / correct, whereas others do not:

![](https://i.imgur.com/cWE5iO5.png)

Note that none of the word combinations in the "unnatural English" column are ungrammatical, rather it's just that they seem to be odd combinations particularly when compared to the versions in the "natural English" column. Collocations are extracted from large corpora of language and are thought to reflect language use, which is in turn reflected by our interpretation of which of these collocations seems right and which seem odd. 


## Bigrams 

As was shown above, collocations are defined at a minimum of two or more co-occuring words. Collocations can stretch beyond two words, to three, four, or more word partners. 

A related but crucially different term from collocations is the term **bigrams**. In NLP, the term `bigram` means a pair of words, and technically can mean *any pair of adjacent words in a text*. The more general terminology is `n-grams`, where the `n` can stand for any number. So you can have bigrams (two words), trigrams (three words), and so on. You may find that corpus linguists sometimes conflate the terms `collocation` with `ngram`, but what they probably are referring to are unusually frequent ngrams when compared to other ngrams, especially when taking into account individual word frequencies. 

Simply put, the relationship between collocations and bigrams is that all collocations are bigrams (or, more accurately, n-grams), but not all n-grams are collocations. In order to be upgraded from n-gram to collocation, it must be shown that the n-gram occurs more frequently than would be allowed by chance. 

So, collocations are defined based on statistical co-occurence frequencies: collocations are thus two words which are more likely to occur with one another when compared to other words, controlled for each word's individual word frequency. NLTK has a neat function which calculates these values for us — `bigrams` and `collocations`. Let's try `bigrams` on some sample text first.

### nltk.bigrams()

The function create bigrams in NLTK is relatively straightforward. We just need to call the function on a tokenized text (otherwise we would get bigrams of characters in a string!). Let's see an example. First download the needed resources. 

In [None]:
# import the NLTK library and download tokenizer
import nltk
nltk.download('punkt')

In [None]:
# create a sample sentence
great_quote = 'we live in a society!'

# use nltk bigrams (wrapping it in list() provides us with the output right away)
list(nltk.bigrams(nltk.word_tokenize(great_quote)))

You should be able to inspect the output and get an idea for what's going on. This function is simply starting with the first word of the sentence, making a pair with the second word, then moving on to the second word, making a pair with the third word, and so on. We can conceptualise how this works with a pseudo formula:

First we would loop through a sentence:

> `for n:m in a sentence (where n = the first word and m = the final word)`

Then we would simply iterate ahead by one and add that to the current iterator

> `output = n + n1, n1 + n2, n2 + n3..., m-1 + m`

Is it that simple, can we produce the bigrams in the same way that the NLTK module has? One excellent function which can help us with this is `enumerate()`!

Check out the code below - you can see that it was relatively easy to get the same basic functionality from NLTK with our own code. Take a moment to study what I've had to do in order to prevent index errors. 





In [None]:
# use enumerate to make bigrams by asking for adjacent words until we get to the end of the sentence.
def bootleg_bigram(tokens):
  for i, word in enumerate(tokens):
    if i != len(tokens)-1: # what is the role of this line? 
      print((tokens[i], tokens[i + 1]))

In [None]:
# Test out bootleg_bigram on the same text. 
bootleg_bigram(nltk.word_tokenize(great_quote))

### **Your Turn**

Spend some time using `nltk.bigram()` on some text/strings. Make sure you understand what it is doing, and also compare the output to my `bootleg_bigram()` function. 

- Are there *any* differences between the two functions? (Hint: yes there are). 
- What could be done to `bootleg_bigram()` to improve it as a function?

## Finding collocations

The `collocations()` function will give us the bigrams which are unusually frequent when also considering the frequency of the individual words in the bigrams. If you look under the hood in the NLTK docs, you'll find they use calculations from [this paper](https://aclanthology.org/J90-1003.pdf) to determine strength of association (i.e., to distinguish collocations from bigrams). 

Let's return to some of the built-in texts and examine their collocations. We need to download the NLTK resources. 



In [None]:
# bring in the nltk resources
nltk.download('book')
from nltk.book import *

###**Activity and Discussion**

Examine the collocations for `text6` and `text9`. 
- What do you think it is about the text which is creating these collocations? 
- Do you think these same collocations would be found in the other texts? 
- What might this tell us about the power of using collocations / ngrams if we wanted to predict where documents came from (e.g., guessing the genre, guessing the author)?

In [None]:
# what are the collocations of Holy Grail?
print(text6, '\n')

text6.collocations()

In [None]:
# examine the collocations of text9. 
# What do you know about this book, without having read it?
print(text9, '\n')

text9.collocations()

## How do people use LOL?

Now, while collocations can tell us about a text in general, we can also use bigrams as a means to explore a targeted use of language we might be interested. 

Let's consider `text5`, the webchat corpus. Perhaps we want to know how people use "lol", regardless of whether "lol" is a collocation or not. To do so, we can conditionally sort through the bigrams of the text. 

We will first obtain all the bigrams of `text5`. Then we will print out the bigrams only if they contain the acronym `lol` or `LOL`. Perhaps this will tell us how LOL is used?

In [None]:
# first get the bigrams
webchat_bigrams = list(bigrams(text5))

# we can see that there is going to be a lot of them!
len(webchat_bigrams)

In [None]:
# inspect a random part of the bigrams
webchat_bigrams[1337:1350]

Now that we have obtained all of the bigrams, let's use a list comprehension to fine the bigrams which contain variations of LOL:

In [None]:
# create a new object named lol_grams 
# which are the bigrams of webchat_bigrams only if they contain 'lol' or 'LOL'
lol_grams = [gram for gram in webchat_bigrams if 'lol' in gram or 'LOL' in gram]

# we have a good number!
len(lol_grams)

In [None]:
# you can examine the bigrams here
# what do you notice? 
sorted(set(lol_grams))

There's a lot of crap in that output, mainly because of the way usernames are represented. This is somewhat interesting/useful because it suggests 'lol/LOL' might be the first/only thing many people type (which makes sense). But we are more interested in seeing how 'lol/LOL' pairs with other words. There was also a lot of punctuation joined with 'lol/LOL'. 

Let's try cleaning it up a bit. We can add a condition that requires both words in the bigram must be `.isalpha()`. Why might this work? Because `.isalpha()` only returns `True` if every character in a string is an alphabetic character (a-z/A-Z). Any punctuation *or* numbers will cause `.isalpha()` to evaulate `False`.


In [None]:
# Description of str.isalpha
help(str.isalpha)

In [None]:
# now use isalpha() to only capture lol or LOL with other words
lol_grams2 = [gram for gram in lol_grams if gram[0].isalpha() and gram[1].isalpha()]
len(lol_grams2)

Doing this really cleans up the output, although we still have *some* words in there that probably aren't what we want (like the JOIN messages). 

In [None]:
# it's a lot easier to see these now
sorted(set(lol_grams2))

### forwards and backwards lol_grams

Let's now see if we can discern any interesting patterns with how lol/LOL is used. We'll create three sublists from `lol_grams2`. These sublists will be:

- all bigrams which start with lol/LOL and the second word is not lol/LOL
- all bigrams which end with lol/LOL and the first word is not lol/LOL
- all bigrams where the first and second words are either lol/LOL

I'll define all three below in one cell. 

In [None]:
targets = ['lol', 'LOL']

forward_lolgrams = [gram for gram in lol_grams2 if gram[0] in targets and gram[1] not in targets]
backwards_lolgrams = [gram for gram in lol_grams2 if gram[1] in targets and gram[0] not in targets]
double_lolgrams = [gram for gram in lol_grams2 if gram[0] in targets and gram[1] in targets]

In [None]:
# what does this distribution tell us? 
print('forward lols:', len(forward_lolgrams), 
      '\n', 'backwards lols:', len(backwards_lolgrams), 
      '\n', 'double lols:', len(double_lolgrams))

In [None]:
# explore the forward lolgrams
sorted(set(forward_lolgrams))

In [None]:
# explore the backwards lolgrams
sorted(set(backwards_lolgrams))

In [None]:
# explore the double lolgrams
# does it make sense to you why this sorted is so short?
sorted(set(double_lolgrams))

### **Your Turn**

Spend some time looking through the distribution of `lol/LOL` that I've created. 

- Can you draw any conclusions about how these words might be used in terms of how they pattern with other words?
- What tweaks or changes might you make to my code? 
- what other bigrams might be interesting to search for? 

# **Parts of Speech**

Now that we've considered bigrams, let's turn our attention to a completely different lexical property: parts of speech. We have thus far dealt with searching and manipulating text based on the orthographic features of words (i.e., their written forms). That means we have remained focused on the forms of the words (i.e., types and tokens). This has given us the ability to measure basic yet important aspects of texts: length, word frequency, and also some considerations about lexical diversity.

You probably have some familiarity with how words are classified into different lexical and syntactic categories, such as nouns, verbs, adjectives, pronouns, etc. These categories are called *parts of speech* (POS), and are used as another source of information which can be exploited during linguistic analysis of texts.

For NLP and Computational Linguistics, it is common to see reference made to POS **Tags**, which are essentially labels or annotations associated with a word to represent more information about that word. These tags can be counted and compared, and also provide critical information for building and understanding grammars of languages.

Let's explore how to tag texts and then use these tags to query information about texts.




## Using `nltk.pos_tag()` to autmatically tag texts.

Fortunately for us, NLTK includes a function which will automatically assign part of speech tags to a text. To use this function, we need to import NLTK and download some additional resources. 

In [None]:
# import NLTK and download the necessary resources
import nltk
# import resources for tokenizing and tagging
nltk.download(['punkt', 'averaged_perceptron_tagger', 'tagsets'])


The NLTK function expects a list of tokens and is used like this:

> `nltk.pos_tag(tokens)`

The results will be a list of `(word,tag)` pairs (which are in the form of tuple.)

The next three cells demonstrate an example of how to POS tag a text using nltk:

In [None]:
# step 1: have some text
rant = "You know, we're living in a society! We're supposed to act in a civilized way."

In [None]:
# step 2: tokenize
rant_tokens = nltk.word_tokenize(rant)

In [None]:
# step 3: tag
rant_pos = nltk.pos_tag(rant_tokens)

# look at the resulting (word, tag) pairs
[tagged for tagged in rant_pos]

POS tagging has created a list of tuples of our words, with `(word, tag)` format. The tags are informative and indeed go beyond broader word categories such as NOUN and VERB. For example, 

```
VBP = verb, present tense, not 3rd person singular
``` 

while 

```
VBG = verb, present participle or gerund.
``` 

These tags are from the Penn tagset which is a very commonly used set of POS tags. You can run the following cell to see the full list or by [going to this link](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

Take a moment to scroll throught these tags and explore their explanations and examples. 

In [None]:
 # full list of tags, with definitions and examples
 nltk.help.upenn_tagset()

You can also look up one specific tag by supplying the tag as a string to the `help` function:

In [None]:
# what is the NNP tag?
nltk.help.upenn_tagset('NNP')

### **Discussion**

Part of speech tags help make sense of words in the context of other words. Consider this example from the NLTK book - what is the difference in use for the instances of *refuse* and *permit*?

Think back to what we know about bigrams. Are there any clues  provided by the words which come *before* refuse/permit that might facilitate tagging of the proper part of speech?  

In [None]:
nltk.pos_tag(nltk.word_tokenize("They refuse to permit us to obtain the refuse permit"))

We can supply our own examples as well — let us compare two uses of the same word "comb":

In [None]:
# what pos tag does the word "comb" have in this example?
nltk.pos_tag(nltk.word_tokenize('Quick, comb the desert for droids!'))

In [None]:
# and what pos tag does the word "comb" have in this example?
nltk.pos_tag(nltk.word_tokenize('Where is my comb?'))

So, adding POS tag information provides more information about a text, which becomes useful for more advanced NLP applications such as information extraction, text prediction, and so on. Because the tags are stores as strings, you can use knowledge of Python to search or filter through the list in order to find specific words associated with specific tags. 



### **Your Turn**

- Explore using `nltk.pos_tag()` on some texts. 
- See if you can understand the different POS tags and what they mean about the words. 
- Can you "break" the tagger or have it produce innaccurate results?
- The tagger has a rule that if it does not know the tag for a word, it will automatically assign a default POS tag. Can you figure out what this default tag is? 

# Bigrams and Parts of Speech

NLTK reviews major parts of speech, with examples. Regardless of what you think you might know (or not know) about grammar/language, you should carefully read these sections. Look at the patterns associated with different parts of speech – these patterns are crucial for training taggers. This is evidenced in the example showing how bigrams of POS tags show typically English word order. We can try the same with our own example.

Let's use `nltk.bigrams()` on a set of tagged tokens — this will create a set of ((word, tag) , (word, tag)) pairs. 

In [None]:
# create bigrams of our pos tagged example
rant_bigrams = [bigram for bigram in nltk.bigrams(rant_pos)]

# inspect the bigrams
rant_bigrams

Now that we have the part of speech information included with our words, we can shift our search patterns away from the orthographic forms of words to instead the part of speech of words. This allows us to find more abstract patterns in language associated with word *categories* rather than with the forms of words themselves. 

For instance, let's look for all words in our example which come before nouns. This requires a bit of slicing, because we are looping through pairs set within a single tuple

```
((word, tag), (word, tag))
```

So to access the word in the first pair, we would first index the larger tuple using `[0]` to get the first `(word, tag)` pair, then then index that pair using `[0]` to get the first part of `(word, tag)`, which would be the word. This is demonstrated in the next code cell 

In [None]:
for i in rant_bigrams:
  print(i[0][0]) #index the first nested tuple, then index that tuples first value

Another strategy would be to follow the NLTK book's guide and set the tuple pair as the iterator, allowing you to index the tuple in a more transparent way. 

```
[a for (a,b) in bigrams]
```

In [None]:
# you can select the first pair of each pair
[a for (a, b) in rant_bigrams]

In [None]:
# or the second
[b for (a,b) in rant_bigrams]

Let's steal the example from NLTK and find all the words which precede nouns in this example: 

In [None]:
noun_preceders = [a for (a, b) in rant_bigrams if b[1] == 'NN']
noun_preceders

### **Your Turn**

What do you notice about the words which come before nouns? 

- Apply the strategy above to some longer texts of your choice (e.g., you could load in Brown?)
- Do you find that same words appearing in front of nouns? What patterns are you noticing?
- Do the findings make sense in terms of what you know about nouns? 

# Frequency distributions and POS tags

The NLTK book demonstrates that we can use frequency distributions to find the most common words associated with a particular part of speech. 

They do so by using a [tagged corpus](https://catalog.ldc.upenn.edu/LDC99T42) comprised of articles from the *Wall Street Journal*, using the treebank tag format. This tag format is different from the Penn tags we've been looking at thus far, and is known as the [universal tag set](https://universaldependencies.org/u/pos/).

We can access this corpus through NLTK.

In [None]:
# load in Penn Treebank corpus using universal pos tags (they are simpler)
wsj = nltk.corpus.treebank.tagged_words(tagset = 'universal')
wsj

The corpus is tagged, so if we use `nltk.FreqDist()`, we will get a frequency distribution of `(word, tag)` pairs. 

In [None]:
# create a frequency distribution of the pairs
word_tag_fd = nltk.FreqDist(wsj)

# not surprisingly, the most common pairs are punctuation and function words. 
word_tag_fd.most_common(10)

We can then run a conditional test on the freqdist to find most common words of a certain category, such as finding the most common verbs. 

To do so, we run a list comprehension with a condition test over the results of the frequency distribution. Note that in the loop we specify the `((word, tag),freq)` nature of each item being iterated over. 

In [None]:
# ask for just verbs (leaving most_common empty means it prints all of the words)
# I included a slice to 25 just so it doesn't spam the screen
[wt[0] for (wt, freq) in word_tag_fd.most_common() if wt[1] == 'VERB'][:25]

## Conditional Freq Dist and POS Tags

Finding the most common verbs is interesting, but we can also use a conditional frequency distribution to find the frequency of specific words among different POS Tags. This allows us to find the frequency to which certain words might appear under different parts of speech. 

The FreqDist will be constructed in a way where each word is a dictionary key, and the values for that key will be each Part of Speech the word occurs under, followed the the frequency:

```
- Word 1
  - POS Tag 1: Frequency
  - POS Tag 2: Frequency
  - etc..
Word 2
  - POS Tag 1: Frequency

Etc..
```

In [None]:
# Create a conditional frequency distribution
wsj_cfd = nltk.ConditionalFreqDist(wsj)

Because words are the keys of the dictionary, we can query the conditional frequency distribution using the words as keys. 

This part is really cool – you can see that words are not always used with just one part of speech tag. When the CFD has the words as the conditions (i.e., the first part of the pair), we can see how often different POS tags occur, as the examples below show. 

In [None]:
wsj_cfd['yield']

In [None]:
# you can use .most_common() to access the values directly
wsj_cfd['yield'].most_common()

In [None]:
wsj_cfd['cut'].most_common()

Some words are more restricted to specific parts of speech, these are the so-called function words. What do you think happened with these tag for `the` that are not `det`?



In [None]:
wsj_cfd['the'].most_common()

You can obtain the initial treebank pos tagset by loading in the corpus and not specifying that you need the "universal" tagset. This tagset is more detailed. 

### **Your Turn**

Spend some time querying the `wsj_cfd` for different words. 

- Which words are more likely to appear in different POS?
- Which ones are not? 

In [None]:
# Look at different POS tags here.


In [None]:
nltk.FreqDist(nltk.pos_tag(nltk.word_tokenize('Where is my comb? Please comb the desert for droids!')))

# **Finding ambiguous words**

How can a word be ambiguous based on its part of speech? The point is that if a word can be used either as a noun or a verb, the word by itself is ambiguous and needs some sort of co-text in order to determine the actual part of speech. 

An easy example to demonstrate this is 

- a comb : noun
- to comb: verb

In both cases, the word that comes before strongly predicts (or even dictates) the part of speech for the word. 

`a` is a determiner, whereas `to` is a preposition, and in this case, `to` is being used as part of the infinitive form of `comb`, which is different than using `to` in a phrase like `from work to school`. 

This small example should help reinforce how patterns in language, based on both word form *and* part of speech, can be exploited by linguists.

Let's explore this following the example from NLTK to find ambiguous words.

In [None]:
# we need the brown corpus
from nltk.corpus import brown

In [None]:
# Make a conditional frequency distribution of Brown POS tags
# using the universal tags, and lowercase the words
brown_news_tagged = brown.tagged_words(categories = 'news', tagset = 'universal')

brown_cfd = nltk.ConditionalFreqDist((word.lower(), tag)
                                for (word, tag) in brown_news_tagged)

After making the CFD, loop through each word (which are they keys in the dictionary and represented by `brown_cfd.condition()`.

Then, checking the `len()` of the word entry allows to see how many tags are associated with that word (each tag will add +1 to the `len()`. This method thus allows you to locate words used with many different POS tags. 



In [None]:
# find all words associated with more than a certain number of pos tags
for word in sorted(brown_cfd.conditions()):
  # if the entry has more than three POS tags
  if len(brown_cfd[word]) > 3:

    # get just the tag, not the frequency
    tags = [tag for (tag, _) in brown_cfd[word].most_common()]
    
    print(word, ' '.join(tags))

# **Your Turn**

If we have time, spend it now loading in your own text(s) and tagging them for part of speech. 

- Then, run some frequency distributions and conditional frequency distributions
- can you find the most frequent nouns, verbs, etc?
- can you find ambiguious words?
- can you find particular types of ambiguous words, such as words which are ambiguous between nouns and verbs? 

# **(Bonus) Adding your own tags**

You can manually tag your own text using a built-in NLTK function, `nltk.tag.str2tuple`. You supply a word and tag in the form of `word/tag', and the result is a (word, tag) pair. For example, below I add the POS tag "NN" to the word "fly":


In [None]:
tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

You can then quickly add a number of tags to a text in one go by writing raw strings with word/tag pairs:

In [None]:
# add tags to a longer text, then tokenize
raw_text = 'We/PRP live/VB in/AT a/DET society/NN'

tokenized_text = nltk.word_tokenize(raw_text)
tokenized_text

In [None]:
# now split the tags
[nltk.tag.str2tuple(w) for w in tokenized_text]

## Why would you want to add your own tags?

You might wonder what the point of adding your own tags is, especially if you find you are not super confident about which tags should be used for a particular word! One of the reasons NLTK shows this to you is to provide a glimpse into how some corpora come tagged, and how NLTK reads those tags and provides them to you.

Another potential use is that you could in theory supply any tags you wanted to your text. So, instead of tagging each word with part of speech, you could devise your own coding scheme for other properties of words. For example, if you had an example of speech which mixed two languages, you could tag the language of each word:


In [None]:
# tag words in English/Mandarin
code_switch = 'Soda/EN is/EN a/EN 很/ZH 乖/ZH 的/ZH 狗/ZH'

In [None]:
# now my text can be searched for "EN" or "ZH"
[nltk.tag.str2tuple(w) for w in nltk.word_tokenize(code_switch)]

## **Your Turn**

- It might be useful to think about some other categories you might be interested in applying to words
- At the least, play around with `str2tuple` and make sure you have a handle on it