<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/07_word_frequencies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Frequencies

In the previous lesson we began exploring how important it can be to analyze the vocabulary of a text in terms of which types of words occur in a text. Now we will expand this exploration to look at numerical distributions of words in a text. `Word frequency` represents the overall frequency of a word in general language use. It is a very interesting property of language because it correlates with other constructs, such as word length (shorter words are more frequent) and word difficulty (more complex words are less frequent). 

One of the interesting things about frequency is a phenomena called Zipf's law, which states that the most frequent word occurs at least twice as much as the second most frequent word, and this this relationship persists. You can read a [reddit post about it here](https://www.reddit.com/r/linguistics/comments/830nf5/zipfs_law_was_so_cool_that_i_performed_and/), or at least look at the person's graph they made explaining the phenomenon:


<img src = https://www.etymologynerd.com/uploads/1/5/8/8/15888322/website.png height = "300">


Moreover, counting the frequency in which words occur with *other* words has proven very insightful for linguistics and NLP. The most basic insight is that words tend to co-occur with other specific words in predictable ways. Corpus linguists call these pairs of words `collocations`, and define them using a variety of different statistical measures. Finding these larger collocational patterns have given strength to functional lingusitic theories of grammar such as construction grammar, which argue that both meaning and syntax determine the way a word is used in language (contrast this with a purely structural approach, which argues grammar rules exist independently of meaning).

Word co-occurence statistics are also used to create co-occurence distributions and vector spaces - these are what large-scale NLP algorithms and artificial intelligence applications rely on for word predictions in both processing and production.

The second half of NLTK Chapter 1 begins to introduce these important concepts.

## Frequency distributions


The simpliest form of a frequency distribution is a count of how many times each word type appears in a text. It's worth pausing for a moment and considering how you might construct your own frequency distribution — what might be the steps for doing so? Here is one general approach you could take:

1. You start a loop over some words
2. At the first word, you note down the word and store it in a separate data container, alongside a value representing its frequency (1)
3. You then move to the next word and check if the next word already exists in your data container, 
      - if it does already exist, you increase its count by 1
      - If it does not exist, you add it to the data container and set an initial count of 1

Here is what that might look like using pseudocode:

```
output_container = []

for word in my words:
  if word in output_container
    increase count of word + 1
  else
    add word to output_container
    increase count of word + 1
```

Now, what kind of data container would make sense for this? A `list` might be able to work, but this would require some careful slicing and indexing and might become a pain. There is another data container better designed for this known as a dictionary. We will learn how to create dictionaries in a later lesson. But for now, we can rely on a built-in NLTK function named `FreqDist()`, which creates a dictionary of `value:frequency` pairs. 







### Using `nltk.FreqDist()`

We can pass a sequence to the `nltk.FreqDist()` function and it will count the number of times different values in the sequence occur. For example, we can count the frequency of letters in a word or words in a sentence.

To do so, we simply pass whatever sequence we want as an argument to `nltk.FreqDist()`. Ideally, save the results to a variable. 

Run the cell below as an example:


In [None]:
# import the FreqDist from nltk
from nltk import FreqDist

# will this become stuck in your head?
turtles = """teenage mutant ninja turtles
            teenage mutant ninja turtles 
            teenage mutant ninja turtles 
            heroes in a halfshell, turtle power"""


# save the frequency distribution to a variable
turtle_fdist = FreqDist(turtles.split())

# inspect the results
turtle_fdist

The resulting frequency distribution is another Python data object called a `dictionary` which stores `key:value` pairs. In this case, our keys are the words, and the values are the frequencies.

We can query a dictionary for specific `key:value` pairs using the following syntax:

> `dictionary['key']`

For example:

In [None]:
# how frequent is "turtles?"
turtle_fdist['turtles']

In [None]:
# how frequent is "turtle?"
turtle_fdist['turtle']

In [None]:
# what happens if we ask for a word not in the dictionary? 
turtle_fdist['shredder']

We can also ask for the most frequent N terms from a frequency distribution using the `.most_common()` method. We can specific the number of top results we want by putting a number in the brackets `()` used by `.most_common()`. Below I ask for the three most common words in our example:

In [None]:
# what is the top most common word?
turtle_fdist.most_common(3)

### Fine-tuning a search with frequency

Lets calculate word frequencies for a larger, more interesting data set. Create a frequency distribution of the webchat corpus, `text5` using `FreqDist()`. You'll need to import `nltk` and download the book resource:

In [None]:
# import the main nltk module
import nltk

# download the nltk.book resources
nltk.download('book')

# import the resources
from nltk.book import *

In [None]:
# Now create a FreqDist of the webchat text
webchat_fdist = FreqDist(text5)

What are the 50 most common words in the webchat corpus? Examine the output - what do you see? Are there items in the output you did or did not expect? What do you think is happening?

In [None]:
webchat_fdist.most_common(50)

Let's now look at how people use the phrase "lol" - both the individual frequency and the overall percentage of "lol" in the corpus.

What do you think about the results? 1.5% might seem low, but is actually a rather strong result considering how many possible words *could* be in the corpus. 


In [None]:
# index the value by using the key (in this case, the word we want to check)
webchat_fdist['lol']

In [None]:
# divide the frequency of 'lol' by the total length of the corpus, then multiply by 100
webchat_fdist['lol']/len(text5)*100

We can now include word frequency as an additional condition when looking for certain words. Do you recall how list comprehensions and conditional for loops worked? For example, if we wanted to ask for all words which are three letters long:

In [None]:
# all tokens which are 3 letters long (list comprehension)
[w for w in text5 if len(w) == 3]

Not very readable, is it? We are getting every single token which is 3 characters long. We can reduce this firstly by wrapping the list comprehension in `set()` so that we get a list of types, rather than tokens. 


In [None]:
# add set()
set([w for w in text5 if len(w) == 3])

If you look through that output, you can see that there are a lot of things that look like codes or other non-word stuff, usually in UPPERCASE. We can try removing those using `.islower()`

In [None]:
# all tokens which are 3 letters long and all characters are lowercase
set([w for w in text5 if len(w) == 3 and w.islower()])

Now it's getting more manageable. It's still quite a long list though. Let's add another condition - asking for the same output as the previous code, but this time setting a minimum frequency. We can embed a FreqDist as part of the condition.  Let's also adjust our length so that we let both 3 and 4 letter words appear.  



In [None]:
# adding minimum frequency, allow for both 3 and 4 letter words (how else could you write that conditional?)
set([w for w in text5 if len(w) <= 4 and len(w) >= 3 and w.islower() and webchat_fdist[w] > 100])

What do you see in that output? Any words stand out as representative of a chat corpus? What kinds of words do you think you will find using the same criteria but on a different corpus? The point, which was made in the NLTK book regarding the length of words, is that a single line of code with the right tuning can provide relatively precise insight into the nature of a text and/or corpus. 

### **Your Turn**

Spend some time to play around with one of the other built-in texts (`text1` through `text8`) from the NLTK data.

Your goal is to try and refine some search patterns to find words which seem to capture the nature of the different texts. For example, you could think about a minimum frequency and minimum or maximum length, such as I have done with `text3` above. 

You can see what the name of each text is by typing `textx` into a cell and running it, for example:

In [None]:
# typing just the text's id tells you the actual document. 
text6

## The importance of pre-processing

It's time to return to something we've already covered — tokenizing a text. So far we've already been doing this with the `.split()` function, which has worked relatively well for us. But, there is one issue, which is that splitting on white space means that sometimes punctuation is included with our words. 

For example, running `.split()` on the example below will retain commas and exclamation marks as part of the words:




In [None]:
turtles = """teenage mutant ninja turtles, 
            teenage mutant ninja turtles, 
            teenage mutant ninja turtles, 
            heroes in a halfshell, turtle power!"""

turtles.split()

### Frequency and pre-processing

And let's see what happens if we subject that `.split()` list to a FreqDist:



In [None]:
# make a frequency distro of our turtles
tfdist = nltk.FreqDist(turtles.split())

In [None]:
# we know that the world "turtles" occurs in the song, so why don't we see it?
tfdist['turtles']

In [None]:
# because the commas has been saved as part of the word! uhg!
tfdist['turtles,']

Using `.split()` clearly needs some help and introduces a fundamental topic in NLP and corpus linguistics — preprocessing or normalising a text. 

Why is this important? Well, consider the goal of this notebook — to calculate the frequency of a word in a corpus / text. In order to do this *properly*, we have to make sure all words are on an even playing ground. Before we even get into punctuation, consider the following:

In [None]:
nltk.FreqDist('Victoria University of WELLINGTON is in Wellington'.split())

Although the word "Wellington" occured twice in the string above, one version was in all capitals and one was not. The `FreqDist` function treated these as two separate words. Why? The answer reminds us about the way these strings are being compared by Python:

In [None]:
# These are two different values!
'WELLINGTON' == 'Wellington'

While we know that these are basically the same word, Python doesn't care because they are *not* the same word in terms of being 100% identical values. So, we want to consider performing some initial processing (i.e., *pre-processing*) on a text before counting the words as a means to normalize or control for these properties of words we might not care about. For example, we could solve the problem above by converting all of our words to lower case.

In [None]:
# Hey we're the same now!
'WELLINGTON'.lower() == 'Wellington'.lower()

### Lexical diversity and pre-processing. 

As another example, let's consider how pre-processing influences the effects of a measure we've already explored: lexical diversity. Compare what capitalization will do to measures of lexical diversity on these two texts:

In [None]:
# create two texts that only differ based on capitalization
version1 = ['Soda', 'soda', 'Onion', 'onion']
version2 = ['soda', 'soda', 'onion', 'onion']

In [None]:
# remember how to measure ttr?
def lexical_diversity(text):
  ld = len(set(text))/len(text)
  return ld

In [None]:
lexical_diversity(version1)

In [None]:
lexical_diversity(version2)

We clearly would not want to think that version1 is more lexically diverse than version2. Hence, normalization is needed to address these issues. 

You might question this approach and wonder whether normalizing serves to remove important information about a text - perhaps capitalization matters? What if Soda is a proper name and soda is just the noun?

These are important things to take into consideration when doing any sort of NLP - the scope of your research questions and the nature of the linguistic features you are interested in (and how you measure them) should drive these decisions.




### Cleaning punctuation

But our problem above with `turtles` was also caused by the use of punctuation and `.split()`. What could we do? Well, we *could* remove all of the punctuation before splitting the text, and this would provide a satisfactory solution (for now). 

Based on what we know now about Python, how could we remove all of the punctuation from a text? We can actually do this quite simply and quickly using a list comprehension. 

We would want to set a condition that inspects each character in a string, and as long as that character is *not* a punctuation mark, keep it. 

Here is some pseudocode that expresses our goal:


```
[character for character in string if character not punctuation]
```

To exectute this code, we'd need to tell Python what we mean by "punctuation". One way is to define a string containing all the puncuation marks we don't want. 

At the same time, we can make sure to lower case everything in the same expression. 


In [None]:
# define a string containing punctuation we don't like, in this case just commas and exclamation marks
punctuation = ',!'

In [None]:
# write a list comprehension that only keeps characterss that aren't in punctuation
# read on to the next section to see how to fix this output!
[character.lower() for character in turtles if character not in punctuation]

### `.join()`

Hrmm, not quite what we wanted, because the list comprehension has returned a list of *characters*, but we wanted to retain the whitespace and other properties of the texts. No worries, we can use the handy `.join()` function to join a list of characters back into one string!

`.join` is sort of the bizzare cousin of `.split()`. `.join` is actually a string method, meaning you need to attach a string to the front part of the `.join()`. The string that you attach to `.join` represents the nature of the join...the character that you want to join everything by. Much like `.split()`, you can choose whatever you like to join stuff with. 

But, if we simply wanted to glue back together a list of characters *without* making any other changes, we would then attach an empty string to `.join()`, indicated with two string delimiters: `''`, in which case we would type `''.join()`.

Then, the thing that you want to join goes inside the `()` part of `''.join()`.

```
''.join([list of characters])
```


In [None]:
# we just wrap the whole list comprehension in ''.join
remove_punctuation = ''.join([character.lower() for character in turtles if character not in punctuation])

In [None]:
# it looks different now...but it's been reformed back into what we first had without punctuation
remove_punctuation

Now we can try to run the FreqDist on our preprocessed text. 

In [None]:
# create a new frequency distribution
cleaned_fdist = nltk.FreqDist(remove_punctuation.split())

In [None]:
# now we get proper results for turtles
cleaned_fdist['turtles']

## `nltk.word_tokenize()`

Okay, we've been using `.split()` and found some potential solutions for the way that punctuation may interfere with our definition of words.

At the same time - what if we wanted to retain punctuation? Do you think it would be important to know the difference between words that come before / after punctuation? Could punctuation tell us something about the syntax of a sentence or the tone of voice of writing? These are questions without clear answers, but are worthy of consideration. Another more practical aspect of retaining punctuation is that punctuation markers could help with segmentation of strings into words and/or sentences. For this reason, using `.split()` can use some help.

NLTK has two built-in segmentation functions which are improvements upon using `.split()`. These function are `nltk.word_tokenize()` and `nltk.sent_tokenize()`. They convert raw strings into tokens or sentences, respectively. Let's just focus on word tokenization for now. 

In the cells below, compare the difference between using `.split()` and `nltk.word_tokenize()`:

In [None]:
# What is the difference between using `.split()` and `nltk.word_tokenize()`?
pretzels = 'These pretzels are making me thirsty!'

split_tokens = pretzels.split()
nltk_tokens = nltk.word_tokenize(pretzels)

print(f"Using .split(): \n{split_tokens}\n\nUsing nltk: \n{nltk_tokens}")

The NLTK tokenizer has treated the punctuation as a separate word - so it is smart enough to recognise that words should be separated from punctuation. It does this using a set of additional rules as well as some splitting. This makes perfect sense for punctuation which occurs after words, such as commas, full stops, exclamation marks, and so on. 

What's going on in the cell below? 

In [None]:
# What is different about these tokens? 
nltk.word_tokenize('I can\'t even.')

The word "can't" was split into two tokens! Why is that? Well, if we think about it, "can't" actually stands for *two* words - "can" and "not." The tokenizer has an additional set of rules to search these contractions and split them accordingly. Using `.split()`, on the other hand, would result in "can't" being stored as a single word. Moreover, removing the punctuation *before* tokenization would turn "can't" into "cant", and then `nltk.word_tokenize()` would treat "cant" as a single word. Is this an issue? Well, considering the word "cant" is its own word separate in meaning from "can't", it certainly could be.


The point is that the order of pre-processing and normalisation steps is important, as are the different things you might want to do to a text. Many modern NLP libraries perform pre-processing automatically, and it is fundamental to understand how your data is being normalised in order to use these functions properly. 

## **Your Turn**

Spend some time becoming familiar with the differences between `.split()` and `nltk.word_tokenize()`. 

As part of your comparisons, create frequency distributions based on the results of `.split()` and `nltk.word_tokenize()` for the same strings. 


In [None]:
# compare the two tokenizers here
# make sure to compare frequency distributions!