<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/14_doing_more_with_dictionaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Doing more with Dictionaries

If you've inspected the `nltk.FreqDist()` and `nltk.ConditionalFreqDist()` functions, you'll see they store data as dictionaries. We have already covered dictionaries in general. By now you should see that dictionaries are quite useful for storing linguistic information, which NLTK refers to as mapping from one thing to another (e.g., mapping a POS tag to a word). In this notebook we will return to dictionaries as a refresher, and also learn about extensions to dictionaries used by NLTK. 


In [None]:
# download necessary resources for this notebook
import nltk
nltk.download(['punkt', 'averaged_perceptron_tagger', 'tagsets', 'treebank', 'universal_tagset', 'book'])

As a reminder, we can manually create a dictionary using a pair of curly braces `{}`. Below, I define a dictionary where words are keys and POS tags are the values:

In [None]:
# manually create a dictionary
pos = {}
pos['colorless'] = 'ADJ'

In [None]:
# add entries which are word:pos_tag key/value pairs
pos['ideas'] = 'N'
pos['sleep'] = 'V'
pos['furiously'] = 'ADV'

In [None]:
# inspect our dictionary
pos

In [None]:
# Using list() will give all the keys
list(pos)

## DefaultDict

A `defaultdict` is an extension of Python dictionaries from the `collections` module (the same module which gave us `namedtuple`). Learning how to use `defaultdict` is helpful if you are performing operations where you want automatic values set to keys which do not yet exist in a dictionary. 



First of all, consider what happens when we ask a dictionary for a key that does not exist. I'll create a dictionary with fictional frequency values for two words.
 

In [None]:
# make a small dictionary with word:pos_tag key/values
fake_frequencies = {'the': 1337, 'dog': 42}
fake_frequencies

In [None]:
# everything is fine when I ask for a key already in the dictionary. 
fake_frequencies['the']

In [None]:
# but if I ask for a key not in the dictionary, we get a key error
fake_frequencies['ran']

Using a `defaultdict` allows you to be a bit more defensive and avoid errors when looking up values that might not exist. And, by extension, allows us to think about how this might be useful for things like frequency dictionaries — if a word does not exist in our frequency dictionary, we could assume that it has a frequency of zero. Same thing with POS tags or other lexical information — rather than finding out that something is not in the dictionary, it would be preferable to update the dictionary with a value indicating the lack of such information. 

When you create a default dictionary, you can specify the default value (or function) which will trigger if a value does not exist in the dictionary. In the cell below, I import `defaultdict` and then specify the default value for a key not in the dictionary will be an `int`. Because I do not supply any other information about the `int`, the default value will be `0`. 



In [None]:
# import defaultdict
from collections import defaultdict

# create a defaultdictionary where default values are `int` of 0.
frequency = defaultdict(int)

# add a key/value to the dictionary
frequency['colorless'] = 4

# inspect key/value
frequency['colorless']

In [None]:
# now look up something not in the dictionary
# the key was not there, so was added with the DEFAULT value of int = 0
frequency['ideas']

In [None]:
# to understand why `int` gives us zero:
int()

If you wanted to set the default value to be a specific number, you can use (`lambda: value`) as the default argument:


In [None]:
# I use an anonymous function (lambda) to set default value to a number that is not zero. 
any_integer = defaultdict(lambda: 1337)

any_integer['test']

We can extend this to other informatoin such as parts of speech. Below, I indicate the `defaultdict` should have an empty `list` as the default value, and the same process occurs when we look up words not in the dictionary. 

Words with no POS would be added to the dictionary, and functions such as `len()` could be used to determine whether they do or do not have POS and update them accordingly. 

In [None]:
# default is an empty list for tag sets
pos = defaultdict(list)
# add a key value
pos['sleep'] = ['NOUN', 'VERB']
# works as intended
pos['sleep']

In [None]:
# a word not in the dictionary is updated to default value (an empty list)
pos['green']

For something like parts of speech, we might actually want to have a default POS tag instead of an empty list. Again we can use the `lambda` function to supply a default POS tag. 



In [None]:
# tell defaultdict that the defaul POS tag is "NOUN"
pos = defaultdict(lambda: 'NOUN')

pos['colorless'] = 'ADJ'
pos['colorless']

In [None]:
# if an entry doesn't exist it is added to the dictionary 
# there are other ways to do this using the basic dictionary type as well
pos['blog']

In [None]:
# remember that list() can be used to inspect 
list(pos.items())

## **Tagging unknown words**

Now, instead of supplying a default tag such as "NOUN" (which could be dangerous), NLTK shows how labels such as "unknown" can be used to tag words which are "out of vocabulary." 

Below, we read in *Alice in Wonderland* from the Gutenberg data built into NLTK. We access the `.words` of the book to get the tokenized list, and then create a frequency distribution of those words using `nltk.FreqDist()`. 




In [None]:
# create a frequency distribution of all the words from alice in wonderland
alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
alice_vocab = nltk.FreqDist(alice)

# take a peek at some of the most common words
alice_vocab.most_common(100)[90:100]

In the next code cell, we use a `list comprehension` to extract the most frequent 1000 words, and we also toss away the frequency values (because we just want the words). 

In [None]:
# gather the most frequent 1000 tokens
alice_top_1000 = [word for (word, frequency) in alice_vocab.most_common(1000)]
alice_top_1000[90:100]

Now we will create a default dictionary in which the default tag will be 'UNK' for unknown. 

In [None]:
# create default dict with a default of "UNK" if not in the dict.
from collections import defaultdict
alice_known = defaultdict(lambda: 'UNK')

We will then add our top 1000 words to the dictionary, and in doing so, will make both the key and the value the word. We do so using a for loop which loops over each word and sets that word as they key and the value. 

Why are we doing this? We are simulating a list of "known" words. Any word which has itself as the value is considered "known." 

In [None]:
# add each word from top 1000 as itself to the dictionary
for v in alice_top_1000:
  alice_known[v] = v

In [None]:
# make sure you understand what is going on here. 
list(alice_known.items())[90:100]

Now, we can perform a `list comprehension` which loops over all of the words in the book (contained in the variable `alice`). This will simulate reading the book one word at a time. 

The list comprehension is asking for the value of each word from our default dictionary `alice_known`. Remember, we've already added the top 1000 words to this dictionary, so if they are checked, they will return themselves. 

For the words in `alice` which are *not* in `alice_known`, the default dictionary will return our unknown tag "UNK".

The end result is a list of tokens the same length as `alice`, but any word not in the top 1000 more frequent words will be replaced with "UNK". 

In [None]:
# create a new vocab which then looks at all the words in alice (not just the top 1000)
alice_complete = [alice_known[v] for v in alice]

# anything not already in the top 1000 words gets assigned to UNK
alice_complete[:100]

We can compare the two versions - because the words `Lewis` and `Carroll` were not in the top 1000 most frequent words, they are replaced with `UNK`

In [None]:
# extract just the first 10 real words from each version
first_10_raw = ' '.join([v for v in alice[1:10]])
first_10_processed = ' '.join([v for v in alice_complete[1:10]])

# print them for comparison
print(f'Original version:\n {first_10_raw} \n Processed version:\n {first_10_processed}')



We can confirm the two lists of tokens are the same lengths(i.e., have the same number of words: 

In [None]:
len(alice)

In [None]:
len(alice_complete)

So, in the end, this was a really convoluted method which replaced all of the words in the book that are *not* in the top 1000 most frequent words with the tag "UNK". 


You are probably wondering, why might we ever want to do something like this? It of course depends on the goals of your analysis. Extracting the most-frequent 1000 words from a text may in turn include a good portion of the text. In fact, if we ask that question about our text here, we can see that the 1000 most-frequent words account for over 90% of all the words in the book. 

In [None]:
# all words that have a value in our known values default dict
in_vocab = [w for w in alice if w in alice_known.values()]

# all those that do not have values in the default dict
out_vocab = [w for w in alice if w not in alice_known.values()]

In [None]:
# calculate proportion based on total words for top 1000
len(in_vocab) / len(alice) * 100

In [None]:
# same for those without
len(out_vocab) / len(alice) * 100

What we have done by looking at the top 1000 most frequent words is reduced the search space for any task or procedure we might want to perform on this text. The NLTK book points out that a procedure such as this can help with Part of Speech tagging, because it means a tagger would not need to consider any word tagged with "UNK" and thus increase accuracy and performance. 

We can also use such information and compare the top 1000-most frequent words in this text to other texts and/or word lists as a way to assess topics or similarity between two texts, or assess the overall difficulty of the vocabulary by comparing this list to lists of word frequency built on even larger corpora. 

## Incrementally updating a dictionary

The NLTK book shows us how to use a very handy operator, the `+=` which iterates a value by a set value. Most commonly we see this used as `+= 1`, which means increase something by 1. For example:

In [None]:
# set the value 1 to the variable a
a = 1

# increment the value by 1
a += 1
a

We can choose any number we like as the units of increment:

In [None]:
# same as above but increment by 1000 instead of 1
a = 1
a += 1000
a

This method for incrementing can be used to update information each time something is encountered, such as increasing the frequency counts of words, pos tags, or anything else we are interested in counting during a loop procedure. 

For example, the NLTK book shows us that we might be interested in counting the total incidence of different Part of Speech tags in the Brown corpus, and shows us how to do it. 

We create a `defaultdict` named `counts` and set the default to `int` (which means 0). 

We then import `brown`, and loop through the tagged version of the corpus. The loop checks the tag in `counts` and increments the value by 1. Because the default value of `counts` will be 0, the very first time a tag is found, that value will increment to 1, and so on. 



In [None]:
# create the default dictionary with default of 0
counts = defaultdict(int)

# read in brown corpus
from nltk.corpus import brown

# NLTK authors use the += operator to increment the tag count by 1 each time it is seen.
for (word, tag) in brown.tagged_words(categories = 'news', tagset = 'universal'):
  counts[tag] += 1

In [None]:
# the dictionary is frequency count of POS tags
counts.items()

In [None]:
# do you have any ideas why this might be the most frequent tag? 
counts['DET']

## **Anagram dictionary**

Although it is very brief the NLTK book talks about finding anagrams using a `defaultdict`. I've tried to add some more detail here to make it clear how this function works and what it is doing to the words. 

Basically, more than one word might include all of the same letters, albeit in a different order. By alphabetically sorting the letters of all words, any word with the same letters will be associated with the same alphabetic pattern, which can then be used as a common key for different words.

For example, the words `heart` and `earth` all use the same letters but in a different sequence. If we used `sorted()` on both strings, they give us the same result:

In [None]:
# both words are the same when you sort them!
print(''.join(sorted('earth')))
print(''.join(sorted('heart')))

In [None]:
# import NLTK's giant list of English words
words = nltk.corpus.words.words('en')
words[123456:123460]

In [None]:
# create anagrams as a default dict with list as the default
anagrams = defaultdict(list)

In [None]:
# loop over each word (e.g., 'earth')
for word in words:
  # the KEY will be the sorted version (e.g., 'aehrt')
  key = ''.join(sorted(word))
  # append the VALUE to that key (e.g., 'earth')
  anagrams[key].append(word) # each new word which has the same key will be added to this value (using list.append())

In [None]:
# this might help you see what's going on here 
# we can find any pattern which has more than a certain number of words - here I choose more than 6
# although, some of these words sure seem odd to me!
for key in anagrams:
  if len(anagrams[key]) > 6: # play around with this value to get different results. 
    print(key, ':', anagrams[key])

# `nltk.Index`

As the authors of NLTK seem fond of doing, they briefly introduce a custom way for doing the same thing using a special NLTK version. Honestly, don't worry too much about this, basically they are just providing their own version of `defaultdict`


In [None]:
help(nltk.Index)

In [None]:
anagrams = nltk.Index((''.join(sorted(w)), w) for w in words)
# choose one that's funny
anagrams['aeilnrt']

# Conclusion

In this notebook we considered an extension to dictionaries which allows for default values. The main benefits of using such dictionaries for our purposes is to count various properties of words, such as word frequency and parts of speech. The convenience of using `defaultdict` is that we don't have to account for missing keys when constructing such objects. 

There are a number of other demonstrations in this chapter which can be interesting to study but are not necessary to understand for our purposes. 