<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/19_NLTK_lexical_resources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lexical Resources**

What are `lexical resources`, and why did Stephen separate this section out from when it was first introduced, in Chapter 02? 

Lexical resources are additional dictionaries or databases which provide specific types of information about words/language. Arguably, any piece of additional information assigned to a word (like part of speech or length of a worst) can be considered a lexical resource. In this section, we turn to resources that can further add to this information for more specific uses. 

NLKT has resources which serve either as tools to help with NLP tasks like cleaning function words from a text or identifying "unusual" words.

There are also resources which are more...analysis-y. Things like valency ratings (e.g., how positive or negative a word seems) are used for sentiment analysis, as are other "word rating" types of resources which gives perceptions of how abstract or concrete a word seems. There are also WordNet synsets, which NLTK covers – these are used to calculate semantic similarity or words with seimilar meanings, and plenty more.   


The reason I have separated these from the WordNet section is because I found that the first two chapters throw so much at you, and wanted to make sure we learn about WordNet and these resources after having a firmer grasp of NLTK.

Let's get started.

In [None]:
# import the necessary resources
import nltk
nltk.download('book')

In [None]:
from nltk.corpus import *

## **A list of words is a resource**

As NLTK points out, a set of vocabulary could be considered a resource. Spell checkers need to know the correct spelling of a word, and a word list could provide this spelling. 

You could also compare sets of vocabulary betwen two text categories in order to find whether certain words appear in one category but not another. 

You can also use a set of vocabulary to find words that aren't "real" or are considered "unusual," which is how NLTK uses the `wordlist` resource, which is literally just a list of words.

The NPS chat corpus is a good candidate for testing these options because there are a number of words used in the NPS chat corpus which would not be found in a "regular" spellchecker/dictionary. 

A word list can be used to check for incorrectly spelled or otherwise "unusual" words. Can you unpack the function NLTK provides to check for unusual words? Pay close attention to the subtraction which occurs in the line before the `return` statment. This works because you can use `set()` in Python to find differences between two sets: [here is a decent explanation](https://www.geeksforgeeks.org/python-set-difference/)

In [None]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

The function therefore shows us all of the words in nps which are not in wordlist. We can see this returns a lot of "non-standard" words used in the chat corpus. 

In [None]:
print(unusual_words(nltk.corpus.nps_chat.words())[-15:])

## **Stop words**

Stop words are any list of words one would *not* want to be included in an NLP analysis. The most common type of stopword lists include function words (or closed class words), which in English include articles/determiners (e.g., `the`, `a`, `an`, `these`) and other function words like conjunctions, pronouns, etc.. The logic behind doing this is that the main portions of meaning in any one document will be included in content words (e.g., nouns, verbs, adjectives) and not function words. Therefore, calculations on a text may be streamlined by removing function words, *or* a researcher may not want the values from function words included in any one analysis (e.g., removing frequency counts of function words from a text's lexical diversity).

At the same time, it's important to keep in mind that other functions and operations might best rely on stopwords, such as calculating part of speeh (remember that determiners usually come before nouns). 

You can easily create your own list of stopwords by simply saving a list of words you don't want. There are also plenty of existing lists of stopwords you can find in various placed on the internets, and NLTK has a built-in stopworrd list for English. Examine the list and this should give you a good idea of the words that I was discussing above. 

In [None]:
# NLTK's list of stopwords
print(stopwords.words('english'))

Even if you do not want to remove all of the stopwords, you might still be interested in knowing the proportion of content words and stopwords in your texts. 

In [None]:
def content_fraction(text):
  stopwords = nltk.corpus.stopwords.words('english')
  content = [w for w in text if w.lower() not in stopwords]
  return len(content) / len(text)

In [None]:
# what is the % of content words in the reuters corpus?

content_fraction(nltk.corpus.reuters.words())

In [None]:
# what is the % of content words in brown? 

content_fraction(nltk.corpus.brown.words())

We see that in both cases, the number of stopwords is a significant portion of the text. The function indicates that content words are about 73% of the Reuters corpus, and around 59% of the Brown corpus. What might account for these differences between the two corpora? 

## **Word puzzle**

The word puzzle is a fun thing to look at, but the larger point is that we are using a lexical resource to create a set of valid words to pull from. Beyond playing a word puzzle game, the applications for this might matter if you wanted to include sets of "valid" words in an analysis. 

NLTK shows us how to play boggle using a combination of frequency distributions and the word list of valid words - having a set of words that are "valid" to search from, outside of your problem / function, can also be seen as a lexical resource. 

In [None]:
# define the "random" set of letters that would appear in boggle. 
puzzle_letters = nltk.FreqDist('egivrvonl')

# always good to check what the object actually is - here we see it's a fancy dictionary
puzzle_letters.pprint()

In [None]:
# then define the letter that MUST be used in each solution
obligatory = 'r'

# then get a literal list of *all* the words
wordlist = nltk.corpus.words.words()

# then write a list comprehension which:
# loops through each word if it is less than 6 letters,
# checks to make sure that word contains the obligatory letter (in this case, 'r')
# checks whether the freqDist of the word is AT LEAST less than or equal to the fdist of the allowable letters
[w for w in wordlist if len(w) >= 6 and obligatory in w and nltk.FreqDist(w) <= puzzle_letters] # the final freqdist is doing it to the LETTERS in the word.

It might be a bit unclear what the authors are doing - but it should help if you compare what is going on with a FreqDist of a list of words versus a FreqDist of a single word. 

In [None]:
# reinforce differece between fdist of words...
nltk.FreqDist(['hello', 'hi']).pprint()

# ...versus fdist of letters.
nltk.FreqDist('hello').pprint()

## **Names list**

This resource is a list of English names that are typically used for females and males. This list is probably seeing its age today, particularly in the way that the authors ascribe gender to specific names. But, the point here is that we can use it as an example of a lexical resource.

In [None]:
# inspect the data set
names = nltk.corpus.names

# so really these are just two .txt files named as male or female. 
names.fileids()

In [None]:
# you can select the specific files in the `words()` method of the corpus
male_names = names.words('male.txt')
female_names = names.words('female.txt')

In [None]:
# If a name is in *both* lists, what does this mean? 
print([w for w in male_names if w in female_names])

NLTK raises an observation that the spelling of names might pattern with these two gendered categories. Specifically, how the final letter of a name can predict whether it is associated with females or males. Using a CFD can help to answer this question for us. 

This conditional frequency distribution will use gender as the category, represented by `fileid`

In [None]:
# You can then exploit the name of the file to create a CFD based on the final letter of the name
cfd = nltk.ConditionalFreqDist(
 (fileid, name[-1])
 for fileid in names.fileids()
 for name in names.words(fileid))

cfd.tabulate()

In [None]:
import matplotlib.pyplot as plt
# define the size of the figure
plt.figure(figsize = (20, 10))

cfd.plot()

Are they right? Do certain letters pattern with certain name categories?

## **CMU pronunciation dictionary**

The CMU dictionary is quite different from the prior resources. This resouce has been used for speech recognition, speech processing, etc, and attempts to encode the audible properties of language. CMU = Carnegie Mellon University in the USA, which has been producing a lot of great work in computational linguistics, NLP, and machine learning for years now. I suggest you take a look at the [CMU dictionary website](http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=C+M+U+Dictionary) so you understand how the dictionary works, and [the link to ARPABET](https://en.wikipedia.org/wiki/ARPABET) is also useful.

The structure of the CMU dictionary is in the form of `(word, [sound1, sound2, ...])`. So when accessing any one item within the CMU dictionary, you need to know how to access both parts of a single entry. The book explains how to do this - by using two iterators in your loop (similar to using `enumerate()`).

In [None]:
# inspect some entries to get an idea for them 
cmu_entries = nltk.corpus.cmudict.entries()

for entry in cmu_entries[42371:42379]:
  print(entry)

In [None]:
# I found dog!
for entry in cmu_entries[33325:33335]:
  print(entry)

So, the CMU codes which are three letters long will be codes which include a number indicating syllable stress. 


In [None]:

# loop through the word (which is a string) as well as the pron (which is a list[])
for word, pron in cmu_entries:
  # if more than three sounds...
  if len(pron) == 3:

    # notice the tuple assignment trick to save something of a certain length to that many variables
    ph1, ph2, ph3 = pron
    # then only get words which begin and end with same sound
    if ph1 == 'P' and ph3 == 'T':
      print(word, ph2, end=' ')

In [None]:
# Find which words rhyme by constructing syllables reflecting the rhyming portion 
syllable = ['N', 'IH0', 'K', 'S']

print([word for word, pron in cmu_entries if pron[-4:] == syllable])

In [None]:
# Find some rather well-known discrepancies between English sound and spelling.

# Can you find others? 
[w for w, pron in cmu_entries if pron[-1] == 'M' and w[-1] == 'n']

In [None]:
# Again, they are looking at mismatches between PRON and the actual orthography. 
sorted(set(w[:2] for w, pron in cmu_entries if pron[0] == 'N' and w[0] != 'n'))

Depending on your knowledge of the possible sounds and stress patterns, you can use that information to find different patterns in langauge based on these features. For example, the CMU entries also encoded syllable stress, indicating which syllables have primary, secondary, or no stress with a 1, 2, or 0 appended to the phone. 

In [None]:
# this extracts all the phones that have a stress pattern associated with a given input
def stress(pron):
  return [char for phone in pron for char in phone if char.isdigit()]

In [None]:
# how many words have five syllables, with primary stress on the second syllable and secondary stress on the second last syllable?
[w for w, pron in cmu_entries if stress(pron) == ['0', '1', '0', '2', '0']]

In [None]:
# can you think of other stress patterns? 
# You can input them into the `stress` function and play around with it. 



You can also find sets of words which are minimally contrasting - they only differ by a single sound. 

In [None]:
# this is a relatively convoluted list comprehension.
# it's split over three lines
p3 = [(pron[0]+'-'+pron[2], word)
  for (word, pron) in cmu_entries
  if pron[0] == 'P' and len(pron) == 3]

cfd = nltk.ConditionalFreqDist(p3)

In [None]:
# Look at the conditions to understand which different sound combinations are being examined. 
print(cfd.conditions())

In [None]:
# now loop through the cfd, the nltk authors named the iterator "template"

# this code is looking for any one template which has more than ten words associated with it. 
for template in sorted(cfd.conditions()):
  if len(cfd[template]) > 10:
    words = sorted(cfd[template])
    wordstring = ' '.join(words)
    print(template, wordstring[:70] + "...")

Of course, a common application of the CMU dictionary might be to get the pronunciation of certain words. You can do that by using the word to access the entry.

In [None]:
# first save the cmu dictionary to a variable
prondict = nltk.corpus.cmudict.dict()

# then ask for different words
prondict['dog']

In [None]:
prondict['exclamation']

In [None]:
# you can pass a list and loop through the entries to parse entire sentences
vuw = ['victoria', 'university', 'of', 'wellington']

# do you get what this list comprehension is doing?
print([ph for w in vuw for ph in prondict[w][0]])

The CMU dictionary is very cool — if you're interested in doing anything with sound / speech of English, this is probably one of the best resources you could start with. 

## **Swadesh / comparative wordlists**

The last resource to explore in this section is a list of words which have one-to-one translations among multiple languages, stored as the **swadesh** wordlists. The book describes these as **cognates**, but technically, cognates are words which are similar in both form and meaning between two languages, such as the word `animal` in English and Spanish. The level of similarity between cognates can range, and I'm not completely convinced this list includes only cognates. 

Moreover, false cognates are words similar in form but *not* meaning between two languages, such as `angel` in English (the heavenly creature) and Dutch (to sting). 

Regarldess, we can view this resource as a relatively simple translation dictionary. 

You can look up the language codes used by the resource [here](https://iso639-3.sil.org/code_tables/639/data/e)

In [None]:
# import the resource
from nltk.corpus import swadesh

In [None]:
# the fileids show all of the different languages in the list
# it's a bit of a guessing game 
print(swadesh.fileids())

The authors show you how to make comparative lists by choosing specific entries from the word list. Doing so will then create tuples (pairs), word-by-word, for the chosen languages. 

In [None]:
# create a french/english set
fr2en = swadesh.entries(['fr','en'])

# make a dictionary of the entires: french will be the keys, and english will be the values
translate = dict(fr2en)

In [None]:
# what is the english translation of "chien"?
translate['chien']

In [None]:
# why do you get a key error when trying to translate from english to french?
translate['dog']

In [None]:
# you can see that these lists are aligned, so that words are at the same index. 
languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']

for i in range(130,135):
  print(swadesh.entries(languages)[i])