<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/20_additional_lexical_resources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Additional Lexical Resources

NLTK provides you with a lot of built-in resources. In this notetbook, I show you how to read in some other resources that are also interesting, including some I have used in my own research. To do so, I rely on loading in a variety of different lexicons, which are sets of words and associated values, scores, ratings, or similar. 

To use this notebook, you can read the basic description about each resource and then copy the code in case you would like to incorporate these resources into your final project. You should be able to directly copy and paste this code if you'd like to use it!

I read each resource in as a dictionary and demontrate the keys/values for you. If you would prefer to download them yourself [you can grab them here](https://github.com/scs-vuw/LING226). But this notebook will also point you to the relevant sources to cite if you do use these resources. All of the resources I have here are for English, although some of these authors also provide data in other languages – you'll need to search that up yourself if you're interested in similar lexicons for languages other than English. 

In [None]:
# make a helper function to create dictionaries. 

# to grab the resource by url, we'll import requests
import requests

def get_word_rating_resource(url):
  """helper function to get lexical resources for LING226 students
  resources are hosted on github as .txt in the form of Word\tValue\n
  """
  # read the raw text and split on newlines
  raw = requests.get(url).text.split('\n')
  
  # split each pair and convert value to rounded float
  # the if statement is there to avoid indexing errors when a row in a resource doesn't have complete data
  raw_list = [(pair.split('\t')[0], round(float(pair.split('\t')[1]), 3)) for pair in raw if len(pair.split('\t')) == 2]

  # create a dictionary and return it
  return dict(raw_list)

# Word Frequency Values

Word frequency refers to how frequent a word appears in general usage. One simplistic analysis one can make is to associate less frequent words with more difficult vocabulary. There are a lot of different frequency lists out there. The resource I will give you here are the [SUBTLXus frequency measures.](https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus)

These frequency measures are taken from a large corpus of subtitles in American television and movies. Using this method, the authors have created frequency lists for a variety of different languages, which you can explore on their website. 

I have only included the Log10WF, (logarithmic 10 word frequency), which is a normalized measure of frequency that takes into account the logarithmic distribution of words in corpora. A higher value means that the word is more frequent. Also, keep in mind this data includes both capitalized and lowercased versions of some words – this was because the authors wanted to take proper names into account.

You would want to use frequency for any comparison of vocabulary between texts in English. For example, you could compute the average frequency of the words in a text. 

## How are these values different from `FreqDist` or `ConditionalFreqDist`?

Note that this is *different* than calculating the frequency of the words in your corpus / texts. The frequency values from SUBTLEXus are more representative of the *general* frequency of a word, and this can be thought to be an expected frequency for any given word from general usage. 


In [None]:
# load in frequency values
freq_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical%20resources/subtlxus_frequency.txt'
freq_dict = get_word_rating_resource(freq_url)

In [None]:
# one of the most frequent words in the English language
freq_dict['the']

In [None]:
# what about some lower frequency words?
# (.477 is the lowest value in this resource)
freq_dict['Tyrannosaur']

In [None]:
# try some more words

word_targets = ['cabbage', 'Klondike', 'sconce', 'Yes', 'car', 'think']

for target in word_targets:
    if target in freq_dict.keys():
        print(f'Word frequency for {target} is {freq_dict[target]}')
    else:
        print(f'Sorry, {target} not found')



# Age of acquisition

Age of acquisition (AoA) is another measure of vocabulary complexity and sophistication. The AoA values represent the average age where native English speakers think they first learned a particular word. These values were gathered via surveys. 

A lower value means mper people believed they learned those words earlier in life, suggesting those words are more frequent, more concrete, and less sophisticated. 

You could include this in an analysis as a measure of the overall complexity or sophistication of vocabulary, but keep in mind it is only one dimension of this feature. You would also want to report the *coverage* of this resource - which is a measure of how many words in your corpus were included/excluded in this lexicon. 

You [can read the paper here.](https://link.springer.com/article/10.3758/s13428-012-0210-4)

In [None]:
# read in aoa data

aoa_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical%20resources/AoA_Brysbart.txt'
aoa_dict = get_word_rating_resource(aoa_url)

In [None]:
# which words do people say they learned, on average, after 20 years of age?
for word in aoa_dict.keys():
  if aoa_dict[word] > 20:
    print(word, aoa_dict[word])

In [None]:

word_targets = ['cabbage', 'sconce', 'car', 'think', 'no']

for target in word_targets:
    if target in aoa_dict.keys():
        print(f'Age of acquisition for {target} is {aoa_dict[target]}')
    else:
        print(f'Sorry, {target} not found')


# Word Concreteness

Word concreteness is a measure of how abstract or concrete the concept associated with a lexical item is. Words are not necessarily only concrete or only abstract, but rather slide along a scale. For example, the meaning of the noun "tree" is more concrete than the meaning of the noun "peace." The list includes mostly single word items but also some compound, two word items. This list was also collected using survey methods. 

In general, words which are more concrete are more frequent and easier to learn, suggesting that language which is more concrete may be less complex. But this is not a hard and fast rule. Concreteness, AOA, and frequency are three measures you might want to consider as a union when investigating anything related to vocabulary complexity. 

This resource is a list of average concreteness ratings for 40,000 English words. [You can find the paper here.]('https://link.springer.com/article/10.3758/s13428-013-0403-5')

Annotators from Amazon Mechnical Turk were asked to rate how concrete a word was on a scale of 1-5, with a 1 meaning abstract and a 5 meaning concrete. 



In [None]:
# create concreteness dictionary
concrete_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical%20resources/concreteness.txt'

concrete_dict = get_word_rating_resource(concrete_url)

In [None]:
# you can now access mean concreteness ratings
for word in concrete_dict.keys():
  # I put a lot of conditions here just to limit the output. there are a LOT of words in here :)
  if concrete_dict[word] == 5 and ' ' not in word and len(word) > 10:
    print(word, concrete_dict[word])

In [None]:

word_targets = ['cabbage', 'sconce', 'car', 'think', 'no']

for target in word_targets:
    if target in concrete_dict.keys():
        print(f'Average concreteness for {target} is {concrete_dict[target]}')
    else:
        print(f'Sorry, {target} not found')

# Semantic Diversity Ratings

This lexicon is a computationally derived measure of word ambiguity. The authors analysed 1000-word spans (i.e. slices of text 1000 words in length) from the British National Corpus and calculated the probability of any one particular word appearing in the spans. A word which has a higher likelihood of appearing in more spans will be more semantically diverse, meaning it can have more polysemous meanings and senses. 

A word with lower ratings means it should have a more specific meaning. 

[You can read the paper here.](https://link.springer.com/article/10.3758%2Fs13428-012-0278-x)

In [None]:
semd_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical%20resources/semantic_diversity.txt'
semd_dict = get_word_rating_resource(semd_url)

In [None]:
# what are some of the most restricted words in terms of the contexts they can appear in?
for word in semd_dict.keys():
  if semd_dict[word] < 0.2:
    print(word)

In [None]:
# are words with more specific meanings also associated with lower semd? 

word_targets = ['cabbage', 'sconce', 'car', 'think', 'no', 'the', 'pretentious']

for target in word_targets:
    if target in semd_dict.keys():
        print(f'Average semantic diversity for {target} is {semd_dict[target]}')
    else:
        print(f'Sorry, {target} not found')

# Humor Ratings

This is a list of humor ratings for almost 5000 English words. [You can read the paper here.](https://link.springer.com/article/10.3758%2Fs13428-017-0930-6) 

Basically, they asked ~800 people on Amazon Mechanical Turk to rate how humorous a word was using a 1-5 scale. The values in this lexicon are the mean humor ratings for each word, after the authors trimmed and cleaned the data.

You could use this resource to see whether a certain text/genre uses words with higher or lower individual humor ratings, or if a text even has words which are deemed to be funny. I'm not convinvced this measure can actually represent how humorous a text is, but nonetheless this might be an interesting or fun resource to try. 


In [None]:
# create humor dictionary
humor_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical%20resources/humor.txt'
humor_dict = get_word_rating_resource(humor_url)

In [None]:
# you can now access mean humor ratings
for word in humor_dict.keys():
  if humor_dict[word] > 4:
    print(word, humor_dict[word])

In [None]:

word_targets = ['cabbage', 'sconce', 'car', 'think', 'no', 'the', 'pretentious']

for target in word_targets:
    if target in humor_dict.keys():
        print(f'Average humour for {target} is {humor_dict[target]}')
    else:
        print(f'Sorry, {target} not found')

# Emotion Words

This resource is a bit different because rather than being a single value associated with each word, this resource is a series of words and whether or not they are associated with a particular emotion. The emotions include:

- anger
- anticipation
- disgust 
- fear
- joy
- negative
- positive 
- sadness
- surprise
- trust

The values associated with the words in the dictionary are a 0 or 1 for each of these emotions. A 0 means the word does not have an association, whereas a 1 means it does. So it is binary - either or. With a resource like this, one would likely then want to see how many words with these associations a particular text has. Notice that it also has positive/negative, so in a way this can also be used as a sentiment analysis resource. 

This is just one of many resources provided [on this website.](https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)

In [None]:
# get emotion resource and split on newlines
emotion_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical%20resources/emotion_lexicon.txt'
raw_emotion = requests.get(emotion_url).text.split('\n')

In [None]:
# create a list, but this time of triples
emotion_list = [(triple.split('\t')[0], triple.split('\t')[1], round(float(triple.split('\t')[2]),2)) for triple in raw_emotion]

In [None]:
# create empty dictionary with defaultdict having another dictionary inside
from collections import defaultdict
emotion_dict = defaultdict(dict)

In [None]:
# add each entry to the new dictionary
for triple in emotion_list:
  word, category, value = triple
  emotion_dict[word][category] = value

In [None]:
# you can now look up words for their associations 
emotion_dict['nepotism']

In [None]:
# and you can index deeper to get specific categories
emotion_dict['nepotism']['negative']

In [None]:

word_targets = ['fight', 'play', 'shout', 'clown', 'tornado']

for target in word_targets:
    if target in emotion_dict.keys():
        print(f'Emotions for {target} is {emotion_dict[target]}')
    else:
        print(f'Sorry, {target} not found')

# General Inquirer List

This is another lexicon which groups words into particular categories. The different categories are [explained here](http://www.wjh.harvard.edu/~inquirer/homecat.htm)

There are a *lot* of categories and you should look through them in order to know if there is anything interesting in here for you. For example, this list includes words associated with different emotions, activities, feelings, etc. The lists work in that words thought to be included with those particular emotions / feelings are simply grouped in that list. 

So if a word is in the list, you can assume it is associated with that concept. You would use a resource like this to find how many words of a particular category exist in a text. 



In [None]:
# GI list
gi_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical%20resources/inquirerbasic.txt'
raw_gi = requests.get(gi_url).text.split('\n')

In [None]:
# we need to do something a bit different for this resource
gi_dict = dict()

for category in raw_gi:
  gi_dict[category.split('\t')[0]] = category.split('\t')[1:]

In [None]:
# which words are associated with strength?
gi_dict['Strong_GI']

In [None]:
# which words are associated with persistance?
gi_dict['Persist_GI']