**IST664: Week 3 Lab**

In the realm of natural language processing, the term "morphology" refers to the study of words and how they are formed. In previous labs we have seen that even the seemingly simple task of tokenization has many complications. In this lab, we take a deeper dive into the world of words by examining how stemmers work, some of the complexities of part of speech tagging, and how we can represent words as high dimensional vectors.

Although contemporary deep learning methods tend to hide a lot of these details behind the veil of the neural network, it is still critically important to have an essential understanding of morphology and how it can impact higher levels of representation. Your ability to create, debug, and successfully modify a natural language system will be enhanced by deepening your understanding of how we use code to assign meaning to various parts of speech.

This lab begins by reading a complete text from the Project Gutenberg website. We are downloading Dostoevsky's Crime and Punishment, as plain text, in a translation by Constance Garnett.

In [1]:
import nltk # We'll be using lots of facilities from this
nltk.download('punkt') # Download, as not included in basic colab

# text from online gutenberg
from urllib import request # We will need this to read from the URL

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw), len(raw)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


(str, 1176812)

In [2]:
# Over one million characters. Let's look at the first few.
raw[:178]

'\ufeffThe Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world'

In [3]:
# We'll begin our processing with tokenization using punkt
crimetokens = nltk.word_tokenize(raw)
crimetokens[112:122]


['Release', 'Date', ':', 'March', ',', '2001', '[', 'eBook', '#', '2554']

In [4]:
# Let's keep track of how many unique tokens we're starting with.
len(set(crimetokens))

11516

In [5]:
crimetokens = [w.lower() for w in crimetokens]
crimetokens[112:122]

['release', 'date', ':', 'march', ',', '2001', '[', 'ebook', '#', '2554']

**Part 1 - Stemming and Lemmatization**

Our first task will be to examine stemming - the process of removing endings from words. Stemming can be considered as a "data reduction" method that may be helpful for simplifying downstream analysis. For example, tokens like eat, eats, eaten, and eating will all be stemmed to eat. As with tokenization, there are many approaches to stemming and each yields somewhat different results. Stemming is also sometimes referred to as a "normalization" technique. Another, even simpler normalization technique we saw earlier in the course is lowercasing. One goal of all normalization techniques is to reduce the overall size of the vocaubulary, with the intent of improving the performance of later tasks such as searching. We will compare three stemmers provided by NLTK.

In [7]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.SnowballStemmer('english')
type(porter), type(lancaster), type(snowball)


(nltk.stem.porter.PorterStemmer,
 nltk.stem.lancaster.LancasterStemmer,
 nltk.stem.snowball.SnowballStemmer)

In [8]:
# Let's test each one
porter.stem('eatery')

'eateri'

In [9]:
lancaster.stem('eatery')

'eatery'

In [10]:
snowball.stem('eatery')

'eateri'

In [11]:
# What happened with the examples above. Can you think of another word
# that might be stemmed inconsistently by these stemmers? Add your word
# in a call to these stemmers:

# 3.1: Use the Porter stemmer to stem a new word
print(porter.stem('happiness') )


# 3.2: Use the Lancaster stemmer to stem a new word
print(lancaster.stem('happiness'))



happi
happy


Computer scientist Martin Porter wrote and published the Porter Stemmer more than 40 years ago. The Porter stemmer is a rule-based algorithm (i.e., no dictionary) for "suffix stripping." The algorithm was subsequently implemented by other coders in more than two dozen different computer languages. Eventually, Porter got tired of hearing about the implementation errors in some of these other versions, so he rewrote the algorithm in C about 20 years ago. He also created a programming framework, called "Snowball" that can be used to create additional stemmers including the third one above, which is also known as the Porter2 stemmer. You can see the latest state of the snowball language and rule based stemmers here: https://snowballstem.org

The Lancaster stemmer, also known as the Paice/Husk stemmer, was created at Lancaster University and has the advantage that the "rule book" it uses is external to the algorithm itself and can therefore be adapted to languages other than English.

In [12]:
# From a data reduction standpoint, which stemmer results in the greatest
# reduction in the number of unique tokens? Remember that we started with 11539.
crimePstem = [porter.stem(t) for t in crimetokens]
crimeLstem = [lancaster.stem(t) for t in crimetokens]
crimeSstem = [snowball.stem(t) for t in crimetokens]

len(set(crimePstem)), len(set(crimeLstem)), len(set(crimeSstem))

(7363, 6399, 7174)

In [13]:
# What percentage reduction have we achieved with the Porter stemmer?
round( 100 - (len(set(crimePstem))/len(set(crimetokens)) * 100), 1)

31.1

In [14]:
# The Porter stemmer was the least aggressive of the three. Now calculate and
# show the percent reduction in the number of tokens for the MOST aggressive
# of the three stemmers.

# 3.3: Compute and display percent reduction
round( 100 - (len(set(crimeLstem))/len(set(crimetokens)) * 100), 1)


40.1

In [15]:
# Let's compare the highest frequency tokens from the three stemmers
from nltk import FreqDist
pdist = FreqDist(crimePstem)
ldist = FreqDist(crimeLstem)
sdist = FreqDist(crimeSstem)

# zip() is a cool built-in function for zipping together two or
# more lists/tuples into a single iterator.
compare = zip(pdist.most_common(20),
              ldist.most_common(20),
              sdist.most_common(20))
print(" - Porter -    - Lancaster -   - Snowball -") # Make a heading
[i for i in compare] # Here we use the iterator to show the 3 sets of results

 - Porter -    - Lancaster -   - Snowball -


[((',', 16177), (',', 16177), (',', 16177)),
 (('.', 8908), ('.', 8908), ('.', 8908)),
 (('the', 8006), ('the', 8038), ('the', 8006)),
 (('and', 7031), ('and', 7031), ('and', 7031)),
 (('to', 5350), ('to', 5350), ('to', 5350)),
 (('he', 4769), ('he', 4769), ('he', 4769)),
 (('a', 4651), ('a', 4651), ('a', 4651)),
 (('i', 4397), ('i', 4397), ('i', 4397)),
 (('you', 4086), ('you', 4094), ('you', 4086)),
 (('’', 4039), ('’', 4039), ('’', 4039)),
 (('“', 3980), ('“', 3980), ('“', 3980)),
 (('”', 3929), ('”', 3929), ('”', 3929)),
 (('of', 3927), ('of', 3927), ('of', 3927)),
 (('it', 3474), ('it', 3474), ('it', 3474)),
 (('that', 3282), ('that', 3282), ('that', 3282)),
 (('in', 3248), ('in', 3261), ('in', 3248)),
 (('wa', 2826), ('was', 2826), ('was', 2826)),
 (('!', 2364), ('on', 2606), ('!', 2364)),
 (('?', 2275), ('!', 2364), ('?', 2275)),
 (('hi', 2114), ('?', 2275), ('his', 2113))]

There's a lot going on in the display just above. All three stemmers agree on commas, periods, the word "and," and the close double quote. Can you think of some hypotheses for why the word "the" has a different count for the Lancaster stemmer?

What's going on near the end of the list where we have the following output:

(('wa', 2825), ('was', 2825), ('was', 2825))

The counts match, but what has the Porter stemmer done differently? Even based on the small amount of evidence above, what conclusions can you draw about the advantages and disadvantages of stemmers?

In [16]:
# Now, as an exercise, compare the hapaxes from the three FreqDist objects.
# Try to create a nice compact display that highlights some of the similarities
# and differences between the stemmers.

# 3.4: Compare lists of hapaxes between the three stemmers

p_hapaxes = pdist.hapaxes()
l_hapaxes = ldist.hapaxes()
s_hapaxes = sdist.hapaxes()

print(p_hapaxes[:20])
print(l_hapaxes[:20])
print(s_hapaxes[:20])
# For reminders on how to access hapaxes, check the Week 2 lab.

['\ufeffthe', '#', '2554', 'august', '6', '2021', 'encod', 'utf-8', 'bicker', 'dagni', 'david', 'widger', 'prefac', 'reader', 'hard-work', 'engin', 'folk.', 'nekrassov', 'review', 'acclam']
['\ufeffthe', '#', '2554', 'august', '6', '2021', 'encod', 'utf-8', 'bick', 'dagny', 'david', 'widg', 'prefac', 'hard-working', 'engin', 'folk.', 'nekrassov', 'review', 'acclam', '1849']
['\ufeffthe', '#', '2554', 'august', '6', '2021', 'encod', 'utf-8', 'bicker', 'dagni', 'david', 'widger', 'prefac', 'reader', 'hard-work', 'engin', 'folk.', 'nekrassov', 'review', 'acclam']


A lemma is the root form on a word. In English one of the most striking set of lemmas comes from the verb "to be." The words "am," "is," "are," and "be," despite their unique spellings and pronunciations, all lemmatize to "be." Let's try this with the Wordnet Lemmatizer:  

In [17]:
nltk.download('wordnet')
nltk.download('omw-1.4')
wnl = nltk.WordNetLemmatizer()
type(wnl)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


nltk.stem.wordnet.WordNetLemmatizer

In [18]:
wnl.lemmatize("am", pos ="v")

'be'

In [19]:
# Now lemmatize am, is, and are. Also test what happens if you leave out the
# pos argument? Write a comment describing what the pos argument does.

# 3.5: Lemmatize am, is, and are
print(wnl.lemmatize("am", pos ="v"))
print(wnl.lemmatize("is", pos ="v"))
print(wnl.lemmatize("are", pos ="v"))


# 3.6: Test the lemmatize method without the pos argument
print(wnl.lemmatize("am"))
print(wnl.lemmatize("is"))
print(wnl.lemmatize("are"))

#"pos" defines the PART OF SPEECH of the word. for example, here were have defined "am" as a verb



be
be
be
am
is
are


In [20]:
# Let's lemmatize Crime and Punishment to see what we get:
crimelemma = [wnl.lemmatize(t) for t in crimetokens]

len(set(crimelemma))

9793

In [21]:
# What percentage reduction have we achieved with the lemmatizer?
round( 100 - (len(set(crimelemma))/len(set(crimetokens)) * 100), 1)

8.4

In [22]:
# Here we make a frequency distribution from crimelemma and
# then a list of the 100 most common words in the FreqDist.
# Then we match those words to the contents of the FreqDist
# dictionary for the Porter stemmer. Finally, we highlight
# tokens for which there is a mismatch in the counts.
wdist = FreqDist(crimelemma) # Make a new FreqDist
wlist = wdist.most_common(100) # List the 100 most common words

# This creates some tuples by storing each word from wlist
# alongside the corresponding frequency count from the Porter stemmer.
matchlist = [(w, pdist[w[0]]) for w in wlist]

# Finally, show us the list of mismatches.
[m for m in matchlist if m[0][1] != m[1] ] # Use slicing to pull out the counts

[(('a', 5865), 4651),
 (('his', 2113), 0),
 (('her', 1824), 1831),
 (('have', 1156), 1216),
 (('be', 1136), 1238),
 (('this', 724), 0),
 (('your', 675), 696),
 (('up', 644), 645),
 (('do', 591), 668),
 (('know', 574), 599),
 (('will', 552), 556),
 (('come', 495), 576),
 (('very', 466), 0),
 (('only', 458), 0),
 (('like', 457), 509),
 (('why', 445), 0),
 (('see', 393), 441),
 (('go', 369), 544),
 (('once', 346), 0),
 (('razumihin', 344), 345)]

In [None]:
# Now compare mismatches between the top 100 lemmatized tokens with the
# frequencies for the Lancaster stemmer and the Snowball stemmer. Based on
# the results, add a comment saying which stemmer causes the greatest
# number of mismatches. Can you speculate on why this might be the case?
# Are there any additional diagnostics you could run to confirm your hypothesis?

# 3.7: Compare frequencies by matching the top 100 with Lancaster

wdist = FreqDist(crimelemma) # Make a new FreqDist
wlist = wdist.most_common(100) # List the 100 most common words

matchlist = [(w, ldist[w[0]]) for w in wlist]

print("Lancaster \n",[m for m in matchlist if m[0][1] != m[1] ])

# 3.8: Compare frequencies by matching the top 100 with Snowball

matchlist = [(w, sdist[w[0]]) for w in wlist]

print("SnowBall\n",[m for m in matchlist if m[0][1] != m[1] ])

# 3.9: Run additional tests to understand differences between stemmers and
# the lemmatizer.

Lancaster 
 [(('the', 8006), 8038), (('a', 5865), 4651), (('you', 4086), 4094), (('in', 3248), 3261), (('wa', 2826), 0), (('at', 2080), 2146), (('her', 1824), 2190), (('not', 1815), 2054), (('with', 1756), 1757), (('she', 1691), 1692), (('for', 1682), 1698), (('on', 1484), 2606), (('all', 1315), 0), (('have', 1156), 0), (('are', 868), 0), (('there', 804), 0), (('this', 724), 0), (('were', 712), 0), (('out', 679), 688), (('your', 675), 0), (('one', 663), 0), (('up', 644), 651), (('them', 584), 588), (('know', 574), 599), (('will', 552), 0), (('am', 545), 549), (('or', 506), 508), (('come', 495), 1), (('too', 491), 493), (('man', 474), 522), (('don', 464), 572), (('only', 458), 0), (('like', 457), 0), (('can', 451), 477), (('time', 441), 0), (('more', 413), 0), (('some', 405), 0), (('sonia', 399), 0), (('ha', 395), 8), (('see', 393), 444), (('go', 369), 360), (('here', 359), 0), (('once', 346), 0), (('razumihin', 344), 345)]
SnowBall
 [(('a', 5865), 4651), (('wa', 2826), 0), (('her', 182

**Part 2 - Part of Speech (POS) Tagging**

At this point in the lab it should be obvious that the WordNet lemmatizer does not work very well unless we already know the part of speech of the word we are trying to lemmatize. This is a significant limitation, which is also reflected in the fact that we only achieved an 8% reduction in the number of unique tokens using this lemmatizer.

Given the limitations of stemmers and simple lemmatizers, it is time to take a more serious look at part of speech tagging. For this, we are going to graduate from NLTK to our first effort with spaCy. Whereas NLTK was designed for teaching and research, spaCy was architected so that it can serve as the basis of a production-grade NLP pipeline. Unlike other NLP toolkits (e.g., Stanford core NLP) spaCy was written in Python and Cython, so it is convenient for use directly from the Jupyter notebook environment. We will do a thorough examination of many of spaCy's capabilities in a later lab session. For now, we will just try out a few techniques.


In [23]:
import spacy
nlp = spacy.load('en_core_web_sm')
type(nlp)

spacy.lang.en.English

In [24]:
# What bound methods are available?
[m for m in dir(nlp) if m[0] != '_']

['Defaults',
 'add_pipe',
 'analyze_pipes',
 'batch_size',
 'begin_training',
 'component',
 'component_names',
 'components',
 'config',
 'create_optimizer',
 'create_pipe',
 'create_pipe_from_source',
 'default_config',
 'default_error_handler',
 'disable_pipe',
 'disable_pipes',
 'disabled',
 'enable_pipe',
 'evaluate',
 'factories',
 'factory',
 'factory_names',
 'from_bytes',
 'from_config',
 'from_disk',
 'get_factory_meta',
 'get_factory_name',
 'get_pipe',
 'get_pipe_config',
 'get_pipe_meta',
 'has_factory',
 'has_pipe',
 'initialize',
 'lang',
 'make_doc',
 'max_length',
 'meta',
 'path',
 'pipe',
 'pipe_factories',
 'pipe_labels',
 'pipe_names',
 'pipeline',
 'rehearse',
 'remove_pipe',
 'rename_pipe',
 'replace_listeners',
 'replace_pipe',
 'resume_training',
 'select_pipes',
 'set_error_handler',
 'set_factory_meta',
 'to_bytes',
 'to_disk',
 'tokenizer',
 'update',
 'use_params',
 'vocab']

In [25]:
# Let's process a small example first
sentence = "The faster Harry got to the store, the faster Harry would get home."
spsent = nlp(sentence)
type(spsent), len(spsent)
print(spsent)

The faster Harry got to the store, the faster Harry would get home.


In [26]:
# But this is no ordinary set of string tokens:
# What bound methods and attributes are available for this object?
[m for m in dir(spsent) if m[0] != '_']

['cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'from_json',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set_ents',
 'set_extension',
 'similarity',
 'spans',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'to_bytes',
 'to_dict',
 'to_disk',
 'to_json',
 'to_utf8_array',
 'user_data',
 'user_hooks',
 'user_span_hooks',
 'user_token_hooks',
 'vector',
 'vector_norm',
 'vocab']

In [27]:
# So there are quite a number of attributes and bound methods for
# this collection of tokens. We will learn more of them eventually
# but for now, let's just look at one attribute.
spsent.is_tagged # What does this one tell us?

  spsent.is_tagged # What does this one tell us?


True

In [28]:
# So spaCy has guessed the part of speech for each token. We can easily list
# all of the tags.
[(i, i.pos_) for i in spsent]

[(The, 'PRON'),
 (faster, 'ADV'),
 (Harry, 'PROPN'),
 (got, 'VERB'),
 (to, 'ADP'),
 (the, 'DET'),
 (store, 'NOUN'),
 (,, 'PUNCT'),
 (the, 'PRON'),
 (faster, 'ADV'),
 (Harry, 'PROPN'),
 (would, 'AUX'),
 (get, 'VERB'),
 (home, 'ADV'),
 (., 'PUNCT')]

In [29]:
# SpaCy has also stored the lemmas for each token
# Let's show the lemmas and clean up our output. We can use the
# tabulate package to make simple display tables.
from tabulate import tabulate

# Make a little dataset for tabulate() to work on.
poslist = [ (i.text, i.lemma_, i.pos_) for i in spsent]

print(tabulate(poslist,  headers=["Token", "Lemma", "Tag"]))


Token    Lemma    Tag
-------  -------  -----
The      the      PRON
faster   fast     ADV
Harry    Harry    PROPN
got      get      VERB
to       to       ADP
the      the      DET
store    store    NOUN
,        ,        PUNCT
the      the      PRON
faster   fast     ADV
Harry    Harry    PROPN
would    would    AUX
get      get      VERB
home     home     ADV
.        .        PUNCT


The table above just scratches the surface, but there's still a lot of interesting stuff happening there. In the first column we have the token itself, which can be a word, a number, or punctuation. The second column has the lemma and the last is the simple part of speech tag. By the way, you can find an explanation of these tages here:

https://universaldependencies.org/docs/u/pos/

There is a function call that will provide information about any of the tags:

In [30]:
spacy.explain("DET")

'determiner'

In [31]:
# Let's practice by tagging another sentence. Here's some text extracted from
# Wikipedia's article on kites.
kites = """
A kite is a tethered heavier-than-air or lighter-than-air craft with wing surfaces that react against the air to create lift and drag forces.
A kite consists of wings, tethers and anchors. Kites often have a bridle and tail to guide the face of the kite so the wind can lift it.
Some kite designs don’t need a bridle; box kites can have a single attachment point.
A kite may have fixed or moving anchors that can balance the kite.
One technical definition is that a kite is “a collection of tether-coupled wing sets“.
The name derives from its resemblance to a hovering bird.
"""

spkites = nlp(kites)
type(spkites), len(spkites)

(spacy.tokens.doc.Doc, 132)

In [32]:
# Add code to conduct the following analyses:

# 3.10: Display tokens, lemmas, and parts of speech for spkites. Try using a
# nice, neat tabular format for the output.

poslist = [ (i.text, i.lemma_, i.pos_) for i in spkites]

print(tabulate(poslist,  headers=["Token", "Lemma", "Tag"]))




Token        Lemma        Tag
-----------  -----------  -----
                          SPACE
A            a            DET
kite         kite         NOUN
is           be           AUX
a            a            DET
tethered     tethered     ADJ
heavier      heavy        ADJ
-            -            PUNCT
than         than         ADP
-            -            PUNCT
air          air          NOUN
or           or           CCONJ
lighter      light        ADJ
-            -            PUNCT
than         than         ADP
-            -            PUNCT
air          air          NOUN
craft        craft        NOUN
with         with         ADP
wing         wing         NOUN
surfaces     surface      NOUN
that         that         PRON
react        react        VERB
against      against      ADP
the          the          DET
air          air          NOUN
to           to           PART
create       create       VERB
lift         lift         NOUN
and          and          CCONJ
drag        

In [33]:
# It might be more convenient to work with individual sentences:
kitespans = list(spkites.sents)

kitespans[0] # Let's view just the first sentence


A kite is a tethered heavier-than-air or lighter-than-air craft with wing surfaces that react against the air to create lift and drag forces.

Let's close the loop on the idea of data reduction by seeing how many unique lemmas spaCy creates for Crime and Punishment. Recall that we were unsatisfied with the lemmatizer from NLTK because - in order for it to work efficiently - we needed to know the POS for each token before calling the lemmatizer. The spaCy nlp() call east up our whole text, applies tags, and determines lemmas, all based on a swappable language model.

In [34]:
# Process Crime and Punishment with spaCy
nlp.max_length = 1200000 # Increase from the default of 1 million characters

# Note that this call takes a little less than a minute to complete.
crimespacy = nlp(raw) # We're going back to the original raw text data!

type(crimespacy), len(crimespacy)

(spacy.tokens.doc.Doc, 274697)

In [35]:
# Let's count unique lemmas
newcrimelemma = [l.lemma_ for l in crimespacy]
len(set(newcrimelemma))

7858

In [36]:
# What percentage reduction have we achieved with the lemmatizer?
round( 100 - (len(set(newcrimelemma))/len(set(crimetokens)) * 100), 1)

26.5

**Part 3 - WordNet**

The leading definition of the word semantics is, "the study of the meaning of words."

One of the earliest and most comprehensive efforts to explore semantics on a large scale arose from the work of George Miller at Princeton in the mid-1980s. The database arising from Miller's work, known as WordNet, was an award-winning effort to create a network of interconnected meanings of words. The WordNet project is alive and well in the present day, in fact there is an international organization  known as the Global WordNet Association that continues research and development of WordNet. Check it out here:

http://globalwordnet.org

GWA has an annual conference and offers some databases and documentation to the world community for free. These databases, now covering more than 200 languages, represent a massive amount of collective human effort, which is both amazing and illustrative of the core challenge with such resources: The maintenance of manually developed language resources requires lots of manual labor.

Possibly, some of the value of what WordNet provides has been or will eventually be superceded by approaches based on deep learning. We see inklings of this with GloVe word embedding and more sophisticated embedding approaches such as BERT that are initially trained (in an unsupervised mode) on masses of unlabeled natural language text. Even so, having some understanding of how WordNet works and what it can do will set the stage for understanding newer approachs. So in this part of the lab, we explore some of the WordNet capabilities afforded by NLTK.

In [37]:
import nltk
nltk.download('wordnet') # Colab does not have it installed by default
from nltk.corpus import wordnet as wn

type(wn.synsets) # A key function call (method) that we will use

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


method

In [53]:
# Let's start by getting data on the word cat. A "synset" is a very basic
# data structure supported by NLTK that can be used to look up synonyms
# and related information for any word that the WordNet folks have included
# in the giant database.
syn = wn.synsets('cat')
type(syn), len(syn)


(list, 10)

In [39]:
# Each element in the list is a synset object. We have more than one whenever
# there is more than one sense of the word.

cat0 = syn[0] # Let's look at some of the details for the first synset

print ("Synset name :  ", cat0.name())

# Defining the word
print ("\nSynset meaning : ", cat0.definition())

# list of phrases that use the word in context; not all words have these
print ("\nSynset example : ", cat0.examples())

Synset name :   cat.n.01

Synset meaning :  feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats

Synset example :  []


Note that the synset name has interesting information in it: Of course the word itself comes first, but then the letter after the dot indicates the part of speech. The number after the second dot reveals the variant. So cat.n.01 would be read as "the first noun sense of cat." The fact that cat.n.01 appears as the first synset in the list indicates that linguists believed it to be the most common sense of the word.

WordNet is organized as a kind of tree structure, where we can find more specific and more general terms related to a particular word by tracing up or down the branches and twigs of the tree. A "hypernym" - which you can think of as "higher level name" - is a more general term that encompasses the word we are focusing on. In the other direction, a "hyponym" is an example of a word that is more specific than the word we are focusing on. As a mnemonic, remember that "hyper" means "excess" or "above" as in "hyperactive." On the other hand, "hypo" means below, as in "hypothermia."

In [45]:
print ("Synset name :  ", cat0.name()) # Let's show the name again

# Here is the "root" word - the highest level hypernym
print ("\nSynset root hypernym:  ", cat0.root_hypernyms())

# These are the more general terms
print ("\nSynset hypernyms:  ", cat0.hypernyms())

# These are the more specific terms
print ("\nSynset hyponyms:  ", cat0.hyponyms())



Synset name :   cat.n.01

Synset root hypernym:   [Synset('entity.n.01')]

Synset hypernyms:   [Synset('feline.n.01')]

Synset hyponyms:   [Synset('domestic_cat.n.01'), Synset('wildcat.n.03')]


The second and subsequent elements in the synset list (if any) are alternative word senses. If you're a music fan, you might be able to think of another use of the word "cat." In the first line of code below, we extract the second element of the synset list. Use it to show the name, definition, example, root hypernym, hypernyms, and hyponyms for this first synonym of cat.

In [44]:
# Exercises: Explore the second synset for "cat."

cat1 = syn[1] # Let's look at some of the details for the second element

# 3.11: Print the name of cat1: What part of speech is it?
print ("Synset name :  ", cat1.name())
#NOUN

# 3.12: Print the definition of cat1, the examples of use of cat1 in context, the root hypernym of cat1, a list of hypernyms of cat1, and a list of hyponyms of cat1

print ("\nSynset meaning : ", cat1.definition())
print ("\nSynset root hypernym:  ", cat1.root_hypernyms())
print ("\nSynset hypernyms:  ", cat1.hypernyms())
print ("\nSynset hyponyms:  ", cat1.hyponyms())




Synset name :   guy.n.01

Synset meaning :  an informal term for a youth or man

Synset root hypernym:   [Synset('entity.n.01')]

Synset hypernyms:   [Synset('man.n.01')]

Synset hyponyms:   [Synset('sod.n.04')]


In [46]:
# Given what you saw above, does it make sense now why the root hypernym
# of cat is "entity" rather than something more specific like "animal?"

# Cat is such a common word in English that it has been reused to refer
# to many different kinds of things. Let's go back to the complete list
# to show all of the definitions:

[s.definition() for s in syn]

['feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats',
 'an informal term for a youth or man',
 'a spiteful woman gossip',
 'the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant',
 'a whip with nine knotted cords',
 'a large tracked vehicle that is propelled by two endless metal belts; frequently used for moving earth in construction and farm work',
 'any of several large cats typically able to roar and living in the wild',
 'a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis',
 "beat with a cat-o'-nine-tails",
 'eject the contents of the stomach through the mouth']

In [47]:
# That's an amazing variety. Let's also glue the corresponding synset name
# to the definition so that we can see the parts of speech and numbering.
[ (s.name(), s.definition())  for s in syn]

[('cat.n.01',
  'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'),
 ('guy.n.01', 'an informal term for a youth or man'),
 ('cat.n.03', 'a spiteful woman gossip'),
 ('kat.n.01',
  'the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant'),
 ("cat-o'-nine-tails.n.01", 'a whip with nine knotted cords'),
 ('caterpillar.n.02',
  'a large tracked vehicle that is propelled by two endless metal belts; frequently used for moving earth in construction and farm work'),
 ('big_cat.n.01',
  'any of several large cats typically able to roar and living in the wild'),
 ('computerized_tomography.n.01',
  'a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis'),
 ('cat.v.01', "beat with a cat-o'-nine-tails"),
 ('vomit.v.01', 'eject the contents of the stomach through the mouth')]

In [48]:
# That last one is British slang, probably arising from the propensity of
# domestic cats to retch hairballs. Anyway. . . We can also get lemmas for
# each synonym entry in our list of 10:
[ (s.name(), s.lemma_names())  for s in syn]

[('cat.n.01', ['cat', 'true_cat']),
 ('guy.n.01', ['guy', 'cat', 'hombre', 'bozo']),
 ('cat.n.03', ['cat']),
 ('kat.n.01',
  ['kat', 'khat', 'qat', 'quat', 'cat', 'Arabian_tea', 'African_tea']),
 ("cat-o'-nine-tails.n.01", ["cat-o'-nine-tails", 'cat']),
 ('caterpillar.n.02', ['Caterpillar', 'cat']),
 ('big_cat.n.01', ['big_cat', 'cat']),
 ('computerized_tomography.n.01',
  ['computerized_tomography',
   'computed_tomography',
   'CT',
   'computerized_axial_tomography',
   'computed_axial_tomography',
   'CAT']),
 ('cat.v.01', ['cat']),
 ('vomit.v.01',
  ['vomit',
   'vomit_up',
   'purge',
   'cast',
   'sick',
   'cat',
   'be_sick',
   'disgorge',
   'regorge',
   'retch',
   'puke',
   'barf',
   'spew',
   'spue',
   'chuck',
   'upchuck',
   'honk',
   'regurgitate',
   'throw_up'])]

In [49]:
# The elements of each of list shown above (as the second part of the tuple)
# are plain words representing the synonym set. This could come in handy
# later, so let's make sure we know how to extract each synonym

[s.lemma_names()[0] for s in syn]

['cat',
 'guy',
 'cat',
 'kat',
 "cat-o'-nine-tails",
 'Caterpillar',
 'big_cat',
 'computerized_tomography',
 'cat',
 'vomit']

In [60]:
# Now repeat the process by finding the synset for an adjectival word, like
# good, bad, great, horrid. etc. Show the list of lemma names for that word.
# As a related task, reduce that list of lemma names to its unique set
# in order to eliminate duplicates. As a bonus challenge, can you figure out
# how to do all that with just one line of code?

# 3.13: Generate a unique set of lemmas for an adjective of your choice.

deliciousSyn = wn.synsets('Delicious')
[list(s.lemma_names()) for s in deliciousSyn]




[['Delicious'],
 ['delightful', 'delicious'],
 ['delectable',
  'delicious',
  'luscious',
  'pleasant-tasting',
  'scrumptious',
  'toothsome',
  'yummy']]

In [65]:
# There are a couple more useful things we can do with a synset. First, we can
# ask WordNet for the part of speech for each entry:
from tabulate import tabulate # To make a neat table

takesyn = wn.synsets('take') # The word "take" has many senses - noun and verb

poslist = [(s.lemma_names()[0], s.pos(), s.definition()) for s in takesyn]

print(tabulate(poslist,  headers=["Word", "POS", "Definition"]))

Word         POS    Definition
-----------  -----  ----------------------------------------------------------------------------------------------
return       n      the income or profit arising from such transactions as the sale of land or other property
take         n      the act of photographing a scene or part of a scene without interruption
take         v      carry out
take         v      require (time or space)
lead         v      take somebody somewhere
take         v      get into one's hands, take physically
assume       v      take on a certain form, attribute, or aspect
take         v      interpret something in a certain way; convey a particular meaning or impression
bring        v      take something or somebody with oneself somewhere
take         v      take into one's possession
take         v      travel or go by means of a certain kind of transportation, or a certain route
choose       v      pick out, select, or choose from a number of alternatives
accept       v   

Having all of the most common words in a language organized based on their hypernyms and hyponyms leads to some interesting results. For example, the noun senses of "dog" and "cat" that refer to pets both have mammal as a "container" word. So we can traverse our way upward from "cat" to find the common ancestor word and then traverse back down to "dog." If we started with "cat" and we wanted to get to "doctor" it would probably take a lot more steps, because the common ancestor word would be much more general.    

This leads to an interesting possibility: We can calculate the similarity between any pair of words by measuring the length of the "path" along the twigs and branches that connects two words. Here's an example to illustrate:

In [66]:
# Pay close attention: the "synset" method looks up ONE synset if it
# exists. We have to specify exactly which synset we are talking about,
# so that's why we use something like bird.n.01 to refer to the first
# noun sense of bird. Earlier in this lab we used the "synsets" method
# which will look up all of the available synsets for a word. So "synset"
# and "synsets" do slightly different jobs.
birdsyn = wn.synset('bird.n.01')
goatsyn = wn.synset('goat.n.01')
sheepsyn = wn.synset('sheep.n.01')

birdsyn.path_similarity(goatsyn) # Bird to goat
# These distances are normalized to be on a scale of 0 to 1 where 0
# is least similar and 1 is most similar.


0.1111111111111111

In [67]:
# Does this value make sense?
birdsyn.path_similarity(sheepsyn) # Bird to sheep

0.1111111111111111

In [68]:
# How about goat to sheep?
goatsyn.path_similarity(sheepsyn)

0.3333333333333333

In [69]:
# As with many things related to language, there is often an alternative way
# to do something. Leacock-Chodorow similarity also uses the path lengths,
# but also uses how deep the least common ancestor is in the hierarchy.
# Resnik similarity also considers the relative frequency of a word in a
# corpus you provide. We repeat the display of path similarity here just
# for the sake of comparison.
nltk.download('wordnet_ic')
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

birdsyn.path_similarity(goatsyn), birdsyn.lch_similarity(goatsyn), birdsyn.res_similarity(goatsyn, brown_ic)

[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet_ic.zip.


(0.1111111111111111, 1.4403615823901665, 5.2175784741185165)

In [71]:
# Obviously, these other similarity measures are calibrated on different
# scales from path similarity. Add code to produce the Leacock-Chodorow
# and the Resnick similarity for the other two pairs, specifically:
# birdsyn to sheepsyn, and
# sheepsyn to goatsyn

# 3.14: Compute L-C and Res similarity for birdsyn to sheepsyn

print("Bird - Sheep (lch_similarity): ",birdsyn.lch_similarity(sheepsyn)), print("Bird - Sheep (res_similarity): ",birdsyn.res_similarity(sheepsyn, brown_ic))
print("sheep - goat (lch_similarity): ",sheepsyn.lch_similarity(goatsyn)), print("sheep - goat (res_similarity): ",sheepsyn.res_similarity(goatsyn, brown_ic))

Bird - Sheep (lch_similarity):  1.4403615823901665
Bird - Sheep (res_similarity):  5.2175784741185165
sheep - goat (lch_similarity):  2.538973871058276
sheep - goat (res_similarity):  8.005695458684853


(None, None)

In [72]:
# OK, one final WordNet trick: Antonyms. If we want to find a word with
# the opposite meaning, WordNet can provide us with choices:
syn = wn.synsets('good') # Grab all of the synonyms for good
[(s.name(), s.definition()) for s in syn] # Display them

[('good.n.01', 'benefit'),
 ('good.n.02', 'moral excellence or admirableness'),
 ('good.n.03', 'that which is pleasing or valuable or useful'),
 ('commodity.n.01', 'articles of commerce'),
 ('good.a.01',
  'having desirable or positive qualities especially those suitable for a thing specified'),
 ('full.s.06', 'having the normally expected amount'),
 ('good.a.03', 'morally admirable'),
 ('estimable.s.02', 'deserving of esteem and respect'),
 ('beneficial.s.01', 'promoting or enhancing well-being'),
 ('good.s.06', 'agreeable or pleasing'),
 ('good.s.07', 'of moral excellence'),
 ('adept.s.01', 'having or showing knowledge and skill and aptitude'),
 ('good.s.09', 'thorough'),
 ('dear.s.02', 'with or in a close or intimate relationship'),
 ('dependable.s.04', 'financially sound'),
 ('good.s.12', 'most suitable or right for a particular purpose'),
 ('good.s.13', 'resulting favorably'),
 ('effective.s.04', 'exerting force or influence'),
 ('good.s.15', 'capable of pleasing'),
 ('good.s.16',

In [73]:
# Let's use the first adjectival form:
goodsyn = wn.synset('good.a.01')

# Now get the antonym from the lemma
[l.antonyms() for l in goodsyn.lemmas()]

[[Lemma('bad.a.01.bad')]]

In [74]:
# Now you look up the antonym(s) for the adjectival sense of bad.

# 3.15: Look up the antonym for bad
# Let's use the first adjectival form:
badsyn = wn.synset('bad.a.01')

# Now get the antonym from the lemma
[l.antonyms() for l in badsyn.lemmas()]


[[Lemma('good.a.01.good')]]