In this notebook, we will explore WordNet synsets, presenting a simple method for finding all mentions of all hyponyms of a given node in the WordNet hierarchy (e.g., finding all buildings in a text).

Source code adapted from: https://github.com/dbamman/anlp21/blob/main/10.wordnet/ExploreWordNet.ipynb

# WordNet

In [None]:
import nltk, re, spacy
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

Get the synsets for a given word. The synsets here are roughly ordered by frequency of use (in a small tagged dataset), so that more frequent senses occur first.

In [None]:
synsets=wn.synsets('blue')
for synset in synsets:
    print (synset, synset.definition())

In [None]:
for lemma in wn.synset("blue.n.01").lemmas():
    print (lemma.name())

# functions from http://www.nltk.org/howto/wordnet.html to get *all* of a synset's hyponym/hypernyms
hypo = lambda s: s.hyponyms()
hyper = lambda s: s.hypernyms()

In [None]:
# find all the synsets that are hyponyms of the target synset (descendents in the WordNet hierarchy)
list(wn.synset("blue.n.01").closure(hypo))

In [None]:
# find all the synsets that are hyperyms (ancestors up the tree) of the target synset
list(wn.synset("blue.n.01").closure(hyper))

In [None]:
# return a list of words/phrases that comprise the hyponyms of a synset
def get_words_in_hypo(synset):
    words=set()
    hyponym_synsets=list(synset.closure(hypo))
    hyponym_synsets.append(synset)
    for synset in hyponym_synsets:
        for l in synset.lemmas():
            word=l.name()
            word=re.sub("_", " ", word)
            words.add(word)
    
    return words

get_words_in_hypo(wn.synset("color.n.01"))

In [None]:
# for a given set of words, find each instance among a list of tokens already processed by spacy.  
# return a list of token indexes that match.
# note this only identifies single words, not multi-word phrases.
def find_all_words_in_text(words, spacy_tokens):
    all_matches=[]
    for idx, token in enumerate(spacy_tokens):
        if token.lemma_ in words:
            all_matches.append(idx)
    return all_matches

# for a given set of token indexes, print out a window of words around each match, in the style of a concordance.
def print_concordance(matches, spacy_tokens, window=3):
    RED="\x1b[31m"
    BLACK="\x1b[0m"
    
    spacing=window*10
    for match in matches:
        start=match-window
        end=match+window+1
        if start < 0:
            start=0
        if end > len(spacy_tokens):
            end=len(spacy_tokens)
        pre=' '.join([token.text for token in spacy_tokens[start:match]])
        post=' '.join([token.text for token in spacy_tokens[match+1:end]])
#         print("xtcyvubjn")
        print("%s %s%s%s %s" % (pre.rjust(spacing), RED, spacy_tokens[match].text, BLACK, post))

# read a text, replacing all whitespace sequences with a single space
def read_text(filename):
    with open(filename, encoding="utf-8") as file:
        return re.sub("\s+", " ", file.read())

In [None]:
# use Pride and Prejudice as an example
book=read_text("Datasets/pride_and_prejudice.txt")
spacy_tokens=nlp(book)

# search through all the tokens in the spacy_tokens argument to find any mention of words in the synset or any of its hyponyms
def wordnet_search(synset, spacy_tokens):
    targets=get_words_in_hypo(synset)
    matches=find_all_words_in_text(targets, spacy_tokens)
    print(len(matches),"jkhbjkn")
    print_concordance(matches, spacy_tokens)

Let's do a very coarse tagging of a document to find all of the mentions of a specific WordNet synset and all of its hyponyms. Using the functions above, find all the color terms in Pride and Prejudice.

In [None]:
wordnet_search(wn.synset("color.n.01"), spacy_tokens)