# Week 6 (Part 1): Dictionary Methods for WSD

We have seen that many words have many different senses.  In order to make the correct decision about the meaning of a sentence or a document, an application often needs to be able to **disambiguate** individual words, that is, choose the correct sense given the context.

In this lab we will be looking at methods for word sense disambiguation (WSD) that make use of dictionaries or other lexical resources (also referred to as **knowledge-based methods** for WSD).  In particular, we will look at
* simplified Lesk
* adapted Lesk
* minimising distance in a semantic hierarchy

As in the previous lab, we will be using WordNet as our lexical resource.  So, first, lets import it.

In [1]:
from google.colab import drive
drive.mount('/content/drive')
import nltk
nltk.download('wordnet')
nltk.download('wordnet_ic')
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
from nltk.stem.wordnet import WordNetLemmatizer
import sys
import operator

#make sure that the path to your utils.py file is correct for your computer
sys.path.append('/content/drive/My Drive/NLE Notebooks/Week4LabsSolutions/')
from utils import *
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader


Mounted at /content/drive
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet_ic.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Sussex NLTK root directory is /content/drive/My Drive/NLE Notebooks/resources


## Simplified Lesk

The Lesk algorithm is based on the intuition that the correct combination of senses in a sentence will share more common words in their definitions.

It is computationally very expensive to compare all possible sense combinations of words in a sentence.  If each word has just 2 senses, then there are $2^n$ possible sense combinations.

In the simplifed Lesk algorithm, below, we consider each word in turn and choose the sense whose definition has more **overlap** with the contextual words in the sentence.


In [3]:

def simplifiedLesk(word,sentence):
    '''
    Use the simplified Lesk algorithm to disambiguate word in the context of sentence
    word: a String which is the word to be disambiguated
    sentence: a String which is the sentence containing the word
    :return: a pair (chosen sense definition, overlap score)
    '''
    
    #construct the set of context word tokens for the sentence: all words in sentence - word itself
    contexttokens=set(word_tokenize(sentence))-{word}
    
    #get all the possible synsets for the word
    synsets=wn.synsets(word)
    scores=[]
    
    #iterate over synsets
    for synset in synsets:
        #get the set of tokens in the definition of the synset
        sensetokens=set(word_tokenize(synset.definition()))
        #find the size of the intersection of the sensetokens set with the contexttokens set
        scores.append((synset.definition(),len(sensetokens.intersection(contexttokens))))
    
    #sort the score list in descending order by the score (which is item with index 1 in the pair)
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]
    


Now lets test it on a couple of sentences containing the word *bank*

In [4]:
banksentences=["he borrowed money from the bank","he sat on the bank of the river and watched the currents"]
for sentence in banksentences:
    print(sentence,":",simplifiedLesk("bank",sentence))

he borrowed money from the bank : ('a financial institution that accepts deposits and channels the money into lending activities', 2)
he sat on the bank of the river and watched the currents : ('sloping land (especially the slope beside a body of water)', 2)


It actually appears not to do too bad.  However, this is more by luck than anything else.   If you inspect the sentences and the definitions, you will notice that most of the overlap is currently generated by stopwords.

### Exercise 1.1
Improve the SimplifiedLesk algorithm by carrying out:
* case and number normalisation 
* stopword filtering
* lemmatisation

You should find some useful functions for doing this in `utils.py` based on earlier labs.

Make sure you test it.  Unfortunately, whilst the first sentence is still disambiguated correctly (with an overlap of 1), you should now find 0 overlap between any of the senses and the second sentence.

In [5]:

def simplifiedLesk(word,sentence):
    '''
    Use the simplified Lesk algorithm to disambiguate word in the context of sentence
    word: a String which is the word to be disambiguated
    sentence: a String which is the sentence containing the word
    :return: a pair (chosen sense definition, overlap score)
    '''
    
    #construct the set of context word tokens for the sentence: all words in sentence - word itself
    lemma =WordNetLemmatizer()
    contexttokens=set((filter_stopwords(normalise(word_tokenize(sentence)))))-{word}
    contextlemmas={lemma.lemmatize(contexttoken) for contexttoken in contexttokens}
    
    #get all the possible synsets for the word
    synsets=wn.synsets(word)
    scores=[]
    
    #iterate over synsets
    for synset in synsets:
        #get the set of tokens in the definition of the synset
        sensetokens=set(filter_stopwords(normalise(word_tokenize(synset.definition()))))
        senselemmas={lemma.lemmatize(token) for token in sensetokens}
        #find the size of the intersection of the sensetokens set with the contexttokens set
        scores.append((synset.definition(),len(senselemmas.intersection(contextlemmas))))
    
    #sort the score list in descending order by the score (which is item with index 1 in the pair)
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]
    

In [6]:
for sentence in banksentences:
    print(sentence,":",simplifiedLesk("bank",sentence))

he borrowed money from the bank : ('a financial institution that accepts deposits and channels the money into lending activities', 1)
he sat on the bank of the river and watched the currents : ('sloping land (especially the slope beside a body of water)', 0)


## Adapted Lesk
WordNet definitions are very short.  However, it is possible to create a bigger set of sense words by including information about the hypernyms and hyponyms of each sense.

### Exercise 2.1
Adapt the Lesk algorithm to include in `sensetokens`:
* all of the lemma_names for the sense itself
* all of the lemma_names for the hypernyms of the sense
* all of the lemma_names for the hypoynyms of the sense
* all of the words from the definitions of the hypernyms of the sense
* all of the words from the definitions of the hyponyms of the sense

Make sure you carry out normalisation and lemmatisation of these words as before

Test each adaptation you make on the bank sentences, recording the overlap observed with the chosen sense.

In [7]:
def adaptedLesk(word,sentence):
    '''
    Use the simplified Lesk algorithm to disambiguate word in the context of sentence, using standard WordNet adaptations
    word: a String which is the word to be disambiguated
    sentence: a String which is the sentence containing the word
    :return: a pair (chosen sense definition, overlap score)
    '''
    
    #construct the set of context word tokens for the sentence: all words in sentence - word itself
    
    lemma =WordNetLemmatizer()
    contexttokens=set((filter_stopwords(normalise(word_tokenize(sentence)))))-{word}
    contextlemmas={lemma.lemmatize(contexttoken) for contexttoken in contexttokens}
    #get all the possible synsets for the word
    synsets=wn.synsets(word)
    scores=[]
    
    #iterate over synsets
    for synset in synsets:
        #get the set of tokens in the definition of the synset
        sensetokens=word_tokenize(synset.definition())
        sensetokens+=synset.lemma_names()
        for hypernym in synset.hypernyms():
            sensetokens+=hypernym.lemma_names()
            sensetokens+=word_tokenize(hypernym.definition())
        for hyponym in synset.hyponyms():
            sensetokens+=hyponym.lemma_names()
            sensetokens+=word_tokenize(hyponym.definition())
        
        sensetokens=set(filter_stopwords(normalise(sensetokens)))
        senselemmas={lemma.lemmatize(token) for token in sensetokens}
        #find the size of the intersection of the sensetokens set with the contexttokens set
        scores.append((synset.definition(),len(senselemmas.intersection(contextlemmas))))
    
    #sort the score list in descending order by the score (which is item with index 1 in the pair)
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]
    

In [8]:
for sentence in banksentences:
    print(sentence,":",adaptedLesk("bank",sentence))

he borrowed money from the bank : ('a financial institution that accepts deposits and channels the money into lending activities', 1)
he sat on the bank of the river and watched the currents : ('sloping land (especially the slope beside a body of water)', 1)


### Exercise 2.2
From a sample of 1000 sentences from the dvd category of the Amazon review corpus (use the `sample_raw_sents()` method), find sentences which contain the stem or lemma *film*.  Use your AdaptedLesk algoritm to disambiguate them.  You may want to adapt it slightly so that it takes as input a list or a set of context tokens or stems rather than the sentence itself.  Record the number of instances of each sense of *film* predicted by this algorithm.

In [9]:
filmsynsets=wn.synsets("film")
for s in filmsynsets:
    print(s.definition())

a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement
a medium that disseminates moving pictures
photographic material consisting of a base of celluloid covered with a photographic emulsion; used to make negatives or transparencies
a thin coating or layer
a thin sheet of (usually plastic and usually transparent) material used to wrap or cover things
make a film or photograph of something
record in film


In [10]:
dvd_reader = AmazonReviewCorpusReader().category("dvd")
sentences=dvd_reader.sample_raw_sents(1000)

In [11]:
def lemmatize(alist):
    lemma=WordNetLemmatizer()
    return [lemma.lemmatize(token) for token in alist]
tokenisedsentences=[lemmatize(normalise(word_tokenize(sentence))) for sentence in sentences]
filmsentences=[sentence for sentence in tokenisedsentences if "film" in sentence]
print(len(sentences),len(filmsentences))
print(filmsentences[0])

1000 99
['but', 'the', 'film', 'is', 'nevertheless', 'inspirational', ',', 'and', 'one', 'that', 'my', 'family', 'thoroughly', 'enjoyed', '.']


In [12]:

def adaptedLesk(word,contexttokens):
    '''
    Use the simplified Lesk algorithm to disambiguate word in the context of sentence, using standard WordNet adaptations
    word: a String which is the word to be disambiguated
    contexttokens: a list of context tokens which have been normalised, stemmed and stopword filtered
    :return: a pair (chosen sense definition, overlap score)
    '''
    
    #construct the set of context word tokens for the sentence: all words in sentence - word itself
    lemma =WordNetLemmatizer()
    contexttokens=set(contexttokens)-{lemma.lemmatize(word)}
    
    #get all the possible synsets for the word
    synsets=wn.synsets(word)
    scores=[]
    
    #iterate over synsets
    for synset in synsets:
        #get the set of tokens in the definition of the synset
        sensetokens=word_tokenize(synset.definition())
        sensetokens+=synset.lemma_names()
        for hypernym in synset.hypernyms():
            sensetokens+=hypernym.lemma_names()
            sensetokens+=word_tokenize(hypernym.definition())
        for hyponym in synset.hyponyms():
            sensetokens+=hyponym.lemma_names()
            sensetokens+=word_tokenize(hyponym.definition())
        
        sensetokens=set(lemmatize(filter_stopwords(normalise(sensetokens))))
        #find the size of the intersection of the sensetokens set with the contexttokens set
        scores.append((synset.definition(),len(sensetokens.intersection(contexttokens))))
    
    #sort the score list in descending order by the score (which is item with index 1 in the pair)
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]

results={}
for sentence in filmsentences:
    res=adaptedLesk("film",sentence)
    results[res[0]]=results.get(res[0],0)+1

print(results)
    

{'a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement': 88, 'photographic material consisting of a base of celluloid covered with a photographic emulsion; used to make negatives or transparencies': 9, 'a thin sheet of (usually plastic and usually transparent) material used to wrap or cover things': 2}


### Exercise 2.3
Inspect some of the individual predictions for your film sentences (at least one for each sense predicted).  Do you agree with the sense prediction?

In [13]:
print(filmsentences[0])
adaptedLesk("film",filmsentences[0])

['but', 'the', 'film', 'is', 'nevertheless', 'inspirational', ',', 'and', 'one', 'that', 'my', 'family', 'thoroughly', 'enjoyed', '.']


('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement',
 0)

In [14]:
results={}
for sentence in filmsentences:
  res=adaptedLesk("film",sentence)
  if res[0] not in results.keys():
    print(sentence)
    print(res)
  results[res[0]]=results.get(res[0],0)+1



['but', 'the', 'film', 'is', 'nevertheless', 'inspirational', ',', 'and', 'one', 'that', 'my', 'family', 'thoroughly', 'enjoyed', '.']
('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 0)
['tim', 'burton', "'s", 'film', 'coupled', 'with', 'depp', "'s", 'remarkable', 'acting', 'ability', 'make', 'this', 'movie', 'a', 'classic', '.']
('photographic material consisting of a base of celluloid covered with a photographic emulsion; used to make negatives or transparencies', 2)
['there', 'wa', 'a', 'few', 'thing', 'that', 'went', 'wrong', 'with', 'pi', 'that', 'i', 'wont', 'go', 'into', ',', 'but', 'a', 'with', 'most', 'filmmaker', 'making', 'no', 'budget', 'film', 'all', 'their', 'mistake', 'suddenly', 'become', 'artistic', 'expression', '.']
('a thin sheet of (usually plastic and usually transparent) material used to wrap or cover things', 1)


## Minimising the Distance in the Semantic Hierarchy (**EXTENSION**)
This WSD method is based on the intuition that the concepts mentioned in a sentence will be close together in the hyponym hierarchy.

### Exercise 3.1 (**EXTENSION**)
Write a function `max_sim(word, contextlemmas,pos)`which will choose the sense of a *word* given its context *sentence* using a WordNet based semantic similarity measure (see Lab_5_1).  You can assume that the part of speech of the word is known and is supplied to the function as another argument.

For each **sense** of the word under consideration:
* compute its semantic similarity with each context **lemma** of the same part of speech 
* sum the semantic similarities over the sentence

Choose the **sense** with the maximum sum.

Test your function on the bank sentences.  You should find, disappointingly for the method,  that the first sentence has a maximum score of 2.71 with "an arrangement of similar objects in a row or in tiers" and the second sentence has a maximum socre of 4.68 with "an arrangement of similar objects in a row or in tiers".

In [15]:
def max_sim(word,contextlemmas,pos=wn.NOUN):
    
    synsets=wn.synsets(word,pos)
    scores=[]
    for synset in synsets:
        total=0
        for lemma in contextlemmas:
            sofar=0
            for synsetB in wn.synsets(lemma,pos):
                sim=wn.path_similarity(synset,synsetB)
                if sim>sofar:
                    sofar=sim
            total+=sofar
        scores.append((synset.definition(),total))
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]


for sentence in banksentences:
    print(sentence,":",max_sim("bank",sentence,wn.NOUN))    

he borrowed money from the bank : ('an arrangement of similar objects in a row or in tiers', 2.7126262626262627)
he sat on the bank of the river and watched the currents : ('an arrangement of similar objects in a row or in tiers', 4.683333333333332)


### Exercise 3.2 (**EXTENSION**)
* Run your max_sim function on all of your film sentences and record the number of predictions for each sense.
* Inspect some of the individual predictions.
* Compare the results with those from the AdaptedLesk algorithm and draw some conclusions.

In [16]:
results={}
for sentence in filmsentences:
    res=max_sim("film",sentence, wn.NOUN)
    results[res[0]]=results.get(res[0],0)+1

print(results)

{'a thin coating or layer': 70, 'a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement': 29}
