# Word Sense Disambiguation
## Lesk, Walker and Random Walk

### This post describes the concept of Word Sense, Word Sense Disambiguation and some of the techniques for Word Sense Disambiguation in Pythonic way!

Let us start by importing nltk and its functions. nltk is an opensource natural language toolkit for analysis of Human languages.

In [1]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize

import networkx as nx # for graph

## What is Word Sense?
Words often have multiple interpretation depending on the use case i.e the context.
These interpretations are referred as the Senses.

#### Wordnet is machine readable lexical database which has infomation about senses and their properties. 
#### We use wordnet for lexical information regarding words

In [2]:
example_word = "bank"

synset_ = wn.synsets(example_word)
print("No of Senses of the word '{}' is {}\n".format(example_word, len(synset_)))

print("The different Senses of the word '{}' are\n".format(example_word))
for ws in synset_:
    print(ws.name(), " : ", ws.definition())

No of Senses of the word 'bank' is 18

The different Senses of the word 'bank' are

bank.n.01  :  sloping land (especially the slope beside a body of water)
depository_financial_institution.n.01  :  a financial institution that accepts deposits and channels the money into lending activities
bank.n.03  :  a long ridge or pile
bank.n.04  :  an arrangement of similar objects in a row or in tiers
bank.n.05  :  a supply or stock held in reserve for future use (especially in emergencies)
bank.n.06  :  the funds held by a gambling house or the dealer in some gambling games
bank.n.07  :  a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
savings_bank.n.02  :  a container (usually with a slot in the top) for keeping money at home
bank.n.09  :  a building in which the business of banking transacted
bank.n.10  :  a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
bank.v.01  :

### What is the confusion about?
Often the correct word sense is not known. The sense depend on the context words. For example:

In [3]:
sentence = "Soham gets interest on his money from bank"
ambiguous_word = "bank"

Here the word "bank" is ambiguous. It can have any of the 18 senses for the word "bank".
It can be either represent sloping edge of land or a financial institution.

#### In this notebook, three algorithms are described to find out the correct sense of a particular word in sentences.
The three algorithms are :
1. Lesk's Algorithm
2. Random Walk
3. Modified variant of Walker's Algorithm for Untagged data without thesaurus

## Preprocessing the input

Lesk Algorithm uses a machine readable dictionary to get gloss of multiple senses of the input words.

Thus it is essential that the input sentence should be in properly lemmatized form to get proper gloss. 
The stopwords also don't add any relevant semantic information to the sentence and thus they are removed as well.

Therefore we begin by defining a function to process input sentence in order to remove stopwords & lemmatize words.

In [4]:
def pre_process(sentence, porterstemmer = False):
    processed_sent = []
    # tokenize words of the sentence
    words = word_tokenize(sentence)
    # Lemmatize words to get their root form
    lemmatizer = WordNetLemmatizer()
    # Get stop words
    stop_words = set(stopwords.words("english"))

    # Remove stopwards and add lemmatized root form of words
    for w in words:
        if w not in stop_words:
                if porterstemmer:
                    w = PorterStemmer().stem(w)
                processed_sent.append(lemmatizer.lemmatize(w))
    return processed_sent

## Get Context bag consisting of all senses of the context words

Gloss is description of a particular sense of any word. Gloss is given by:

In [5]:
example_gloss = "bank.n.10"
gloss_ = wn.synset(example_gloss)
print("Example Gloss i.e. definition for a sense is:\n")
print("The gloss for sense {} is :\n{}".format(gloss_.name(), gloss_.definition()))

Example Gloss i.e. definition for a sense is:

The gloss for sense bank.n.10 is :
a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)


## Find all sense glosses i.e definition for all words other than ambiguous word from the given sentence
Concatenate them to form the context bag


In [6]:
def lesk_context_bag(context_sentence, word):
    context_bag_list = []
    context_sentence.remove(word)
    for w in context_sentence:
        for syn in wn.synsets(w):
            gloss = pre_process(str(syn.definition()))
            for w_g in gloss:
                context_bag_list.append(w_g)
    return context_bag_list

# lesk_context_bag(sent, word)

## Get Lesk score for each Sense of the Ambiguous word

Get all possible senses for the ambiguous word.
Find Lesk score for each sense by finding number of times the words in the word sense gloss(definition) occur in the context bag. 


In [7]:
def compute_lesk_score(context_sentence, word):
    lesk_scores = {}
    context_bag = lesk_context_bag(context_sentence, word)
    for sense in wn.synsets(word):
        count = 0
        for w_gloss in sense.definition().split():
            for w in context_bag:
                if w==w_gloss:
                    count += 1
        lesk_scores[sense.name()] = count
    return lesk_scores

# Get the most Apt Sense of the Ambiguous word 
# Using Lesk Score as metric

Give the word sense with highest Lesk score!

In [8]:
def lesk(context_sentence, word):
    lesk_scores = compute_lesk_score(context_sentence, word)
    max_ = 0
    lesk_prediction = None
    for s in lesk_scores:
        if max_ < lesk_scores[s]:
            lesk_prediction = s
            max_ = lesk_scores[s]
    
    return lesk_prediction

## Test for Lesk Algorithm

In [9]:
word = WordNetLemmatizer().lemmatize(PorterStemmer().stem(word=ambiguous_word))
# print(word)
sent = pre_process(sentence, porterstemmer=False)
# print(sent)

print("\n Sentence is: {}\n where the ambiguous word is {} \n". format(sentence, ambiguous_word))

prediction = lesk(sent, word)
w = wn.synset(prediction)
print(" The Predicted Sense is\n")
print(" Sense name: {}\n Sense Gloss: {}".format(w.name(), w.definition()))


 Sentence is: Soham gets interest on his money from bank
 where the ambiguous word is bank 

 The Predicted Sense is

 Sense name: depository_financial_institution.n.01
 Sense Gloss: a financial institution that accepts deposits and channels the money into lending activities


# Modified Walker Algorithm

In this modified version of Walker Algorithm, we don't use thesaurus category. Instead we use similarity measures like path similarity, LCH similarity and WUP similarity. Refer Reference[1] for more details  

## Algorithm:
1. Find Context bag i.e concatenation of all words within the context
2. For every sense of ambiguous word, find its similarity with every sense name of its context words via one of the techniques mentioned above. The sum of the cumulative scores is stored
3. The Sense with maximum cumulative score is the desired sense

In [10]:
def modified_walker(sent, word, method = "path_similarity"):
    max_ = 0
    predicted = None
    sent.remove(word)
    for sense in wn.synsets(word):
        score = 0
        for w in sent:
            for context_sense in wn.synsets(w):
#                 print(sense.name(), context_sense.name())
                try:
                    if method == "path_similarity":
                        score += wn.synset(sense.name()).path_similarity(wn.synset(context_sense.name()))
                    elif method == "lch_similarity":
                        score += wn.synset(sense.name()).lch_similarity(wn.synset(context_sense.name()))
                    elif method == "wup_similarity":
                        score += wn.synset(sense.name()).wup_similarity(wn.synset(context_sense.name()))
                except:
                    continue
        if max_ < score:
            max_ = score
            predicted = sense
    return predicted

## Test for Walkers algorithm
The Algorithm works better with tagged data

In [11]:
word = WordNetLemmatizer().lemmatize(PorterStemmer().stem(word=ambiguous_word))
# print(word)
sent = pre_process(sentence, porterstemmer=False)
# print(sent)

print("\n Sentence is: {}\n where the ambiguous word is {} \n". format(sentence, ambiguous_word))
w = modified_walker(sent, word, method="wup_similarity")
print(" The Predicted Sense is\n")
print(" Sense name: {}\n Sense Gloss: {}".format(w.name(), w.definition()))


 Sentence is: Soham gets interest on his money from bank
 where the ambiguous word is bank 

 The Predicted Sense is

 Sense name: bank.v.05
 Sense Gloss: be in the banking business


# Random Walk Algorithm

### Get Context bag consisting of all senses of the word

In [12]:
def lesk_context_bag_word(word):
    context_bag_list = []
    syn = wn.synset(word)
    gloss = pre_process(str(syn.definition()))
    for w_g in gloss:
        context_bag_list.append(w_g)
    return context_bag_list

# lesk_context_bag(sent, word)

### Get Lesk score comparing two words
Get count of number of common words shared by the two words in their gloss bags

In [13]:
def compute_lesk_score_word(word1, word2):
    lesk_score = 0
    context_bag_word1 = set(lesk_context_bag_word(word1))
    context_bag_word2 = set(lesk_context_bag_word(word2))
        
    lesk_score = context_bag_word1.intersection(context_bag_word2)
    
    return len(lesk_score)

## Create Graph for for the Random walk algorithm
Add edges according to the lesk score

In [14]:
def random_walk_graph(sentence):
    G = nx.DiGraph()
    sent_list = pre_process(sentence)
#     print(sent_list)
    for i in range(len(sent_list)-1):
        w1 = sent_list[i]
        w2 = sent_list[i+1]
        for w1_sense in wn.synsets(w1):
            for w2_sense in wn.synsets(w2):
                lesk_score = compute_lesk_score_word(w1_sense.name(), w2_sense.name())
                if lesk_score>0:
                    G.add_edge(str(i)+'_'+w1_sense.name(), str(i+1)+ '_'+ w2_sense.name(), weight = lesk_score)
    page_rank = nx.pagerank( G, alpha = 0.9)
    return page_rank

## Get final sense according to descending order of page rank scores

In [15]:
def random_walk(sentence):
    page_rank_scores = random_walk_graph(sentence)
#     print(page_rank_scores)
    sent_list = pre_process(sentence)
    dict_pred_sense = {}
    # initialise dictionary
    for j in range(len(sent_list)):
        dict_pred_sense[j+1] = [0,'null','sense']
    for i in page_rank_scores:
        t = int(i[0])
        if page_rank_scores[i]>dict_pred_sense[t][0]:
            dict_pred_sense[t][0] = page_rank_scores[i]
            dict_pred_sense[t][1] = i
            dict_pred_sense[t][2] = sent_list[t]
    return dict_pred_sense

## Test for Random walk

In [16]:
print("Sentence is: {}\n".format(sentence))
print("Predicted Word Sense for functional words:")
final_pgrnk = random_walk(sentence)
for i in final_pgrnk:
    if final_pgrnk[i][1]!='null':
        word = final_pgrnk[i][1]
        w_syn = wn.synset(word[2:])
        print("\nThe Sense for word '{}':\t {}\nThe definition is: {}".format(final_pgrnk[i][2], w_syn.name(), w_syn.definition() ))

Sentence is: Soham gets interest on his money from bank

Predicted Word Sense for functional words:

The Sense for word 'get':	 get.v.01
The definition is: come into the possession of something concrete or abstract

The Sense for word 'interest':	 interest.n.05
The definition is: (law) a right or legal share of something; a financial involvement with something

The Sense for word 'money':	 money.n.01
The definition is: the most common medium of exchange; functions as legal tender

The Sense for word 'bank':	 bank.n.07
The definition is: a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force


# Thanks
## References

[1] WordNet Interface. (2018). Nltk.org
    Link: http://www.nltk.org/howto/wordnet.html
    
[2] NetworkX — NetworkX. (2018). Networkx.github.io 
    Link: https://networkx.github.io/documentation/networkx-1.10/overview.html