# Practical exercise 4 - experiments with Wordnet

WordNet is a large digital lexicon made by hand. The kernel of WordNet are the so called synsets that can be understood as meanings. Each word belongs to one or more synsets and each synset is made up of one or more words. Semantic relations like hypernymy and hyponymy exist between synsets, not between words! Consequently, there is no such thing like synonymy in Wordnet. If two words are synonymous the will share one or several synsets.<br>
It is possible to access Wordnet is via the web interface: http://wordnetweb.princeton.edu/perl/webwn. There we can see e.g. the synsets
of a word.


# 1. WordNet in Python

The NLTK package offers some easy methods to access WordNet. Before you use WordNet you have to run once the following code:

In [3]:
import nltk

nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sruthysanthosh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/sruthysanthosh/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

How to access synsets:

In [4]:
from nltk.corpus import wordnet as wn

# get synsets of a word
synsets = wn.synsets("rock")
for s in synsets:
    print(s)
print()

# use synset identifier directly
dog = wn.synset("dog.n.01")
print(dog.hypernyms())
print(dog.hyponyms())
print(dog.lemmas())  # ??

Synset('rock.n.01')
Synset('rock.n.02')
Synset('rock.n.03')
Synset('rock.n.04')
Synset('rock_candy.n.01')
Synset('rock_'n'_roll.n.01')
Synset('rock.n.07')
Synset('rock.v.01')
Synset('rock.v.02')

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
[Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]
[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]


An easy way to compute the similarity between two synsets is to measure the length of the path between the synsets in the WordNet hierarchy made up by the hypernym relations. The method path_simiarity returns 1/p where p is the length of the path between two synsets.

In [5]:
ape = wn.synset("ape.n.01")
monkey = wn.synset("monkey.n.01")
zoo = wn.synset("zoo.n.01")

print( "Similarity between ape and monkey: ", ape.path_similarity(monkey))
print( "Similarity between ape and zoo: ", ape.path_similarity(zoo))

Similarity between ape and monkey:  0.3333333333333333
Similarity between ape and zoo:  0.07692307692307693


Wordnet is not completely connected. The path similarity method therefore assumes a fake root node that connect all parts. The path similarity has the problem that words are less similar if they are part of the hierarchy that is worked out in more detail. In general we would assume that the first divisions at the top of the hierarchy imply large semantic differences, while a division at a very deep position in the hierarchy makes only small semantic distinctions. Therefore some alternative measures have been defined, e.g. the Wu-Palmer similarity and the Leacock-Chodorow similarity (feel free to read up on those measures).

In [6]:
print("Wu−Palmer similarity between ape and monkey: ", ape.wup_similarity(monkey))
print("Wu−Palmer similarity between ape and zoo: ", ape.wup_similarity(zoo))

print("Leacock Chodorow similarity between ape and monkey: ", ape.lch_similarity(monkey))
print("Leacock Chodorow similarity between ape and zoo: ", ape.lch_similarity(zoo))

Wu−Palmer similarity between ape and monkey:  0.9230769230769231
Wu−Palmer similarity between ape and zoo:  0.4
Leacock Chodorow similarity between ape and monkey:  2.538973871058276
Leacock Chodorow similarity between ape and zoo:  1.072636802264849


Both measures give higher weight to distances between nodes that are closer to the root. However, the distance to the root is also a design decision and a number of measures try to include other information sources as well. E.g. the similarity measures of Resnik and Lin include the frequency of words in a corpus as well.

# 2. Exercise:

0. Read in the email dataset (see exercise 3). You may copy some of the code from that notebook.

1. Let us investigate the coverage of this data in Wordnet:
    - Count the unique words (types) in the data and store them in a list.
    - How many of those items have synsets in Wordnet? (calculate a percentage value)
    - What is the average number of synsets per type?

2. Not all words have lexical meaning. We can filter certain word classes. Apply POS-tagging (for example https://www.nltk.org/book/ch05.html) to extract only nouns (just NN - not proper nouns NNP). Check the coverage in Wordnet for these nouns. How many have synsets in Wordnet? (calculate a percentage value)

3. Experiments with the similarity of words:
    - Choose 10 out of the 50 most frequent nouns from the data set (they all should have at least one synset in Wordnet).
    - Now compute for each of the 10 words the Wu-Palmer or Leacock-Chodorow similarity to each of the other 9 words (you may use the first synset for each word for this calculation when words have multiple synsets). You might want to display the resulting numbers in a table. Which words are most similar to each other?
    - Check for all sentences which contain the word 'Obama': How often does each of the 10 words you selected occur in these sentences? Have words with similar meaning also similar co-occurrence counts with 'Obama'?

## 0. Reading the email dataset . 

In [112]:
# extract the zip file
import zipfile
from somajo import SoMaJo
import numpy as np

with zipfile.ZipFile("emails-body.txt.zip", 'r') as zip_f:
    zip_f.extractall('.')
texts = open('emails-body.txt').read().split('<cmail>\n')
print(f'number of email bodies: {len(texts)}')

#Tokenizing data
somajo_tokenizer = SoMaJo(language="en_PTB",
                          split_camel_case=True)

data_tok = []
for sentence in somajo_tokenizer.tokenize_text(texts):
    data_tok.append([token.text for token in sentence])
print("Number of sentences-",len(data_tok))

print("Sample sentence-",data_tok[200])

number of email bodies: 6741
Number of sentences- 32290
Sample sentence- ['Unfortunately', ',', 'the', 'European', 'Intelligence', 'services', 'have', 'been', 'unable', 'to', 'confirm', 'or', 'discredit', 'these', 'reports', '.']


### Count the unique words (types) in the data and store them in a list.

In [113]:
# count words and their frequencies
from collections import Counter

sentences = data_tok

words = Counter(word for sentence in sentences for word in sentence)
# Note: "words" now contains a mapping of words to their frequencies.
# total number of types in the corpus
print(f'Total number of types (unique words): {len(words)}')

# total number of tokens in the corpus
print(f'Total number of tokens: {sum(words.values())}')


sorted_words = sorted(words, key=lambda word: words[word], reverse=True)

print('the most frequent words:')
print(sorted_words[:20])

Total number of types (unique words): 37340
Total number of tokens: 708310
the most frequent words:
[',', 'the', '.', 'to', 'and', 'of', 'a', 'in', '"', 'that', "'s", 'is', 'for', '-', 'I', 'on', 'with', ':', 'you', 'it']


### How many of those items have synsets in Wordnet? (calculate a percentage value)

In [114]:
count = 0
#For each unique word
for word in words:
    #checking if it has synset
    if(wn.synsets(word)):
        count+=1

In [115]:
print("Percentage of words having synsets-", (count/len(words)*100))

Percentage of words having synsets- 66.24799143010178


### What is the average number of synsets per type?

In [116]:
flag = []
for word in words:
    #for each unique word having synset
    if(wn.synsets(word)):
        #appending the number of synsets to an array
        flag.append(len(wn.synsets(word)))

In [117]:
#Finding average of the array
avg = np.mean(flag)
print("Average number of synsets per type-", round(avg,2))

Average number of synsets per type- 4.67


## Apply POS-tagging  to extract only nouns (just NN).

In [118]:
k = []
#Applying pos-tagging to all tokens
for i in range(len(sentences)):
    k+=(nltk.pos_tag(sentences[i]))

noun = []
#Extracting tokens having NN tag and storing it in another array
for i in range(len(k)):
    if(k[i][1]== 'NN'):
        noun.append(k[i])
print(noun[10:20])

[('hrod17', 'NN'), ('meat', 'NN'), ('Sent', 'NN'), ('Subject', 'NN'), ('htte', 'NN'), ('maxbiumenthal', 'NN'), ('com12012', 'NN'), ('meet', 'NN'), ('extremist', 'NN'), ('musiim', 'NN')]


In [120]:
print("Number of nouns-",len(noun))

Number of nouns- 89073


### How many have synsets in Wordnet? (calculate a percentage value)

In [121]:

count = 0
noun_synset = []
#For each noun, checking if it has synset
for i in range(len(noun)):
    if(wn.synsets(noun[i][0])):
        count+=1
        noun_synset.append(noun[i][0])

In [122]:
percent = count/len(noun)*100
print("Percentage of nouns having synsets- ",round(percent,3))

Percentage of nouns having synsets-  88.346


In [123]:
noun_synset[:20] #Nouns having synsets

['H',
 'memo',
 'syria',
 'print',
 'print',
 'meat',
 'Sent',
 'Subject',
 'meet',
 'extremist',
 'PRODUCED',
 'H',
 'memo',
 'print',
 'direct',
 'Sent',
 'direct',
 'Sent',
 'piece',
 'page']

## Experiments with similarity

### Choose 10 out of the 50 most frequent nouns from the data set

In [124]:
#Finding frequency of all the nouns having synsets
freq = Counter(noun_synset)
freq["Subject"]

9

In [125]:
sorted_nouns = sorted(freq, key=lambda noun: freq[noun], reverse=True)
print("Ten most frequent nouns are - ",sorted_nouns[:10])

Ten most frequent nouns are -  ['pm', 'time', 'government', 'today', 'policy', 'way', 'president', 'world', 'year', 'security']


In [126]:
top_10 = sorted_nouns[:10]


### Compute for each of the 10 words the Wu-Palmer or Leacock-Chodorow similarity to each of the other 9 words

In [127]:
#Computing WUP similarities btw all the 10 nouns and storing it in a matrix
#Using two for loops to get all combinations

sim= np.zeros((10,10))
for i in range(10):
    for j in range(10):
        #print(i,j)
        n1 = wn.synsets(top_10[j])[0]
        n2 = wn.synsets(top_10[i])[0]
        sim[i][j] = n1.wup_similarity(n2)
        

In [128]:
import pandas as pd

#Displaying the similarity matrix in table form using pandas
names = [_ for _ in top_10]
df = pd.DataFrame(sim, index=names, columns=names)
print("The similarity of words in a tabular format-\n")
df

The similarity of words in a tabular format-



Unnamed: 0,pm,time,government,today,policy,way,president,world,year,security
pm,1.0,0.470588,0.235294,0.25,0.315789,0.266667,0.1,0.125,0.25,0.235294
time,0.470588,1.0,0.285714,0.307692,0.375,0.333333,0.117647,0.153846,0.307692,0.285714
government,0.235294,0.285714,1.0,0.307692,0.25,0.333333,0.117647,0.153846,0.307692,0.285714
today,0.25,0.307692,0.307692,1.0,0.266667,0.545455,0.125,0.166667,0.333333,0.461538
policy,0.315789,0.375,0.25,0.266667,1.0,0.285714,0.105263,0.133333,0.266667,0.25
way,0.266667,0.333333,0.333333,0.545455,0.285714,1.0,0.133333,0.181818,0.363636,0.5
president,0.1,0.117647,0.117647,0.125,0.105263,0.133333,1.0,0.421053,0.125,0.117647
world,0.125,0.153846,0.153846,0.166667,0.133333,0.181818,0.421053,1.0,0.166667,0.153846
year,0.25,0.307692,0.307692,0.333333,0.266667,0.363636,0.125,0.166667,1.0,0.307692
security,0.235294,0.285714,0.285714,0.461538,0.25,0.5,0.117647,0.153846,0.307692,1.0


In [140]:
# Words which are most similar to each other
max_row = 0
max_col = 0
max_val = 0
for i in range(10):
    for j in range(10):
        #finding value in matrix which is the maximum value
        if((sim[i][j] > max_val) and (i is not j)):
            max_val = sim[i][j]
            max_row = i
            max_col = j

print("Maximum similarity is btw {} and {} with value {}".format(names[max_row],names[max_col], round(max_val,3)))


Maximum similarity is btw today and way with value 0.545


### Check for all sentences which contain the word 'Obama'

In [129]:
Obama_sent = []
#Finding sentences having the word 'Obama' in them
for i in range(len(sentences)):
    if('Obama' in sentences[i]):
        Obama_sent.append(sentences[i])
        
print("Sample sentences containing Obama-", Obama_sent[0:2])

Sample sentences containing Obama- [['By', 'Anne', '-', 'Marie', 'Slaughter', 'PRESIDENT', 'Obama', 'says', 'the', 'noose', 'is', 'tightening', 'around', 'Col.', 'Muammar', 'al', '-', 'Qaddafi', '.'], ['Americans', 'in', 'turn', 'will', 'read', 'the', 'words', 'of', 'Mr.', 'Obama', "'s", 'June', '2009', 'speech', 'in', 'Cairo', '.', ',', 'with', 'its', 'lofty', 'promises', 'to', 'stand', 'for', 'universal', 'human', 'rights', ',', 'and', 'cringe', '.']]


In [142]:
print("Number of sentences having the word 'Obama'- ",len(Obama_sent))

Number of sentences having the word 'Obama'-  1156


### How often does each of the 10 words you selected occur in these sentences? Have words with similar meaning also similar co-occurrence counts with 'Obama'?

In [144]:
count_each = np.zeros(10)
#Iterating through each sentence and each word in top 10 list
for i in range(len(Obama_sent)):
    for k in range(10):
        #if word is in the Obama sentence, then updating the corresponding counter
        if(top_10[k] in Obama_sent[i]):
            count_each[k]+=1

In [149]:
for i in range(10):
    print("\n {} occurs {} times".format(top_10[i],int(count_each[i])))


 pm occurs 8 times

 time occurs 36 times

 government occurs 25 times

 today occurs 24 times

 policy occurs 73 times

 way occurs 19 times

 president occurs 76 times

 world occurs 23 times

 year occurs 38 times

 security occurs 39 times


We see that the word 'President' occurs the most in sentences having the word 'Obama'.

In [172]:
co_matrix = np.zeros((10,10))

#for each sentence, for each word pair
for i in range(len(Obama_sent)):
    for j in range(10):
        for k in range(10):
            #checking if the word pair occurs in sentence
            if ((top_10[k]in Obama_sent[i]) and (top_10[j]in Obama_sent[i]) and (names[k]!= names[j])):
                co_matrix[j][k] +=1

In [165]:
co_matrix

array([[0., 1., 0., 0., 1., 0., 0., 0., 0., 2.],
       [1., 0., 0., 1., 1., 0., 4., 4., 1., 3.],
       [0., 0., 0., 0., 0., 0., 1., 1., 0., 0.],
       [0., 1., 0., 0., 2., 0., 0., 0., 1., 2.],
       [1., 1., 0., 2., 0., 0., 3., 3., 2., 2.],
       [0., 0., 0., 0., 0., 0., 2., 0., 0., 0.],
       [0., 4., 1., 0., 3., 2., 0., 1., 4., 0.],
       [0., 4., 1., 0., 3., 0., 1., 0., 1., 2.],
       [0., 1., 0., 1., 2., 0., 4., 1., 0., 0.],
       [2., 3., 0., 2., 2., 0., 0., 2., 0., 0.]])

In [166]:
#Displaying the co occurrence matrix in table form using pandas
names = [_ for _ in top_10]
df = pd.DataFrame(co_matrix, index=names, columns=names)
print("The co occurent count of words in a tabular format-\n")
df

The co occurent count of words in a tabular format-



Unnamed: 0,pm,time,government,today,policy,way,president,world,year,security
pm,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0
time,1.0,0.0,0.0,1.0,1.0,0.0,4.0,4.0,1.0,3.0
government,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
today,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,2.0
policy,1.0,1.0,0.0,2.0,0.0,0.0,3.0,3.0,2.0,2.0
way,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
president,0.0,4.0,1.0,0.0,3.0,2.0,0.0,1.0,4.0,0.0
world,0.0,4.0,1.0,0.0,3.0,0.0,1.0,0.0,1.0,2.0
year,0.0,1.0,0.0,1.0,2.0,0.0,4.0,1.0,0.0,0.0
security,2.0,3.0,0.0,2.0,2.0,0.0,0.0,2.0,0.0,0.0


In [171]:
# Words which occur together the most 
max_row = 0
max_col = 0
max_val = 0
for i in range(10):
    for j in range(10):
        #finding value in matrix which is the maximum value
        if((co_matrix[i][j] > max_val) and (i is not j)):
                
                max_val = co_matrix[i][j]
                max_row = i
                max_col = j

for i in range(10):
    for j in range(10):
        if(co_matrix[i][j] == max_val):
            
            print("The words which occured the most together are '{}' and '{}' which occured {} times"
      .format(names[i],names[j], int(max_val)))


The words which occured the most together are 'time' and 'president' which occured 4 times
The words which occured the most together are 'time' and 'world' which occured 4 times
The words which occured the most together are 'president' and 'time' which occured 4 times
The words which occured the most together are 'president' and 'year' which occured 4 times
The words which occured the most together are 'world' and 'time' which occured 4 times
The words which occured the most together are 'year' and 'president' which occured 4 times


We observe that time-president, time-world and year-president occurs the most together ( 4 times ). But they do not have high similarity meaning values.

Hence words with similar meaning dont have similar co occurence values.