# Word Similarity

Student Name: Shireen Hassann

## Overview

In this task, will quantifying the similarity between pairs of words of a dataset using different methods with the word co-occurrence in the Brown corpus and synset structure of WordNet. 
Firstly, will preprocess the dataset to filter out the rare and ambiguous words. 
Secondly, will calculate the similarity scores for pairs of words in the filtered dateset using Lin similarity, NPMI and LSA. 
Lastly,  will quantify how well these methods work by comparing to a human annotated gold-standard.

## 1. Preprocessing (2 marks)

<b>Instructions</b>: For this homework we will be comparing our methods against a popular dataset of word similarities called <a href="http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/">Similarity-353</a>. You need to first obtain this dataset, which is available on LMS. The file we will be using is called *set1.tab*. Make sure you save this in the same folder as the notebook.  Except for the header (which should be stripped out), the file is tab formated with the first two columns corresponding to two words, and the third column representing a human-annotated similarity between the two words. <b>You should ignore the subsequent columns</b>.

Here shows the first six lines of the file:

```
Word 1	Word 2	Human (mean)	1	2	3	4	5	6	7	8	9	10	11	12	13	
love	sex	6.77	9	6	8	8	7	8	8	4	7	2	6	7	8	
tiger	cat	7.35	9	7	8	7	8	9	8.5	5	6	9	7	5	7	
tiger	tiger	10.00	10	10	10	10	10	10	10	10	10	10	10	10	10	
book	paper	7.46	8	8	7	7	8	9	7	6	7	8	9	4	9	
computer	keyboard	7.62	8	7	9	9	8	8	7	7	6	8	10	3	9	
```
    
You should load this file into a Python dictionary (NOTE: in Python, tuples of strings, i.e. ("tiger","cat") can serve as the keys of a dictionary to map to their human-annotated similarity). This dataset contains many rare words: we need to filter this dataset in order for it to be better suited to the resources we will use in this assignment. So your first goal is to filter this dataset to generate a smaller test set where you will evaluate your word similarity methods.

The first filtering is based on document frequencies in the Brown corpus, in order to remove rare words. In this task, we will be treating the paragraphs of the Brown corpus as our "documents". You can iterate over them by using the `paras` method of the corpus reader. You should remove tokens that are not alphabetic. Tokens should be lower-cased and lemmatized. Now calculate document frequencies for each word type, and use this to remove from your word similarity data any word pairs where at least one of the two words has a document frequency of less than 8 in this corpus.

For this part, store all the word pair and similarity mappings in your filtered test set in a dictionary called *filtered_gold_standard*. 

(1 mark)

In [2]:
import nltk
from nltk.corpus import brown
from nltk.corpus import wordnet

nltk.download("brown")
nltk.download("wordnet")

# filtered_gold_standard stores the word pairs and their human-annotated similarity in your filtered test set
filtered_gold_standard = {}

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
file = 'set1.tab'

with open(file) as f:
    next(f)
    for line in f:
        list = line.split('\t')
        key = (list[0], list[1])
        value = float(list[2])
        filtered_gold_standard[key] = value        

#List of WordType  
doc_brown = brown.paras()
wordType = []
for i, doc in enumerate(doc_brown):
    tempSet = set()
    for j, sentence in enumerate(doc):
        for k, word in enumerate(sentence):
            if(word.isalpha()):
                result = lemmatizer.lemmatize(word.lower())
                tempSet.add(result)        
    wordType.append(tempSet)

#Filteing is based on document frequencies 
for key, value in filtered_gold_standard.items():
    sumOne = 0
    sumTwo = 0
    for word in wordType:
        if key[0] in word: 
            sumOne += 1
        if key[1] in word: 
            sumTwo += 1
    if sumOne < 8 or sumTwo < 8:
        filtered_gold_standard[key] = False   
filtered_gold_standard = {key:value for key, value in filtered_gold_standard.items() if value != False}

print(len(filtered_gold_standard))
print(filtered_gold_standard)

[nltk_data] Downloading package brown to
[nltk_data]     /Users/shireenhassan/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/shireenhassan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


94
{('love', 'sex'): 6.77, ('tiger', 'cat'): 7.35, ('tiger', 'tiger'): 10.0, ('book', 'paper'): 7.46, ('plane', 'car'): 5.77, ('train', 'car'): 6.31, ('telephone', 'communication'): 7.5, ('television', 'radio'): 6.77, ('drug', 'abuse'): 6.85, ('bread', 'butter'): 6.19, ('doctor', 'nurse'): 7.0, ('professor', 'doctor'): 6.62, ('student', 'professor'): 6.81, ('smart', 'student'): 4.62, ('smart', 'stupid'): 5.81, ('company', 'stock'): 7.08, ('stock', 'market'): 8.08, ('stock', 'phone'): 1.62, ('stock', 'egg'): 1.81, ('stock', 'live'): 3.73, ('stock', 'life'): 0.92, ('book', 'library'): 7.46, ('bank', 'money'): 8.12, ('wood', 'forest'): 7.73, ('money', 'cash'): 9.08, ('king', 'queen'): 8.58, ('bishop', 'rabbi'): 6.69, ('holy', 'sex'): 1.62, ('football', 'basketball'): 6.81, ('football', 'tennis'): 6.63, ('tennis', 'racket'): 7.56, ('law', 'lawyer'): 8.38, ('movie', 'star'): 7.38, ('movie', 'critic'): 6.73, ('movie', 'theater'): 7.92, ('space', 'chemistry'): 4.88, ('alcohol', 'chemistry'): 

<b>For your testing:</b>

In [3]:
assert(len(filtered_gold_standard) > 50 and len(filtered_gold_standard) < 100)

In [4]:
assert(filtered_gold_standard[('love', 'sex')] == 6.77)

<b>Instructions</b>: Here, you apply the second filtering. The second filtering is based on words with highly ambiguous senses and involves using the NLTK interface to WordNet. Here, you should remove any words which do not have a *single primary sense*. We define single primary sense here as either a) having only one sense (i.e. only one synset), or b) where the count (as provided by the WordNet `count()` method for the lemmas associated with a synset) of the most common sense is at least 4 times larger than the next most common sense. Note that a synset can be associated with multiple lemmas. You should only consider the count of your lemma. Also, you should remove any words where the primary sense is not a noun (this information is also in the synset). Store the synset corresponding to this primary sense in a dictionary for use in the next section. Given this definition, remove the word pairs from the test set where at least one of the words does not meet the above criteria.

When you have applied the two filtering steps, you should store all the word pair and similarity mappings in your filtered test set in a dictionary called *final_gold_standard*. 

(1 mark)

In [5]:
# final_gold_standard stores the word pairs and their human-annotated similarity in your final filtered test set
final_gold_standard = {}

###
# Your answer BEGINS HERE
###

wordSynset = {}
def countLemmas(word):
    list_wordSynset = []
    for i in wordnet.synsets(word):
        for j in i.lemmas():
            if j.name() == word:
                list_wordSynset.append((i.name(), j.count()))
    list_wordSynset = sorted(list_wordSynset, key=lambda x:x[1], reverse=True)
    return list_wordSynset

def checkSynset(word):
    lemma_list = countLemmas(word)
    if len(lemma_list) == 1 and wordnet.synsets(word)[0].pos() == 'n':
        wordSynset[word] = wordnet.synsets(word)[0].name()
        return True
    if lemma_list[0][1] >= 4*lemma_list[1][1]:
        name = lemma_list[0][0]
        pos = name.split('.')[1]
        if pos == 'n':
            wordSynset[word] = name
            return True    
    return False

#Filtering is based on words with highly ambiguous senses
for key, value in filtered_gold_standard.items():
    if checkSynset(key[0]) and checkSynset(key[1]):
        continue
    else:
        filtered_gold_standard[key] = False
final_gold_standard = {key:value for key, value in filtered_gold_standard.items() if value != False}

###
# Your answer ENDS HERE
###

print(len(final_gold_standard))
print(final_gold_standard)

27
{('bread', 'butter'): 6.19, ('professor', 'doctor'): 6.62, ('student', 'professor'): 6.81, ('stock', 'egg'): 1.81, ('money', 'cash'): 9.08, ('king', 'queen'): 8.58, ('bishop', 'rabbi'): 6.69, ('football', 'basketball'): 6.81, ('football', 'tennis'): 6.63, ('alcohol', 'chemistry'): 5.54, ('baby', 'mother'): 7.85, ('car', 'automobile'): 8.94, ('journey', 'voyage'): 9.29, ('coast', 'shore'): 9.1, ('furnace', 'stove'): 8.79, ('brother', 'monk'): 6.27, ('journey', 'car'): 5.85, ('coast', 'hill'): 4.38, ('forest', 'graveyard'): 1.85, ('monk', 'slave'): 0.92, ('coast', 'forest'): 3.15, ('psychology', 'doctor'): 6.42, ('psychology', 'mind'): 7.69, ('psychology', 'health'): 7.23, ('psychology', 'science'): 6.71, ('planet', 'moon'): 8.08, ('planet', 'galaxy'): 8.11}


<b>For your testing:</b>

In [6]:
assert(len(final_gold_standard) > 10 and len(final_gold_standard) < 40)

In [7]:
assert(final_gold_standard[('professor', 'doctor')] == 6.62)

## 2. Word similiarity scores with Lin similarity, NPMI and LSA (3 marks)

<b>Instructions</b>: Now you will create several dictionaries with similarity scores for pairs of words in your test set derived using the techniques discussed in class. The first of these is the Lin similarity for your word pairs using the information content of the Brown corpus, which you should calculate using the primary sense for each word derived above. You can use the built-in method included in the NLTK interface, you don't have to implement your own. 

When you're done, you should store the word pair and similarity mappings in a dictionary called *lin_similarities*.

(1 mark)

In [8]:
from nltk.corpus import wordnet_ic
nltk.download('wordnet_ic')

# lin_similarities stores the word pair and Lin similarity mappings
lin_similarities = {}

###
# Your answer BEGINS HERE
###

brown_ic = wordnet_ic.ic('ic-brown.dat')

for key, value in final_gold_standard.items():
    firstWord = wordnet.synset(wordSynset[key[0]])
    secondWord = wordnet.synset(wordSynset[key[1]])
    lin_similarities[key] = firstWord.lin_similarity(secondWord,brown_ic)
    
###
# Your answer ENDS HERE
###

print(lin_similarities)

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /Users/shireenhassan/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


{('bread', 'butter'): 0.711420490146294, ('professor', 'doctor'): 0.7036526610448273, ('student', 'professor'): 0.26208607023317687, ('stock', 'egg'): -0.0, ('money', 'cash'): 0.7888839126424345, ('king', 'queen'): 0.25872135992145145, ('bishop', 'rabbi'): 0.6655650900427844, ('football', 'basketball'): 0.7536025025710653, ('football', 'tennis'): 0.7699955045932811, ('alcohol', 'chemistry'): 0.062235427146896456, ('baby', 'mother'): 0.6315913189894092, ('car', 'automobile'): 1.0, ('journey', 'voyage'): 0.6969176573027711, ('coast', 'shore'): 0.9632173804623256, ('furnace', 'stove'): 0.22813808925013807, ('brother', 'monk'): 0.24862817480738675, ('journey', 'car'): -0.0, ('coast', 'hill'): 0.5991131628821826, ('forest', 'graveyard'): -0.0, ('monk', 'slave'): 0.2543108201944307, ('coast', 'forest'): -0.0, ('psychology', 'doctor'): -0.0, ('psychology', 'mind'): 0.304017384194818, ('psychology', 'health'): 0.06004979886905243, ('psychology', 'science'): 0.8474590505736942, ('planet', 'moon

<b>For your testing:</b>

In [9]:
assert(lin_similarities[('professor', 'doctor')] > 0.5 and lin_similarities[('professor', 'doctor')] < 1)

**Instructions:** Next, you will calculate Normalized PMI (NPMI) for your word pairs using word frequency derived from the Brown.

PMI is defined as:

\begin{equation*}
PMI = \log_2\left(\frac{p(x,y)}{p(x)p(y)}\right)
\end{equation*}

where

\begin{equation*}
p(x,y) = \frac{\text{Number of paragraphs with the co-occurrence of x and y}}{\sum_i \text{Number of word types in paragraph}_i}
\end{equation*}

\begin{equation*}
p(x) = \frac{\text{Number of paragraphs with the occurrence of x}}{\sum_i \text{Number of word types in paragraph}_i}
\end{equation*}

\begin{equation*}
p(y) = \frac{\text{Number of paragraphs with the occurrence of y}}{\sum_i \text{Number of word types in paragraph}_i}
\end{equation*}

with the sum over $i$ ranging over all paragraphs. Note that there are other ways PMI could be formulated.

NPMI is defined as:

\begin{equation*}
NPMI = \frac{PMI}{-\log_2(p(x,y))} = \frac{\log_2(p(x)p(y))}{\log_2(p(x,y))} - 1
\end{equation*}

Thus, when there is no co-occurrence, NPMI is -1. NPMI is normalized between [-1, +1].

You should use the same set up as you did to calculate document frequency above: paragraphs as documents, lemmatized, lower-cased, and with term frequency information removed by conversion to Python sets. You need to use the basic method for calculating PMI introduced in class (and also in the reading) which is appropriate for any possible definition of co-occurrence (here, there is co-occurrence when a word pair appears in the same paragraph), but you should only calculate PMI for the words in your test set. You must avoid building the entire co-occurrence matrix, instead you should keeping track of the sums you need for the probabilities as you go along. 

When you have calculated NPMI for all the pairs, you should store the word pair and NPMI-similarity mappings in a dictionary called *NPMI_similarities*.

(1 mark)

In [10]:
# NPMI_similarities stores the word pair and NPMI similarity mappings
NPMI_similarities = {}

###
# Your answer BEGINS HERE
###
from math import log2
probIndividual = {}
probCooccurrence   = {}
sumOfWordType = 0

for i in wordType:
    for j in i: 
        sumOfWordType = sumOfWordType +1

for key, value in wordSynset.items():
    probIndividual[key] = 0
    for i in wordType:
        if key in i:
            probIndividual[key] += 1
    probIndividual[key] = probIndividual[key] / sumOfWordType
    
for key, value in final_gold_standard.items():
    probCooccurrence [key] = 0
    for i in wordType:
        if key[0] in i and key[1] in i:
            probCooccurrence [key] += 1
    probCooccurrence [key] = probCooccurrence [key] / sumOfWordType

#when there is no co-occurrence, NPMI is -1    
for key, value in final_gold_standard.items():
    if probCooccurrence [key] <= 0:
        NPMI_similarities[key] = -1  
    else:  
        NPMI_similarities[key] = (log2(probIndividual[key[0]] * probIndividual[key[1]]))/(log2(probCooccurrence [key]))-1
###
# Your answer ENDS HERE
###


print(NPMI_similarities)


{('bread', 'butter'): 0.6538151518893889, ('professor', 'doctor'): -1, ('student', 'professor'): 0.5365071315202603, ('stock', 'egg'): 0.3751921743542559, ('money', 'cash'): 0.44845734833722983, ('king', 'queen'): 0.41920030980112544, ('bishop', 'rabbi'): -1, ('football', 'basketball'): 0.7167622049649203, ('football', 'tennis'): -1, ('alcohol', 'chemistry'): 0.6253212412399309, ('baby', 'mother'): 0.5164247675130536, ('car', 'automobile'): 0.5440375908153885, ('journey', 'voyage'): -1, ('coast', 'shore'): 0.5910001797885329, ('furnace', 'stove'): -1, ('brother', 'monk'): 0.4309699607308475, ('journey', 'car'): -1, ('coast', 'hill'): 0.34402837373003714, ('forest', 'graveyard'): -1, ('monk', 'slave'): -1, ('coast', 'forest'): 0.4626209010620377, ('psychology', 'doctor'): 0.46517044587124756, ('psychology', 'mind'): 0.44789746893872007, ('psychology', 'health'): -1, ('psychology', 'science'): 0.5916853413144052, ('planet', 'moon'): 0.6587081059716964, ('planet', 'galaxy'): -1}


<b>For your testing:</b>

In [11]:
assert(NPMI_similarities[('professor', 'doctor')] == -1)

**Instructions:** As PMI matrix is very sparse and can be approximated well by a dense representation via singular value decomposition (SVD), you will derive similarity scores using the Latent Semantic Analysis (LSA) method, i.e. apply SVD and truncate to get a dense vector representation of a word type and then calculate cosine similarity between the two vectors for each word pair. You can use the Distributed Semantics notebook as a starting point, but note that since you are interested here in word semantics, you will be constructing a matrix where the (non-sparse) rows correspond to words in the vocabulary, and the (sparse) columns correspond to the texts where they appear (this is the opposite of the notebook). Again, use the Brown corpus, in the same format as with PMI and document frequency. After you have a matrix in the correct format, use `truncatedSVD` in `sklearn` to produce dense vectors of length k = 500, and then use cosine similarity to produce similarities for your word pairs. 

When you are done, you should store the word pair and LSA-similarity mappings in a dictionary called *LSA_similarities*. 

(1 mark)

In [12]:
# LSA_similarities stores the word pair and LSA similarity mappings
LSA_similarities = {}

###
# Your answer BEGINS HERE
###
from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

brownList_paras = []
for i in wordType:
    para_dict = {}
    for j in i:
        para_dict[j]=1
    brownList_paras.append(para_dict)

vectorizer = DictVectorizer()
matrixBrown = vectorizer.fit_transform(brownList_paras)
matrixBrown = matrixBrown.transpose()
svd = TruncatedSVD(n_components=500)
matrixBrown = svd.fit_transform(matrixBrown)

for wordPair in final_gold_standard:
    wordOne = matrixBrown[vectorizer.feature_names_.index(wordPair[0]),:]
    wordTwo = matrixBrown[vectorizer.feature_names_.index(wordPair[1]),:]
    LSA_similarities[tuple((wordPair[0],wordPair[1]))] = (np.dot(wordOne, wordTwo) 
                                                  / np.sqrt(np.dot(wordOne,wordOne) * np.dot(wordTwo,wordTwo)))
    
###
# Your answer ENDS HERE
###

print(LSA_similarities)

{('bread', 'butter'): 0.2951162988823682, ('professor', 'doctor'): 0.06269017001792164, ('student', 'professor'): 0.2686080127056509, ('stock', 'egg'): 0.12591854232962824, ('money', 'cash'): 0.1537288213670708, ('king', 'queen'): 0.15916084802614966, ('bishop', 'rabbi'): 0.03705881676102094, ('football', 'basketball'): 0.23351843666126051, ('football', 'tennis'): 0.13928676956933891, ('alcohol', 'chemistry'): 0.08256727968871291, ('baby', 'mother'): 0.3190152022014363, ('car', 'automobile'): 0.3391150376533186, ('journey', 'voyage'): 0.1266235650396778, ('coast', 'shore'): 0.4080920153585178, ('furnace', 'stove'): 0.10149253563173559, ('brother', 'monk'): 0.06820177432287158, ('journey', 'car'): -0.00013027645995288068, ('coast', 'hill'): 0.22204958105994993, ('forest', 'graveyard'): 0.06292790189564043, ('monk', 'slave'): -0.038042130943285395, ('coast', 'forest'): 0.11081581358544729, ('psychology', 'doctor'): 0.1798311442495496, ('psychology', 'mind'): 0.11205451947910017, ('psycho

<b>For your testing:</b>

In [13]:
assert(LSA_similarities[('professor', 'doctor')] > 0 and LSA_similarities[('professor', 'doctor')] < 0.4)

## 3. Comparison with the Gold Standard (1 mark)


**Instructions:** Finally, you should compare all the similarities you've created to the gold standard you loaded and filtered in the first step. For this, you can use the Pearson correlation co-efficient (`pearsonr`), which is included in scipy (`scipy.stats`). Be careful converting your dictionaries to lists for this purpose, the data for the two datasets needs to be in the same order for correct comparison using correlation. Write a general function, then apply it to each of the similarity score dictionaries.

When you are done, you should put the result in a dictionary called *pearson_correlations* (use the keys: 'lin', 'NPMI', 'LSA').  

<b>Hint:</b> All of the methods used here should be markedly above 0, but also far from 1 (perfect correlation); if you're not getting reasonable results, go back and check your code for bugs!  

(1 mark)


In [14]:
from scipy.stats import pearsonr

# pearson_correlations stores the pearson correlations with the gold standard of 'lin', 'NPMI', 'LSA'
pearson_correlations = {}

###
# Your answer BEGINS HERE
###
def getCorrelation(data,similarity):
    listOne = []
    listTwo = []
    for key, value in data.items():
        listOne.append(value)
    for key, value in similarity.items():
        listTwo.append(value)    
    result = pearsonr(listOne, listTwo)
    return result

lin_correlations= getCorrelation(final_gold_standard,lin_similarities)
pearson_correlations['lin'] = lin_correlations[0]

NPMI_correlations = getCorrelation(final_gold_standard,NPMI_similarities)
pearson_correlations['NPMI'] = NPMI_correlations[0]

LSA_correlations= getCorrelation (final_gold_standard,LSA_similarities)
pearson_correlations['LSA'] = LSA_correlations[0]
   
###
# Your answer ENDS HERE
###

print(pearson_correlations)

{'lin': 0.44906965748915684, 'NPMI': 0.1268490481420886, 'LSA': 0.4353158562504137}


<b>For your testing:</b>

In [15]:
assert(pearson_correlations['lin'] > 0.4 and pearson_correlations['lin'] < 0.8)

## A final word

Normally, we would not use a corpus as small as the Brown for the purposes of building distributional word vectors. Also, note that filtering our test set to just words we are likely to do well on would typically be considered cheating.