# SHW5: WordNet
In this homework you will be exploring WordNet by finding hyponyms of a synset throughout a text and building synset clusters.  

This homework will be due **Tuesday, November 13 at 11:59pm**

First, we'll start off by importing a few modules. We'll also be defining a few functions to parse the Wordnet files for us to use. **Please do not import any additional libraries beyond the ones below.**

In [None]:
import numpy as np
import os
from IPython.display import Markdown, display
from nltk.stem import WordNetLemmatizer


def read_index_file(filename):
    
    words_to_first_synsets={}
    first_synsets_to_words={}
    with open(filename) as file:
        for i in range(29):
            file.readline()
            
        for line in file:
            cols = line.rstrip().split(" ")
            term = cols[0]
            p_cnt = int(cols[3], 16);
            first_synset_index = 6+p_cnt

            # A word (like "bank") can belong to multiple synsets, so select just one;
            # the first is typically the most frequently used for that word
            first_synset = cols[first_synset_index]
            words_to_first_synsets[term] = first_synset

            if first_synset not in first_synsets_to_words:
                first_synsets_to_words[first_synset] = set()

            first_synsets_to_words[first_synset].add(term)
            
    return words_to_first_synsets, first_synsets_to_words


def read_data_file(filename):

    hyponyms = {}
    with open(filename) as file:
    
        # skip header
        for i in range(29):
            file.readline()

        for line in file:        
            words = []
            cols = line.rstrip().split(" ")
            synset_id = cols[0]
            numWords = int(cols[3], 16);

            numptr_index = 6+((numWords-1) * 2)
            numPtrs = int(cols[numptr_index])

            for i in range(0, numPtrs):
                pointer_symbol = cols[numptr_index+(i * 4) + 1]
                pointed_synset = cols[numptr_index+(i * 4) + 2]
                
                if pointer_symbol == '~': # hyponym relation
                    if synset_id not in hyponyms:
                        hyponyms[synset_id] = set()
                    hyponyms[synset_id].add(pointed_synset)

    return hyponyms

## Problem 1: Hyponym Identification
For this problem, we will be using the WordNet hyponym tree in order to identify all occurences of a hyponym of a given synset in a piece of text. To begin, we call the functions defined above to get the relevant information.  

`word2first_synset` is a dictionary that maps a word to its first synset, and `first_synset2word` is a dictionary that maps a synset to all words that have that synset as their first synset. Note that for this homework, we are considering only the first (which is usually the most common) synset for each word.  
`hyponyms` is a dictionary that maps a synset to a set of synsets which are direct hyponyms of the given synset. This gives the tree structure of hypernym/hyponym relationships.  

In [None]:
# Dictionary mapping word to the first synset it is contained in, and synset to words in the synset
word2first_synset, first_synset2word = read_index_file('index.noun')

# Dictionary of synset to a set of synsets that are direct hyponyms of the synset
hyponyms = read_data_file('data.noun')

### Problem 1.1
Implement `get_hyponym_terms`, which gets all the terms included in the set of all hyponyms of the designated synset.

In [None]:
def get_hyponym_terms(synset_id, hyponyms, first_synsets_to_words):
    
    terms = set()
    """ YOUR CODE HERE """

    return terms

With this function, we are able to identify whether a particular word or phrase is a hyponym of a given synset. Now we can move on to using this function to help identify the locations of hyponyms in text.

### Problem 1.2
Implement `get_synset_locations`, which takes in a specified text (represented as multiple lines of tokenized words) and returns locations of where any hyponyms of a given word, where each location is a nested tuple of the format `(line, (start_index, end_index))`. The start index is inclusive and the end index is exclusive.

For example, if the word is 'mammal', and the 5th line of the text is "Dogs , such as the poodle and german shepherd make wonderful pets ." you should add `(4, (0, 1))`, `(4, (5, 6))`, and `(4, (7, 9))` to the locations list, as `dog`, `poodle`, and `german_shepherd` are hyponyms of 'mammal'.  

Assume `text` is a list of lines, where each line is a word-tokenized (by spaces) string representation of a paragraph of text, and `word` is the word we want to find the hyponyms of.

In [None]:
def get_synset_locations(text, word, hyponyms, first_synset2word):
    
    locations = []
    """ YOUR CODE HERE """
    
    return locs

The function defined below will help visualize where the hyponyms have been located.

In [None]:
def print_text_with_bolded_hyponyms(text, word, hyponyms, first_synset2word):
    locations = get_synset_locations(text, word, hyponyms, first_synset2word)
    
    text_print = [t.split() for t in text]
    for line_index, word_index in locations:
        text_print[line_index][word_index[0]] = '**' + text_print[line_index][word_index[0]]
        text_print[line_index][word_index[1]-1] = text_print[line_index][word_index[1]-1] + '**'
    
    for l in text_print:
        display(Markdown((' '.join(l))))

With the above functions implemented, we can now see how well we're able to identify hyponyms in text. Run the cell immediately below to read the text file, and then the following cell to display the text, wherein all hyponym of the given word (in this case, 'mammal'), will be bolded.

In [None]:
with open('literary.texts.txt', 'r') as f:
    lines = [l.rstrip() for l in f.readlines()]

In [None]:
print_text_with_bolded_hyponyms(lines, 'mammal', hyponyms, first_synset2word)

Feel free to change 'mammal' in the above cell to see different hyponyms being identified.

## Problem 2: Synset Clustering
In this next problem, we will be generating clusters from synsets in order to find which synset are most similar to some words that are not contained in Wordnet.  

We begin by reading in our GloVe word embeddings, trained on a Twitter dataset.

In [None]:
glove_dict = {}
with open('glove.twitter.27B.25d.txt', 'r') as f:
    for line in f.readlines():
        glove_dict[line.split()[0]] = np.array(line.split()[1:], dtype=np.float32)

### Problem 2.1
Create clusters for each synset by finding the point that maximizes the cosine similarity of all the embeddings of the words in the synset. The resultant dictionary should contain a mapping between the synset ID and the optimal point.

In [None]:
def create_clusters(first_synset2word, glove_dict):
    
    clusters = {}
    """ YOUR CODE HERE """
    
    return clusters

Below we have a few words outside of Wordnet's vocabulary. We'll check the most similar synset from the synset clusters.

In [None]:
out_of_wordnet = ['minigame', 'grandmama', 'nocebo', 'crazycatlady', 'blogoversary', 'self-motivation',
                  'bioshock', 'horcrux', 'pokemon', 'allnighter', 'belieber', 'facebook', 'ransomware',
                  'bokeh', 'crowdfunding']

synset_clusters = create_clusters(first_synset2word, glove_dict)

for word in out_of_wordnet:
    dists = []
    for key, val in synset_clusters.items():
        dists.append((key, np.dot(glove_dict[word], val) / np.linalg.norm(val)))
    closest = sorted(dists, key=lambda x: x[1], reverse=True)[0][0]
    print('Closest synset to %s: %s' % (word, ', '.join(first_synset2word[closest])))

## Problem 2.2
Choose three of the above out of WordNet words and write a comment about each of them, answering the following questions: Was the most similar synset what you expected, or did it surprise you? Why do you think that synet was the most similar, based on what you know about WordNet, word embeddings, and the data that the embeddings were trained on?

(write your response here)