# lexsub: default program

In [1]:
from default import *
import os

## Run the default solution on dev

In [2]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.magnitude'))
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner


## Evaluate the default output

In [3]:
from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=27.89


## Documentation
Write some beautiful documentation of your program here.


## Explanation of Retrofit Implementation

In this assignment, we have used retrofitting method to train a word vector to predict the synonyms of a given word. Retrofitting utilizes two type of data to train the word embeddings: the pretrained word embedding (GloVe for this exercise) and ontology file which provide undirected graphs of words connected to other words having similar context. The trained word embeddings should be located in similar vector space to other words in the undirected graph.  

In [58]:
from lexsub import *
import os
from lexsub_check import precision

Initial Retrofit __init__ function

The constructor creates and initializes three important variables - vocabulary, wvecs, and ontology_dic. Vocabulary is a set of words that exist as the pretrained word embedding. Wvecs is the dictionary variable containing the pymagnitude vector and the corresponding key created for faster access to data. Lastly ontology_dic is the dictionary object that contains list of ontology words of the key.



In [None]:
def __init__(self, wvec_file, ontology_file, topn=10):
    if os.getcwd().split('/')[-1] == 'answer':
        self.path = os.path.dirname(os.getcwd())
    else:
        self.path = os.getcwd()

    self.vocabulary = set()
    self.wvecs = pymagnitude.Magnitude(wvec_file)
    self.topn = topn

    self.Beta = {} # hyperparameter matrix key: (key1, key2); value: number of times the pair appears in ontology
    self.wvecs_dic = {}
    self.Q = {}
    for key, vector in self.wvecs:
        self.vocabulary.add(key)
        self.Q[key] = vector

    # store text data in ontolgy file
    self.ontology_dic = {}
    for line in open(ontology_file, 'r'):
        words = line.lower().strip().split()
        target = words[0]
        contexts = words[1:]

        if words[0] not in self.ontology_dic:
            self.ontology_dic[words[0]] = [word for word in words[1:]]
        else:
            self.ontology_dic[words[0]] = self.ontology_dic[words[0]] + [word for word in words[1:]]

This function returns list of ontology words of the target word stored in the ontology_dic. If no key matches the target, the function simply returns nothing

In [44]:
def find_qjs(self, target):
    if target in self.ontology_dic:
        contexts = self.ontology_dic[target]
        return contexts
    else:
        return None

This function calculates the new values for the vector qi. The constants alpha and beta are a hyperparameter which can be changed to improve the performance of the model. In this implementation, beta is set as the number of occurence of target, ontology word pairs in the onotlogy document. For alpha, we have tried __a = 1__ which resulted dev.out = __44.55__ and __a = 2__ resulted the dev.out score of __46.10__. As we kept increseing the alpha values, as __a = 3__ resulted dev.out score __48.50__ , __a = 15__ resulted dev.out score __52.32__ and lastly __a = 20__ dev.out score __52.73__, we figured out that increasing the alpha value gave us the better dev.out score. Increased accuracy due to increasing alpha can be explained as with the "quality" of ontology document. As it will be shown in the Alternative Methods section, the alternative methods with increased usage of ontology, suffered from decreased model accuracy. Meaning ontology could be acting as a noise to the model. By increasing the alpha we increase the weight applied to the pretrained vector and prevent the noise from ontology from sabotaging the updated word embeddings. 

In [34]:
def update_qi(self):
    a = 20
    B = 1
    for index, target in enumerate(self.vocabulary): 
        qjs = self.find_qjs(target) 
        if qjs == None:
            continue
        else:     
            new_qi = np.zeros(self.Q[target].shape)
            count = 0
            for context in qjs: 
                if context in self.Q:
                    new_qi += B * self.Q[context] 
                    count += 1
            new_qi = (new_qi + a * self.Q[target]) / (count + a)
            self.Q[target] = new_qi

This function creates a glove.6B.100d.retrofit.txt file and glove.6B.100d.retrofit.magnitude.

In [35]:
def create_retrofit_txt(self):
    text_file_name = os.path.join(self.path, 'data', 'glove.6B.100d.retrofit.txt')
    with open(text_file_name, 'w') as f:
        for key, vector in self.Q.items():
            line = key
            for num in vector:
                line = line + " " + str(num)
            line = line + '\n'
            f.write(line)

    destination_file_name = os.path.join(self.path, 'data', 'glove.6B.100d.retrofit.magnitude')
    converter.convert(text_file_name, destination_file_name)

    return destination_file_name

This function creates set of vocabularies and assign it.

In [45]:
def retrofit(self):
    for t in range(self.topn):
        self.update_qi()
    Q_pymag_dest = self.create_retrofit_txt()
    self.Q_pymag = pymagnitude.Magnitude(Q_pymag_dest)

## Output after retrofitting implementation

In [None]:
lexsub = LexSub(os.path.join(os.path.dirname(os.getcwd()), 'data','glove.6B.100d.retrofit.magnitude'), os.path.join(os.path.dirname(os.getcwd()), 'data','lexicons','ppdb-xl.txt'))
lexsub.retrofit()
output = []
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

In [None]:
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away
sides aside edge bottom under hand part close below away

In [None]:
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=52.73

## Analysis

Do some analysis of the results. What ideas did you try? What worked and what did not?

The main hurdle of retrofitting implementation was creating efficient algorithm. Since GloVe datasets are consisted with 400,000 word vectors of dimensionality 100, and each vector has find its “neighbours” in to update itself ten times the running time of the program can differ dramatically depending on how it is implemented. After investing many hours and creating several implementations, the solution settled on utilizing python set to store the neighbouring context words for each target word. Also, there are subtle detail one should consider when one of the context word does not exist in the pertained vector. Our baseline implementation handles this issue by simply ignoring the vector and removing corresponding term during the calculation. 
We have tried running with different types of txt files in lexicons. As a result, we have implemented retrofitting to combine the information about word senses from ppdb-xl.txt in order to modify the default word vectors. After some tests, we find that the retrofitting with ppdb-xl.txt has the best result which was __52.73__ dev.out score.

## Alternative Methods

Although following methods did not improve the F-score of the solution (in fact they yielded lower F-score), these methods applies heuristic to the baseline model. These methods add the out-of-vocabulary handling feature to the baseline method.

1. First alternative methods, solves the aforementioned non-existing context word case by utilizing pymagnitude’s __“out-of-vocabulary”__ feature. Pymagnitude generates vector for __out-of-vocabulary__ words by observing the similar words in the pre-trained vector such that word with similar spellings will be assigned with high similarity value. By including word embeddings of unseen word during the vector update process, the new calculated vector will incorporate more information of its context words.

### Second alternative methods 

In [73]:
def __init__(self, wvec_file, ontology_file, topn=10):
    self.vocabulary = set()
    self.wvecs = pymagnitude.Magnitude(wvec_file)
    self.topn = topn
    self.Q = {} # updated vectors for pre_existing_vectors 
    self.context_out_of_words = set()

    self.Beta = {} # hyperparameter matrix key: (key1, key2); value: number of times the pair appears in ontology

    for key, vector in self.wvecs: 
        self.vocabulary.add(key)
        self.Q[key] = vector

    # look through the ontology and check for out_of_words
    num_ontology_line = 0
    with open(ontology_file) as f:
        for i, l in enumerate(f):
            pass
        num_ontology_line = i+1

    self.out_of_word = {} #  dictionary containing key and vector of newly added vocabulary
    self.ontology_dic = {} # dictionary containing context words of a target word

    ont = open(ontology_file, 'r')
    for l in range(num_ontology_line):
        line = ont.readline(l)
        words = line.lower().strip().split() # line in ontology
        for target in words: 
            contexts = [word for word in words if word!=target]
            if target not in self.vocabulary: # check if the word vector exists # there is no vector for this word
                new_vec = np.zeros(self.wvecs.query(target).shape)
                sum_count = 0 
                for context in contexts:
                    if context in self.wvecs:
                        sum_count += 1
                        new_vec += self.wvecs.query(target)
                    else:
                        self.context_out_of_words.add(context)
                if sum_count > 0 :
                    new_vec = new_vec / sum_count
                    self.Q[target] = new_vec    
                    self.vocabulary.add(target) 
            if target not in self.ontology_dic:
                self.ontology_dic[target] = set(contexts)
            else:
                self.ontology_dic[target] |= set(contexts)
            for context in contexts:
                if frozenset([target, context]) in self.Beta: 
                    self.Beta[frozenset([target, context])] += 1
                else:
                    self.Beta[frozenset([target, context])] = 1

2. Second method enhances the context word dictionary (ontology_dic in the source code) and attempts to create Beta weight dictionary where its key is a frozenset of two words and value is the number of occurrence of the word pair in the ontology file. 
With frozenset key, key with different order still yields same value making the order of key ineffective __(dic[frozenset(A,B)] == dic[frozenset(B,A)])__. In the baseline implementation, the code only treats the zeroth word of the ontology line as a target and saved all latter words as context words for faster computation. However the ontology line is an undirected, fully-connected graph. Hense, the second method creates accurate ontology dictionary which reflects the definition of the ontology file. Also, this method takes different approach for treating the out-of-vocabulary word. Instead of utilizing he pymagnitude’s feature, it computes the average of its context word’s vector embedding if the context exist as the pertained vector. This method worsened dev.out score as __27.00__ which is similar to the default solution.