# lexsub: default program

In [1]:
import sys
sys.path.append("..")
from default import *
import os

## Run the default solution on dev

In [2]:
lexsub = LexSub(os.path.join('../data','glove.6B.100d.magnitude'))
output = []
with open(os.path.join('../data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner


## Evaluate the default output

In [3]:
import sys
sys.path.append("..")
from lexsub_check import precision
with open(os.path.join('../data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=27.89


## Analysis

The default solution used the pre-trained word vectors to get the nearest neighbors of words. This approach disregarded the valuable information that is contained in semantic lexicons such as WordNet. And it was not good for our lexicon substitution task. We need to retrofit the vectors to make use of the semantic relations.


# Baseline

In [4]:
from lexsub import *
import os

In [5]:
def build_wvec_dict(file):
    wordVectors = {}
    for word, vec in file:
        wordVectors[word] = numpy.zeros(len(vec), dtype=float)
        for idx, elem in enumerate(vec):
            wordVectors[word][idx] = float(elem)
        wordVectors[word] /= math.sqrt((wordVectors[word]**2).sum() + 1e-6)
    return wordVectors

isNumber = re.compile(r"\d+.*")
def norm_word(word):
    if isNumber.search(word.lower()):
        return "---num---"
    elif re.sub(r"\W+", "", word) == "":
        return "---punc---"
    else:
        return word.lower()

def build_lexicon_dict(filename):
    lexicon = {}
    for line in open(filename, "r"):
        words = line.lower().strip().split()
        lexicon[norm_word(words[0])]=[norm_word(word) for word in words[1:]]
    return lexicon

def retrofit_vector(wordVecs, lexicon, alpha, beta, iteration):
    newwordVecs = deepcopy(wordVecs)
    wvVocab = set(wordVecs.keys())

    loopVocab = wvVocab.intersection(set(lexicon.keys()))
    for i in range(iteration):
        for word in loopVocab:
            wordNeighbours = set(lexicon[word]).intersection(wvVocab)
            num = len(wordNeighbours)

            if(num==0):
                continue
            numerator = alpha * wordVecs[word]
            for neighbour in wordNeighbours:
                numerator += beta * newwordVecs[neighbour]
            denominator = num * beta + alpha
            newwordVecs[word]= numerator / denominator
    return newwordVecs

## Run baseline on dev

In [6]:
#lexsub = LexSub(os.path.join('../data','glove.6B.100d.magnitude'))
lexsub = LexSub(os.path.join('../data','glove.6B.100d.retrofit.magnitude'),10)
output = []
with open(os.path.join('../data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below


## Evaluate the default output

In [7]:
import sys
sys.path.append("..")
from lexsub_check import precision
with open(os.path.join('../data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=53.02


## Analysis


The idea of our baseline is to retrofit the word vectors with a graph-based approach. We regarded the words as vertices and connected the words with semantic relations. As the given graph shows, edges exist between words with semantic relations as well as the same words in Q (inferred word vector representations) and Q_hat (observed word vector representations). Our objective is to make Q close to both Q_hat (the pre-trained word vectors) in vector space and semantic lexicons. So, we constructed the Euclidean Distance formula L(Q) which computes the sum of all the edges (with weights) and take the derivative to update each element (from q1,...,qn) in Q for a number of iterations. 

L(Q) = sum from 1 to n(alpha_i * ||q_i - q_hat_i||^2  + sum for all edge (i,j) in E( beta_ij * ||q_i - q_j||^2))

    For iterations = 1 to T:
        For i = 1 to n:
            if q_i has no neighbours:
                continue
            q_i = sum of j:(i,j) in E(beta_ij * q_j + alpha_i * q_hat_i) / sum of j:(i,j) in E( beta_ij * alpha_i)

In each iteration, we iterate over each vector q_i and update their value with the formula. If some word (vertex) has no edge, we ignore it. 

This was shown on retrofit_vector.



## Tune parameters

After implementing the baseline, we tried different lexicons to train the vectors. In our implementation, we got the highest score in ppdb-xl.txt, which is 52.96 in dev.out. 

We tried to tune three parameters: alpha, beta and the number of iterations.

Actually, when we tuned alpha or beta to other values, (e.g. alpha = 1 /1.5 /0.5, beta = 2 /0.9 /1.5), the scores are always lower than original combinations (alpha = 1 and beta = 1).

When we tuned the number of iterations, the scores go higher when "iterations" is set as 15, while the scores go lower when "iterations" is set as 20 and 25. After many tries, we decided that the optimal iterations should be 15 in our implementation.

Finally, we got 53.02 in dev.out.

Tuning the three parameters (alpha, beta, and interations) did not improve the performance significantly. We think that we should try other loss functions or other distance functions to improve the performance.