# lexsub: default program

In [1]:
from default import *
import os

## Run the default solution on dev

In [2]:
lexsub = LexSub(os.path.join('data','glove.6B.100d.magnitude'))
output = []
with open(os.path.join('data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner
sides edge bottom front club line both back place corner


## Evaluate the default output

In [3]:
from lexsub_check import precision
with open(os.path.join('data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=27.89


## Documentation

Write some beautiful documentation of your program here.

### Some explanation about our code

We implement a script ***retrofit.py*** to do the retrofitting. The script can detect whether the retrofitted word vectors exist. And the ***retrofit.py*** is called in ***lexsub.py*** to get the latest word vectors and lexicon.

***52.97*** is the best score we can get. We have tried different methods to get better results, but we fail to do that. We will discuss the results we have tested and the reason why we fail to do that.

In [1]:
from lexsub import *
import os
from lexsub_check import precision

### Retrofit implement

We implement the retrofit according to the code and the concept provided in the websites. And we have tested some different combination of T, alpha and beta. We will discuss this in the Analysis part.

And we implement retrofitting to combine the information about word senses from ***ppdb-xl.txt*** in order to modify the default word vectors. After some tests, we find that the retrofitting with ***ppdb-xl.txt*** has the best result.

In this retrofitting, we use iteration = 20, alpha = 2.0, beta = 1.0.
We do not test more since the result does not have significant improvement.

In [None]:
def retrofit(wvecs, lexicon, T = 20, alpha = 2.0, beta = 1.0):
    vocab = set()
    newVecs = {}
    for word, vec in wvecs:
        vocab.add(word)
        newVecs[word] = vec

    for i in range(T):
        for index, word in enumerate(vocab):
            tmpVec = np.zeros(newVecs[word].shape)
            if word in lexicon:
                count = 0
                neighbors = lexicon[word]
                for w in neighbors:
                    if w in newVecs:
                        tmpVec += beta * newVecs[w]
                        count += 1

                newVecs[word] = ((tmpVec + (alpha * wvecs.query(word)))) / (count + alpha)

    return newVecs

### Some output after retrofitting

In [2]:
lexsub = LexSub(os.path.join(os.path.dirname(os.getcwd()), 'data','glove.6B.100d.retrofit.magnitude'))
output = []
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along
sides bottom edge part hand place under close near along


In [3]:
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=52.97


## Analysis

Do some analysis of the results. What ideas did you try? What worked and what did not?

### Explanations about Analysis part

We have done two different methods tried to improve the performance.

1. Adjust iterations, alpha and beta
2. Incorporating context words

The first makes little improvement to the result. The second even makes the result worse.

### 1. Adjust iterations, alpha and beta

The best score we get is ***52.97*** with T = 20, alpha = 2.0 and beta = 1.0. The orginal selections are T = 10, alpha = 1.0 and beta = 1.0 and the result is ***52.91*** as indicated below. We improve the number of iterations hoping that the word vectors can learn better about the lexicon. However, the score only increases by 0.06. 

We think the reason why there is only little improvement is that the pre-trained word vectors have been trained well, and more iterations or adjusting other parameters like alpha can not make too many difference to the original word vectors.

In [2]:
lexsub = LexSub(os.path.join(os.path.dirname(os.getcwd()), 'data','glove.6B.100d.retrofit.magnitude'))
output = []
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below
sides edge bottom part hand place close tip under below


In [3]:
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=52.91


### 2. Incorporating context words

We use the ***Add*** measurement decribed in the second paper in the website to calculate the arithmetic mean of the ***cos***. The equation to calculate the mean is indicated below, where C denotes the context words we use, s denotes candidate substitute words, c denotes one of the context word, t denotes the target word.

\begin{equation}
\frac{cos(s, t) + \sum\nolimits_{c \in C} cos(s, c)} {|C|+1}
\end{equation}

Below is the code we implement for this equation. We limit the number of candidate words to 50 since there are 400000 words in the Ontology. If every word is regarded as a candidate word, then some are useless since it is impossible for them to be substitutes and it will take too long to run the program.

However, the result is not good even worse. The score is only ***25.54*** which is even worse than the default solution. After checking the result we produce and the correct answer (at the last of this notebook), we try to make some explanations about this situation. 

The limitation we set to the number of candidate words does not even treat ***team*** as the possible solution to the substitute of ***side***. Since the cosine similarity is only about 0.5 which is not in the top 50 most similar word in the word vectors. As a result, it is impossible to see ***team*** anywhere in the result.

After figure this out, we try to get rid of the limitation. However, the running time is so long that we can not get a proper result in a short time because the algorithm has to sort words according to their means for every sentence to get the top 10 guesses.

But we find that the incorporation of context words can affect the result of the 10 guesses and we think if we can run the program using 400000 words as the candidate words, we can get better results.

In [None]:
def substitutes(self, index, sentence):
    "Return ten guesses that are appropriate lexical substitutions for the word at sentence[index]."
    # return (list(map(lambda k: k[0], self.wvecs.most_similar(sentence[index], topn=self.topn))))

    context = []
    for i in range(5):
        if (i is not 2) and (index - 2 + i >= 0) and (index - 2 + i <= len(sentence) - 1):
            context.append(sentence[index - 2 + i])

    candidate_sub = self.wvecs.most_similar(sentence[index], topn = 50)
    words = []
    nums = []
    for (word, num) in candidate_sub:
        words.append(word)
        nums.append(num)
    nums = np.array(nums)
    for w in context:
        tmp = np.array(self.wvecs.similarity(w, words))
        nums = np.add(nums, tmp)

    ind = np.argpartition(nums, -10)[-10:]
    print(nums[ind])
    return np.array(words)[ind]

In [4]:
lexsub = LexSub(os.path.join(os.path.dirname(os.getcwd()), 'data','glove.6B.100d.retrofit.magnitude'))
output = []
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','input','dev.txt')) as f:
    for line in f:
        fields = line.strip().split('\t')
        output.append(" ".join(lexsub.substitutes(int(fields[0].strip()), fields[1].strip().split())))
print("\n".join(output[:10]))

behind on into through out further both under over during
during part out way both under further over through on
over further hand point during part on both through under
around into towards out during over through both under on
into across place under during towards part on both through
across into hand on during through out over both under
point further place over during part through on under both
towards between over on during into under through across both
towards between over on during into under through across both
around on across both during through under between into towards


In [7]:
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','reference','dev.out'), 'rt') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
print("Score={:.2f}".format(100*precision(ref_data, output)))

Score=25.54


#### Output of the candidate words for the target word in the first sentence 

In [5]:
output[0]

'sides edge bottom part hand place close tip under below'

#### The substitutes provided by the human annotators for the given target word in the first sentence

In [8]:
ref_data[0].replace('\t', ' ')

'side.n team'