# A2: Vector Semantics

By Nikolai Ilinykh, Mehdi Ghanimifard, Wafia Adouane and Simon Dobnik. Updated in 2025 by Ricardo Muñoz Sánchez

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.

In this lab we will look at how to build distributional semantic models from corpora and use semantic similarity captured by these models to do semantic tasks. We are also going to examine how different vector composition functions for vectors work in approximating semantic similarity of phrases when compared to human judgements.

This lab uses code from a file called `dist_erk.py` which contains functions similar to those shown in the lecture. You can use either set of functions to solve these tasks.

In [1]:
# The code for dist_erk.py uses both Spacy and NLTK, so make sure to have them installed!
# Our code also uses SciPY and scikit-learn, so you'll need to install it as well.
# If you're unsure how to do this, check out these websites:
### https://scipy.org/beginner-install/
### https://scikit-learn.org/stable/install.html
### https://spacy.io/usage
### https://www.nltk.org/install.html


# We also need to make sure we have the necessary models and datasets for Spacy
import spacy
#spacy.cli.download('en_core_web_sm')

# You only need to run this cell once
# You *need* to restart the kernel after downloading the model!

In [2]:
# the following command simply imports all the methods from the dist_erk file
from dist_erk import *

## 1. Loading a corpus

To train a distributional model, we first need a sufficiently large collection of texts which contain different words used frequently enough in different contexts. Here we will use a section of the Wikipedia corpus `wikipedia.txt` stored in `wikipedia.zip`. This file has been borrowed from another lab by [Richard Johansson](http://www.cse.chalmers.se/~richajo/).

When unpacked, the file is 151mb, hence if you are using the MLT servers you should store it in a temporary folder outside your home and adjust the `corpus_dir` path below. It may already exist in `/srv/data/computational-semantics/`.

In [3]:
corpus_dir = 'wikipedia'

## 2. Building a model

Now you are ready to build the model.  
Using the methods from the code imported above build three word matrices with 1000 dimensions as follows:  

(i) with raw counts (saved to a variable `space_1k`);  
(ii) with PPMI (`ppmispace_1k`);  
(iii) with reduced dimensions SVD (`svdspace_1k`).  
For the latter use `svddim=5`. **[5 marks]**

Your task is to replace `...` with function calls to functions from `dist_erk.py` which are similar to functions shown during the lecture.

Do not despair if the code takes a bit long to run!
It took me about 9 minutes for the cell below.

In [4]:
numdims = 1000
svddim = 5

# Which words to use as targets and context words?
# We need to count the words and keep only the N most frequent ones
# Which function would you use here with which variable?
ktw = do_word_count(corpus_dir, numdims)

wi = make_word_index(ktw) # word index
words_in_order = sorted(wi.keys(), key=lambda w:wi[w]) # sorted words

# Create different spaces (the original matrix space, the ppmi space, the svd space)
# Which functions with which arguments would you use here?
print('create count matrices')
space_1k = make_space(corpus_dir, wi, numdims)
print('ppmi transform')
ppmispace_1k = ppmi_transform(space_1k, wi)
print('svd transform')
svdspace_1k = svd_transform(ppmispace_1k, numdims, svddim)
print('done.')

reading file wikipedia.txt
create count matrices
reading file wikipedia.txt


1145485it [01:40, 11427.26it/s]


ppmi transform
svd transform
done.


Comment :

For words_in_order we use the sorted method but its may not be needed as the function do_word_count already sorts the words using the method most_common

In [5]:
# now, to test the space, you can print vector representation for some words
print('house:', space_1k['house'])

house: [2551 3714 3104  567  962  627  443  185  311  189  131   28   93  169
   81  125  151  408  194   89   79   29  217  184   62   15   31   70
   10    1   41   21    1   31   37    1   30    5   25    7    3   20
   11    1   32   36    2    5   65    4    0   46    8   18   28    0
   20    7    8   16   10   40    0  175   10    2    7   19    1  174
   11    3    1    6    0    0    0   10    9   11    7   24    4    4
   14   23   58    7    0   10    2    3   10    6   18    6   13    3
   22    0    3    5    3    7   14    3   40   20   19   15    6    8
   23    4    5    1   19    0    3    1    0   14    0   14   53    7
    7   11    6    5    5    4   12    6   53    1    1  433    4    0
    5    7    7   12    1    1    3    4   17    8   16    1    2   31
    1   12   14    1   44    6   14    9   38    7    2    6    8    1
   10    6   10    1    9    7    9    4    3    9    0   11    3    2
    0    2   11   37    2    0    2    1    5    9   10   16    4    6

Oxford Advanced Dictionary has 185,000 words, hence 1,000 words is not representative. We trained a model with 10,000 words, and 50 dimensions on truncated SVD. All matrices are available in the folder `pretrained` of the `wikipedia.zip`file. These are `ktw_wikipediaktw.npy`, `raw_wikipediaktw.npy`, `ppmi_wikipediaktw.npy`, `svd50_wikipedia10k.npy`. Make sure they are in your path as we load them below.

In [6]:
import numpy as np

numdims = 10000
svddim = 50

print('Please wait...')
ktw_10k       = np.load('wikipedia/pretrained/ktw_wikipediaktw.npy', allow_pickle=True)
space_10k     = np.load('wikipedia/pretrained/raw_wikipediaktw.npy', allow_pickle=True).tolist()
ppmispace_10k = np.load('wikipedia/pretrained/ppmi_wikipediaktw.npy', allow_pickle=True).tolist()
svdspace_10k  = np.load('wikipedia/pretrained/svd50_wikipedia10k.npy', allow_pickle=True).tolist()
print('Done.')


Please wait...
Done.


In [7]:
# testing semantic space
print('house:', space_10k['house'])

house: [2554 3774 3105 ...    0    0    0]


## 3. Testing semantic similarity

The file `similarity_judgements.txt` contains 7,576 pairs of words and their lexical and visual similarities (based on the pictures) collected through crowd-sourcing using Mechanical Turk as described in [1]. The scores range from 1 (highly dissimilar) to 5 (highly similar). Note: this is a different dataset from the phrase similarity dataset we discussed during the lecture [2]. You can find more details about how they were collected in the papers.

The following code will transform similarity scores into a Python-friendly format:

In [8]:
word_pairs = [] # test suit word pairs
semantic_similarity = []
visual_similarity = []
test_vocab = set()

for index, line in enumerate(open('similarity_judgements.txt')):
    data = line.strip().split('\t')
    if index > 0 and len(data) == 3:
        w1, w2 = tuple(data[0].split('#'))
        # Checks if both words from each pair exist in the word matrix.
        if w1 in ktw_10k and w2 in ktw_10k:
            word_pairs.append((w1, w2))
            test_vocab.update([w1, w2])
            semantic_similarity.append(float(data[1]))
            visual_similarity.append(float(data[2]))

print('number of available words to test:', len(test_vocab-(test_vocab-set(ktw))))
print('number of available word pairs to test:', len(word_pairs))
#list(zip(word_pairs, visual_similarity, semantic_similarity))

number of available words to test: 12
number of available word pairs to test: 774


We are going to test how the cosine similarity between vectors of each of the three spaces (normal space, ppmi, svd) compares with the human similarity judgements for the words in the similarity dataset. Which of the three spaces best approximates human judgements?

For comparison of several scores, we can use [the Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better the similarity scores align. The p values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Here is how you can calculate the Spearman correlation coefficient betweeen the scores of visual similarity and semantic similarity of the available words in the test suite:

In [9]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, visual_similarity)
print("""Visual Similarity vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))


Visual Similarity vs. Semantic Similarity:
rho     = 0.7122
p-value = 0.0000


Let's now calculate the cosine similarity scores of all word pairs in an ordered list using all three matrices. **[6 marks]**

In [10]:
raw_similarities  = [cosine(w1, w2, space_10k) for w1, w2 in word_pairs]
ppmi_similarities = [cosine(w1, w2, ppmispace_10k) for w1, w2 in word_pairs]
svd_similarities  = [cosine(w1, w2, svdspace_10k) for w1, w2 in word_pairs]

Calculate correlation coefficients between lists of similarity scores and the real semantic similarity scores from the experiment. The scores of what model best correlates them? Is this expected? **[6 marks]**

In [11]:
# your code should go here
# for raw similarities
rho_raw, pval_raw = stats.spearmanr(semantic_similarity, raw_similarities)
print("""Raw Similarity vs. Semantic Similarity:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_raw, pval_raw))
# for PPMI similarities
rho_ppmi, pval_ppmi = stats.spearmanr(semantic_similarity, ppmi_similarities)
print("""PPMI Similarity vs. Semantic Similarity:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_ppmi, pval_ppmi))
# for SVD similarities
rho_svd, pval_svd = stats.spearmanr(semantic_similarity, svd_similarities)
print("""SVD Similarity vs. Semantic Similarity:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_svd, pval_svd))

Raw Similarity vs. Semantic Similarity:
	rho     = 0.1522
	p-value = 0.0000
PPMI Similarity vs. Semantic Similarity:
	rho     = 0.4547
	p-value = 0.0000
SVD Similarity vs. Semantic Similarity:
	rho     = 0.4232
	p-value = 0.0000


**Your answer should go here:**

The scores of the PPMI model best correlates with the real semantic similarity scores. The PPMI model is indeed expected to be better than the raw one because more information (mutual information) is extracted. SVD is a reduction from PPMI so some information is lost but SVD is supposed to be better when there is a lot of noise in the data. Maybe here, with only few data, the noise is already removed at the ppmi level, which could explain why the PPMI model performs best.

We can also calculate correlation coefficients between lists of cosine similarity scores and the real visual similarity scores from the experiment. Which similarity model best correlates with them? How do the correlation coefficients compare with those from the previous comparison - and can you speculate why do we get such results? **[7 marks]**

In [12]:
# Your code should go here...
# for raw similarities
rho_raw_vis, pval_raw_vis = stats.spearmanr(visual_similarity, raw_similarities)
print("""Raw Similarity vs. Visual Similarity:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_raw_vis, pval_raw_vis))
# for PPMI similarities
rho_ppmi_vis, pval_ppmi_vis = stats.spearmanr(visual_similarity, ppmi_similarities)
print("""PPMI Similarity vs. Visual Similarity:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_ppmi_vis, pval_ppmi_vis))
# for SVD similarities
rho_svd_vis, pval_svd_vis = stats.spearmanr(visual_similarity, svd_similarities)
print("""SVD Similarity vs. Visual Similarity:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_svd_vis, pval_svd_vis))

Raw Similarity vs. Visual Similarity:
	rho     = 0.1212
	p-value = 0.0007
PPMI Similarity vs. Visual Similarity:
	rho     = 0.3838
	p-value = 0.0000
SVD Similarity vs. Visual Similarity:
	rho     = 0.3097
	p-value = 0.0000


**Your answer should go here:**

For visual similarity, it's also the PPMI model which correlates best.

The correlation coefficients obtained with the real visual similarity are lower than those obtained with the real semantic similarity, for each model. It is to be noticed that for each model, coefficients obtained for visual and semantic similarities are quite close, and the variation from model to model is close for visual and semantic similarities. It may show an existing relationship betwwen semantic and visual similarity.

## 4. Operations on similarities

We can perform mathematical operations on vectors to derive meaning predictions.

For example, we can perform `king - man` and add the resulting vector to `woman` and we hope to get the vector for `queen`. What would be the result of `stockholm - sweden + denmark`? Why? **[3 marks]**

If you want to learn more about vector differences between words (and words in analogy relations), check this paper [4].

**Your answer should go here:**

stockholm - sweden + denmark should result in copenhagen because the vector stockholm - sweden shoud represent the meaning of capital city. So combined with denmark it should give the danish capital city.

In [13]:
vec1 = ppmispace_10k['king'] - ppmispace_10k['man'] + ppmispace_10k['woman']
vec2 = ppmispace_10k['queen']
print(ppmispace_10k['king'])
print(ppmispace_10k['man'])
print(ppmispace_10k['woman'])
print(ppmispace_10k['queen'])
print(vec1)
print(vec2)
print("Cosine similarity betwwen 'queen' and 'king' - 'man' + 'woman':", np.sum(vec1 * vec2) / (veclen(vec1) * veclen(vec2))) #cosine similarity vec1 / vec2


vec3 = ppmispace_10k['stockholm'] - ppmispace_10k['sweden'] + ppmispace_10k['denmark']
vec4 = ppmispace_10k['copenhagen']
print(vec3)
print(vec4)
print("Cosine similarity betwwen 'copenhagen' and 'stockholm' - 'sweden' + 'denmark':", np.sum(vec3 * vec4) / (veclen(vec3) * veclen(vec4))) #cosine similarity vec3 / vec4

[0.03552611 0.545724   1.0181931  ... 0.         0.         0.        ]
[0.         0.32761374 0.14667151 ... 0.         1.65460267 0.        ]
[0.         0.24733743 0.         ... 0.         0.         0.        ]
[0.07510489 0.53714324 0.81981624 ... 0.         0.         0.        ]
[ 0.03552611  0.4654477   0.87152159 ...  0.         -1.65460267
  0.        ]
[0.07510489 0.53714324 0.81981624 ... 0.         0.         0.        ]
Cosine similarity betwwen 'queen' and 'king' - 'man' + 'woman': 0.24685292900604752
[ 0.77140793  0.         -0.30190277 ...  0.          0.
  0.        ]
[0.43039703 0.         0.3714493  ... 0.         0.         0.        ]
Cosine similarity betwwen 'copenhagen' and 'stockholm' - 'sweden' + 'denmark': 0.10368265641034352


Here is some code that allows us to calculate such comparisons.

In [14]:
from scipy.spatial import distance

def normalize(vec):
    return vec / veclen(vec)

def find_similar_to(vec1, space):
    # vector similarity funciton
    #sim_fn = lambda a, b: 1-distance.euclidean(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.correlation(a, b)
    #sim_fn = lambda a, b: 1-distance.cityblock(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.chebyshev(normalize(a), normalize(b))
    #sim_fn = lambda a, b: np.dot(normalize(a), normalize(b))
    sim_fn = lambda a, b: 1-distance.cosine(a, b)

    sims = [
        (word2, sim_fn(vec1, space[word2]))
        for word2 in space.keys()
    ]
    return sorted(sims, key = lambda p:p[1], reverse=True)

Here is how you apply this code. Comment on the results you get. **[3 marks]**

In [15]:
short = normalize(svdspace_10k['short'])
light = normalize(svdspace_10k['light'])
long = normalize(svdspace_10k['long'])
heavy = normalize(svdspace_10k['heavy'])

find_similar_to(light - (heavy - long), svdspace_10k)[:10]

[('long', np.float64(0.8733111261346902)),
 ('above', np.float64(0.8259671977311956)),
 ('around', np.float64(0.8030776291120686)),
 ('sun', np.float64(0.7692439111243974)),
 ('just', np.float64(0.767848197477811)),
 ('wide', np.float64(0.7672574319922534)),
 ('each', np.float64(0.7665960260861158)),
 ('circle', np.float64(0.7647746702909335)),
 ('length', np.float64(0.7601066921319761)),
 ('almost', np.float64(0.7542351860536627))]

In [16]:
queen = normalize(ppmispace_10k['queen'])
king = normalize(ppmispace_10k['king'])
man = normalize(ppmispace_10k['man'])
woman = normalize(ppmispace_10k['woman'])
find_similar_to(king - man + woman, ppmispace_10k)[:10]


[('king', np.float64(0.6491443577785698)),
 ('woman', np.float64(0.5197809644507281)),
 ('queen', np.float64(0.24465515993898057)),
 ('prince', np.float64(0.2401281327079562)),
 ('emperor', np.float64(0.23675776346331323)),
 ('ii', np.float64(0.22459826786171)),
 ('son', np.float64(0.21607422141963617)),
 ('iii', np.float64(0.21333759857917067)),
 ('louis', np.float64(0.20969687973075513)),
 ('charles', np.float64(0.2058020637334398))]

In [17]:
denmark = normalize(ppmispace_10k['denmark'])
stockholm = normalize(ppmispace_10k['stockholm'])
sweden = normalize(ppmispace_10k['sweden'])
copenhagen = normalize(ppmispace_10k['copenhagen'])
find_similar_to(stockholm - sweden + denmark, ppmispace_10k)[:15]


[('stockholm', np.float64(0.6471741063767443)),
 ('denmark', np.float64(0.5270482981264609)),
 ('paris', np.float64(0.14192180704346102)),
 ('prague', np.float64(0.13330803087198895)),
 ('moscow', np.float64(0.12723873038417588)),
 ('london', np.float64(0.12179613535983802)),
 ('berlin', np.float64(0.11239954334145585)),
 ('munich', np.float64(0.11111614973189443)),
 ('oslo', np.float64(0.10943906436877393)),
 ('vienna', np.float64(0.10922784597519919)),
 ('copenhagen', np.float64(0.10702413228724272)),
 ('montreal', np.float64(0.10121429569043294)),
 ('philadelphia', np.float64(0.10036811707045457)),
 ('toronto', np.float64(0.09830080285008835)),
 ('1876', np.float64(0.09723346442225633))]

**Your answer should go here:**

For light - heavy + long we don't get the expected result 'short' : maybe the model is confused due to 'light' and 'heavy' having several different meanings. We don't get 'short' but we get words associated with space and dimension or localisation (above, length, around), 'sun' is a surprising result here but it might be related to the second meaning of 'light'.

For stockholm - sweden + denmark, copenhagen arrives late in the similar vectors list, but the list mainly includes capital cities which shows that the model performs well.

For king - man + woman, queen is the third result in the list, which is fine. Besides the list contains words related to rulers (emperoor, prince, louis...)



Find 5 similar pairs of pairs of words and test them. Hint: google for `word analogies examples`. You can also construct analogies that are not only lexical but also express other relations such as grammatical relations, e.g. `see, saw, leave, ?` or analogies that are based on world knowledge as in `question-words.txt` from the [Google analogy dataset](http://download.tensorflow.org/data/questions-words.txt) described in [3]. Does the resulting vector similarity confirm your expectations? Remember you can only do this test if the words are contained in our vector space with 10,000 dimensions. **[10 marks]**

In [18]:
# Your code should go here...
see = normalize(svdspace_10k['see'])
saw = normalize(svdspace_10k['saw'])
leave = normalize(svdspace_10k['leave'])
left = normalize(svdspace_10k['left'])

find_similar_to(saw - (see - leave), svdspace_10k)[:15]

[('leave', np.float64(0.7717628100683235)),
 ('stay', np.float64(0.7701972693040484)),
 ('abandon', np.float64(0.7593780673146007)),
 ('move', np.float64(0.7573350540949787)),
 ('shut', np.float64(0.755817876374274)),
 ('resign', np.float64(0.7502871358861033)),
 ('return', np.float64(0.7482934387903277)),
 ('meet', np.float64(0.7471493847747047)),
 ('put', np.float64(0.7469297945261502)),
 ('lose', np.float64(0.742779025865288)),
 ('stand', np.float64(0.741304842113655)),
 ('break', np.float64(0.7398152892159425)),
 ('bring', np.float64(0.7334528313285422)),
 ('fight', np.float64(0.7301413971995372)),
 ('lay', np.float64(0.7285960910279293))]

In [19]:
# Your code should go here...
go = normalize(svdspace_10k['go'])
went = normalize(svdspace_10k['went'])
work = normalize(svdspace_10k['work'])
worked = normalize(svdspace_10k['worked'])

find_similar_to(went - (go - work), svdspace_10k)[:15]

[('works', np.float64(0.7682132682899971)),
 ('career', np.float64(0.7647854220454858)),
 ('work', np.float64(0.7363374544705507)),
 ('worked', np.float64(0.7356698664459044)),
 ('appeared', np.float64(0.7315200520007924)),
 ('began', np.float64(0.7243143729711581)),
 ('wrote', np.float64(0.7186745926162428)),
 ('became', np.float64(0.7167849898531065)),
 ('went', np.float64(0.7120800159218981)),
 ('started', np.float64(0.7055818131919861)),
 ('met', np.float64(0.6950394491706384)),
 ('role', np.float64(0.6916456155990831)),
 ('took', np.float64(0.6882249479502374)),
 ('era', np.float64(0.6840303739142773)),
 ('followed', np.float64(0.6812720145732776))]

About the grammatical relation infinitive/preterit forms: at first with the first test we concluded that the model can't perform well due to the stemization as it is mentioned in comment in the library dist_erk.py. However, looking into the code in more details we noticed that our models actually don't work with stems only, and we got confused. After some discussion between us and with Mattias, we decided to do the same test with some other words which aren't ambiguous. And the second test shows that the model actually performs well, as the expected result 'worked' is returned. As a conclusion, for the first example, the model was certainly confused with the different meanings of 'saw' and 'left'


In [20]:
# Your code should go here...
safe = normalize(svdspace_10k['safe'])
safely = normalize(svdspace_10k['safely'])
slow = normalize(svdspace_10k['slow'])
slowly = normalize(svdspace_10k['slowly'])

find_similar_to(safely - (safe - slow), svdspace_10k)[:10]

[('safely', np.float64(0.8367927222953683)),
 ('drop', np.float64(0.7724152355781814)),
 ('slowly', np.float64(0.7584178211700946)),
 ('sink', np.float64(0.7328318296579761)),
 ('leg', np.float64(0.7315775451698037)),
 ('plane', np.float64(0.7278822154760428)),
 ('intentionally', np.float64(0.7276022163631448)),
 ('suddenly', np.float64(0.7255999410004005)),
 ('fires', np.float64(0.7172154718762171)),
 ('discharge', np.float64(0.716919892696525))]

About adverbs/adjectives forms relation: the model performs quite well as we get the expected result 'slowly' in third position

In [21]:
# Your code should go here...
clear = normalize(svdspace_10k['clear'])
unclear = normalize(svdspace_10k['unclear'])
possible = normalize(svdspace_10k['possible'])
impossible = normalize(svdspace_10k['impossible'])

find_similar_to(unclear - (clear - possible), svdspace_10k)[:15]

[('unclear', np.float64(0.8019858714685726)),
 ('unlikely', np.float64(0.7635318354914318)),
 ('likely', np.float64(0.7587057038069621)),
 ('uncertain', np.float64(0.7531067951102806)),
 ('fatal', np.float64(0.7472101357720266)),
 ('possible', np.float64(0.737763108554002)),
 ('inevitable', np.float64(0.7344989416834508)),
 ('probable', np.float64(0.7323442226054725)),
 ('incorrect', np.float64(0.726845381672453)),
 ('autism', np.float64(0.7265489567003465)),
 ('worse', np.float64(0.7193373085385575)),
 ('beneficial', np.float64(0.7148626113260488)),
 ('obvious', np.float64(0.7055427617364421)),
 ('documented', np.float64(0.7026494233525161)),
 ('hiv', np.float64(0.7025376156186143))]

In [22]:
# Your code should go here...
clear = normalize(svdspace_10k['clear'])
unclear = normalize(svdspace_10k['unclear'])
certain = normalize(svdspace_10k['certain'])
uncertain = normalize(svdspace_10k['uncertain'])

find_similar_to(unclear - (clear - certain), svdspace_10k)[:15]

[('unclear', np.float64(0.7417766270923434)),
 ('certain', np.float64(0.740289056174162)),
 ('uncertain', np.float64(0.7372605182197638)),
 ('exist', np.float64(0.7300752063052249)),
 ('differing', np.float64(0.7229690702234336)),
 ('related', np.float64(0.7123033756302716)),
 ('common', np.float64(0.7099805095087283)),
 ('associated', np.float64(0.7079283489014209)),
 ('beneficial', np.float64(0.7066936210552571)),
 ('everyday', np.float64(0.7005789816594914)),
 ('unrelated', np.float64(0.6999577323586899)),
 ('vary', np.float64(0.6993012080198612)),
 ('arise', np.float64(0.6986075346912508)),
 ('these', np.float64(0.698591241492787)),
 ('autism', np.float64(0.6981623819043589))]

About adjectives antonyms: testing different prefixes show that the model performs better in recognizing antonyms built with the same prefix than antonyms built with different prefixes ('impossible' is not obtained from 'clear' and 'unclear', whereas 'uncertain' appears from the same words pair)

In [23]:
# Your code should go here...
bad = normalize(svdspace_10k['bad'])
worse = normalize(svdspace_10k['worse'])
big = normalize(svdspace_10k['big'])
bigger = normalize(svdspace_10k['bigger'])

find_similar_to(worse - (bad - big), svdspace_10k)[:15]

[('bigger', np.float64(0.7914575619512246)),
 ('worse', np.float64(0.7553937803306197)),
 ('looking', np.float64(0.7269805214567301)),
 ('aging', np.float64(0.7239652963405067)),
 ('grows', np.float64(0.7234485733586011)),
 ('sharp', np.float64(0.7213339295938627)),
 ('big', np.float64(0.7208705963298278)),
 ('rising', np.float64(0.7164153233012861)),
 ('riding', np.float64(0.7144412898508123)),
 ('unfortunately', np.float64(0.711821020483255)),
 ('interestingly', np.float64(0.7098070541244442)),
 ('bug', np.float64(0.7094753414147278)),
 ('dropping', np.float64(0.7082867934036383)),
 ('gets', np.float64(0.7060384528744815)),
 ('broken', np.float64(0.7036461939988118))]

About adjectives positive/comparative forms: the model performs as expected, 'bigger' being the first result in the list

In [24]:
# Your code should go here...
mouse = normalize(svdspace_10k['mouse'])
mice = normalize(svdspace_10k['mice'])
bird = normalize(svdspace_10k['bird'])
birds = normalize(svdspace_10k['birds'])

find_similar_to(mice - (mouse - bird), svdspace_10k)[:15]

[('insects', np.float64(0.8239032697493532)),
 ('rats', np.float64(0.8077913450941908)),
 ('birds', np.float64(0.8038784714162265)),
 ('ants', np.float64(0.7899042170335102)),
 ('whales', np.float64(0.7863088380070324)),
 ('animals', np.float64(0.7778530868904784)),
 ('bird', np.float64(0.7762879802801479)),
 ('mice', np.float64(0.7753622048372169)),
 ('mammals', np.float64(0.771623363038249)),
 ('whale', np.float64(0.7510836335792803)),
 ('bees', np.float64(0.7477381760454581)),
 ('bats', np.float64(0.7453179443078095)),
 ('vegetation', np.float64(0.7293494912806123)),
 ('dogs', np.float64(0.7217067271991996)),
 ('pigs', np.float64(0.7152532509202817))]

About plural/singular forms for nouns: we get the expected result 'birds' as the third result, which is rather good. Also we can notice that most of the returned vectors are animals in plural forms.

General comment: Mathematical operations on vectors to test meaning predictions is not perfect but considering the size of our model we still can say it performs quite well, as we can see some consistency between the expectations and the results.

## 5. Semantic composition and phrase similarity **[20 marks]**

In this task, we are going to examine how the composed vectors of phrases by different semantic composition functions/models introduced in [2] correlate with human judgements of similarity between phrases. We will use the dataset from this paper which is stored in `mitchell_lapata_acl08.txt`. If you are interested about further details about this task also refer to this paper.

(i) Process the dataset. The dataset contains human judgemements of similarity between phrases recorded one per line. The first column indicates the id of a participant making a judgement (`participant`), the next column is `verb`, followed by `noun` and `landmark`. From these three columns we can construct phrases that were compared by human informants, namely `verb noun` vs `verb landmark`. The next column `input` indicates a similarity score a participant assigned to a pair of such phrases on a scale from 1 to 7 where 1 is lowest and 7 is highest. The last column `hilo` groups the phrases into two sets: phrases where we expect low and phrases where we expect high similarity scores. This is because we want to test our compositional functions on two tasks and examine whether a function is discriminative between them. Correlation between scores could also be due to other reasons than semantic similarity and hence good prediction on both tasks simultaneously shows that a function is truly discriminating the phrases using some semantic criteria.

For extracting information you can use the code from the lecture to start with. How to structure this data is up to you - a dictionary-like format would be a good choice. Remember that each example was judged by several participants and phrases will repeat in the dataset. Therefore, you have to collect all judgments for a particular set of phrases and average them. This will become useful in step (iii).

(ii) Compose the vectors of the extracted word pairs by testing different compositional functions. In the lecture we introduced simple additive, simple multiplicative and combined models (details are described in [2]). Your task is to take a pair of phrases, e.g. the first example in the dataset `stray thought` and `stray roam` and for each phrase compute a composition of the vectors of their words using these functions, using one function per experiment run. For each phrase you will get a single vector. You can encode the words with any vector space introduced earlier (standard space, ppmi or svd) but your code should be structured in a way that it will be easy to switch between them. Finally, take the resulting (composed) vectors of phrase pairs in the dataset and calculate a cosine similarity between them.

(iii) Now you have cosine similairity scores between vectors of phrases but how do they compare with the average human scores that you calculated from the individual judgements from the `input` column of the dataset for the same phrases? Calculate Spearman rank correlation coefficient between two lists of the scores both for the `high` and the `low` task .

We use the Spearmank rank correlation coefficient (or Spearman's rho) rather than Peason's correlation coefficent because we cannot compare cosine scores with human judgements directly. Cosine is a constinuous measure and human judgements are expressed as ranks. Also, we cannot say if 0.28 to 1 is the same (or different) to 6 to 7 in the human scores.  The Spearman rank correlation coeffcient turns the scores for all examples within each group first to ranks and then these ranks are correlated (or approximated to a linear function).

In the end you should get a table similar to the one below from the paper. What is the best compositional function from those that you evaluated with your vector spaces and why?

<img src="res.png" alt="drawing" width="500"/>

Note that you might not get results in the same range as those in the paper.
That is ok, a good interpretation of results and discussion why sometimes they are not as good as you would expect is better than giving the best performing results with little to no analysis.


In [25]:
# (i) - Process the data
# your code should go here
## this part of code is from the example provided
# load the task dataset
with open('mitchell_lapata_acl08.txt', 'r') as f:
    phrase_dataset = f.read().splitlines()

for line in phrase_dataset[:10]:
    print(line)

# get all unique words
words = []
for line in phrase_dataset[1:]:
    _, verb, noun, landmark, _, _ = line.split()
    if verb not in words:
        words.append(verb)
    if noun not in words:
        words.append(noun)
    if landmark not in words:
        words.append(landmark)

# are there any words in the task dataset which do not appear in the reference corpus?
to_remove = []
for w in words:
    if w not in svdspace_10k.keys():
        print(w)
        to_remove.append(w)
# if some words are not found in the reference corpus, makes sense to ignore whole phrases with such words

# how many words does our task dataset has in general? and without words to remove?
print(len(words), len(words) - len(to_remove))

# pre-processing the task dataset
# we are removing all phrases which contain words which are not in the reference corpus
preprocessed_phrase_dataset = []
for line in phrase_dataset:
    _, verb, noun, landmark, _, _ = line.split()
    if verb in to_remove or noun in to_remove or landmark in to_remove:
        continue
    preprocessed_phrase_dataset.append(line)

target_words = []
for line in preprocessed_phrase_dataset[1:]:
    _, verb, noun, landmark, _, _ = line.split()
    if verb not in target_words:
        target_words.append(verb)
    if noun not in target_words:
        target_words.append(noun)
    if landmark not in target_words:
        target_words.append(landmark)
# how many words do we have after pre-processing
len(target_words)

## this part is our code
# get average human judgements for each set of phrases
processed_phrase_dataset = {}
temp_phrase_dataset = {}
for line in preprocessed_phrase_dataset[1:]:
    participant, verb, noun, landmark, input, hilo = line.split()
    if (verb, noun, landmark) not in temp_phrase_dataset.keys():
        temp_phrase_dataset[(verb, noun, landmark)] = {'input': [], 'hilo': hilo}
    temp_phrase_dataset[(verb, noun, landmark)]['input'].append(int(input))
print(temp_phrase_dataset)
for (verb, noun, landmark) in temp_phrase_dataset.keys():
    average_input = sum([input for input in temp_phrase_dataset[(verb, noun, landmark)]['input']]) / len(temp_phrase_dataset[(verb, noun, landmark)]['input'])
    processed_phrase_dataset[(verb, noun, landmark)] = {'average input': average_input, 'hilo' : temp_phrase_dataset[(verb, noun, landmark)]['hilo']}
print(processed_phrase_dataset)

participant verb noun landmark input hilo
participant20 stray thought roam 7 low
participant20 stray discussion digress 6 high
participant20 stray eye roam 7 high
participant20 stray child digress 1 low
participant20 throb body pulse 5 high
participant20 throb head shudder 2 low
participant20 throb voice shudder 3 low
participant20 throb vein pulse 6 high
participant20 chatter machine click 4 high
stray
roam
digress
throb
pulse
shudder
vein
chatter
gabble
tooth
rebound
ricochet
optimism
flicker
waver
flick
subside
lessen
symptom
slump
slouch
stoop
erupt
burst
temper
flare
recoil
flinch
prosper
fluctuate
falter
cigarette
reel
whirl
stagger
glow
cigar
95 58
{('bow', 'butler', 'submit'): {'input': [3, 2, 2, 3, 4, 5, 4, 2, 2, 2, 3, 5, 4, 2, 2, 4, 6, 3, 7, 1, 2, 2, 5, 2, 2, 1, 2, 3, 3, 1, 1, 1, 4, 5], 'hilo': 'low'}, ('bow', 'company', 'submit'): {'input': [5, 6, 6, 4, 6, 5, 5, 4, 6, 2, 2, 5, 6, 6, 2, 4, 4, 6, 7, 6, 6, 2, 3, 3, 2, 4, 2, 5, 6, 2, 5, 2, 5, 3], 'hilo': 'high'}, ('boom', 'sale'

In [26]:
# (ii) - Compose the vectors of the extracted word pairs by testing different compositional functions
# your code should go here

# simple additive compositional function
def add_comp_function(word1, word2, space, alpha, beta, gamma):
    return space[word1] + space[word2]

# simple multiplicative compositional function
def mult_comp_function(word1, word2, space, alpha, beta, gamma):
    return space[word1] * space[word2]

# combined model for compositional function
def comb_comp_function(word1, word2, space, alpha, beta, gamma):
    return alpha * space[word1] + beta * space[word2] + gamma * space[word1] * space[word2]

# function to compose the vectors, with the compositional function as a parameter
def compose_vectors(comp_function, space, alpha, beta, gamma):
    similarities = {}
    for (verb, noun, landmark) in processed_phrase_dataset.keys():
        similarities[(verb, noun, landmark)] = {}
        vec_comp_noun = comp_function(verb, noun, space, alpha, beta, gamma)
        vec_comp_landmark = comp_function(verb, landmark, space, alpha, beta, gamma)
        similarities[(verb, noun, landmark)]['average input'] = processed_phrase_dataset[(verb, noun, landmark)]['average input']
        similarities[(verb, noun, landmark)]['hilo'] = processed_phrase_dataset[(verb, noun, landmark)]['hilo']
        similarities[(verb, noun, landmark)]['comp vector noun'] = vec_comp_noun
        similarities[(verb, noun, landmark)]['comp vector landmark'] = vec_comp_landmark
        similarities[(verb, noun, landmark)]['comp vectors cosine similarity'] = np.sum(vec_comp_noun * vec_comp_landmark) / (veclen(vec_comp_noun) * veclen(vec_comp_landmark))
    return similarities

# composing the vectors using the simple additive compositional function and the svdspace_10k
processed_phrase_dataset_add = compose_vectors(add_comp_function, svdspace_10k, 0, 0, 0)

# composing the vectors using the simple multiplicative compositional function and the svdspace_10k
processed_phrase_dataset_mult = compose_vectors(mult_comp_function, svdspace_10k, 0, 0, 0)

# composing the vectors using the combined compositional function with parameetrs as the paper and the svdspace_10k
processed_phrase_dataset_comb = compose_vectors(comb_comp_function, svdspace_10k, 0.95, 0, 0.05)

# composing the vectors using the combined compositional function  but switching the values for alpha and beta (as we build a VP and not a sentence in our case) and the svdspace_10k
processed_phrase_dataset_comb_rev = compose_vectors(comb_comp_function, svdspace_10k, 0, 0.95, 0.05)

In [27]:
# (iii) - Compare the cosine similarity scores between vectors of phrases with the average human scores
# your code should go here

sorted_keys = sorted(list(processed_phrase_dataset.keys()))

# building the lists of human scores for each task
average_human_similarity_high = [processed_phrase_dataset[(verb, noun, landmark)]['average input']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset[(verb, noun, landmark)]['hilo'] == 'high']
average_human_similarity_low = [processed_phrase_dataset[(verb, noun, landmark)]['average input']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset[(verb, noun, landmark)]['hilo'] == 'low']

# Using simple additive compositional function to build the list of similarities for each task

comp_vectors_similarity_add_high = [processed_phrase_dataset_add[(verb, noun, landmark)]['comp vectors cosine similarity']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset_add[(verb, noun, landmark)]['hilo'] == 'high']
comp_vectors_similarity_add_low = [processed_phrase_dataset_add[(verb, noun, landmark)]['comp vectors cosine similarity']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset_add[(verb, noun, landmark)]['hilo'] == 'low']

rho_add_high, pval_add_high = stats.spearmanr(average_human_similarity_high, comp_vectors_similarity_add_high)
print('For the simple additive function')
print("""Cosine Similarity vs. Average human score for the high task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_add_high, pval_add_high))

rho_add_low, pval_add_low = stats.spearmanr(average_human_similarity_low, comp_vectors_similarity_add_low)
print("""Cosine Similarity vs. Average human score for the low task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_add_low, pval_add_low))

# Using simple multiplicative compositional function to build the list of similarities for each task

comp_vectors_similarity_mult_high = [processed_phrase_dataset_mult[(verb, noun, landmark)]['comp vectors cosine similarity']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset_mult[(verb, noun, landmark)]['hilo'] == 'high']
comp_vectors_similarity_mult_low = [processed_phrase_dataset_mult[(verb, noun, landmark)]['comp vectors cosine similarity']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset_mult[(verb, noun, landmark)]['hilo'] == 'low']

rho_mult_high, pval_mult_high = stats.spearmanr(average_human_similarity_high, comp_vectors_similarity_mult_high)
print('For the simple multiplicative function')
print("""Cosine Similarity vs. Average human score for the high task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_mult_high, pval_mult_high))

rho_mult_low, pval_mult_low = stats.spearmanr(average_human_similarity_low, comp_vectors_similarity_mult_low)
print("""Cosine Similarity vs. Average human score for the low task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_mult_low, pval_mult_low))

# Using combined compositional function with parameters as in the paper to build the list of similarities for each task

comp_vectors_similarity_comb_high = [processed_phrase_dataset_comb[(verb, noun, landmark)]['comp vectors cosine similarity']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset_comb[(verb, noun, landmark)]['hilo'] == 'high']
comp_vectors_similarity_comb_low = [processed_phrase_dataset_comb[(verb, noun, landmark)]['comp vectors cosine similarity']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset_comb[(verb, noun, landmark)]['hilo'] == 'low']

rho_comb_high, pval_comb_high = stats.spearmanr(average_human_similarity_high, comp_vectors_similarity_comb_high)
print('For the combined function with parameters as in the paper')
print("""Cosine Similarity vs. Average human score for the high task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_comb_high, pval_comb_high))

rho_comb_low, pval_comb_low = stats.spearmanr(average_human_similarity_low, comp_vectors_similarity_comb_low)
print("""Cosine Similarity vs. Average human score for the low task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_comb_low, pval_comb_low))

# Using combined compositional function with parameters alpha and beta switched compared to the paper to build the list of similarities for each task

comp_vectors_similarity_comb_rev_high = [processed_phrase_dataset_comb_rev[(verb, noun, landmark)]['comp vectors cosine similarity']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset_comb_rev[(verb, noun, landmark)]['hilo'] == 'high']
comp_vectors_similarity_comb_rev_low = [processed_phrase_dataset_comb_rev[(verb, noun, landmark)]['comp vectors cosine similarity']
                                 for (verb, noun, landmark) in sorted_keys if processed_phrase_dataset_comb_rev[(verb, noun, landmark)]['hilo'] == 'low']

rho_comb_rev_high, pval_comb_rev_high = stats.spearmanr(average_human_similarity_high, comp_vectors_similarity_comb_rev_high)
print('For the combined function with parameters alpha and beta switched compared to thethe paper')
print("""Cosine Similarity vs. Average human score for the high task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_comb_rev_high, pval_comb_rev_high))

rho_comb_rev_low, pval_comb_rev_low = stats.spearmanr(average_human_similarity_low, comp_vectors_similarity_comb_rev_low)
print("""Cosine Similarity vs. Average human score for the low task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_comb_rev_low, pval_comb_rev_low))

For the simple additive function
Cosine Similarity vs. Average human score for the high task:
	rho     = 1.0000
	p-value = 0.0000
Cosine Similarity vs. Average human score for the low task:
	rho     = -0.6000
	p-value = 0.4000
For the simple multiplicative function
Cosine Similarity vs. Average human score for the high task:
	rho     = 1.0000
	p-value = 0.0000
Cosine Similarity vs. Average human score for the low task:
	rho     = -0.8000
	p-value = 0.2000
For the combined function with parameters as in the paper
Cosine Similarity vs. Average human score for the high task:
	rho     = 1.0000
	p-value = 0.0000
Cosine Similarity vs. Average human score for the low task:
	rho     = -0.4000
	p-value = 0.6000
For the combined function with parameters alpha and beta switched compared to thethe paper
Cosine Similarity vs. Average human score for the high task:
	rho     = 1.0000
	p-value = 0.0000
Cosine Similarity vs. Average human score for the low task:
	rho     = 0.8000
	p-value = 0.2000


In [28]:
# calculate average similarity scores without segragating high and low tasks
average_human_similarity_all = average_human_similarity_high + average_human_similarity_low

comp_vectors_similarity_add_all = comp_vectors_similarity_add_high + comp_vectors_similarity_add_low
comp_vectors_similarity_mult_all = comp_vectors_similarity_mult_high + comp_vectors_similarity_mult_low
comp_vectors_similarity_comb_all = comp_vectors_similarity_comb_high + comp_vectors_similarity_comb_low
comp_vectors_similarity_comb_rev_all = comp_vectors_similarity_comb_rev_high + comp_vectors_similarity_comb_rev_low

rho_add_all, pval_add_all = stats.spearmanr(average_human_similarity_all, comp_vectors_similarity_add_all)
print('For the simple additive function (whole task)')
print("""Cosine Similarity vs. Average human score for the whole task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_add_all, pval_add_all))

rho_mult_all, pval_mult_all = stats.spearmanr(average_human_similarity_all, comp_vectors_similarity_mult_all)
print('For the simple multiplicative function (whole task)')
print("""Cosine Similarity vs. Average human score for the whole task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_mult_all, pval_mult_all))

rho_comb_all, pval_comb_all = stats.spearmanr(average_human_similarity_all, comp_vectors_similarity_comb_all)
print('For the combined function with parameters as in the paper (whole task)')
print("""Cosine Similarity vs. Average human score for the whole task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_comb_all, pval_comb_all))

rho_comb_rev_all, pval_comb_rev_all = stats.spearmanr(average_human_similarity_all, comp_vectors_similarity_comb_rev_all)
print('For the combined function with parameters alpha and beta switched compared to the paper (whole task)')
print("""Cosine Similarity vs. Average human score for the whole task:
\trho     = {:.4f}
\tp-value = {:.4f}""".format(rho_comb_rev_all, pval_comb_rev_all))

For the simple additive function (whole task)
Cosine Similarity vs. Average human score for the whole task:
	rho     = -0.0238
	p-value = 0.9554
For the simple multiplicative function (whole task)
Cosine Similarity vs. Average human score for the whole task:
	rho     = 0.2857
	p-value = 0.4927
For the combined function with parameters as in the paper (whole task)
Cosine Similarity vs. Average human score for the whole task:
	rho     = -0.5952
	p-value = 0.1195
For the combined function with parameters alpha and beta switched compared to the paper (whole task)
Cosine Similarity vs. Average human score for the whole task:
	rho     = 0.7381
	p-value = 0.0366


In [29]:
# building the table comparing results for all models
import pandas as pd

high_means = [
    sum(comp_vectors_similarity_add_high) / len(comp_vectors_similarity_add_high),
    sum(comp_vectors_similarity_mult_high) / len(comp_vectors_similarity_mult_high),
    sum(comp_vectors_similarity_comb_high) / len(comp_vectors_similarity_comb_high),
    sum(comp_vectors_similarity_comb_rev_high) / len(comp_vectors_similarity_comb_rev_high)
]

low_means = [
    sum(comp_vectors_similarity_add_low) / len(comp_vectors_similarity_add_low),
    sum(comp_vectors_similarity_mult_low) / len(comp_vectors_similarity_mult_low),
    sum(comp_vectors_similarity_comb_low) / len(comp_vectors_similarity_comb_low),
    sum(comp_vectors_similarity_comb_rev_low) / len(comp_vectors_similarity_comb_rev_low)
]

rho_vals = [rho_add_all, rho_mult_all, rho_comb_all, rho_comb_rev_all]
p_vals = [pval_add_all, pval_mult_all, pval_comb_all, pval_comb_rev_all]

table_data = {
    "Model": ["Additive", "Multiplicative", "Combined", "Combined Rev"],
    "High": high_means,
    "Low": low_means,
    "rho": rho_vals,
    "p-value": p_vals
}

df = pd.DataFrame(table_data)

df["High"] = df["High"].round(4)
df["Low"] = df["Low"].round(4)
df["rho"] = df["rho"].round(4)
df["p-value"] = df["p-value"].round(4)

print(df)

            Model    High     Low     rho  p-value
0        Additive  0.6996  0.7416 -0.0238   0.9554
1  Multiplicative  0.9282  0.9328  0.2857   0.4927
2        Combined  0.8783  0.9797 -0.5952   0.1195
3    Combined Rev  0.1691 -0.0229  0.7381   0.0366


**Any comments/thoughts should go here:**

The combined function (reversed) showed the best overall performance, with the highest correlation to human judgments. This suggests that weighting the noun more heavily in the composition helps capture semantic similarity more effectively. It may be related to the difference between the datasets used in the paper and in our current study, as in the paper the composed vectors represent a sentence built from an NP (subject) and a VP, whereas here they represent a VP built from a V and an (object) NP. 

However this result needs to be nuanced as the Combined Reversed model performing best is also the only one getting a valid p-value, which makes comparison and results difficult to interpret.

# Literature

[1] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 721–732, Baltimore, Maryland, USA, June 23–25 2014 2014. Association for Computational Linguistics.  

[2] Mitchell, J., & Lapata, M. (2008). Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT (pp. 236–244). Association for Computational Linguistics.
  
[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[4] E. Vylomova, L. Rimell, T. Cohn, and T. Baldwin. Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. arXiv, arXiv:1509.01692 [cs.CL], 2015.

## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

Before meeting, everyone of us worked individually on the assignment.
We met once on the 24th April morning, around 2 hours, 3 of us on site and one online. We discussed our answers, mainly the part 5.

We met again on the 28th to finalize our code and comments.


## Marks

The assignment is marked on a 7-level scale where 4 is sufficient to complete the assignment; 5 is good solid work; 6 is excellent work, covers most of the assignment; and 7: creative work.

This assignment has a total of 60 marks. These translate to grades as follows: 1 = 17% 2 = 34%, 3 = 50%, 4 = 67%, 5 = 75%, 6 = 84%, 7 = 92% where %s are interpreted as lower bounds to achieve that grade.