# A2: Vector Semantics

Nikolai Ilinykh, Mehdi Ghanimifard, Wafia Adouane and Simon Dobnik


The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.

In this lab we will look at how to build distributional semantic models from corpora and use semantic similarity captured by these models to do semantic tasks. We are also going to examine how different vector composition functions for vectors work in approximating semantic similarity of phrases when compared to human judgements.

This lab uses code from `dist_erk.py` which contains functions similar to those shown in the lecture. You can use either functions to solve these tasks.

In [1]:
# the following command simply imports all the methods from that code.
from dist_erk import *

## 1. Loading a corpus

To train a distributional model, we first need a sufficiently large collection of texts which contain different words used frequently enough in different contexts. Here we will use a section of the Wikipedia corpus `wikipedia.txt` stored in `wikipedia.zip`. This file has been borrowed from another lab by [Richard Johansson](http://www.cse.chalmers.se/~richajo/).

When unpacked, the file is 151mb, hence if you are using the MLT servers you should store it in a temporary folder outside your home and adjust the `corpus_dir` path below. It may already exist in `/srv/data/computational-semantics/`.

In [2]:
# run on pc
# import zipfile
# # unpack .zip
# with zipfile.ZipFile('wikipedia.zip', 'r') as zip_ref:
#     zip_ref.extractall('wikipedia')
corpus_dir = 'wikipedia/wikipedia/'

## 2. Building a model

Now you are ready to build the model.  
Using the methods from the code imported above build three word matrices with 1000 dimensions as follows:  

(i) with raw counts (saved to a variable `space_1k`);  
(ii) with PPMI (`ppmispace_1k`);  
(iii) with reduced dimensions SVD (`svdspace_1k`).  
For the latter use `svddim=5`. **[5 marks]**

Your task is to replace `...` with function calls. Functions are imported from `dist_erk.py`, and they similar to functions shown during the lecture.

In [3]:
numdims = 1000
svddim = 5

# Which words to use as targets and context words?
# We need to count the words and keep only the N most frequent ones
# Which function would you use here with which variable?
ktw = do_word_count(corpus_dir, numdims) # 1000 most common words

wi = make_word_index(ktw) # word:index pair
words_in_order = sorted(wi.keys(), key=lambda w: wi[w]) # sorted words by index

# Create different spaces (the original matrix space, the ppmi space, the svd space)
# Which functions with which arguments would you use here?
print('create count matrices')
space_1k = make_space(corpus_dir, wi, numdims)
print('ppmi transform')
ppmispace_1k = ppmi_transform(space_1k, wi)
print('svd transform')
svdspace_1k = svd_transform(space_1k, numdims, svddim)
print('done.')

reading file wikipedia.txt
create count matrices
reading file wikipedia.txt


1145485it [02:18, 8258.53it/s] 


ppmi transform
svd transform
done.


In [4]:
# now, to test the space, you can print vector representation for some words
print('house:', space_1k['house'])

house: [2551 3714 3104  567  962  627  443  185  311  189  131   28   93  169
   81  125  151  408  194   89   79   29  217  184   62   15   31   70
   10    1   41   21    1   31   37    1   30    5   25    7    3   20
   11    1   32   36    2    5   65    4    0   46    8   18   28    0
   20    7    8   16   10   40    0  175   10    2    7   19    1  174
   11    3    1    6    0    0    0   10    9   11    7   24    4    4
   14   23   58    7    0   10    2    3   10    6   18    6   13    3
   22    0    3    5    3    7   14    3   40   20   19   15    6    8
   23    4    5    1   19    0    3    1    0   14    0   14   53    7
    7   11    6    5    5    4   12    6   53    1    1  433    4    0
    5    7    7   12    1    1    3    4   17    8   16    1    2   31
    1   12   14    1   44    6   14    9   38    7    2    6    8    1
   10    6   10    1    9    7    9    4    3    9    0   11    3    2
    0    2   11   37    2    0    2    1    5    9   10   16    4    6

Oxford Advanced Dictionary has 185,000 words, hence 1,000 words is not representative. We trained a model with 10,000 words, and 50 dimensions on truncated SVD. All matrices are available in the folder `pretrained` of the `wikipedia.zip`file. These are `ktw_wikipediaktw.npy`, `raw_wikipediaktw.npy`, `ppmi_wikipediaktw.npy`, `svd50_wikipedia10k.npy`. Make sure they are in your path as we load them below.

In [5]:
import numpy as np

numdims = 10000
svddim = 50

print('Please wait...')
ktw_10k       = np.load('wikipedia/wikipedia/pretrained/ktw_wikipediaktw.npy', allow_pickle=True)
space_10k     = np.load('wikipedia/wikipedia/pretrained/raw_wikipediaktw.npy', allow_pickle=True).all()
ppmispace_10k = np.load('wikipedia/wikipedia/pretrained/ppmi_wikipediaktw.npy', allow_pickle=True).all()
svdspace_10k  = np.load('wikipedia/wikipedia/pretrained/svd50_wikipedia10k.npy', allow_pickle=True).all()
print('Done.')


Please wait...
Done.


In [6]:
# testing semantic space
print('house:', space_10k['house'])

house: [2554 3774 3105 ...    0    0    0]


## 3. Testing semantic similarity

The file `similarity_judgements.txt` contains 7,576 pairs of words and their lexical and visual similarities (based on the pictures) collected through crowd-sourcing using Mechanical Turk as described in [1]. The score range from 1 (highly dissimilar) to 5 (highly similar). Note: this is a different dataset from the phrase similarity dataset we discussed during the lecture [2]. You can find more details about how they were collected in the papers.

The following code will transform similarity scores into a Python-friendly format:

In [7]:
word_pairs = [] # test suit word pairs
semantic_similarity = [] 
visual_similarity = []
test_vocab = set()

for index, line in enumerate(open('similarity_judgements.txt')):
    data = line.strip().split('\t')
    if index > 0 and len(data) == 3:
        w1, w2 = tuple(data[0].split('#'))
        # Checks if both words from each pair exist in the word matrix.
        if w1 in ktw_10k and w2 in ktw_10k:
            word_pairs.append((w1, w2))
            test_vocab.update([w1, w2])
            semantic_similarity.append(float(data[1]))
            visual_similarity.append(float(data[2]))
        
print('number of available words to test:', len(test_vocab-(test_vocab-set(ktw_10k)))) # here ktw should be ktw_10k; i.e. the intersection set of test_vocab & ktw_10k
print('number of available word pairs to test:', len(word_pairs))
#list(zip(word_pairs, visual_similarity, semantic_similarity))

number of available words to test: 155
number of available word pairs to test: 774


We are going to test how the cosine similarity between vectors of each of the three spaces (normal space, ppmi, svd) compares with the human similarity judgements for the words in the similarity dataset. Which of the three spaces best approximates human judgements?

For comparison of several scores, we can use [Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better the similarity scores align. The p values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Here is how you can calculate Pearson's correlation coefficient betweeen the scores of visual similarity and semantic similarity of the available words in the test suite:

In [8]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, visual_similarity)
print("""Visual Similarity vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))


Visual Similarity vs. Semantic Similarity:
rho     = 0.7122
p-value = 0.0000


Let's now calculate the cosine similarity scores of all word pairs in an ordered list using all three matrices. **[6 marks]**

In [9]:
from scipy.spatial import distance

raw_similarities  = [1 - distance.cosine(space_10k[w1], space_10k[w2]) for w1, w2 in word_pairs] 
ppmi_similarities = [1 - distance.cosine(ppmispace_10k[w1], ppmispace_10k[w2]) for w1, w2 in word_pairs]
svd_similarities  = [1 - distance.cosine(svdspace_10k[w1], svdspace_10k[w2]) for w1, w2 in word_pairs]
# 1-cosine vs. cosine => convert range[-1,1] to [0,2]?
# pay attention to the interpretation!

Calculate correlation coefficients between lists of similarity scores and the real semantic similarity scores from the experiment. The scores of what model best correlates them? Is this expected? **[6 marks]**

In [10]:
# your code should go here
# Calculate correlation coefficients for raw similarities
raw_correlation = stats.spearmanr(raw_similarities, semantic_similarity)

# Calculate correlation coefficients for PPMI similarities
ppmi_correlation = stats.spearmanr(ppmi_similarities, semantic_similarity)

# Calculate correlation coefficients for SVD similarities
svd_correlation = stats.spearmanr(svd_similarities, semantic_similarity)

# Determine which model has the highest correlation coefficient
best_model = max([(abs(raw_correlation[0]), 'Raw Similarities'),
                  (abs(ppmi_correlation[0]), 'PPMI Similarities'),
                  (abs(svd_correlation[0]), 'SVD Similarities')])

# Print correlation coefficients and the best model
print("Correlation Coefficients:")
print("Raw Similarities:", raw_correlation[0])
print("PPMI Similarities:", ppmi_correlation[0])
print("SVD Similarities:", svd_correlation[0])
print("\nBest Model(based on absolute correlation coefficient):", best_model[1])

Correlation Coefficients:
Raw Similarities: 0.1522361772810631
PPMI Similarities: 0.45474430383677467
SVD Similarities: 0.42322095953128175

Best Model(based on absolute correlation coefficient): PPMI Similarities


**Your answer should go here:**

We can also calculate correlation coefficients between lists of cosine similarity scores and the real visual similarity scores from the experiment. Which similarity model best correlates with them? How do the correlation coefficients compare with those from the previous comparison - and can you speculate why do we get such results? **[7 marks]**

In [11]:
# Your code should go here...
# Calculate correlation coefficients for cosine similarities
raw_visual_correlation = stats.spearmanr(raw_similarities, visual_similarity)
ppmi_visual_correlation = stats.spearmanr(ppmi_similarities, visual_similarity)
svd_visual_correlation = stats.spearmanr(svd_similarities, visual_similarity)

# Print correlation coefficients
print("Correlation Coefficients with Visual Similarity:")
print("Raw Similarities:", raw_visual_correlation[0])
print("PPMI Similarities:", ppmi_visual_correlation[0])
print("SVD Similarities:", svd_visual_correlation[0])

# Determine which model has the highest correlation coefficient
best_model_visual = max([(abs(raw_visual_correlation[0]), 'Raw Similarities'),
                        (abs(ppmi_visual_correlation[0]), 'PPMI Similarities'),
                        (abs(svd_visual_correlation[0]), 'SVD Similarities')])

# Print the best model for visual similarity
print("\nBest Model for Visual Similarity (based on correlation coefficient):", best_model_visual[1])

Correlation Coefficients with Visual Similarity:
Raw Similarities: 0.12117933714649261
PPMI Similarities: 0.3837712858901843
SVD Similarities: 0.3096551629349149

Best Model for Visual Similarity (based on correlation coefficient): PPMI Similarities


**Your answer should go here:**

Pearson vs. Spearman correlation coefficients

Pearson: suitable for linear relationships between metric or continuous variables.

Spearman: more applicable; not affected by outliers because it is calculated by ranking the data.

## 4. Operations on similarities

We can perform mathematical operations on vectors to derive meaning predictions.

For example, we can perform `king - man` and add the resulting vector to `woman` and we hope to get the vector for `queen`. What would be the result of `stockholm - sweden + denmark`? Why? **[3 marks]**

If you want to learn more about vector differences between words (and words in analogy relations), check this paper [4].

**Your answer should go here:**

Here is some code that allows us to calculate such comparisons.

In [12]:
def normalize(vec):
    return vec / veclen(vec)

def find_similar_to(vec1, space):
    # vector similarity funciton
    #sim_fn = lambda a, b: 1-distance.euclidean(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.correlation(a, b)
    #sim_fn = lambda a, b: 1-distance.cityblock(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.chebyshev(normalize(a), normalize(b))
    #sim_fn = lambda a, b: np.dot(normalize(a), normalize(b))
    sim_fn = lambda a, b: 1-distance.cosine(a, b)

    sims = [
        (word2, sim_fn(vec1, space[word2]))
        for word2 in space.keys()
    ]
    return sorted(sims, key = lambda p:p[1], reverse=True)

Here is how you apply this code. Comment on the results you get. **[3 marks]**

In [13]:
# short = normalize(svdspace_10k['short'])
light = normalize(svdspace_10k['light'])
long = normalize(svdspace_10k['long'])
heavy = normalize(svdspace_10k['heavy'])

find_similar_to(light - (heavy - long), svdspace_10k)[:10] # expected answer: short

[('long', 0.8733111261346901),
 ('above', 0.8259671977311955),
 ('around', 0.8030776291120685),
 ('sun', 0.7692439111243973),
 ('just', 0.7678481974778111),
 ('wide', 0.767257431992253),
 ('each', 0.7665960260861158),
 ('circle', 0.7647746702909336),
 ('length', 0.7601066921319761),
 ('almost', 0.7542351860536628)]

In [14]:
stockholm = normalize(svdspace_10k['stockholm'])
sweden = normalize(svdspace_10k['sweden'])
denmark = normalize(svdspace_10k['denmark'])

find_similar_to(stockholm - sweden + denmark, svdspace_10k)[:10] # expected answer: copenhagen

[('stockholm', 0.9571449438177746),
 ('prague', 0.9255634442739451),
 ('amsterdam', 0.8997980639287134),
 ('brussels', 0.8969830643584371),
 ('oslo', 0.892292311243968),
 ('hamburg', 0.8911355385331753),
 ('cologne', 0.8904376966250224),
 ('milan', 0.8895360380024667),
 ('munich', 0.8874478811027279),
 ('frankfurt', 0.8872747257104567)]

In [15]:
# svdspace_10k['copenhagen'] - the expected word does exist

**Your answer should go here:**
Based on my expectations, the returned most similar word should be "short". However, the result does not return "short", though the returned word list has a intuitively similar topic. Instead, it returns the words most similar to "long".(The same with "stockholm - sweden + denmark".) 

Below are why this happens i guess:

Vector representation: It may be because in word vector space, the similarity between "short" and "long" is greater than the similarity between "light" and "short". This may be because "short" and "long" are closer in some semantic sense in this word vector space, or their word vectors are more similar.

Characteristics of word vector spaces: Word vector spaces are generated based on statistical information in the corpus, so the results are affected by the corpus. Using a different word vector space, or fine-tuning the space could help us to get the desired results.

In [16]:
print("short vs. long:",1-distance.cosine(svdspace_10k['short'],svdspace_10k['long']))
print("light vs. short:",1-distance.cosine(svdspace_10k['light'],svdspace_10k['short']))

short vs. long: 0.808121714766419
light vs. short: 0.6356790412973713


In [17]:
# find_similar_to(denmark - sweden + stockholm, svdspace_10k)[:10] 
## Considering the different order of operations, while the results appear same.

In [18]:
## try different representation methods， "short" appeared
# PPMI
light = normalize(ppmispace_10k['light'])
long = normalize(ppmispace_10k['long'])
heavy = normalize(ppmispace_10k['heavy'])

find_similar_to(light - (heavy - long), ppmispace_10k)[:10] # expected answer: short

[('long', 0.6455903190370874),
 ('light', 0.5756309144130762),
 ('short', 0.22665051391680424),
 ('about', 0.2062232575609404),
 ('around', 0.20341064658972596),
 ('than', 0.1997623801104953),
 ('longer', 0.19775518996857122),
 ('through', 0.19530678790677225),
 ('each', 0.19315737065635885),
 ('a', 0.1897425288444996)]

In [19]:
stockholm = normalize(ppmispace_10k['stockholm'])
sweden = normalize(ppmispace_10k['sweden'])
denmark = normalize(ppmispace_10k['denmark'])

find_similar_to(stockholm - sweden + denmark, ppmispace_10k)[:10] # expected answer: copenhagen

[('stockholm', 0.6471741063767443),
 ('denmark', 0.527048298126461),
 ('paris', 0.14192180704346102),
 ('prague', 0.13330803087198895),
 ('moscow', 0.12723873038417588),
 ('london', 0.12179613535983802),
 ('berlin', 0.11239954334145585),
 ('munich', 0.11111614973189443),
 ('oslo', 0.10943906436877393),
 ('vienna', 0.10922784597519919)]

In [20]:
print("sm vs. ch:",1-distance.cosine(svdspace_10k['stockholm'],svdspace_10k['copenhagen']))
print("ch vs. dk:",1-distance.cosine(svdspace_10k['copenhagen'],svdspace_10k['denmark']))

sm vs. ch: 0.9166679278319493
ch vs. dk: 0.5358611624270834


In [21]:
print("sm vs. ch:",1-distance.cosine(ppmispace_10k['stockholm'],ppmispace_10k['copenhagen']))
print("ch vs. dk:",1-distance.cosine(ppmispace_10k['copenhagen'],ppmispace_10k['denmark']))

sm vs. ch: 0.1607946310921181
ch vs. dk: 0.12763734813237693


Find 5 similar pairs of pairs of words and test them. Hint: google for `word analogies examples`. You can also construct analogies that are not only lexical but also express other relations such as grammatical relations, e.g. `see, saw, leave, ?` or analogies that are based on world knowledge as in `question-words.txt` from the [Google analogy dataset](http://download.tensorflow.org/data/questions-words.txt) described in [3]. Does the resulting vector similarity confirm your expectations? Remember you can only do this test if the words are contained in our vector space with 10,000 dimensions. **[10 marks]**

In [22]:
# Your code should go here...
# Ref: https://www.enchantedlearning.com/sampletests/verbalanalogies/

# 1. "love" is to "hate" as "sweet" is to ? expected: sour(candy, gift, nice, Valentine,...)
similar_pairs_1 = find_similar_to(ppmispace_10k['love'] - ppmispace_10k['hate'] + ppmispace_10k['sweet'], ppmispace_10k)[:10]

# 2. "north" is to "south" as "east" is to ? expected: west(up, left, down, center,...)
similar_pairs_2 = find_similar_to(ppmispace_10k['north'] - ppmispace_10k['south'] + ppmispace_10k['east'], ppmispace_10k)[:10]

# 3. "see" is to "saw" as "leave" is to ? expected: left
similar_pairs_3 = find_similar_to(ppmispace_10k['see'] - ppmispace_10k['saw'] + ppmispace_10k['leave'], ppmispace_10k)[:10]

# 4. "father" is to "son" as "mother" is to ? expected: daughter
similar_pairs_4 = find_similar_to(ppmispace_10k['father'] - ppmispace_10k['son'] + ppmispace_10k['mother'], ppmispace_10k)[:10]

# 5. "milk" is to "cow" as "eggs" is to ? expected: chicken(feather, lamb, pie, orange,...)
similar_pairs_5 = find_similar_to(ppmispace_10k['milk'] - ppmispace_10k['cow'] + ppmispace_10k['eggs'], ppmispace_10k)[:10]

# Print results
print("Similar Pairs for Analogy 1 (love - hate + sweet):\n", similar_pairs_1)
print("\nSimilar Pairs for Analogy 2 (north - south + east):\n", similar_pairs_2)
print("\nSimilar Pairs for Analogy 3 (see - saw + leave):\n", similar_pairs_3)
print("\nSimilar Pairs for Analogy 4 (father - son + mother):\n", similar_pairs_4)
print("\nSimilar Pairs for Analogy 5 (milk - cow + eggs):\n", similar_pairs_5)

Similar Pairs for Analogy 1 (love - hate + sweet):
 [('sweet', 0.6772237829945578), ('love', 0.6213929874619027), ('s', 0.20137267557036642), ('my', 0.195772511699837), ('song', 0.1931407595156005), ('like', 0.1885200595257016), ('album', 0.17900034659202524), ('her', 0.17816860524654987), ('me', 0.1762242431173946), ('your', 0.17532521314546579)]

Similar Pairs for Analogy 2 (north - south + east):
 [('north', 0.6994557089885227), ('east', 0.6935156652115283), ('west', 0.4435474065908418), ('41', 0.3682912039801547), ('34', 0.36688584215396913), ('59', 0.3640116694504689), ('39', 0.36251349163878877), ('57', 0.35771572136525065), ('38', 0.3567062560473737), ('residing', 0.35547113637504346)]

Similar Pairs for Analogy 3 (see - saw + leave):
 [('leave', 0.6183099577877832), ('see', 0.5041967031219105), ('make', 0.17141552301577323), ('take', 0.16858952627217827), ('go', 0.16635135035955062), ('give', 0.1642441561134178), ('history', 0.15961478631610337), ('use', 0.15912477588721963), (

In [23]:
test_target = ['left', 'sour', 'nice', 'daughter']
for word in test_target:
    check_existence = lambda w: f"target '{w}' exists" if w in ppmispace_10k else f"target '{w}' does not exist"
    print(check_existence(word))

target 'left' exists
target 'sour' does not exist
target 'nice' exists
target 'daughter' exists


In [24]:
## try different representation methods
# 1. "love" is to "hate" as "sweet" is to ?
similar_pairs_1 = find_similar_to(svdspace_10k['love'] - svdspace_10k['hate'] + svdspace_10k['sweet'], svdspace_10k)[:10]

# 2. "north" is to "south" as "east" is to ?
similar_pairs_2 = find_similar_to(svdspace_10k['north'] - svdspace_10k['south'] + svdspace_10k['east'], svdspace_10k)[:10]

# 3. "see" is to "saw" as "leave" is to ?
similar_pairs_3 = find_similar_to(svdspace_10k['see'] - svdspace_10k['saw'] + svdspace_10k['leave'], svdspace_10k)[:10]

# 4. "father" is to "son" as "mother" is to ?
similar_pairs_4 = find_similar_to(svdspace_10k['father'] - svdspace_10k['son'] + svdspace_10k['mother'], svdspace_10k)[:10]

# 5. "milk" is to "cow" as "eggs" is to ?
similar_pairs_5 = find_similar_to(svdspace_10k['milk'] - svdspace_10k['cow'] + svdspace_10k['eggs'], svdspace_10k)[:10]

# Print results
print("Similar Pairs for Analogy 1 (love - hate + sweet):\n", similar_pairs_1)
print("\nSimilar Pairs for Analogy 2 (north - south + east):\n", similar_pairs_2)
print("\nSimilar Pairs for Analogy 3 (see - saw + leave):\n", similar_pairs_3)
print("\nSimilar Pairs for Analogy 4 (father - son + mother):\n", similar_pairs_4)
print("\nSimilar Pairs for Analogy 5 (milk - cow + eggs):\n", similar_pairs_5)

Similar Pairs for Analogy 1 (love - hate + sweet):
 [('sweet', 0.9019111082574507), ('love', 0.8692463975350873), ('song', 0.8521931783633939), ('girl', 0.8518121319082917), ('baby', 0.8409831300845901), ('cat', 0.8345027264639273), ('boy', 0.8216282342389759), ('dream', 0.8117562713554443), ('soul', 0.7995940671557389), ('little', 0.7962444038618905)]

Similar Pairs for Analogy 2 (north - south + east):
 [('north', 0.9843013133089371), ('west', 0.9491403863140437), ('94', 0.8416520305403512), ('interstate', 0.8280754682547768), ('59', 0.8227276112113504), ('81', 0.8199631551445681), ('highway', 0.8182442121450889), ('83', 0.8181702612740784), ('77', 0.8177665381106839), ('47', 0.8163972112219406)]

Similar Pairs for Analogy 3 (see - saw + leave):
 [('leave', 0.7113980550779404), ('see', 0.7039598056847666), ('enter', 0.6867888303663497), ('find', 0.6719347730389507), ('learn', 0.6570663216260377), ('go', 0.6528884257332832), ('call', 0.6519290174549311), ('discover', 0.651761601747027

In [25]:
for word in test_target:
    check_existence = lambda w: f"target '{w}' exists" if w in svdspace_10k else f"target '{w}' does not exist"
    print(check_existence(word))

target 'left' exists
target 'sour' does not exist
target 'nice' exists
target 'daughter' exists


The resulting vector similarity confirm my expectations ***to some extent***.

## 5. Semantic composition and phrase similarity **[20 marks]**

In this task, we are going to examine at how the composed vectors of phrases by different semantic composition functions/models introduced in [2] correlate with human judgements of similarity between phrases. We will use the the dataset from this paper which is stored in `mitchell_lapata_acl08.txt`. If you are interested about furtehr details about this task also refer to this paper.

(i) Process the dataset. The dataset contains human judgemements of similarity between phrases recorded one per line. The first column indicates the id of a participant making a judgement (`participant`), the next column is `verb`, followed by `noun` and `landmark`. From these three columns we can construct phrases that were compared by human informants, namely `verb noun` vs `verb landmark`. The next column `input` indicates a similarity score a participant assigned to a pair of such phrases on a scale from 1 to 7 where 1 is lowest and 7 is highest. The last column `hilo` groups the phrases into two sets: phrases where we expect low and phrases where we expect high similarity scores. This is because we want to test our compositional functions on two tasks and examine whether a function is discriminative between them. Correlation between scores could also be due to other reasons than semantic similarity and hence good prediction on both tasks simultaneously shows that a function is truly discriminating the phrases using some semantic criteria.

For extracting information you can use the code from the lecture to start with. How to structure this data is up to you - a dictionary-like format would be a good choice. Remember that each example was judged by several participants and phrases will repeat in the dataset. Therefore, you have to collect all judgments for a particular set of phrases and average them. This will become useful in step (iii).

(ii) Compose the vectors of the extracted word pairs by testing different compositional functions. In the lecture we introduced simple additive, simple multiplicative and combined models (details are described in [2]). Your task is to take a pair of phrases, e.g. the first example in the dataset `stray thought` and `stray roam` and for each phrase compute a composition of the vectors of their words using these functions, using one function per experiment run. For each phrase you will get a single vector. You can encode the words with any vector space introduced earlier (standard space, ppmi or svd) but your code should be structured in a way that it will be easy to switch between them. Finally, take the resulting (composed) vectors of phrase pairs in the dataset and calculate a cosine similarity between them.

(iii) Now you have cosine similairity scores between vectors of phrases but how do they compare with the average human scores that you calculated from the individual judgements from the `input` column of the dataset for the same phrases? Calculate Spearman rank correlation coefficient between two lists of the scores both for the `high` and the `low` task . 

We use the Spearmank rank correlation coefficient (or Spearman's rho) rather than Peason's correlation coefficent because we cannot compare cosine scores with human judgements directly. Cosine is a constinuous measure and human judgements are expressed as ranks. Also, we cannot say if 0.28 to 1 is the same (or different) to 6 to 7 in the human scores.  The Spearman rank correlation coeffcient turns the scores for all examples within each group first to ranks and then these ranks are correlated (or approximated to a linear function). 

In the end you should get a table similar to the one below from the paper. What is the best compositional function from those that you evaluated with your vector spaces and why?

<img src="res.png" alt="drawing" width="500"/>


In [26]:
# your code should go here
# Step 1: Processing the dataset
dataset = {}  # Dictionary to store dataset
with open('mitchell_lapata_acl08.txt', 'r') as file:
    next(file)  # skip the first line (column name)
    for line in file:
        parts = line.strip().split(' ')
        participant_id = parts[0]
        verb = parts[1]
        noun = parts[2]
        landmark = parts[3]
        similarity_score = float(parts[4])
        similarity_group = parts[5]

        # Construct phrases
        phrase1 = f"{verb} {noun}"
        phrase2 = f"{verb} {landmark}"

        # Add entry to dataset
        key = (phrase1, phrase2)
        if key not in dataset:
            dataset[key] = {'scores': [], 'group': similarity_group}
        dataset[key]['scores'].append(similarity_score)

In [27]:
# Step 2: Composing vectors
# Implement encoding of words using chosen vector space representation
# Function to encode a word using a selected vector space representation
def encode_word(word, vector_space):
    if word in vector_space:
        return vector_space[word]
    else: 
        return None

# Define semantic composition functions (additive, multiplicative, combined)
# Semantic composition functions
def additive_composition(vec1, vec2):
    if vec1 is not None and vec2 is not None:
        return [x + y for x, y in zip(vec1, vec2)]
    else:
        return None

def multiplicative_composition(vec1, vec2):
    if vec1 is not None and vec2 is not None:
        return [x * y for x, y in zip(vec1, vec2)]
    else:
        return None

def combined_composition(vec1, vec2, a=0.3, b=0.2, c=0.5 ): # here weights are set mannually
    if vec1 is not None and vec2 is not None:
        weighted_sum = [a*x + b*y + c*x*y for x, y in zip(vec1, vec2)]
        return weighted_sum
    else:
        return None

# Compose vectors for each pair of phrases
def compose_vectors(word1, word2, vector_space, composition_function):
    
    vec1 = encode_word(word1, vector_space)
    vec2 = encode_word(word2, vector_space)
    
    composed_vec = composition_function(vec1, vec2)
    
    return composed_vec

# # Example usage
# phrase1 = "stray thought"
# phrase2 = "stray roam"

# additive_result = compose_vectors(phrase1, phrase2, svdspace_10k, additive_composition)
# multiplicative_result = compose_vectors(phrase1, phrase2, svdspace_10k, multiplicative_composition)
# combined_result = compose_vectors(phrase1, phrase2, svdspace_10k, combined_composition)

# print("Additive composition result:", additive_result)
# print("Multiplicative composition result:", multiplicative_result)
# print("Combined composition result:", combined_result)

In [28]:
# Step 3: Calculating cosine similarity
cosine_similarities = {}
# vector_space = svdspace_10k
vector_space = ppmispace_10k # switch between different vector spaces
functionDict = {"Add":additive_composition, "Multiply":multiplicative_composition, "Combined":combined_composition}
for pair, data in dataset.items():
    # Get composed vectors for the current pair of phrases
    phrase1, phrase2 = pair
    word1_phrase1, word2_phrase1 = phrase1.split()
    word1_phrase2, word2_phrase2 = phrase2.split()

    # Get vector representations for different composition methods respectively
    for model, composition_function in functionDict.items():
        composed_vector1 = compose_vectors(word1_phrase1, word2_phrase1, vector_space, composition_function)
        composed_vector2 = compose_vectors(word1_phrase2, word2_phrase2, vector_space, composition_function)
    
        # Compute cosine similarity between composed vectors of phrase pairs 
        if model not in cosine_similarities:
            cosine_similarities[model] = {}
        if composed_vector1 is not None and composed_vector2 is not None:
            cosine_similarities[model][pair] = distance.cosine(composed_vector1, composed_vector2)

In [29]:
cosine_similarities

{'Add': {('bow butler', 'bow submit'): 0.4594588325838541,
  ('bow company', 'bow submit'): 0.49624922273390726,
  ('boom sale', 'boom thunder'): 0.4466601985143872,
  ('boom gun', 'boom thunder'): 0.4817832750951593,
  ('bow head', 'bow submit'): 0.44922096015252266,
  ('bow government', 'bow submit'): 0.4552515625213709,
  ('boom noise', 'boom thunder'): 0.4539308360850818,
  ('boom export', 'boom thunder'): 0.4449971255937708},
 'Multiply': {('bow butler', 'bow submit'): 0.9916378075126581,
  ('bow company', 'bow submit'): 0.8990508255691259,
  ('boom sale', 'boom thunder'): 0.9969015128462371,
  ('boom gun', 'boom thunder'): 0.9622922349435084,
  ('bow head', 'bow submit'): 0.8961702458199792,
  ('bow government', 'bow submit'): 0.7369626185927327,
  ('boom noise', 'boom thunder'): 0.8754942491516382,
  ('boom export', 'boom thunder'): 0.9827516201004473},
 'Combined': {('bow butler', 'bow submit'): 0.5380984130508589,
  ('bow company', 'bow submit'): 0.4934194076943256,
  ('boom s

In [39]:
# Step 4: Calculating Spearman rank correlation coefficient
# Extract cosine similarities and human scores for high and low similarity groups
high_cosine_similarities = {}
low_cosine_similarities = {}

for model in functionDict:
    high_similarity_scores = [] # why inside: avoid repeatation
    low_similarity_scores = []   
    
    for pair, data in dataset.items():
        if pair in cosine_similarities[model]:
            similarity_score = np.mean(data['scores'])
            cosine_similarity = cosine_similarities[model][pair]

            if model not in high_cosine_similarities:
                high_cosine_similarities[model] = []
            if model not in low_cosine_similarities:
                low_cosine_similarities[model] = []

            if data['group'] == 'high':
                high_similarity_scores.append(similarity_score)
                high_cosine_similarities[model].append(cosine_similarity)
            else:
                low_similarity_scores.append(similarity_score)
                low_cosine_similarities[model].append(cosine_similarity)

# Calculate general Spearman rank correlation coefficients
similarity_scores = high_similarity_scores + low_similarity_scores # concatenate the two groups
correlationDict = {}
for model in functionDict:
    rho, pval = stats.spearmanr(similarity_scores, high_cosine_similarities[model] + low_cosine_similarities[model])
    
    if model not in correlationDict:
        correlationDict[model] = {}
     
    correlationDict[model]['rho'] = rho
    correlationDict[model]['pval'] = pval
    
    print(f"Spearman Rank Correlation Coefficients for Model {model}:")
    print("""  rho     = {:.4f}
  p-value = {:.4f}""".format(rho, pval))

Spearman Rank Correlation Coefficients for Model Add:
  rho     = 0.5714
  p-value = 0.1390
Spearman Rank Correlation Coefficients for Model Multiply:
  rho     = -0.7143
  p-value = 0.0465
Spearman Rank Correlation Coefficients for Model Combined:
  rho     = -0.4048
  p-value = 0.3199


In [50]:
# Construct a similar table
import pandas as pd

high_cosine_sim_averages = {key: sum(values) / len(values) for key, values in high_cosine_similarities.items()}
low_cosine_sim_averages = {key: sum(values) / len(values) for key, values in low_cosine_similarities.items()}

data = {
    'Model': ['Add', 'Multiply', 'Combined'],
    'High': list(map(lambda x: '{:.2f}'.format(x), high_cosine_sim_averages.values())),
    'Low': list(map(lambda x: '{:.2f}'.format(x), low_cosine_sim_averages.values())),
    'ρ': list(map(lambda x: '{:.2f}'.format(x), [value['rho'] for value in correlationDict.values()])),
    'p-value': [value['pval'] for value in correlationDict.values()]
}

df = pd.DataFrame(data)
df['ρ'] = df.apply(lambda row: f"{row['ρ']} *" if row['p-value'] < 0.05 else str(row['ρ']), axis=1)


In [51]:
df

Unnamed: 0,Model,High,Low,ρ,p-value
0,Add,0.47,0.45,0.57,0.13896
1,Multiply,0.87,0.97,-0.71 *,0.046528
2,Combined,0.5,0.55,-0.40,0.319889


In [3]:
from IPython.display import display, Latex

latex_code = r"""\begin{array}{|lllc|}
\hline
{Model} & {High} & {Low} & {\rho} \\
\hline
{Add} & 0.47 & 0.45 & {0.57} \\
{Multiply} & 0.87 & 0.97 & {-0.71*} \\
{Combined} & 0.50 & 0.55 & {-0.40}\\
\hline
\end{array}"""

display(Latex(latex_code))

<IPython.core.display.Latex object>

In [4]:
# latex_code = r"""\begin{tabular}{llll}
# \hline
# \multicolumn{1}{|l}{Model} & {High} & {Low} & \multicolumn{1}{c|}{$\rho$} \\
# \hline
# \hline
# \multicolumn{1}{|l}{Add} & 0.47 & 0.45 & \multicolumn{1}{c|}{0.57} \\
# \multicolumn{1}{|l}{Multiply} & 0.87 & 0.97 & \multicolumn{1}{c|}{-0.71*} \\
# \multicolumn{1}{|l}{Combined} & 0.50 & 0.55 & \multicolumn{1}{c|}{-0.40}\\
# \hline
# \end{tabular}"""

# display(Latex(latex_code))

<IPython.core.display.Latex object>

**Any comments/thoughts should go here:**

I tried svd & ppmi vector representations respectively, and the result of ppmi appears better.

From my result, I tend to assume that the Multiply method of composition is the best one because its result is siginificant(p<0.05).

While for the combined model, its weighing constants are hyperparameters, so there may exists a better combination to compose the vectors.


P.S.
We have also tried another way of making dataset to make sure all the words in the 'mitchell_lapata_acl08.txt' are included. ⬇️

word_pairs = [] # test suit word pairs
high_group = []
low_group = []

# participant -> similarity score, for each pair, an average score of all participants
semantic_similarity = {}  # key: (w1, w2), value: list of scores
# make two separate dicts from semantic_similarity dict based on 'high' and 'low' group
sem_sim_high = {}
sem_sim_low = {}

for index, line in enumerate(open('/Users/gengtianyi/Downloads/MLT2023/LT2213/02-vector-semantics/mitchell_lapata_acl08/mitchell_lapata_acl08.txt')):
    data = line.strip().split()
    if index > 0 and len(data) == 6:
        # w1 = (verb, noun), w2 = (verb, landmark)
        w1, w2 = tuple(data[1:3]), (data[1],data[3])
        word_pairs.append((w1, w2))
        # calculate the average similarity score for each pair
        score = float(data[4])
        if (w1, w2) not in semantic_similarity:
            semantic_similarity[(w1, w2)] = []
        semantic_similarity[(w1, w2)].append(score)
        if data[5] == 'high':
            high_group.append((w1, w2))
        else:
            low_group.append((w1, w2))

word_pairs = list(set(word_pairs)) # remove duplicates
# calculate the average similarity score for each pair
for key in semantic_similarity:
    semantic_similarity[key] = sum(semantic_similarity[key]) / len(semantic_similarity[key])
high_group = list(set(high_group)) # remove duplicates
low_group = list(set(low_group)) # remove duplicates

# generate two separate dicts from semantic_similarity dict based on 'high' and 'low' group
for key in semantic_similarity:
    if key in high_group:
        sem_sim_high[key] = semantic_similarity[key]
    else:
        sem_sim_low[key] = semantic_similarity[key]

#test print
print(word_pairs[:3])
print(semantic_similarity[('stray', 'thought'), ('stray', 'roam')])

# Step 2: Compose the vectors of the extracted word pairs by testing different compositional functions.
# compose new vectors based on vocalbulary of the new txt file "mitchell_lapata_acl08.txt"

# new "ktw" should be "all the unique words in the new txt file"
new_ktw = []
for word in word_pairs:
    new_ktw.extend(word[0])
    new_ktw.extend(word[1])
new_ktw = list(set(new_ktw)) # remove duplicates
# new_ktw = do_word_count(corpus_dir, numdims)

new_wi = make_word_index(new_ktw)
new_words_in_order = sorted(new_wi.keys(), key=lambda w:new_wi[w])

# make 3 matrices
new_space = make_space(corpus_dir, new_wi, numdims)
new_ppmispace = ppmi_transform(new_space, new_wi)
new_svdspace = svd_transform(new_ppmispace, numdims, svddim)


# test print
print(new_ktw[:10])
print("stray:",new_space["stray"])

# for high task
# calculate Spearman rank correlation coefficient between two lists of the scores
rho_raw, pval_raw = stats.spearmanr(list(sem_sim_high.values()), list(raw_similarities.values()))
rho_ppmi, pval_ppmi = stats.spearmanr(list(sem_sim_high.values()), list(ppmi_similarities.values()))
rho_svd, pval_svd = stats.spearmanr(list(sem_sim_high.values()), list(svd_similarities.values()))

print("High task:")
print("""Semantic Similarity vs. Raw Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho_raw, pval_raw))
print("""Semantic Similarity vs. PPMI Similarity:
      rho     = {:.4f}
      p-value = {:.4f}""".format(rho_ppmi, pval_ppmi))
print("""Semantic Similarity vs. SVD Similarity:
        rho     = {:.4f}
        p-value = {:.4f}""".format(rho_svd, pval_svd))

# for low task
# calculate all the cosine similarities for all the pairs in the dataset
# raw similarity scores & additve model
raw_similarities = {}
for key in sem_sim_low:
    p1, p2 = key
    vec_p1 = new_space[p1[0]] + new_space[p1[1]]
    vec_p2 = new_space[p2[0]] + new_space[p2[1]]
    raw_similarities[key] = distance.cosine(vec_p1, vec_p2)

# ppmi similarity scores & additve model
ppmi_similarities = {}
for key in sem_sim_low:
    p1, p2 = key
    vec_p1 = new_ppmispace[p1[0]] + new_ppmispace[p1[1]]
    vec_p2 = new_ppmispace[p2[0]] + new_ppmispace[p2[1]]
    ppmi_similarities[key] = distance.cosine(vec_p1, vec_p2)

# svd similarity scores & additve model
svd_similarities = {}
for key in sem_sim_low:
    p1, p2 = key
    vec_p1 = new_svdspace[p1[0]] + new_svdspace[p1[1]]
    vec_p2 = new_svdspace[p2[0]] + new_svdspace[p2[1]]
    svd_similarities[key] = distance.cosine(vec_p1, vec_p2)


# calculate Spearman rank correlation coefficient between two lists of the scores
rho_raw, pval_raw = stats.spearmanr(list(sem_sim_low.values()), list(raw_similarities.values()))
rho_ppmi, pval_ppmi = stats.spearmanr(list(sem_sim_low.values()), list(ppmi_similarities.values()))
rho_svd, pval_svd = stats.spearmanr(list(sem_sim_low.values()), list(svd_similarities.values()))

print("Low task:")
print("""Semantic Similarity vs. Raw Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho_raw, pval_raw))
print("""Semantic Similarity vs. PPMI Similarity:
      rho     = {:.4f}
      p-value = {:.4f}""".format(rho_ppmi, pval_ppmi))
print("""Semantic Similarity vs. SVD Similarity:
        rho     = {:.4f}
        p-value = {:.4f}""".format(rho_svd, pval_svd))


# Literature

[1] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 721–732, Baltimore, Maryland, USA, June 23–25 2014 2014. Association for Computational Linguistics.  

[2] Mitchell, J., & Lapata, M. (2008). Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT (pp. 236–244). Association for Computational Linguistics.
  
[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[4] E. Vylomova, L. Rimell, T. Cohn, and T. Baldwin. Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. arXiv, arXiv:1509.01692 [cs.CL], 2015.

## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

## Marks

The assignment is marked on a 7-level scale where 4 is sufficient to complete the assignment; 5 is good solid work; 6 is excellent work, covers most of the assignment; and 7: creative work. 

This assignment has a total of 60 marks. These translate to grades as follows: 1 = 17% 2 = 34%, 3 = 50%, 4 = 67%, 5 = 75%, 6 = 84%, 7 = 92% where %s are interpreted as lower bounds to achieve that grade.