# Distributional semantics

Mehdi Ghanimifard, Adam Ek, Wafia Adouane and Simon Dobnik

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on how to work on group assignments.

Write all your answers and the code in the appropriate boxes below.

---

In this lab we will look how to build distributional semantic models from corpora and use semantic similarity captured by these models to do some simple semantic tasks. We are going to use the code that we discussed in the class last time.

The following command simply imports all the methods from that code.

In [1]:
from dist_erk import *

## 1. Loading a corpus

To train a distributional model, we first need a sufficiently large collection of text which will contain different words used frequently enough in different contexts. Here we will use a section of the Wikipedia corpus which you can download from [here](https://linux.dobnik.net/oc/index.php/s/9NTlpOJfPWGS56t/download?path=%2Flab4-distributional-data&files=wikipedia.txt.zip) (Linux and Mac) or [here](https://linux.dobnik.net/oc/index.php/s/9NTlpOJfPWGS56t/download?path=%2Flab4-distributional-data&files=wikipedia-for-windows.zip) (Windows). (This file has been borrowed from another lab by [Richard Johansson](http://www.cse.chalmers.se/~richajo/).) When unpacked the file is 151mb hence if you are using the lab computers you should store it in a temporary folder outside your home and adjust `corpus_dir` path below.
<It may already exist in `/opt/mlt/courses/cl2015/a5`.>


In [2]:
corpus_dir = '/home/guszarzmo@GU.GU.SE/LT2213-v20/lab1/lt2213-lab-1-group-3/zarzouram/problem-set-3/wikipedia'

## 2.1 Building a count-based model

Now you are ready to build a count-based model. The functions for building word spaces can be found in `dist_erk.py`. We will build a model that create count-based vectors for 1000 words. Using the methods from the code imported above build three word matrices with 1000 dimensions as follows: (i) with raw counts (saved to a variable `space_1k`); (ii) with PPMI (`ppmispace_1k`); and (iii) with reduced dimensions SVD (`svdspace_1k`). For the latter use `svddim=5`. **[5 marks]**


---
---AE: Marks=5
    
---

In [3]:
word_to_keep = 1000
num_dims = 1000
svddim = 5

# which words to use as targets and context words?
ktw = do_word_count(corpus_dir, word_to_keep)

wi = make_word_index(ktw) # word index
words_in_order = sorted(wi.keys(), key=lambda w:wi[w]) # sorted words

print('create count matrices')
space_1k = make_space(corpus_dir, wi, num_dims)
print('ppmi transform')
ppmispace_1k = ppmi_transform(space_1k, wi)
print('svd transform')
svdspace_1k = svd_transform(space_1k, num_dims, svddim)
print('done.')

reading file wikipedia.txt
create count matrices
reading file wikipedia.txt
ppmi transform
svd transform
done.


## 2.2 Bulding a word2vec model

We will also build a continuous-bag-of-words (CBOW) word2vec model using gensim (https://radimrehurek.com/gensim/index.html). Build a CBOW word2vec model, where each word have 300 dimensions and as above, limit the vocabulary size to the most common 1000 words. **[5 marks]**

Documentation for the Word2Vec class can be found here: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

In [4]:
from gensim import utils
from gensim.models import Word2Vec

# gensim require a iterable class to process the corpus
class CorpusReader():
    def __init__(self, corpus_path):
        self.corpus_path = corpus_path

    def __iter__(self):
        for line in open(self.corpus_path):
            sentence = utils.simple_preprocess(line)
            if sentence:
                yield sentence 
            
corpus = CorpusReader(corpus_dir+'/wikipedia.txt')
w2v_model = Word2Vec(sentences=corpus,
                     # training options goes here
                     size=300, max_final_vocab=num_dims, window=2)

---
---AE: Marks=5
    
---

In [5]:
print('house:', space_1k['house'])
print('house:', w2v_model['house'])

house: [2554 3774 3105  567  962  631  443  185  311  189  131   28   93  169
   81  125  151  408  194   90   79   29  217  184   62   15   31   70
   10    1   41   21    1   31   37    1   30    5   25    7    3   20
   11    1   32   36    2    5   66    4    0   46    8   18   28    0
   20    7    8   16   10   40    0  175   10    2    7   19    1  174
   11    3    1    6    0    0    0   10    9   11    7   24    4    4
   14   23   58    7    0   10    2    3   10    6   18    6   13    3
   22    0    3    5    3    7   14    3   40   20   19   15    6    8
   24    4    5    1   19    0    3    1    0   14    0   14   53    7
    7   11    6    5    5    4   12    6   53    1    1  433    4    0
    5    7    7   12    1    1    3    4   17    8   16    1    2   31
    1   12   14    1   44    6   14    9   38    7    2    6    8    1
   10    6   10    1    9    7    9    4    3   10    0   11    3    2
    0    2   11   37    2    0    2    1    5    9   10   16   88    6

Oxford Advanced Dictionary has 185,000 words, hence 1,000 words is not representative. We trained a model with 10,000 words, and 50 dimensions on truncated SVD. It took 40 minutes on a laptop. 

Additionally, we trained a word2vec model on the same data. The vocabulary size is also 10,000 with 300 dimensions for the words, and truncated to 50 dimensions in SVD. It took about 15 minutes on a desktop.

We saved all five matrices [here](https://gubox.app.box.com/folder/75208243314) ([alternative/old link](https://linux.dobnik.net/oc/index.php/s/9NTlpOJfPWGS56t/download?path=%2Flab4-distributional-data&files=pretrained.zip)) which you can load as follows:

In [6]:
import numpy as np

numdims = 10000
svddim = 50

basepath = "/home/guszarzmo@GU.GU.SE/LT2213-v20/lab1/lt2213-lab-1-group-3/zarzouram/problem-set-3/"

print('Please wait..')
ktw_10k       = np.load(basepath+'pretrained/ktw_wikipediaktw.npy', allow_pickle=True)
space_10k     = np.load(basepath+'pretrained/raw_wikipediaktw.npy', allow_pickle=True).all()
ppmispace_10k = np.load(basepath+'pretrained/ppmi_wikipediaktw.npy', allow_pickle=True).all()
svdspace_10k  = np.load(basepath+'pretrained/svd50_wikipedia10k.npy', allow_pickle=True).all()
w2v_space     = np.load(basepath+'pretrained/w2v.npy', allow_pickle=True)
w2v_svd_space = np.load(basepath+'pretrained/w2v_svd.npy', allow_pickle=True)
print('Done.')


Please wait..
Done.


Each vector space can be queried as a dictionary $\texttt{(word_form: vector, ...)}$:

In [7]:
print('house:', space_10k['house'])
print('house:', w2v_space['house'])

house: [2554 3774 3105 ...    0    0    0]
house: [-0.22926676  0.84213424  0.41197702  0.8781869   1.2811967  -1.4847856
  1.4102424   0.7267851  -0.5590538   0.04928471  1.8261132  -0.4911551
  2.6236389  -0.62284136 -1.4621106   1.1592358   1.0392265  -0.07465155
  1.0108253   1.1842203  -1.5743443  -1.3098637  -0.04264146 -0.1076067
  0.5574365   0.7599903   0.11031609  0.16449381 -0.40311787 -0.68341875
  0.48706874 -0.73431605 -0.2089108  -0.10828558 -0.6296254   1.3785347
 -0.2206072  -1.0867819  -0.2650222  -0.18507054 -1.6295078  -1.0952461
  1.2633797   0.29369423 -0.10325834  1.2930017   0.83000755 -0.14103375
  1.786327    0.49764258 -2.0428705   0.64002794 -0.3000837   0.03268864
 -0.0933575   0.76802623 -0.1682042   1.8946133  -0.10339233  0.78187567
 -0.28241557 -1.0668939   2.4631667   1.0492538   0.10093345  0.5764743
 -0.24940039 -0.27094615 -0.5501715  -0.07181013  0.830345    0.06051366
 -0.75200856  0.03423605  0.12481829  0.35145602 -0.5419142   0.62099475
 -0.059

## 3. Operations on similarities

We can perform mathematical operations on word vectors to derive meaning predictions. For example, we can subtract the normalised vectors for `king` minus `queen` and add the resulting vector to `man` and we hope to get the vector for `woman`. Why? **[3 marks]**

----------------------------
### Answer:

We expect that the relation between `king` and `queen` is the same as `man` and `woman`.  In other words, the similarity (quantified as distance in vector space) between the `king_vector` and the `queen_vector` should be equal to the similarity between the `man_vector` and the `woman_vector`. Thus, we expect that adding the difference between `king_vector` and `queen_vector` to the `man_vector` will move the `man_vector` to be close to `woman_vector`.

However, although the `woman` vector appears in the closest space to the `king - queen + man`, further analysis suggests that the calculated vector does not represent the `woman` vector. We think that what we can get from the algebric manipulation described above is the space that contains vectors related to `king`, `queen`, `man` and `woman`. Context words "features" that differentiate the `king` and `queen`, which are then added to the `man` are the main factors that affect the output vectors. Please see our analysis below for furtehr detailed discussion.

**Note:** The cell's location, which contains the functions `normalize()` and `find_similar_to()`, has been changed so we can use it to the answer of question 3.

--------------------------------------------------------------------

---
---AE: Clear answer! Marks=3
    
---

Here is some helpful code that allow us to calculate such comparisons.

In [40]:
from scipy.spatial import distance

def normalize(vec):
    return vec / veclen(vec)

def find_similar_to(vec1, space):
    # vector similarity functions
    sim_fn = lambda a, b: 1-distance.euclidean(normalize(a), normalize(b))
    # sim_fn = lambda a, b: 1-distance.correlation(a, b)
    # sim_fn = lambda a, b: 1-distance.cityblock(normalize(a), normalize(b))
    # sim_fn = lambda a, b: 1-distance.chebyshev(normalize(a), normalize(b))
    # sim_fn = lambda a, b: np.dot(normalize(a), normalize(b))
    # sim_fn = lambda a, b: 1-distance.cosine(a, b)

    sims = [
        (word2, sim_fn(vec1, space[word2]))
        for word2 in space.keys()
    ]
    return sorted(sims, key = lambda p:p[1], reverse=True)

In [114]:
# Your code should go here...
king = w2v_space['king'];               king_norm = normalize(king)
queen = w2v_space['queen'];             queen_norm = normalize(queen)
man = w2v_space['man'];                 man_norm = normalize(man)
woman = w2v_space['woman'];             woman_norm = normalize(woman)
# boy = w2v_space['boy'];                 boy_norm = normalize(boy)

# Calculate man+queen-king
woman_cal = king - queen + man
woman_cal_norm = normalize(woman_cal)

# Find similar vectors in the space
similar_vectors = find_similar_to(woman_cal_norm, w2v_space)
max_dis = similar_vectors[-1][0]
display(similar_vectors[:10])
# display(similar_vectors[-10:])

# Similarity function based on Euclidean distance
mysim_fn = lambda a, b: distance.euclidean(a, b)

# get the word that has the maximum distance from woman calculated vector
max_vector = w2v_space[max_dis];       
max_norm = normalize(max_vector)

s_max_norm = mysim_fn(woman_cal_norm, max_norm)
s_max = mysim_fn(woman_cal, max_vector)

# Compare different vectors
# Using Normalized Vectors
print("{:.^19} {:^15} {:^13} {:^13} {:^13} {:^13}".format(".", "WomanC-Woman", "womanC-man", "woman-man", "queen-king", "WomanC-max"))
s0 = mysim_fn(woman_cal_norm, woman_norm) / s_max_norm
s1 = mysim_fn(woman_cal_norm, man_norm) / s_max_norm
# s2 = mysim_fn(woman_cal_norm, boy_norm) / s_max_norm
s3 = mysim_fn(woman_norm, man_norm) / s_max_norm
s4 = mysim_fn(queen_norm, king_norm) / s_max_norm
s5 = mysim_fn(woman_cal_norm, max_norm) / s_max_norm

print("{:^19} {:^15.4f} {:^13.4f} {:^13.4f} {:^13.4f} {:^13.4f}".format("Normalized", s0, s1, s3, s4, s5))

# Using non normalized vectors
s10 = mysim_fn(woman_cal, woman) / s_max
s11 = mysim_fn(woman_cal, man) / s_max
# s12 = mysim_fn(woman_cal, boy) / s_max
s13 = mysim_fn(woman, man) / s_max
s14 = mysim_fn(queen, king) / s_max
s15 = mysim_fn(woman_cal, max_vector) / s_max
print("{:^19} {:^15.4f} {:^13.4f} {:^13.4f} {:^13.4f} {:^13.4f}".format("Not Normalized", s10, s11, s13, s14, s15))



[('man', 0.2804240584373474),
 ('king', 0.010552525520324707),
 ('himself', -0.09425520896911621),
 ('boy', -0.11881411075592041),
 ('woman', -0.12057375907897949),
 ('son', -0.12237465381622314),
 ('prophet', -0.12315809726715088),
 ('hero', -0.12349879741668701),
 ('nephew', -0.1237407922744751),
 ('gentleman', -0.13512647151947021)]

...................  WomanC-Woman    womanC-man     woman-man    queen-king    WomanC-max  
    Normalized          0.6988         0.4487        0.5402        0.5764        1.0000    
  Not Normalized        0.7118         0.4874        0.4372        0.4874        1.0000    


#### Further disscussion on the algebric manuplation "`king - queen + man`"

In the code above, we use the Euclidean distance as a similarity function. The "`calculated woman`" is equal to `king - queen + man`.

We then search for words whose vectors are similar to the "`calculated woman`" vector. As expected, the `woman` vector appears in the closest space to "`calculated woman`". Interestingly, other vectors show up, such as `man`, `king`, `himself` and `boy`.  The word `boy` is the closest word vector to the "`calculated woman`". Also, words like `son`, `prophet`, `gentlemen` appear. These words have a masculine gender feature. This result is inconsistent with the intuition of getting `woman` because of gender influence on context.

We compared the `calculated woman` and the `women` vectors to examine to what degree they are similar, see the table in above cell. We normalized the Euclidean distance to the maximum distance found in our vector space. As expected, the similarity between `king` and `queen` is reasonably equal to the similarity between `man` and `woman`. We noticed that the `calculated woman` is closer to `man` than the original `woman` vector. 

The last result, in addition to the words that are found to be more corrlated to the calculaed vector, suggest that the calculated vector may do not represent the `woman` vector. 

-----


---
--- AE: So, what to consider when working with distributed vectors is that they wont give the EXACT location of woman, they will give an approximate location (kinda: "woman" should be around here). The ofcourse, based on the content of the corpora, and the method used, other words may lie in this area, or region, which that the algebraic manipulation points at. 

I'm also wondering abit about your division with s_max/s_max_norm, and the "max_distance" thing in general. I don't really understand your reasoning for dividing the similarity with this, what does this add to your analysis? 

The results are abit confusing, so the similarity between the calculated vector for woman (`womanC`) is higher with `Woman` (.69) than to `man` (.44) in your table? So, `woman` is more similar to `womanC` than `man`?

In general tho, I think it's an interesting idea for analysing vector spaces (which is kinda fun!), but you need to motivate the method behind the analysis abit more. 

---

Here is how you apply this code. Compare the count-based method with the word2vec method and comment on the results you get. **[4 marks]**

In [71]:
king = normalize(w2v_space['king'])
queen = normalize(w2v_space['queen'])
man = normalize(w2v_space['man'])
woman = normalize(w2v_space['woman'])

w2v = find_similar_to(king - man + woman, w2v_space)[:10]

king = normalize(space_10k['king'])
queen = normalize(space_10k['queen'])
man = normalize(space_10k['man'])
woman = normalize(space_10k['woman'])

count = find_similar_to(king - man + woman, space_10k)[:10]

print("word2vec")
display(w2v)
print("count based")
display(count)

word2vec


[('king', 0.30582571029663086),
 ('isabella', 0.05993384122848511),
 ('queen', 0.05481141805648804),
 ('regent', 0.024451017379760742),
 ('consort', 0.006222903728485107),
 ('princess', -0.006409406661987305),
 ('aragon', -0.02746891975402832),
 ('woman', -0.040259361267089844),
 ('prince', -0.044527292251586914),
 ('throne', -0.04945361614227295)]

count based


[('king', 0.7203524622658145),
 ('master', 0.6998342452614426),
 ('group', 0.6798995313375658),
 ('legacy', 0.678797007381283),
 ('bishop', 0.6705391829560815),
 ('wizard', 0.6702145341927735),
 ('great', 0.6683195163949522),
 ('chronology', 0.6651789427826227),
 ('shadow', 0.6631569577811421),
 ('prophecy', 0.660960082400398)]

----------------------------
### Answer:

The word2vec is more capable of capturing words related to `king - man + woman` such as isabella, queen, princess, prince, and throne. The word2vec method seems to capture the meaning relations we expect from the "king-man+woman" and has words like "queen", "Isabella" (supposedly a queen Isabella), or "princess" in the top 10 results. Interestingly, "king" remains in first place, with a significantly higher score than the second-highest, Isabella. In contrast, the count-based method does not confirm our intuition about meaning here at all and instead gives words that often occur in a similar context as "king", such as "master", "bishop", or "great". These word relations seem to be taken from the context of fantasy books, films, or computer games and don't correspond to the typical associations one would have with "real-world"-kings (or queens). However, these results have very relatively high confidence score.

Word2vec is a predict-based model that predicts a word based on context words surrounding it. Thus, the model is more capable of learning words related to each other than the count-based model, which relies on counting wording that appears with each other. 

---


---
---AE: Marks=4
    
---

Find 2 similar pairs of pairs of words and test them. Does the resulting vector similarity confirm your expectations? But remember you can only do this if the words are contained in our vector space. **[2 marks]**

In [115]:
noon = normalize(w2v_svd_space['noon'])
sun = normalize(w2v_svd_space['sun'])
moon = normalize(w2v_svd_space['moon'])

find_similar_to(noon - sun + moon, w2v_svd_space)[:10]


[('noon', 0.3371831394944971),
 ('moon', 0.1947421521774182),
 ('midnight', 0.18643548055835057),
 ('night', 0.1732825223598521),
 ('nights', 0.15294529026537618),
 ('dawn', 0.11412533982372974),
 ('morning', 0.10290676720109992),
 ('pm', 0.09467113661912308),
 ('mars', 0.08986970536185568),
 ('lunar', 0.07473034257418476)]

----------------------------
### Answer:

Word2Vec finds vectors for words that are related to night such as night, midnight, pm and dawn.

---


---
---AE: Marks=1
    
---

Try changing the number of dimensions, and the window size in the models you built in (2.1) and (2.2). Comment on the new results you get in comparison to the first results results. [**4 marks**]



In [154]:
# print(words_in_order[:100])
print("Window = 2, Number of Word = 1000, Reduced Dimention = 300, Word3Vec size = 300")

myword = "mother"

# making size of count based word embedding equal to the size of Word2Vec
svdspace1_1k = svd_transform(space_1k, 1000, 300)
# find 10 most words that are similar to myword using count-based model
word_v = normalize(svdspace1_1k[myword])

########################################
### AE note to self: comparison between count-1000-context + svd-300 and original w2v

print("Count-based Reduced")
display(find_similar_to(word_v, svdspace1_1k)[:10])

# find 10 most words that are similar to myword using word2vec model
# convert w2vecmodel to dict 
w2v_dict = dict({})
for key in w2v_model.wv.vocab:
    w2v_dict[key] = w2v_model.wv[key]

word_v = normalize(w2v_dict[myword])
print("Word2Vec")
display(find_similar_to(word_v, w2v_dict)[:10])


print('----'*20)
print()

########################################
### AE note to self: comparison between
### count-3000-context + svd-500 (w=5) and
### w2v-dim-500 (w=5)

# New count-based and Word2Vec models with new configuration
num_dims = 3000
svddim = 500
win = 5

ktw = do_word_count(corpus_dir, num_dims)
wi = make_word_index(ktw) # word index

# Creat new models
print('create count matrices')
space_3k = make_space(corpus_dir, wi, num_dims, win)
print('svd transform')
svdspace_3k = svd_transform(space_3k, num_dims, svddim)
print('create Word2Vec')
corpus = CorpusReader(corpus_dir+'/wikipedia.txt')
w2v_new = Word2Vec(sentences=corpus,
                     # training options goes here
                     size=svddim, max_final_vocab=num_dims, window=win)

print('Finished creating new models.')
print('-'*10)
print("\nWindow = 5, Number of Word = 3000, Reduced Dimention = 500, Word3Vec size = 500")

# find 10 most words that are similar to myword using count-based model
word_v = normalize(svdspace_3k[myword])
print("\nCount-based Reduced")
display(find_similar_to(word_v, svdspace_3k)[:10])

# find 10 most words that are similar to myword using word2vec model
# convert w2vecmodel to dict 
w2vnew_dict = dict({})
for key in w2v_new.wv.vocab:
    w2vnew_dict[key] = w2v_new.wv[key]

word_v = normalize(w2vnew_dict[myword])
print("Modified Word2Vec")
display(find_similar_to(word_v, w2vnew_dict)[:10])


Window = 2, Number of Word = 1000, Reduced Dimention = 300, Word3Vec size = 300
Count-based Reduced


[('mother', 1.0),
 ('father', 0.7849403673208615),
 ('brother', 0.7500760451729602),
 ('death', 0.6416708056325708),
 ('wife', 0.6275464846826474),
 ('son', 0.619609781934459),
 ('life', 0.6078508816798458),
 ('daughter', 0.570165359578929),
 ('personal', 0.5597811250978723),
 ('works', 0.5572185235230924)]

Word2Vec


[('mother', 1.0),
 ('father', 0.3792811632156372),
 ('daughter', 0.26224279403686523),
 ('wife', 0.24355071783065796),
 ('son', 0.1880810260772705),
 ('brother', 0.12326663732528687),
 ('child', 0.08178287744522095),
 ('death', -0.03311812877655029),
 ('queen', -0.035846829414367676),
 ('her', -0.04848647117614746)]

--------------------------------------------------------------------------------

reading file wikipedia.txt
create count matrices
reading file wikipedia.txt
svd transform
create Word2Vec
Finished creating new models.
----------

Window = 5, Number of Word = 3000, Reduced Dimention = 500, Word3Vec size = 500

Count-based Reduced


[('mother', 0.9999999999999998),
 ('father', 0.8140425061064158),
 ('brother', 0.7550573380088922),
 ('sister', 0.7314683289735222),
 ('wife', 0.7021246708442017),
 ('death', 0.6833624612820793),
 ('parents', 0.660287021048755),
 ('writings', 0.65710866333699),
 ('successor', 0.657003946296038),
 ('life', 0.6241045227685684)]

Modified Word2Vec


[('mother', 1.0),
 ('father', 0.34551578760147095),
 ('daughter', 0.2779615521430969),
 ('wife', 0.21182197332382202),
 ('son', 0.17281436920166016),
 ('child', 0.15633940696716309),
 ('parents', 0.13318002223968506),
 ('sister', 0.11209118366241455),
 ('brother', 0.07305717468261719),
 ('friend', 0.05141681432723999)]

----------------------------
### Answer:

We tried to establish comparisons based on algebraic manipulation over `queen`, `king`, `man`, and `woman`. However, we did not find the word `woman` in the count-based model, so we have established another experiment. We will compare the results of capturing the words that have similar semantic meaning with the word `mother`.

We equalized the word embeddings' length of the reduced count-based model to be 300 (instead of 5) similar to Word2Vec. The original configuration has a relatively low number of vocabulary and context window size. In the modified configuration, we have increased the number of vocabulary to be 3000 unique words and the context window size to be 5. 

In both cases, the Word2Vec model performs better over the count-based model; the Word2Vec captured more words that have the same semantic meaning to our target word `mother`.  

For the count-based model, the performance did not noticeably change after using the new configuration. On the other hand, the performance of the Word2Vec module increased when we applied the new setting. All words captured by Word2Vec under the new configuration are semantically related to the word `mother`.

We noticed that when we used a low value of the context window size, some words with a syntactic relation with our target word appear, such as `her` and `works`. No such words appear as we increased the window size. 

----


---
---AE: Some comments, the experiments you do are reasonable, but I'm not sure about the comparison between word2vec and count-based methods in this regard. Because word2vec and count-based methods are very different methods using the same number of dimensions is not very meaningful. So, the number of dimensions in word2vec indicate "anonymous features", while in count-based methods the features are not anonymous, they are distinct words. Thus, reasonably the two different methods would have different "optimal" number of dimensions. My main point is that "dimensions" have different meaning in the two methods, thus trying to equate them to each other is not very indicative of something. 

In general, it is hard to compare the methods using the same type of modification. Something which have roughly the same meaning is "window size" in the two models, so only changing that would be a good contender for comparison (between cout-based and word2vec). However, if you compare two different word2vec models, changing the dimension size makes sense (since it has the same meaning in both models)

But, the experiment you are doing (only using woman) is interesting and much simpler (which is good!). Marks=3
    
---

## 4. Testing semantic similarity

The file `similarity_judgements.txt` (a copy is included with this notebook or you can download it from [here](https://linux.dobnik.net/oc/index.php/s/9NTlpOJfPWGS56t/download?path=%2Flab4-distributional-data&files=similarity_judgements.txt.zip)) contains 7,576 pairs of words and their lexical and visual similarities (based on the pictures) collected in on online crowd-sourcing data collection using Mechanical Turk as described in [1]. The score range from 1 (highly dissimilar) to 5 (highly similar).

The following code will import them into python lists below:

In [161]:
word_pairs = [] # test suit word pairs
semantic_similarity = [] 
visual_similarity = []
test_vocab = set()

filepath = "/home/guszarzmo@GU.GU.SE/LT2213-v20/lab1/lt2213-lab-1-group-3/zarzouram/problem-set-3/pretrained/similarity_judgements.txt"

# for index, line in enumerate(open('similarity_judgements.txt')):
for index, line in enumerate(open(filepath)):
    data = line.strip().split('\t')
    if index > 0 and len(data) == 3:
        w1, w2 = tuple(data[0].split('#'))
        # it will check if both words from each pair exist in the word matrix.
        if w1 in space_1k and w2 in space_1k:
            word_pairs.append((w1, w2))
            test_vocab.update([w1, w2])
            semantic_similarity.append(float(data[1]))
            visual_similarity.append(float(data[2]))
        
print("number of available words to test:", len(test_vocab-(test_vocab-set(ktw))))
print("number of available word pairs to test:", len(word_pairs))
gold_standard = list(zip(word_pairs, visual_similarity, semantic_similarity))
print(gold_standard[0:3])

number of available words to test: 10
number of available word pairs to test: 13
[(('book', 'bureau'), 1.0, 1.4), (('church', 'radio'), 1.0, 1.0), (('book', 'house'), 1.2, 1.4)]


Now we can test how the cosine similarity between vectors of each of the five spaces compares with the human judgements on the words collected in the previous step. Which of the five spaces best approximates human judgements?

For comparison of several scores we can use [Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better. The $p$-values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Here is how you can calculate Spearman's correlation coefficient betweeen the scores of visual similarity and semantic similarity of the available words in the test suite:

In [160]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, visual_similarity)
print("""Visual Similarity vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))


Visual Similarity vs. Semantic Similarity:
rho     = 0.8136
p-value = 0.0007


Let's now calculate the cosine similarity scores of all word pairs in an ordered list using all three matrices. **[2 marks]**

In [198]:
# Similarity function based on Euclidean distance
mysim_fn = lambda a, b: 1-distance.cosine(a, b)

raw_similarities  = [mysim_fn(space_10k[w1], space_10k[w2]) for w1, w2 in word_pairs]
ppmi_similarities = [mysim_fn(ppmispace_10k[w1], ppmispace_10k[w2])  for w1, w2 in word_pairs]
svd_similarities  = [mysim_fn(svdspace_10k[w1], svdspace_10k[w2]) for w1, w2 in word_pairs]

w2v_similarities     = [mysim_fn(w2v_space[w1], w2v_space[w2])for w1, w2 in word_pairs]
w2v_svd_similarities = [mysim_fn(w2v_svd_space[w1], w2v_svd_space[w2]) for w1, w2 in word_pairs]

---
---AE: Marks=2
    
---

Now, calculate correlation coefficients between lists of similarity scores and the real semantic similarity scores from the experiment. The scores of what model best correlate them? Is this expected? **[4 marks + 2 marks]**

In [203]:
# Your code should go here...
similarities = [("raw", raw_similarities), ("ppmi", ppmi_similarities), ("svd", svd_similarities), ("w2v", w2v_similarities), ("w2v_svd", w2v_svd_similarities)]

print("{:-^10} {:^10s} {:^13s}".format("Similarity", "rho", "p-value"))
for name, similarity in similarities:
    rho, pval = stats.spearmanr(similarity, semantic_similarity)
    print("{:<10} {:^10.4f} {:^13.4f}".format(name, rho, pval))

Similarity    rho        p-value   
raw          0.3024      0.3153    
ppmi         0.7501      0.0031    
svd          0.6513      0.0159    
w2v          0.6978      0.0080    
w2v_svd      0.6716      0.0119    


----------------------------
### Answer:


Reducing the dimensions of the count-based model increases the correlation. This indicates that the SVD successfully removed the noise from the term-term matrix.

PPMI has higher correlation that raw count-based model. This is expected as PPMI solves the problem associated with the raw count-based model; raw counts are not discribtive and skwed. Words that does not convey information and occur a lot (such as "the" and "of") will have a high weight. PPMI normalizes the likelihood of co-occurrence of two words (`a`, `b`) to the probability of observing any of them individually. For example, the PPMI of "the dog" will be penalize for have the word "the", while the PPMI "old dog" will not. In that sense, the PPMI performed better than the raw count-based model.

Surprisingly, PPMI, which is based on word counting, outperforms the predict-based model (Word2Vec).

---


---
---AE: Marks=6
    
---

We can also calculate correlation coefficients between lists of cosine similarity scores and the real visual similarity scores from the experiment. Which similarity model best correlates with them? How do the correlation coefficients compare with those from the previous comparison - and can you speculate why do we get such results? **[2 marks + 6 marks]**

In [204]:
# Your code should go here...
similarities = [("raw", raw_similarities), ("ppmi", ppmi_similarities), ("svd", svd_similarities), ("w2v", w2v_similarities), ("w2v_svd", w2v_svd_similarities)]

print("{:-^10} {:^10s} {:^13s}".format("Similarity", "rho", "p-value"))
for name, similarity in similarities:
    rho, pval = stats.spearmanr(similarity, visual_similarity)
    print("{:<10} {:^10.4f} {:^13.4f}".format(name, rho, pval))

Similarity    rho        p-value   
raw          0.5882      0.0345    
ppmi         0.8445      0.0003    
svd          0.7571      0.0027    
w2v          0.7396      0.0039    
w2v_svd      0.6989      0.0079    


**Your answer should go here:**

We have realized that the correlation of the visual similarity increased, especially for the raw count-based model. If we look at the table below, we find that each pair-of-words could have a high semantic similarity but this is not the case for the visual similarity.  

|    W1   	|   W2   	| Sem-S 	| Vis-S 	|
|:-------:	|:------:	|:-----:	|:-----:	|
| chimp   	| horse  	|  3.20 	|  1.40 	|
| pajamas 	| socks  	|  3.20 	|  1.40 	|
| cat     	| rabbit 	|  4.50 	|  2.75 	|


We suspected that the failure in capturing the semantic similarity and having a relatively high ratio of disagreement between both semantic and visual similarities lead to this increase in `rho` values for the raw count-based model. However, our further analysis, see the code below, indicates that this method is not indicative. The number of sample data is too low, and all of amples are belong to one category ---no semantic similarity and no visual similarity--- except for one sample data. 

Raw count-based model fail to capture the similarities for nearly all samples. Yet, there is a noticable improvement in performance when we tried to correlate with the visual similarity. The PPMI fail to capture one sample,
however this sample construct a class by its own.

Thus, under the current experiment configuration, we believe that these correlation values can not be used as an evaluation of the goodness of the models under test.

In [253]:
print("{:-^44}".format("Types of similarity"))
print("{:^11}{:^11}{:^11}{:^11}".format("raw", "ppmi","visual", "semantic"))
for x in zip(raw_similarities, ppmi_similarities, visual_similarity, semantic_similarity):
    print("{:^11.2f}{:^11.2f}{:^11.2f}{:^11.2f}".format(x[0], x[1], x[2], x[3]))
# print("semantic: ", np.array(raw_similarities))
# print("visual: ", np.array(visual_similarity))


------------Types of similarity-------------
    raw       ppmi      visual    semantic  
   0.60       0.02       1.00       1.40    
   0.74       0.09       1.00       1.00    
   0.96       0.15       1.20       1.40    
   0.92       0.20       4.75       4.25    
   0.97       0.15       1.20       1.40    
   0.85       0.10       1.00       1.00    
   0.87       0.13       1.20       1.40    
   0.97       0.18       1.50       1.75    
   0.97       0.22       1.20       1.40    
   0.96       0.11       1.20       1.20    
   0.86       0.13       1.00       1.40    
   0.95       0.19       2.40       2.00    
   0.93       0.10       1.00       1.00    


---
---AE: Good analysis, and you are correct, there are very few data points so the generalizability of the models is low. Marks=8
    
---

## 5. Discussion

What are the limitations of our approach in this lab? Suggest three ways in which the results could be improved. **[6 marks]**

**Your answer should go here:**

1. Text preprocessing
    1. removing functional words, words like `the`, `a`, etc are not indicative. 
    2. lemmatization, the co-occurrence of `old dog` and `old dogs` should be considered the same.
2. Consider using a larger corpus. Many word pairs in human judgment are not found in the corpus, limiting our ability to evaluate our models properly. Consider using the pretrained word embeddings.
3. Instead of finding correlation for a continuous scale of human judgment, try to convert the continuous data type to categorical type; what is the difference between 4.25 and 4.75? The same for the model output. A fuzzy function could be used to deal with numbers that are on edge between two categories.
4. Dimensionality reduction:
    1. Consider using a dimensionality reduction method other than SVD for count-based model
    2. Select the number of reduced dimensions that produced the least errors when the original vector is reconstructed back using the reduced vector.    

---
---AE: Reasonable suggestions! Although, I'm not sure what you mean in (3), the difference between 4.25 and 4.75 is precisely 0.5, whatever "category" the words are. So, a small difference between the similarity judgments mean there is not much difference, while a large gap indicate there is a large difference. This approach has problems on it's own, such that people imagine different things when "visually" comparing dog and cat, if a human imagines both the "dog" and "cat" as black, they are quite similar, but in shape, or size, they are typically very different. 

Marks=6
    
---

---
---AE: 43 points out of 45. I really liked your enthusiasm and analysis in the lab, good job! Something in general to consider: when you are proposing some "new" type of analysis or method it, motivating the method is very important. Someone reading it will have a hard time finding the latent factors behind the development, and assesing something without grasping the ins and outs of the method is rather dangerous haha

Best,
Adam
    
---

# Literature

[1] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 721–732, Baltimore, Maryland, USA, June 23–25 2014 2014. Association for Computational Linguistics.

[2] Y. Bengio, R. Ducharme, P. Vincent, & C. Jauvin (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137-1155.

[3] Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.
