# Vector Semantics and Semantic Composition

Nikolai Ilinykh (with modifications of notebooks by Katrin Erk)

Here, we are going to look at  
    (i) how a word can be represented through the counts of its neighboring words,  
    (ii) how we can construct different semantic compositions for the phrase similarity task,  
    (iii) how we can reduce the computation space of our meaning representations

## 1. Choose your task dataset and your reference corpus

The task dataset is basically the reason why we are computing word vectors. We are going to work with the dataset of phrase similarity, proposed in [1].  
The idea is to build a model that is able to capture differences between different combinations of intransitive verbs and different nouns.  
These phrases can be of high and low similarity.  
In addition, we have human ratings of how similar/dissimilar phrases are.  

The reference phrase is a combination of a noun and intransitive verb. Two landmark phrases (with high and low similarity) are paired with the reference phrase to produce a single item. Example:

![title](lapataexample.png)

Reference phrase: the fire glowed (noun + reference)

Landmark phrases:  
High similarity phrase: the fire burned (noun + verb from high) (this phrase is HIGHLY similar to the reference phrase)  
Low similarity phrase: the fire beamed (noun + verb from low) (this phrase is NOT REALLY similar to the reference)

We need to construct semantic space for our task dataset. The best solution would be to use the dataset used in [1], which is British National Corpus.  
But for simplicity and purpose of this lecture, we are going to work with Gutenberg project, offered by nltk.

Let's get our Gutenberg corpus and import necessary packages!

In [232]:
from nltk.corpus import stopwords
import nltk
import string
import numpy as np
import math

import pickle
import glob
import json
import tqdm

gutenberg_files = glob.glob('./lecture7-code-vt2017/gutenberg/*.txt')
gutenberg_files = gutenberg_files

In [233]:
gutenberg_files

['./lecture7-code-vt2017/gutenberg/blake-poems.txt',
 './lecture7-code-vt2017/gutenberg/carroll-alice.txt',
 './lecture7-code-vt2017/gutenberg/whitman-leaves.txt',
 './lecture7-code-vt2017/gutenberg/milton-paradise.txt',
 './lecture7-code-vt2017/gutenberg/bible-kjv.txt',
 './lecture7-code-vt2017/gutenberg/austen-persuasion.txt',
 './lecture7-code-vt2017/gutenberg/melville-moby_dick.txt',
 './lecture7-code-vt2017/gutenberg/edgeworth-parents.txt',
 './lecture7-code-vt2017/gutenberg/chesterton-thursday.txt',
 './lecture7-code-vt2017/gutenberg/burgess-busterbrown.txt',
 './lecture7-code-vt2017/gutenberg/austen-emma.txt',
 './lecture7-code-vt2017/gutenberg/chesterton-brown.txt',
 './lecture7-code-vt2017/gutenberg/shakespeare-hamlet.txt',
 './lecture7-code-vt2017/gutenberg/austen-sense.txt',
 './lecture7-code-vt2017/gutenberg/shakespeare-macbeth.txt',
 './lecture7-code-vt2017/gutenberg/bryant-stories.txt']

In [234]:
def preprocess(s):
    '''
    split text into words, lowercase them, remove punctuation and stopwords
    we need to keep words which makes sense and can be informative for our task
    '''
    return [w.lower().strip(string.punctuation) for w in s.split() if w.lower() not in stopwords.words('english')]

# testing our function
preprocess("I am making a very important test.!")

['making', 'important', 'test']

In [4]:
def do_word_count(corpus):
    '''
    let's count all words in our texts
    we will need this information to ignore the least frequent words
    why? least frequent words are typically not that informative
    '''
    word_count = nltk.FreqDist()
    for filename in corpus:
        print('reading file', filename)
        with open(filename, 'r') as f1:
            text = f1.read()
            word_count.update(preprocess(text))
    return word_count

In [5]:
dataset = do_word_count(gutenberg_files)

reading file ./lecture7-code-vt2017/gutenberg/blake-poems.txt
reading file ./lecture7-code-vt2017/gutenberg/carroll-alice.txt
reading file ./lecture7-code-vt2017/gutenberg/whitman-leaves.txt
reading file ./lecture7-code-vt2017/gutenberg/milton-paradise.txt
reading file ./lecture7-code-vt2017/gutenberg/bible-kjv.txt
reading file ./lecture7-code-vt2017/gutenberg/austen-persuasion.txt
reading file ./lecture7-code-vt2017/gutenberg/melville-moby_dick.txt
reading file ./lecture7-code-vt2017/gutenberg/edgeworth-parents.txt
reading file ./lecture7-code-vt2017/gutenberg/chesterton-thursday.txt
reading file ./lecture7-code-vt2017/gutenberg/burgess-busterbrown.txt
reading file ./lecture7-code-vt2017/gutenberg/austen-emma.txt
reading file ./lecture7-code-vt2017/gutenberg/chesterton-brown.txt
reading file ./lecture7-code-vt2017/gutenberg/shakespeare-hamlet.txt
reading file ./lecture7-code-vt2017/gutenberg/austen-sense.txt
reading file ./lecture7-code-vt2017/gutenberg/shakespeare-macbeth.txt
reading

In [6]:
# save our dataset (good to keep files)
#import pickle
pickle.dump(dataset, open('./gutenberg_dataset.txt', 'wb'))

In [235]:
our_dataset = pickle.load(open('./gutenberg_dataset.txt', 'rb'))

In [237]:
reference_corpus = our_dataset

In [238]:
reference_corpus

FreqDist({'shall': 11499, 'unto': 9010, 'said': 8738, 'lord': 8387, 'thou': 6622, 'one': 5724, 'him': 5713, 'thy': 5544, 'god': 5028, 'it': 4868, ...})

Since our task dataset is different from reference corpus, we need to make sure that all words from the task dataset can be found in our reference corpus.  
Otherwise, we will not be able to build vectors based on co-occurrences simply because our words of interest have never appeared in the reference texts.

In [240]:
# load the task dataset
with open('./mitchell_lapata_acl08.txt', 'r') as f:
    phrase_dataset = f.read().splitlines()

for line in phrase_dataset[:10]:
    print(line)
    
# get all unique words
words = []
for line in phrase_dataset[1:]:
    _, verb, noun, landmark, _, _ = line.split()
    if verb not in words:
        words.append(verb)
    if noun not in words:
        words.append(noun)
    if landmark not in words:
        words.append(landmark)

participant verb noun landmark input hilo
participant20 stray thought roam 7 low
participant20 stray discussion digress 6 high
participant20 stray eye roam 7 high
participant20 stray child digress 1 low
participant20 throb body pulse 5 high
participant20 throb head shudder 2 low
participant20 throb voice shudder 3 low
participant20 throb vein pulse 6 high
participant20 chatter machine click 4 high


participant20 throb body pulse 5 high

participant id = participant 20
the task for the human: evaluate landmark phrase and reference phrase and say on the scale from 1 to 7, how similar they are? 1 is not similar, 7 is very similar

reference phrase: body throb
landmark phrase: budy pulse (this phrase is HIGH in similarity to the reference, and we expect human to rate it accordingly)

now, human gives us rating of 5; kind of high


input: HUMAN judgements (1 - 7)
hilo (high/low): some known truth about similarity between these phrases


In [241]:
# simply check if all words that we have in our task dataset can be found in the reference corpus (the result should return nothing)
to_remove = []
for w in words:
    if w not in our_dataset:
        print(w)
        to_remove.append(w)
# if something is not found, makes sense to ignore phrases with such non-present words

ricochet
flick
slump
erupt
export
fluctuate


In [11]:
# how many words do we have before cleaning?
print(len(words))

95


In [242]:
# cleaning the task dataset (we might call it phrase dataset from now on)
# we are removing all phrases which contain non-found words
# this would probably remove other words as well (those, which are paired with the non-found words)

cleaned_phrase_dataset = []
for line in phrase_dataset:
    _, verb, noun, landmark, _, _ = line.split()
    if verb in to_remove or noun in to_remove or landmark in to_remove:
        continue
    cleaned_phrase_dataset.append(line)

target_words = []
for line in cleaned_phrase_dataset[1:]:
    _, verb, noun, landmark, _, _ = line.split()
    if verb not in target_words:
        target_words.append(verb)
    if noun not in target_words:
        target_words.append(noun)
    if landmark not in target_words:
        target_words.append(landmark)

In [243]:
# how many words do we have after cleaning?
len(target_words)

80

now our task dataset (phrase similarity dataset) matches our semantic space: all words in the task dataset OCCUR in our semantic space

In [14]:
all_words = []
for filename in gutenberg_files:
    print('reading file', filename)
    with open(filename, 'r') as f1:
        for line in f1:  
            words = [w.lower().strip(string.punctuation) for w in line.split()]
            if words != []:
                for elem in words:
                    all_words.append(elem)

reading file ./lecture7-code-vt2017/gutenberg/blake-poems.txt
reading file ./lecture7-code-vt2017/gutenberg/carroll-alice.txt
reading file ./lecture7-code-vt2017/gutenberg/whitman-leaves.txt
reading file ./lecture7-code-vt2017/gutenberg/milton-paradise.txt
reading file ./lecture7-code-vt2017/gutenberg/bible-kjv.txt
reading file ./lecture7-code-vt2017/gutenberg/austen-persuasion.txt
reading file ./lecture7-code-vt2017/gutenberg/melville-moby_dick.txt
reading file ./lecture7-code-vt2017/gutenberg/edgeworth-parents.txt
reading file ./lecture7-code-vt2017/gutenberg/chesterton-thursday.txt
reading file ./lecture7-code-vt2017/gutenberg/burgess-busterbrown.txt
reading file ./lecture7-code-vt2017/gutenberg/austen-emma.txt
reading file ./lecture7-code-vt2017/gutenberg/chesterton-brown.txt
reading file ./lecture7-code-vt2017/gutenberg/shakespeare-hamlet.txt
reading file ./lecture7-code-vt2017/gutenberg/austen-sense.txt
reading file ./lecture7-code-vt2017/gutenberg/shakespeare-macbeth.txt
reading

Now, let's compute the semantic space for our target words!

In [15]:
def compute_space(window_size, corpus):
    '''
    this function builds semantic space for each word in the corpus
    the space is limited by the window_size : how many words on both sides of the target word should be used
    the space is counting frequency of context words, e.g. the idea of co-occurence
    '''
    
    space = nltk.ConditionalFreqDist()
    #
    for index in tqdm.tqdm(range(len(corpus))):
        # current word
        current = corpus[index]
        if current in target_words:
        
            # get future context for the first word only
            # because there is no past context for the first word
            if index == 0:
                for cxword_index_after in range(window_size):
                    cxword_after = corpus[cxword_index_after + 1]
                    space[current].update([cxword_after])      
            # context before and after the current word: specified by context size
            if index > 0:
                # extract words within the context window for the current word that occur BEFORE
                for cxword_index_before in range(max(index - window_size, 0), index):
                    # range is inclsuive the first value but exclusive the second
                    cxword_before = corpus[cxword_index_before]
                    # In a ConditionalFreqDist, if 'current' is not a condition yet,
                    # then accessing it creates a new empty FreqDist for 'current'
                    # The FreqDist method inc() increments the count for the given item by one.
                    space[current].update([cxword_before])

                # extract words within the context window for the current word that occur AFTER
                if index + window_size < len(corpus):
                    for cxword_index_after in range(index, max(index + window_size, 0)):
                        cxword_after = corpus[cxword_index_after + 1]
                        space[current].update([cxword_after])

                # if window_size AFTER exceeds the length of the corpus (needed for the words in the end)
                else:
                   for cxword_index_after in range(index + 1, len(corpus)):
                        cxword_after = corpus[cxword_index_after]
                        space[current].update([cxword_after])

    return space

In [16]:
sp = compute_space(5, all_words)

100%|██████████| 2033185/2033185 [00:02<00:00, 708980.67it/s]


In [17]:
sp

<ConditionalFreqDist with 80 conditions>

In [18]:
target_words[:10]

['stray',
 'thought',
 'roam',
 'discussion',
 'digress',
 'eye',
 'child',
 'throb',
 'body',
 'pulse']

In [19]:
print('stray:\n', sp['stray'].most_common(10), '\n')
print('thought:\n', sp['thought'].most_common(10), '\n')
print('discussion:\n', sp['discussion'].most_common(10), '\n')

stray:
 [('the', 8), ('from', 5), ('to', 5), ('me', 3), ('and', 3), ('of', 3), ('or', 3), ('a', 3), ('might', 2), ('out', 2)] 

thought:
 [('i', 498), ('the', 481), ('of', 421), ('and', 339), ('he', 317), ('to', 316), ('it', 276), ('a', 246), ('that', 223), ('she', 209)] 

discussion:
 [('of', 12), ('and', 9), ('the', 8), ('to', 5), ('in', 4), ('a', 4), ('one', 4), ('it', 3), ('their', 3), ('an', 3)] 



The problem: a lot of context words do not carry much sense in them  
What happens if we increase the window_size?

In [20]:
print('stray:\n', sp['stray'].most_common(100), '\n')
print('thought:\n', sp['thought'].most_common(100), '\n')
print('discussion:\n', sp['discussion'].most_common(100), '\n')

stray:
 [('the', 8), ('from', 5), ('to', 5), ('me', 3), ('and', 3), ('of', 3), ('or', 3), ('a', 3), ('might', 2), ('out', 2), ('with', 2), ('you', 2), ('side', 2), ('into', 2), ('near', 2), ('wish', 1), ('church', 1), ('then', 1), ('parson', 1), ('preach', 1), ('them', 1), ('that', 1), ('would', 1), ('pedler', 1), ('sweats', 1), ('his', 1), ('yet', 1), ('who', 1), ('can', 1), ('i', 1), ('follow', 1), ('through', 1), ('groves', 1), ('coral', 1), ('sporting', 1), ('quick', 1), ('glance', 1), ('thy', 1), ('henceforth', 1), ("where'er", 1), ('our', 1), ("day's", 1), ('work', 1), ('lies', 1), ('purpose', 1), ('popping', 1), ('off', 1), ('narwhales', 1), ('vagrant', 1), ('sea', 1), ('unicorns', 1), ('whenever', 1), ('oar', 1), ('bit', 1), ('plank', 1), ('fast', 1), ('lest', 1), ('guinea-hen', 1), ('fall', 1), ('again', 1), ('only', 1), ('four', 1), ('finally', 1), ('last', 1), ('merry-maker', 1), ('ran', 1), ('house', 1), ('not', 1), ('understand', 1), ('let', 1), ('little', 1), ('too', 1), 

We can remove stop words since they are not adding anything useful to our representations.

In [21]:
#nltk.download('stopwords')

filtered_words = [word for word in all_words if word.lower() not in stopwords.words('english')]

print('excluding stopwords from the semantic space...')

sp2 = compute_space(5, filtered_words)

  5%|▌         | 52501/1002514 [00:00<00:01, 524977.96it/s]

excluding stopwords from the semantic space...


100%|██████████| 1002514/1002514 [00:02<00:00, 463026.98it/s]


In [22]:
#pickle.dump(sp2, open('./vector-space-sp2.p', 'wb'))

In [26]:
sp2 = pickle.load(open('vector-space-sp2.p', 'rb'))

In [27]:
print('stray:\n', sp2['stray'].most_common(50), '\n')
print('thought:\n', sp2['thought'].most_common(50), '\n')
print('discussion:\n', sp2['discussion'].most_common(50), '\n')

stray:
 [('side', 3), ('might', 2), ('sun', 2), ('hands', 2), ('little', 2), ('near', 2), ('livelong', 1), ('day', 1), ('ever', 1), ('wish', 1), ('church', 1), ('parson', 1), ('preach', 1), ('drink', 1), ('sing', 1), ('drover', 1), ('watching', 1), ('drove', 1), ('sings', 1), ('would', 1), ('pedler', 1), ('sweats', 1), ('pack', 1), ('back', 1), ('purchaser', 1), ('keep', 1), ('teach', 1), ('straying', 1), ('yet', 1), ('follow', 1), ('whoever', 1), ('present', 1), ('hour', 1), ('words', 1), ('graze', 1), ('sea-weed', 1), ('pasture', 1), ('groves', 1), ('coral', 1), ('sporting', 1), ('quick', 1), ('glance', 1), ('show', 1), ('forth', 1), ('never', 1), ('thy', 1), ('henceforth', 1), ("where'er", 1), ("day's", 1), ('work', 1)] 

thought:
 [('would', 176), ('could', 116), ('said', 106), ('one', 91), ('never', 86), ('little', 77), ('must', 63), ('thought', 60), ('much', 59), ('time', 57), ('well', 57), ('good', 57), ('mr', 57), ('man', 55), ('might', 53), ('alice', 51), ('shall', 50), ('like

What about punctuation? We can also remove punctuation from our corpus.

In [28]:
real_words = [w for w in filtered_words if w not in string.punctuation]
print('excluding punctuation from the semantic space...')

sp3 = compute_space(5, real_words)

 11%|█         | 105940/1001816 [00:00<00:01, 508795.97it/s]

excluding punctuation from the semantic space...


100%|██████████| 1001816/1001816 [00:01<00:00, 544639.56it/s]


In [29]:
print('stray:\n', sp3['stray'].most_common(100), '\n')
print('thought:\n', sp3['thought'].most_common(100), '\n')
print('discussion:\n', sp3['discussion'].most_common(100), '\n')

stray:
 [('side', 3), ('might', 2), ('sun', 2), ('hands', 2), ('little', 2), ('near', 2), ('livelong', 1), ('day', 1), ('ever', 1), ('wish', 1), ('church', 1), ('parson', 1), ('preach', 1), ('drink', 1), ('sing', 1), ('drover', 1), ('watching', 1), ('drove', 1), ('sings', 1), ('would', 1), ('pedler', 1), ('sweats', 1), ('pack', 1), ('back', 1), ('purchaser', 1), ('keep', 1), ('teach', 1), ('straying', 1), ('yet', 1), ('follow', 1), ('whoever', 1), ('present', 1), ('hour', 1), ('words', 1), ('graze', 1), ('sea-weed', 1), ('pasture', 1), ('groves', 1), ('coral', 1), ('sporting', 1), ('quick', 1), ('glance', 1), ('show', 1), ('forth', 1), ('never', 1), ('thy', 1), ('henceforth', 1), ("where'er", 1), ("day's", 1), ('work', 1), ('lies', 1), ('though', 1), ('powder', 1), ('flask', 1), ('shot', 1), ('purpose', 1), ('popping', 1), ('narwhales', 1), ('vagrant', 1), ('sea', 1), ('unicorns', 1), ('infesting', 1), ('feeling', 1), ('flukes', 1), ('whenever', 1), ('oar', 1), ('bit', 1), ('plank', 1)

Let's save our semantic space.

In [30]:
pickle.dump(sp3, open('./vector-space.p', 'wb'))

In [244]:
semantic_space = pickle.load(open('vector-space.p', 'rb'))

print('discussion:\n', semantic_space['discussion'].most_common(100))

discussion:
 [('one', 4), ('book', 2), ('given', 2), ('least', 2), ('could', 2), ('would', 2), ('every', 2), ('must', 2), ('perpetually', 1), ('printed', 1), ("preach'd", 1), ('discussed', 1), ('eludes', 1), ('print', 1), ('put', 1), ('whoever', 1), ('harden', 1), ('nerves', 1), ('sufficiently', 1), ('feel', 1), ('continual', 1), ('crofts', 1), ('business', 1), ('evil', 1), ('assisted', 1), ('however', 1), ('persuasion', 1), ("evening's", 1), ('indulgence', 1), ('subjects', 1), ('usual', 1), ('companions', 1), ('probably', 1), ('concern', 1), ('meet', 1), ('even', 1), ('lady', 1), ('russell', 1), ('merits', 1), ('anne', 1), ('understand', 1), ('nearest', 1), ('police', 1), ('station', 1), ("night's", 1), ('detective', 1), ('know', 1), ('tone', 1), ('talk', 1), ('terrible', 1), ('purport', 1), ('deep', 1), ('actual', 1), ('immediate', 1), ('plot', 1), ('waiter', 1), ('downstairs', 1), ('secretary', 1), ('sorry', 1), ('cut', 1), ('short', 1), ('cultured', 1), ('said', 1), ('colonel', 1),

In [40]:
len(semantic_space['discussion'])

150

We also need to make sure that all words have the same number of dimensions.

In [41]:
columns = {}
for item in target_words:
    # similar to [1],
    # we identify the most frequent co-occurring words for all target words
    # we need to take those, which seem to be appearing very often in different words' context
    fixed_sem_space = dict(semantic_space[item])#.most_common(2000))
    
    for word, freq in fixed_sem_space.items():
        if word not in columns:
            columns[word] = freq
        else:
            if columns[word] < freq:
                columns[word] = freq

In [42]:
sorted_columns = sorted(columns.items(), key=lambda item: item[1], reverse=True)

In [43]:
# every target word will be represented through these dimensions (most frequent and sensible words)
dims = dict(sorted_columns[:2000])

In [45]:
# building our final semantic space for our target words
target_semantic_space = {}
for item in tqdm.tqdm(target_words):
    target_semantic_space[item] = {}
    this_sem_space = dict(semantic_space[item])
    for other_w in dims.keys():
        if other_w not in this_sem_space.keys():
            target_semantic_space[item][other_w] = [0]
        else:
            target_semantic_space[item][other_w] = [this_sem_space[other_w]]

100%|██████████| 80/80 [00:00<00:00, 279.43it/s]


In [46]:
import pandas as pd
df = pd.DataFrame.from_dict(target_semantic_space['fire'])
m2 = (df != 0).all()
print(df.loc[:, m2])

   shall  man  unto  said  every  god  lord  hand  thou  son  ...  edward  \
0    248   22   103    32     21   33   154    26    51   13  ...       1   

   indignation  cease  deeds  half-chick  leaves  rage  prison  gifts  \
0            5      2      1           5       2     2       1      2   

   accepted  
0         1  

[1 rows x 1103 columns]


In [47]:
df = pd.DataFrame.from_dict(target_semantic_space['flame'])
m2 = (df != 0).all()
print(df.loc[:, m2])

   shall  unto  said  every  god  lord  hand  thou  son  one  ...  breast  \
0     18     4     1      1    1     4     1     1    2    3  ...       2   

   friendship  yield  hollow  generations  female  grown  expect  tower  \
0           1      1       1            1       1      1       1      2   

   indignation  
0            1  

[1 rows x 343 columns]


In [48]:
df = pd.DataFrame.from_dict(target_semantic_space['beam'])
m2 = (df != 0).all()
print(df.loc[:, m2])

   shall  man  unto  said  every  hand  thou  one  thy  old  ...  2:11  \
0      3    4     1     2      2     1     8    1    2    2  ...     1   

   plough  farewell  gath  cubits  glance  eternal  canst  pull  latter  
0       3         1     2       1       1        1      2     1       1  

[1 rows x 132 columns]


In [129]:
df = pd.DataFrame.from_dict(target_semantic_space['skin'])
m2 = (df != 0).all()
print(df.loc[:, m2])

   shall  man  unto  said  every  lord  hand  thou  one  old  ...  girded  \
0     39    7     2     7      2     2     2     6    3    2  ...       1   

   canst  raiment  yield  hollow  fit  simply  cover  entire  giant  
0      1        2      1       1    1       4      4       1      5  

[1 rows x 331 columns]


In [50]:
df = pd.DataFrame.from_dict(target_semantic_space['head'])
m2 = (df != 0).all()
print(df.loc[:, m2])

   shall  man  unto  said  every  god  lord  hand  thou  son  ...  edward  \
0    146   47    52   126     34   19    46    55    71   14  ...       3   

   engaged  indignation  latter  half-chick  leaves  rage  prison  giant  \
0        1            1       1           1       5     2       5      2   

   accepted  
0         2  

[1 rows x 1436 columns]


In [53]:
#with open('./target_semantic_space.json', 'w') as f2:
#    json.dump(target_semantic_space, f2)

In [54]:
with open('./target_semantic_space.json', 'r') as f3:
    our_space = json.load(f3)

## 1. Define the task

We want to make a model that is able to see the differences between phrases / sentences, not solely individual words.  
Then, we will need to test this model and check whether its performance correlates with human judgements.

Important steps:  
1. Our semantic space needs to cover a lot of words; this is why it is important to train it on the large corpus, e.g. Gutenberg corpus.  
2. We need the dataset of phrases that have human judgements of how similar/dissimilar these phrases are. It is needed to later see which of our models performs the closest to the way humans deal with the task.  
3. Therefore, we need to decide how we are going to combine representations from single words into a single phrase representation. Since we have every word represented through counts of words from its context, we can do the following operations:  

In [None]:
u = [15, 16, 5, 6] (stands for house, window_size = N)
v = [4, 5, 3, 4] (stands for burn)

house + burn = house burn


![title](simple_additive.png)

![title](simple_multiplicative.png)

![title](combined.png)

alpha, beta, gamma = three variables, which control how much each constituent contributes to the result

if alpha is 0.0, it means that no contribution will be given by u
then higher the variable is, the more contribution the count has

alpha = 0, beta = 0.95, gamma - 0.05
->>>> for 'house burn', if we go with the last method we are saying that meaning of 'house' is not needed, while mostly the meaning of
the verb is needed, and the meaning of noun multiplied with verb is also needed a bit


Let's test these representations and see how different they are!

In [245]:
cleaned_phrase_dataset[-10:]

['participant50 chatter child gabble 6 high',
 'participant50 chatter tooth click 2 high',
 'participant50 reel head whirl 5 high',
 'participant50 reel mind stagger 4 low',
 'participant50 reel industry stagger 5 high',
 'participant50 reel man whirl 3 low',
 'participant50 glow fire beam 7 low',
 'participant50 glow face burn 3 low',
 'participant50 glow cigar burn 5 high',
 'participant50 glow skin beam 7 high']

In [56]:
len(target_words)

80

In [246]:
# the first line is the column name line, we ignore it
column_names = phrase_dataset[0].split()
print(column_names)

dataset = {}
references = []

for line in phrase_dataset[1:]:
    participant_id, reference, noun, landmark, rating, hilo = line.split()
        
    reference_phrase = [noun, reference]

    if reference_phrase not in references:
        references.append(reference_phrase)

    landmark_phrase = [noun, landmark]

    if participant_id not in dataset:
        dataset[participant_id] = []
    else:
        dataset[participant_id].append((reference_phrase, landmark_phrase, rating, hilo))

['participant', 'verb', 'noun', 'landmark', 'input', 'hilo']


In [58]:
#for item in dataset:
#    for judgement in dataset[item]:
#        if judgement[0][0] == 'face':
#            print(judgement)

In [59]:
dataset['participant1'][:10]

[(['discussion', 'stray'], ['discussion', 'digress'], '6', 'high'),
 (['eye', 'stray'], ['eye', 'roam'], '1', 'high'),
 (['child', 'stray'], ['child', 'digress'], '1', 'low'),
 (['head', 'reel'], ['head', 'stagger'], '4', 'low'),
 (['mind', 'reel'], ['mind', 'whirl'], '5', 'high'),
 (['industry', 'reel'], ['industry', 'whirl'], '2', 'low'),
 (['man', 'reel'], ['man', 'stagger'], '5', 'high'),
 (['cigarette', 'flare'], ['cigarette', 'erupt'], '1', 'low'),
 (['eye', 'flare'], ['eye', 'flame'], '2', 'high'),
 (['argument', 'flare'], ['argument', 'erupt'], '6', 'high')]

In [60]:
references[-7:]

[['mind', 'reel'],
 ['head', 'reel'],
 ['man', 'reel'],
 ['fire', 'glow'],
 ['face', 'glow'],
 ['cigar', 'glow'],
 ['skin', 'glow']]

How we are going to combine representations of words into a single phrase representations?  

In [61]:
example_phrase = references[-3]
print(example_phrase)

['face', 'glow']


In [62]:
subject_space = our_space[example_phrase[0]]
verb_space = our_space[example_phrase[1]]

In [250]:
reference = ['face', 'glow']
landmark_high = ['face', 'beam']
landmark_low = ['face', 'burn']

#face glow vs face beam (reference vs high similarity landmark), we also have a human rating
#face glow vs face burn (reference vs low similarity landmark)


## How do we access similarity between two phrases?

The main way in which distributional vectors are used is for estimating similarity between words. The central idea is that words that appear in similar contexts tend to be similar in meaning. So what we need to do to estimate word similarity from distributional vectors is to compute a similarity measure that determines how similar the vectors of two words are. There are many similarity measures, but the one that is used the most is cosine similarity. If we consider the vector of a word as an arrow from the origin, then two words should be similar if their vectors go in roughly the same direction. Cosine similarity measures this as the cosine of the angle between the two vectors. 

$$cosine(u, v) = \frac{\sum_i {u_i v_i}}{|u| |v|}$$

$$|v| = \sqrt{\sum_i v_i^2}$$


In [251]:
##
# similarity measure: cosine
#                           sum_i vec1_i * vec2_i
# cosine(vec1, vec2) = ------------------------------
#                        veclen(vec1) * veclen(vec2)
# where
#
# veclen(vec) = squareroot( sum_i vec_i*vec_i )
#

def veclen(vector):
    return math.sqrt(np.sum(np.square(vector)))

def cosine(vector1, vector2):
    veclen1 = veclen(vector1)
    veclen2 = veclen(vector2)
    if veclen1 == 0.0 or veclen2 == 0.0:
        # one of the vectors is empty, the cosine is 0
        return 0.0
    else:
        # we could also simply do:
        dotproduct = np.dot(vector1, vector2)
        return dotproduct / (veclen1 * veclen2)

In [267]:
def build_phrase_space(phrase, x_names):

    # first we get representations for verb and noun
    subject_space = our_space[phrase[0]]
    verb_space = our_space[phrase[1]]
    # representation for house and burn
    
    representation = np.zeros(len(x_names))

    for n, word in enumerate(x_names.keys()):

        # I get v^Ith element from each of the vectors
        subject_value = subject_space[word][0]
        verb_value = verb_space[word][0]

        #out = subject_value + verb_value
        #out = subject_value * verb_value
        
        # 6 and 0, if we do summation, we are getting 6
        # if we do mulitplication, we are getting 0
        
        #out = subject_value * 0.2 + verb_value * 0.8
        #out = subject_value * 0.0 + verb_value * 0.95 + (0.05 * subject_value * verb_value)

        representation[n] = out

    return representation

In [268]:
reference = ['face', 'glow']
landmark_high = ['face', 'beam']
landmark_low = ['face', 'burn']

ref = build_phrase_space(reference, dims)

lhigh = build_phrase_space(landmark_high, dims)

llow = build_phrase_space(landmark_low, dims)


print(ref, ref.shape)
print(lhigh)
print(llow)

[119.  85.   0. ...   0.   0.   0.] (2000,)
[357. 340.  83. ...   0.   0.   0.]
[11186.   340.  2490. ...     0.     0.     0.]


In [265]:
cosine(ref, lhigh)

0.26726492389325335

In [266]:
cosine(ref, llow)

0.22604548518015988

Now we can test how the cosine similarity between vectors of each of our spaces compares with the human judgements on the words collected in the previous step. Which of the three spaces best approximates human judgements?

For comparison of several scores we can use [Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better. The p values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Some guidelines for implementing the correlation:

In [204]:
j = {}
j['landmark_high'] = []
j['landmark_low'] = []

for participant in dataset:
    #if participant == 'participant8':
    judgements = dataset[participant]
    for ref, landmark, score, gt in judgements:
        if ['face', 'glow'] == ref and landmark == ['face', 'beam']:
            j['landmark_high'].append(score)
        if ['face', 'glow'] == ref and landmark == ['face', 'burn']:
            j['landmark_low'].append(score)

#dataset['participant1'][-10:]

In [211]:
print(len(j['landmark_high']))
print(len(j['landmark_low']))

34
26


In [228]:
high = np.array([int(elem) for elem in j['landmark_high']])
high

array([7, 7, 6, 2, 6, 4, 5, 7, 3, 4, 6, 5, 5, 7, 6, 6, 7, 6, 6, 2, 5, 5,
       6, 7, 6, 4, 6, 6, 7, 4, 5, 6, 5, 5])

In [230]:
low = np.array([int(elem) for elem in j['landmark_low']])
low

array([5, 7, 3, 3, 4, 4, 5, 2, 3, 1, 1, 7, 2, 5, 4, 1, 4, 1, 6, 7, 4, 7,
       3, 4, 7, 3])

Important: try to correlate model's predictions with each human individually. For example, for model A (simple additive), you would have two conditions: high and low rankings. First, gather human judgements for N-high-rated pairs. Let's say, you have one judgement from a single human for N=5 pairs. For the same pairs, gather model A cosine scores. In the end, tou should have two vectors of same size. Now, compute Spearman correlation on these two vectors and the result will be the correlation between model's predictions and human judgements for highly rated phrases. Similar procedure should be done for pairs with low ratings.  

Question: what would you do to get one single number for *all* human judgements vs. each model?

## Transforming counts to association weights

All words will have high co-occurrence counts with the most frequent context items. In our demo dataset, these are i-PR, the-DT, man-NN, on-CD, could-MD. This will falsely inflate all our similarity estimates. What we want to know instead is how strongly a target word is associated with a context item: Does it appear with the context item more often than we could expect at random? Less often? About as often as we would expect? 

There are multiple options for computing degree of association:

- tf-idf (term frequency / inverse document frequency)
- pointwise mutual information (PMI)
- positive mutual information (PPMI): just change negative PMI values to zero
- local mutual information (LMI)

We do PPMI here. The PMI of a target word t and context item c is defined as:
 
$$PMI(t, c) = log \frac{P(t, c)}{P(t) P(c)}$$

All the probabilities are computed from the table of counts. We need:

- $\#(t, c)$: the co-occurrence count of t with c
- $\#(\_, \_)$: the sum of counts in the whole table, across all targets
- $\#(t, \_)$: the sum of counts in the row of target t
- $\#(\_, c)$: the sum of counts in the column of context item c

Then we have: 

- $P(t, c) = \frac{\#(t, c)}{\#(\_,\_)}$
- $P(t) = \frac{\#(t,\_)}{\#(\_,\_)}$
- $P(c) = \frac{\#(\_,c)}{\#(\_,\_)}$

Here is the code for computing PPMI:

In [174]:
#########
# transform the space using positive pointwise mutual information

# target t, dimension value c, then
# PMI(t, c) = log ( P(t, c) / (P(t) P(c)) )
# where
# P(t, c) = #(t, c) / #(_, _)
# P(t) = #(t, _) / #(_, _)
# P(c) = #(_, c) / #(_, _)
#
# PPMI(t, c) =   PMI(t, c) if PMI(t, c) > 0
#                0 else

def ppmi_transform(space, word):
    
    row_sums = {}
    col_sums = {}
    context_word = {}
    
    pmi_return = {}
    
    #(_, _): overall count of occurrences
    overall = 0
    for _, vectors in space.items():
        for _, f in vectors.items():
            overall += f[0]    
    
    for t, vectors in space.items():
        
        if t == word:
            # #(t, _): for each target word, sum up all its counts.
            # row_sums is a dictionary mapping from target words to row sums
            # how many time t appears in the context
            t_sum = 0
            for _, f in vectors.items():
                t_sum += f[0]
            row_sums[t] = t_sum
            
        # #(_, c): for each context word, sum up all its counts
        # col_sums is a dictionary mapping from context word indices to column sums
        #col_sums = {}
        #for c, f in vectors.items():
        #    col_sums[c] = sum([elem[0] for elem in f])
        #print(col_sums)
        
        # #(_, c): for each context word, sum up all its counts
        # col_sums is a dictionary mapping from context word indices to column sums
        for c, f in vectors.items():
            if c not in col_sums:
                col_sums[c] = f[0]
            else:
                col_sums[c] += f[0]

    #print(col_sums)
    #print(row_sums)
    #print(context_word)
    pmi_return[word] = {}
    
    for context_word, context_sums in col_sums.items():
        target_pmi = np.log2((context_sums / overall) / (row_sums[word] / overall) * (col_sums[context_word] / overall))

        pmi_return[word][context_word] = target_pmi
        
    return pmi_return


In [179]:
for w in target_words:
    
    ppmispace = ppmi_transform(target_semantic_space, w)
    
    for k, v in ppmispace.items():
        print(k)
        for c in list(v)[:5]:
            print(v[c], c)


stray
-0.35357644493511 shall
-2.2496496016922087 man
-1.8865614322739177 unto
-1.2478531298268456 said
-3.1559022722260726 every
thought
-7.2851463147162 shall
-9.181219471473298 man
-8.818131302055008 unto
-8.179422999607935 said
-10.087472142007162 every
roam
1.0412831724061036 shall
-0.8547899843509952 man
-0.491701814932704 unto
0.14700648751436796 said
-1.7610426548848588 every
discussion
-0.5613813300485115 shall
-2.4574544868056107 man
-2.0943663173873195 unto
-1.4556580149402472 said
-3.363707157339474 every
digress
2.988815752511968 shall
1.0927425957548689 man
1.4558307651731601 unto
2.094539067620232 said
0.1864899252210054 every
eye
-5.65091255164008 shall
-7.546985708397179 man
-7.183897538978888 unto
-6.545189236531816 said
-8.453238378931042 every
child
-6.046572506162086 shall
-7.942645662919184 man
-7.579557493500893 unto
-6.940849191053822 said
-8.848898333453048 every
throb
0.6668876576246057 shall
-1.2291854991324933 man
-0.8660973297142021 unto
-0.2273890272671301

## Dimensionality reduction

Dimensionality reduction is a method that does exactly this: It takes a space where each word has a vector of, say, 10,000 dimensions and reduces it to a space where each word has a vector of something like 300 or 500 dimensions, making the space more manageable.

The new dimensions can be seen as groupings (soft clusterings) of the old dimensions, or as latent semantic classes underlying the old dimensions. A popular choice of dimensionality reduction method is singular value decomposition (SVD). SVD involves representing a set of points in a different space (that is, through a new set of dimensions) in such a way that it brings out the underlying structure of the data.

Here is how we can do this in Python.

In [78]:
space_to_reduce = {}
for item in target_words:
    space_to_reduce[item] = np.zeros(2000)
    this_sem_space = dict(semantic_space[item])
    for n, other_w in enumerate(dims.keys()):
        if other_w not in this_sem_space.keys():
            space_to_reduce[item][n] = 0
        else:
            space_to_reduce[item][n] = this_sem_space[other_w]

In [79]:
def svd_transform(space, original_dim, dim_to_keep):
    
    # space is a dictionary mapping words to vectors
    # combine those into a big matrix
    spacematrix = np.empty((len(space.keys()), original_dim))
    rowlabels = sorted(space.keys())

    for index, word in enumerate(rowlabels):
        spacematrix[index] = space[word]

    # start SVD
    umatrix, sigmavector, vmatrix = np.linalg.svd(spacematrix)

    # remove the last few dimensions of u and sigma
    utrunc = umatrix[:, :dim_to_keep]
    sigmatrunc = sigmavector[ :dim_to_keep]

    # new space: U %matrixproduct% Sigma_as_diagonal_matrix   
    newspacematrix = np.dot(utrunc, np.diag(sigmatrunc))

    # transform back to a dictionary mapping words to vectors
    newspace = {}
    for index, word in enumerate(rowlabels):
        newspace[word] = newspacematrix[index]
        
    return newspace


In [80]:
new_space = svd_transform(space_to_reduce, 2000, 10)

In [81]:
new_space['fire']

array([-324.59862697,   93.51490343,  -34.27373355,   76.59531525,
        176.35701049, -125.31640658, -132.07118252,   -8.28644964,
        -47.31506928,  -23.47823178])

In [82]:
def build_phrase_svd_space(phrase, svd_space):

    subject_space = svd_space[phrase[0]]
    verb_space = svd_space[phrase[1]]
        
    representation = np.zeros(len(svd_space))

    #out = subject_space + verb_space
    #out = subject_space * verb_space
    #out = subject_space * 0.2 + verb_space * 0.8
    out = subject_space * 0.0 + verb_space * 0.95 + (0.05 * subject_space * verb_space)

    return out

In [83]:
ref = build_phrase_svd_space(reference, new_space)
lhigh = build_phrase_svd_space(landmark_high, new_space)
llow = build_phrase_svd_space(landmark_low, new_space)
print(ref)
print(lhigh)
print(llow)

[90.44279811 -5.14456791 36.35814754 -2.86711681  4.66359159  2.90481913
 14.41698183 -0.80209509  4.09211922 -0.96749855]
[ 1.75745363e+02  9.59732021e+00  3.82883270e+00 -3.13018079e+00
  8.04708853e+00  1.62232226e-01  1.85807947e+01  1.05341584e+00
 -6.57953719e+00 -2.36594184e+00]
[ 1.57255728e+03  1.02070122e+02 -1.01357118e+02  1.99985536e+01
  2.26357612e+02  1.91022709e+02 -4.64670873e+02 -7.26842922e-01
 -1.09547613e+01 -4.56855382e+00]


In [84]:
cosine(ref, lhigh)

0.9271333248815603

In [85]:
cosine(ref, llow)

0.8024780785051394

# References

1. Mitchell, J., & Lapata, M. (2008). Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT (pp. 236–244). Association for Computational Linguistics.