Question 4
----


In [14]:
from collections import defaultdict
import pandas as pd
import numpy as np
from numpy.linalg import norm

np.set_printoptions(suppress=True)

stop_words = ['is', 'a', 'of', 'and']

def co_occurrence(sentences, window_size):
    histogram = defaultdict(int)
    vocab = set()
    for sentence in sentences:
        words = sentence.split(' ')
        words = list(filter(lambda x: x not in stop_words, words))
        for i in range(len(words)):
            word = words[i]
            vocab.add(word)
            rest_window = words[i + 1 : i + 1 + window_size]
            for neighbor_word in rest_window:
                key = tuple(sorted([neighbor_word, word]))
                histogram[key] += 1

    vocab = sorted(vocab)
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in histogram.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

**1) Co-occurence matrix.**

In [15]:
sentences = [
    'John likes NLP',
    'He likes Mary',
    'John likes machine learning',
    'Deep learning is a subfield of machine learning',
    'John wrote a post about NLP and got likes'
]

co_occurrence_df = co_occurrence(sentences, 1)
np.set_printoptions(linewidth=300)
print(co_occurrence_df)

          Deep  He  John  Mary  NLP  about  got  learning  likes  machine  \
Deep         0   0     0     0    0      0    0         1      0        0   
He           0   0     0     0    0      0    0         0      1        0   
John         0   0     0     0    0      0    0         0      2        0   
Mary         0   0     0     0    0      0    0         0      1        0   
NLP          0   0     0     0    0      1    1         0      1        0   
about        0   0     0     0    1      0    0         0      0        0   
got          0   0     0     0    1      0    0         0      1        0   
learning     1   0     0     0    0      0    0         0      0        2   
likes        0   1     2     1    1      0    1         0      0        1   
machine      0   0     0     0    0      0    0         2      1        0   
post         0   0     0     0    0      1    0         0      0        0   
subfield     0   0     0     0    0      0    0         1      0        1   

**2) Singular Value Decomposition and eigenvalues.**

In [16]:
co_occurrence_matrix = co_occurrence_df.to_numpy()
u, s, v = np.linalg.svd(co_occurrence_matrix)
print(u, "\n\n", s, "\n\n", v, "\n\n")
print("eigenvalues:", s**2)

[[-0.09525618 -0.078084   -0.20409309 -0.30660432  0.10840231  0.06262278 -0.00986875  0.11843932 -0.22723703  0.01384834  0.69775862 -0.52978814  0.        ]
 [-0.16408222 -0.20187228  0.11351611  0.08518163 -0.11456372 -0.16601357  0.0282392  -0.14589432 -0.11169003 -0.22682727 -0.28939733 -0.44915337 -0.70710678]
 [-0.36361678 -0.45759106  0.28167501  0.26909965 -0.1014125  -0.07594124  0.45395368  0.19533118 -0.1045044   0.41073465  0.18797743  0.17719595  0.        ]
 [-0.16408222 -0.20187228  0.11351611  0.08518163 -0.11456372 -0.16601357  0.0282392  -0.14589432 -0.11169003 -0.22682727 -0.28939733 -0.44915337  0.70710678]
 [-0.25987227 -0.18258385  0.23872932  0.1509113   0.43876455  0.17183454 -0.52683758 -0.50615785 -0.14151891  0.12692599  0.11614814  0.11806938 -0.        ]
 [-0.09294414  0.08314097  0.1288352  -0.15246158 -0.53761061  0.54461547 -0.29465061  0.09041336 -0.04576066  0.43260971 -0.19017074 -0.19639523  0.        ]
 [-0.2390606  -0.14409564  0.2034698   0.01369

**3) Reduced matrix.**

In [17]:
clipped_size = int(0.3 * s.shape[0])
u_tag = u[:, :clipped_size]
s_tag = np.diag(s[:clipped_size])
v_tag = v[:clipped_size, :]
x_tag = np.matmul(np.matmul(u_tag, s_tag), v_tag)
print(x_tag)

[[ 0.12272743 -0.05712672 -0.14543279 -0.05712672 -0.08856279 -0.01858134 -0.06683841  0.46327138  0.1820022   0.28075788 -0.0551122   0.27973583  0.00400971]
 [-0.05712672 -0.00127236 -0.00027247 -0.00127236  0.10323074  0.14471053  0.1053263   0.1820022   0.82116207 -0.1297564   0.01536375  0.03687718  0.22212452]
 [-0.14543279 -0.00027247  0.00711724 -0.00027247  0.24194435  0.33367263  0.24501427  0.36801412  1.86444865 -0.33413496  0.03970333  0.05363304  0.50933345]
 [-0.05712672 -0.00127236 -0.00027247 -0.00127236  0.10323074  0.14471053  0.1053263   0.1820022   0.82116207 -0.1297564   0.01536375  0.03687718  0.22212452]
 [-0.08856279  0.10323074  0.24194435  0.10323074  0.27996985  0.21331318  0.26109289  0.09658246  1.07119889 -0.11271466  0.07521302 -0.0160058   0.30073777]
 [-0.01858134  0.14471053  0.33367263  0.14471053  0.21331318  0.0521477   0.18444068 -0.14367498  0.11859449  0.08351377  0.07640111 -0.06643536  0.0444587 ]
 [-0.06683841  0.1053263   0.24501427  0.10532

One practical advantage is that we need much less numbers to express the co-occurence matrix (it's like [JPEG compression](https://en.wikipedia.org/wiki/JPEG#JPEG_compression) in a way - we take x% of the crucial frequencies).
The real advantage, however, is the reduced dimension, which means it's easier to work with our data (e.g. visualize, compute) and our data gets much more "smooth", it's continous rather than discrete.





**4) Cosine similarity.**

Now we're in the latent space (looking at U'), every word is described by only 3 features.

In [18]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

john_vector = u_tag[2]
he_vector = u_tag[1]
subfield_vector = u_tag[11]
deep_vector = u_tag[0]
machine_vector = u_tag[9]

print("John-he:", cosine_similarity(john_vector, he_vector))
print("John-subfield:", cosine_similarity(john_vector, subfield_vector))
print("Deep-machine:", cosine_similarity(deep_vector, machine_vector))

John-he: 0.999241482928557
John-subfield: -0.15497560206457914
Deep-machine: 0.9329156098788491


As we can see, our toy model captures the semantic similarity to some extent. Since our dataset is so small, it might not make any sense, but we used the words `John` and `He` interchangeably and our model learned it! This is exciting.

On the other hand, our model knows that the words `John` and `Subfield` are not related because they are really far away from each other in our dataset - there are only few other words connecting them.



**6) Cosine similarity - special case.**

In [21]:
wrote_vector = u_tag[12]
post_vector = u_tag[10]
likes_vector = u_tag[8]
print("wrote-post:", cosine_similarity(wrote_vector, post_vector))
print("likes-likes:", cosine_similarity(likes_vector, likes_vector))

wrote-post: 0.243076522917071
likes-likes: 1.0


It might be a problem since these words are semantically-related. Maybe we can add more examples with these words so the smoothing introduced with the dimension reduction process won't butch it.



**7)** We would expect these two words to have a similarity of 1, because they are the same, but our model turns out to outsmart us - we used these 2 words with 2 different semantic meanings! So we're losing data here (it feels like quantization). Maybe, we can use a POS tagged corpus (where milenial `likes` is a noun), and define an entity also by it's tag. This way we will be able to differentiate between the two meanings.

![](https://imgs.xkcd.com/comics/python.png)

*Created with Jupyter using vscode. Not everything in 2020 sucks.*