# Experimenting and Understanding Word Embedding/Vectors
# Using the GloVe Embeddings


Word embeddings (also known as word vectors) are a way to encode the meaning of words into a set of numbers.

These embeddings are created by training a neural network model using many examples of the use of language.  These examples could be the whole of Wikipedia or a large collection of news articles. 

To start, we will explore a set of word embeddings that someone else took the time and computational power to create. One of the most commonly-used pre-trained word embeddings are the **GloVe embeddings**.

## GloVe Embeddings

You can read about the GloVe embeddings here: https://nlp.stanford.edu/projects/glove/, and read the original paper describing how they work here: https://nlp.stanford.edu/pubs/glove.pdf.

There are several variations of GloVe embeddings. They differ in the text used to train the embedding, and the *size* of the embeddings.

Throughout this course we'll use a package called `torchtext`, that is part of PyTorch, that we will be using in most assignments and your project.

We'll begin by loading a set of GloVe embeddings. The first time you run the code below, it will cause the download a large file (862MB) containing the embeddings.

In [2]:
import torch
import torchtext

# The first time you run this will download a ~823MB file
glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
                              dim=50)    # embedding size = 50

In [3]:
glove1 = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
                              dim=300)    # embedding size = 50

In [4]:
import torch
import torchtext

# The first time you run this will download a ~823MB file
glove3 = torchtext.vocab.FastText()    # embedding size = 50

First, let's look at what the embedding of the word "apple" looks like:

In [3]:
glove['apple']

tensor([ 0.5204, -0.8314,  0.4996,  1.2893,  0.1151,  0.0575, -1.3753, -0.9731,
         0.1835,  0.4767, -0.1511,  0.3553,  0.2591, -0.7786,  0.5218,  0.4769,
        -1.4251,  0.8580,  0.5982, -1.0903,  0.3357, -0.6089,  0.4174,  0.2157,
        -0.0742, -0.5822, -0.4502,  0.1725,  0.1645, -0.3841,  2.3283, -0.6668,
        -0.5818,  0.7439,  0.0950, -0.4787, -0.8459,  0.3870,  0.2369, -1.5523,
         0.6480, -0.1652, -1.4719, -0.1622,  0.7986,  0.9739,  0.4003, -0.2191,
        -0.3094,  0.2658])

You can see that it is a torch tensor with dimension `(50,)`. We don't know what the meaning of each number is, but we do know that there are properties of the embeddings that can be observed.  For example, `distances between embeddings` are meaningful.

## Measuring Distance

Let's consider one specific metric of distance between two embedding vectors called the **Euclidean distance**. The Euclidean distance of two vectors $x = [x_1, x_2, ... x_n]$ and
$y = [y_1, y_2, ... y_n]$ is just the 2-norm of their difference $x - y$. We can compute
the Euclidean distance between $x$ and $y$: $\sqrt{\sum_i (x_i - y_i)^2}$

The PyTorch function `torch.norm` computes the 2-norm of a vector for us, so we 
can compute the Euclidean distance between two vectors like this:

In [3]:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)

tensor(1.8846)

In [4]:
a = glove['apple']
b = glove['orange']
torch.norm(b - a)

tensor(4.9094)

In [5]:
torch.norm(glove['good'] - glove['bad'])

tensor(3.3189)

In [66]:
torch.norm(glove1['good'] - glove1['bad'])

tensor(4.8563)

In [6]:
torch.norm(glove['good'] - glove['water'])

tensor(5.3390)

In [7]:
torch.norm(glove['good'] - glove['well'])

tensor(2.7703)

In [8]:
torch.norm(glove['good'] - glove['perfect'])

tensor(2.8834)

## Cosine Similarity

An alternative and more commonly-sued measure of distance is the **Cosine Similarity**. The cosine similarity measures the *angle* between two vectors, and has the property that it only considers the *direction* of the vectors, not their the magnitudes. It is computed as follows for two vectors A and B:
<img src="cosine_sim.png" width="50%" align = "middle">

In [20]:
x = torch.tensor([1., 1., 1.]).unsqueeze(0) # cosine similarity wants at least 2-D inputs
y = torch.tensor([2., 2., 2.]).unsqueeze(0)
torch.cosine_similarity(x, y) # should be one because x and y point in the same "direction"

tensor([1.0000])

The cosine similarity is actually a *similarity* measure rather than a *distance* measure, and gives a result between -1 and 1. Thus, the larger the similarity, (closer to 1) the "closer in meaning" the word embeddings are to each other.

In [21]:
z = torch.tensor([-1., -1., -1.]).unsqueeze(0)
torch.cosine_similarity(x, z) # should be -1 because x and y point in the opposite "direction"

tensor([-1.0000])

In [22]:
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0))

tensor([0.9218])

In [75]:
x = glove1['cat']
y = glove1['dog']
torch.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0))

tensor([0.6817])

In [23]:
a = glove['apple']
b = glove['banana']
torch.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))

tensor([0.5608])

In [71]:
a = glove1['apple']
b = glove1['banana']
torch.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))

tensor([0.3924])

In [24]:
torch.cosine_similarity(glove['good'].unsqueeze(0), 
                        glove['bad'].unsqueeze(0))

tensor([0.7965])

In [67]:
torch.cosine_similarity(glove1['good'].unsqueeze(0), 
                        glove1['bad'].unsqueeze(0))

tensor([0.6445])

In [74]:
torch.cosine_similarity(glove['good'].unsqueeze(0), 
                        glove['well'].unsqueeze(0))

tensor([0.8511])

In [72]:
torch.cosine_similarity(glove['good'].unsqueeze(0), 
                        glove['perfect'].unsqueeze(0))

tensor([0.8376])

Note: torch.cosine_similarity requires two dimensions to work, which is created with the unsqueeze option, illustrated in more detail below

In [27]:
x = glove['good']
print(x.shape) # [50]
y = x.unsqueeze(0) # [1, 50]
print(y.shape)

torch.Size([50])
torch.Size([1, 50])


## Word Similarity

Now that we have notions of distance and similarity in our embedding space, we can talk about words that are "close" to each other in the embedding space. For now, let's use Euclidean distances to look at how close various words are to the word "cat".

In [5]:
word = 'cat'
other = ['pet', 'dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = torch.norm(glove[word] - glove[w]) # euclidean distance
    print(w, "\t%5.2f" % float(dist))

pet 	 3.04
dog 	 1.88
bike 	 5.05
kitten 	 3.51
puppy 	 3.06
kite 	 4.21
computer 	 6.03
neuron 	 6.23


Let's do the same thing with cosine similarity:

In [6]:
word = 'cat'
other = ['pet', 'dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = torch.cosine_similarity(glove[word].unsqueeze(0),glove[w].unsqueeze(0)) # cosine distance
    print(w, "\t%5.2f" % float(dist))

pet 	 0.78
dog 	 0.92
bike 	 0.44
kitten 	 0.64
puppy 	 0.76
kite 	 0.49
computer 	 0.35
neuron 	 0.21


We can look through the entire **vocabulary** for words that are closest to a point in the embedding space -- for example, we can look for words that are closest to another word such as "cat".

In [5]:
def print_closest_words(vec, n=5):
    dists = torch.norm(glove.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[1:n+1]:                         # take the top n
        print(glove.itos[idx], "\t%5.2f" % difference)

print_closest_words(glove["cat"], n=10)

dog 	 1.88
rabbit 	 2.46
monkey 	 2.81
cats 	 2.90
rat 	 2.95
beast 	 2.99
monster 	 3.00
pet 	 3.04
snake 	 3.06
puppy 	 3.06


In [7]:
def print_closest_words1(vec, n=5):
    dists = torch.norm(glove1.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[1:n+1]:                         # take the top n
        print(glove1.itos[idx], "\t%5.2f" % difference)

print_closest_words1(glove1["cat"], n=10)

RuntimeError: The size of tensor a (50) must match the size of tensor b (300) at non-singleton dimension 1

In [8]:
def print_closest_words3(vec, n=5):
    dists = torch.norm(glove3.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[1:n+1]:                         # take the top n
        print(glove3.itos[idx], "\t%5.2f" % difference)

print_closest_words3(glove3["cat"], n=10)

cats 	 3.23
dog 	 3.48
kitten 	 3.55
kittens 	 3.66
fluffykittens 	 3.80
👯 	 3.83
spoodle 	 3.86
puppy/kitten 	 3.87
kitteneatkitten 	 3.87
feline 	 3.87


In [19]:
print_closest_words(glove['dog'])

cat 	 1.88
dogs 	 2.65
puppy 	 3.15
rabbit 	 3.18
pet 	 3.23


In [76]:
print_closest_words(glove1['dog'])

RuntimeError: The size of tensor a (50) must match the size of tensor b (300) at non-singleton dimension 1

In [28]:
print_closest_words(glove['nurse'])

doctor 	 3.13
dentist 	 3.13
nurses 	 3.27
pediatrician 	 3.32
counselor 	 3.40


In [29]:
print_closest_words(glove['computer'])

computers 	 2.44
software 	 2.93
technology 	 3.19
electronic 	 3.51
computing 	 3.60


In [30]:
print_closest_words(glove['elizabeth'])

margaret 	 2.01
mary 	 2.27
anne 	 2.30
catherine 	 2.62
katherine 	 2.72


In [31]:
print_closest_words(glove['michael'])

peter 	 2.92
moore 	 2.93
david 	 2.94
steven 	 2.99
murphy 	 3.02


In [32]:
print_closest_words(glove['health'])

care 	 2.64
medical 	 3.24
welfare 	 3.62
prevention 	 3.76
education 	 3.76


In [33]:
print_closest_words(glove['anxiety'])

persistent 	 3.23
experiencing 	 3.25
discomfort 	 3.29
nervousness 	 3.29
anxieties 	 3.30


We could also look at which words are closest to the midpoints of two words:

In [34]:
print_closest_words((glove['happy'] + glove['sad']) / 2)

happy 	 1.92
feels 	 2.36
sorry 	 2.50
hardly 	 2.53
imagine 	 2.57


In [35]:
print_closest_words((glove['lake'] + glove['building']) / 2)

surrounding 	 3.07
nearby 	 3.11
bridge 	 3.16
along 	 3.16
shore 	 3.16


In [36]:
print_closest_words((glove['bravo'] + glove['michael']) / 2)

farrell 	 2.80
anderson 	 2.85
jacobs 	 2.85
boyle 	 2.86
slater 	 2.87


In [37]:
print_closest_words((glove['one'] + glove['ten']) / 2)

ten 	 1.57
only 	 1.88
three 	 2.03
five 	 2.05
four 	 2.11


## Analogies

One surprising aspect of word embeddings is that the *directions* in the embedding space can be meaningful. For example, some analogy-like relationships like this tend to hold:

$$ king - man + woman \approx queen $$

In [8]:
print_closest_words(glove['king'] - glove['man'] + glove['woman'])

queen 	 2.84
prince 	 3.66
elizabeth 	 3.72
daughter 	 3.83
widow 	 3.85


In [64]:
print_closest_words(glove['greater'] - glove['great'] + glove['fine'])

limits 	 4.18
minimum 	 4.23
requires 	 4.28
amounts 	 4.30
limiting 	 4.31


The top result is a reasonable answer like "queen",  and the name of the queen of england.

We can flip the analogy around and it works:

In [39]:
print_closest_words(glove['queen'] - glove['woman'] + glove['man'])

king 	 2.84
prince 	 3.25
crown 	 3.45
knight 	 3.56
coronation 	 3.62


Or, try a different but related analogies along a gender axis:

In [40]:
print_closest_words(glove['king'] - glove['prince'] + glove['princess'])

queen 	 3.18
king 	 3.91
bride 	 4.29
lady 	 4.30
sister 	 4.42


In [None]:
print_closest_words(glove['uncle'] - glove['man'] + glove['woman'])

In [None]:
print_closest_words(glove['grandmother'] - glove['mother'] + glove['father'])

In [None]:
print_closest_words(glove['old'] - glove['young'] + glove['father'])

We can also move an embedding towards the direction of "goodness" or "badness":

In [61]:
print_closest_words(glove['good'] - glove['bad'] + glove['programmer'])

versatile 	 4.38
creative 	 4.57
entrepreneur 	 4.63
enables 	 4.72
intelligent 	 4.73


In [62]:
print_closest_words(glove['bad'] - glove['good'] + glove['programmer'])

hacker 	 3.84
glitch 	 4.00
originator 	 4.04
hack 	 4.05
serial 	 4.23


## Bias in Word Vectors

While it may appear that machine learning models have an implicit air of "fairness" about them, because the models
make decisions without human intervention. However, models can and do learn whatever bias is present in the training data - in this case the bias is present in the text that the vectors were trained on.

Below are some examples that show that the structure of the GloVe vectors encodes the everyday biases present in the texts that they are trained on.

We'll start with an example analogy:

$$doctor - man + woman \approx ??$$

Using GloVe vectors to find the answer to the above analogy:

In [9]:
print_closest_words(glove['doctor'] - glove['man'] + glove['woman'])

nurse 	 3.14
pregnant 	 3.78
child 	 3.78
woman 	 3.86
mother 	 3.92


In [11]:
print_closest_words3(glove3['doctor'] - glove3['man'] + glove3['woman'])

doctoress 	 4.27
woman 	 4.32
doctors 	 4.35
doctor/physician 	 4.39
doctory 	 4.43


The $$doctor - man + woman \approx nurse$$ analogy is very concerning.
Just to verify, the same result does not appear if we flip the gender terms:

In [42]:
print_closest_words(glove['doctor'] - glove['woman'] + glove['man'])

man 	 3.93
colleague 	 3.98
himself 	 3.98
brother 	 4.00
another 	 4.03


We see similar types of gender bias with other professions.

In [43]:
print_closest_words(glove['programmer'] - glove['man'] + glove['woman'])

prodigy 	 3.67
psychotherapist 	 3.81
therapist 	 3.81
introduces 	 3.91
swedish-born 	 4.12


Beyond the first result, none of the other words are even related to
programming! In contrast, if we flip the gender terms, we get very
different results:

In [44]:
print_closest_words(glove['programmer'] - glove['woman'] + glove['man'])

setup 	 4.00
innovator 	 4.07
programmers 	 4.17
hacker 	 4.23
genius 	 4.36


In [12]:
print_closest_words3(glove3['programmer'] - glove3['woman'] + glove3['man'])

programmer/developer 	 3.94
programmering 	 4.07
programmers 	 4.15
programmer,drums 	 4.21
designer/programmer 	 4.24


Here are the results for "engineer":

In [45]:
print_closest_words(glove['engineer'] - glove['man'] + glove['woman'])

technician 	 3.69
mechanic 	 3.92
pioneer 	 4.15
pioneering 	 4.19
educator 	 4.23


In [46]:
print_closest_words(glove['engineer'] - glove['woman'] + glove['man'])

builder 	 4.35
mechanic 	 4.40
engineers 	 4.48
worked 	 4.53
replacing 	 4.60
