# B - A Closer Look at Word Embeddings

We have very briefly covered how word embeddings (also known as word vectors) are used in the tutorials. In this appendix we'll have a closer look at these embeddings and find some (hopefully) interesting results.

Embeddings transform a one-hot encoded vector (a vector that is 0 in elements except one, which is 1) into a much smaller dimension vector of real numbers. The one-hot encoded vector is also known as a *sparse vector*, whilst the real valued vector is known as a *dense vector*. 

The key concept in these word embeddings is that words that appear in similar _contexts_ appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. By context here, we mean the surrounding words. For example in the sentences "I purchased some items at the shop" and "I purchased some items at the store" the words 'shop' and 'store' appear in the same context and thus should be close together in vector space.

You may have also heard about *word2vec*. *word2vec* is an algorithm (actually a bunch of algorithms) that calculates word vectors from a corpus. In this appendix we use *GloVe* vectors, *GloVe* being another algorithm to calculate word vectors. If you want to know how *word2vec* works, check out a two part series [here](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [here](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/), and if you want to find out more about *GloVe*, check the website [here](https://nlp.stanford.edu/projects/glove/).

In PyTorch, we use word vectors with the `nn.Embedding` layer, which takes a _**[sentence length, batch size]**_ tensor and transforms it into a _**[sentence length, batch size, embedding dimensions]**_ tensor.

In tutorial 2 onwards, we also used pre-trained word embeddings (specifically the GloVe vectors) provided by TorchText. These embeddings have been trained on a gigantic corpus. We can use these pre-trained vectors within any of our models, with the idea that as they have already learned the context of each word they will give us a better starting point for our word vectors. This usually leads to faster training time and/or improved accuracy.

In this appendix we won't be training any models, instead we'll be looking at the word embeddings and finding a few interesting things about them.

A lot of the code from the first half of this appendix is taken from [here](https://github.com/spro/practical-pytorch/blob/master/glove-word-vectors/glove-word-vectors.ipynb). For more information about word embeddings, go [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/). 

## Loading the GloVe vectors

First, we'll load the GloVe vectors. The `name` field specifies what the vectors have been trained on, here the `6B` means a corpus of 6 billion words. The `dim` argument specifies the dimensionality of the word vectors. GloVe vectors are available in 50, 100, 200 and 300 dimensions. There is also a `42B` and `840B` glove vectors, however they are only available at 300 dimensions.

In [1]:
import torchtext.vocab

# glove = torchtext.vocab.Vectors('hongkong.vec', cache='./vectors/')
# glove = torchtext.vocab.Vectors('cc.zh.300.vec', cache='./vectors/')
glove = torchtext.vocab.Vectors('model.txt', cache='./vectors/')

print(f'There are {len(glove.itos)} words in the vocabulary')

There are 1935477 words in the vocabulary


As shown above, there are 400,000 unique words in the GloVe vocabulary. These are the most common words found in the corpus the vectors were trained on. **In these set of GloVe vectors, every single word is lower-case only.**

`glove.vectors` is the actual tensor containing the values of the embeddings.

In [2]:
glove.vectors.shape

torch.Size([1935503, 100])

We can see what word is associated with each row by checking the `itos` (int to string) list. 

Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), row 2 for '.' (period), etc.

In [3]:
glove.itos[:10]

['</s>', ',', '的', '。', '(', ')', ':', '、', '-', '一']

We can also use the `stoi` (string to int) dictionary, in which we input a word and receive the associated integer/index. If you try get the index of a word that is not in the vocabulary, you receive an error.

In [4]:
glove.stoi['the']

154

We can get the vector of a word by first getting the integer associated with it and then indexing into the word embedding tensor with that index.

In [5]:
glove.vectors[glove.stoi['the']].shape

torch.Size([100])

We'll be doing this a lot, so we'll create a function that takes in word embeddings and a word and returns the associated vector. It'll also throw an error if the word doesn't exist in the vocabulary.

In [6]:
def get_vector(embeddings, word):
    assert word in embeddings.stoi, f'*{word}* is not in the vocab!'
    return embeddings.vectors[embeddings.stoi[word]]

As before, we use a word to get the associated vector.

In [7]:
get_vector(glove, 'the').shape

torch.Size([100])

## Similar Contexts

Now to start looking at the context of different words. 

If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary finding any vectors similar to this input word vector.

The function below returns the closest 10 words to an input word vector:

In [8]:
import torch

def closest_words(embeddings, vector, n=10):
    distances = [(w, torch.dist(vector, get_vector(embeddings, w)).item()) for w in embeddings.itos]
    return sorted(distances, key = lambda w: w[1])[:n]

Let's try it out with 'Hong Kong':

In [9]:
closest_words(glove, get_vector(glove, '香港'))

[('香港', 0.0),
 ('內地', 2.314028739929199),
 ('本港', 2.353198528289795),
 ('港青', 2.3729612827301025),
 ('bookfest', 2.374790906906128),
 ('寶御', 2.3775370121002197),
 ('黃浩潮', 2.39192795753479),
 ('ourradio', 2.4064459800720215),
 ('金鐘萬豪', 2.4302167892456055),
 ('京澳', 2.432112216949463)]

We can also try the leader of Hong Kong:

In [10]:
closest_words(glove, get_vector(glove, '特首'))

[('特首', 0.0),
 ('振英', 1.8918646574020386),
 ('政改', 2.1343867778778076),
 ('蔭權', 2.151686191558838),
 ('林瑞麟', 2.192889928817749),
 ('鈺成', 2.3309617042541504),
 ('家傑', 2.438174247741699),
 ('泛民', 2.460930824279785),
 ('退聯', 2.4835751056671143),
 ('君堯', 2.513657331466675)]

Some Hong Kongese words:

In [11]:
closest_words(glove, get_vector(glove, '毒男'))

[('毒男', 0.0),
 ('港女', 1.9794985055923462),
 ('影衰', 2.3026230335235596),
 ('除褲', 2.3101210594177246),
 ('肯搏', 2.337437868118286),
 ('cctvb', 2.3419041633605957),
 ('玩波', 2.347959280014038),
 ('夜鬼', 2.348475694656372),
 ('野片', 2.370290517807007),
 ('單野', 2.3723373413085938)]

In [12]:
closest_words(glove, get_vector(glove, '冇'))

[('冇', 0.0),
 ('唔', 1.4685471057891846),
 ('咁', 1.4884002208709717),
 ('乜', 1.5145183801651),
 ('唔係', 1.645288348197937),
 ('喎', 1.690563678741455),
 ('只係', 1.7539080381393433),
 ('仲有', 1.767637848854065),
 ('但係', 1.8374937772750854),
 ('知係', 1.8415424823760986)]

We'll also create another function that will nicely print out the tuples returned by our `closest_words` function.

In [13]:
def print_tuples(tuples):
    for w, d in tuples:
        print(f'({d:02.04f}) {w}') 

A final word to look at, 'Umbrella Movement'. A large-scale demostration that lasted 79 days in 2014 in Hong Kong:

In [14]:
print_tuples(closest_words(glove, get_vector(glove, '佔中')))

(0.0000) 佔中
(2.3043) 港獨
(2.4098) 反佔
(2.4620) 退聯
(2.5240) 愛港
(2.5489) 倒梁
(2.5644) 警權
(2.5821) 挺梁
(2.5914) 耀廷
(2.5993) 左膠


In [15]:
print_tuples(closest_words(glove, get_vector(glove, '六四')))

(0.0000) 六四
(2.3659) 民運
(2.6713) 柴玲
(2.7292) 港共
(2.7525) 六四屠殺
(2.7604) 習馬
(2.7628) 死難者
(2.7986) 國恥
(2.8081) 天安
(2.8131) 挺梁


In [16]:
print_tuples(closest_words(glove, get_vector(glove, '港獨')))

(0.0000) 港獨
(2.1488) 左膠
(2.3043) 佔中
(2.3460) 愛港
(2.3493) 假普選
(2.3781) 退聯
(2.4165) 警權
(2.4274) 黨性
(2.4783) 禍港
(2.5025) 傾中


In [17]:
print_tuples(closest_words(glove, get_vector(glove, '左膠')))

(0.0000) 左膠
(1.8530) 政棍
(1.9499) 港共
(1.9519) 媚共
(1.9719) 假普選
(1.9807) 黑泛
(1.9918) 禍港
(1.9991) 貪曾
(2.0050) 蟲論
(2.0406) 搶咪


## Analogies

Another property of word embeddings is that they can be operated on just as any standard vector and give interesting results.

We'll show an example of this first, and then explain it:

In [18]:
def analogy(embeddings, word1, word2, word3, n=5):
    
    candidate_words = closest_words(embeddings, get_vector(embeddings, word2) - get_vector(embeddings, word1) + get_vector(embeddings, word3), n+3)
    
    candidate_words = [x for x in candidate_words if x[0] not in [word1, word2, word3]][:n]
    
    print(f'{word1} is to {word2} as {word3} is to...')
    
    return candidate_words

In [19]:
print_tuples(analogy(glove, '男', '王帝', '女'))

男 is to 王帝 as 女 is to...
(2.8135) 天照大
(2.8200) 宮守
(2.8291) 因幡
(2.8543) 花倉
(2.8642) 豐臣秀


This is the canonical example which shows off this property of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?

If we think about it, the vector calculated from 'king' minus 'man' gives us a "royalty vector". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this "royality vector" to 'woman', this should travel to her royal equivalent, which is a queen!

We can do this with other analogies too. About the language of Taiwan and Hong Kong:

In [20]:
print_tuples(analogy(glove, '台灣', '台語', '香港'))

台灣 is to 台語 as 香港 is to...
(2.7639) 粵
(2.9972) 薄情歌
(3.0301) 粵唱
(3.0631) 國語
(3.0636) 劇名曲


Food, of course:

In [21]:
print_tuples(analogy(glove, '香港', '點心', '台灣'))

香港 is to 點心 as 台灣 is to...
(2.9720) 冰品
(3.1338) 家咖
(3.1583) 饅
(3.1906) 貝果
(3.1946) 泡麵


See if it understands uniquely Hong Kongese words:

In [22]:
print_tuples(analogy(glove, '係', '是', '嘅'))

係 is to 是 as 嘅 is to...
(2.6261) 的
(2.8228) 有此種
(2.8249) 地域遼闊
(2.8286) 能夠
(2.8514) 經歷那


In [23]:
print_tuples(analogy(glove, '什麼', '乜', '他'))

什麼 is to 乜 as 他 is to...
(2.6637) 俾
(2.8348) 吓
(2.8990) 哂
(2.9857) 諗
(2.9891) 好似


A "capital city vector":

In [24]:
print_tuples(analogy(glove, '香港', '西環', '中國'))

香港 is to 西環 as 中國 is to...
(2.7801) 錦普
(2.7942) 至砵
(2.8525) 河經
(2.8704) 頭地區
(2.8716) 派亞


A "political party grouping vector":

In [25]:
print_tuples(analogy(glove, '民主黨', '泛民', '民建聯'))

民主黨 is to 泛民 as 民建聯 is to...
(3.1028) 英下台
(3.1348) 青政
(3.1392) 做騷
(3.1415) 落區
(3.1594) 連政


And an "leader of country vector":

In [26]:
print_tuples(analogy(glove, '中國', '近平', '美國'))

中國 is to 近平 as 美國 is to...
(3.0961) 歐巴馬
(3.1883) 奧巴馬
(3.3149) 拜登
(3.3811) 希拉里
(3.4041) 耶倫
