# B - A Closer Look at Word Embeddings

We have very briefly covered how word embeddings (also known as word vectors) are used in the tutorials. In this appendix we'll have a closer look at these embeddings and find some (hopefully) interesting results.

Embeddings transform a one-hot encoded vector (a vector that is 0 in elements except one, which is 1) into a much smaller dimension vector of real numbers. The one-hot encoded vector is also known as a *sparse vector*, whilst the real valued vector is known as a *dense vector*. 

The key concept in these word embeddings is that words that appear in similar _contexts_ appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. By context here, we mean the surrounding words. For example in the sentences "I purchased some items at the shop" and "I purchased some items at the store" the words 'shop' and 'store' appear in the same context and thus should be close together in vector space.

You may have also heard about *word2vec*. *word2vec* is an algorithm (actually a bunch of algorithms) that calculates word vectors from a corpus. In this appendix we use *GloVe* vectors, *GloVe* being another algorithm to calculate word vectors. If you want to know how *word2vec* works, check out a two part series [here](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [here](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/), and if you want to find out more about *GloVe*, check the website [here](https://nlp.stanford.edu/projects/glove/).

In PyTorch, we use word vectors with the `nn.Embedding` layer, which takes a _**[sentence length, batch size]**_ tensor and transforms it into a _**[sentence length, batch size, embedding dimensions]**_ tensor.

In tutorial 2 onwards, we also used pre-trained word embeddings (specifically the GloVe vectors) provided by TorchText. These embeddings have been trained on a gigantic corpus. We can use these pre-trained vectors within any of our models, with the idea that as they have already learned the context of each word they will give us a better starting point for our word vectors. This usually leads to faster training time and/or improved accuracy.

In this appendix we won't be training any models, instead we'll be looking at the word embeddings and finding a few interesting things about them.

A lot of the code from the first half of this appendix is taken from [here](https://github.com/spro/practical-pytorch/blob/master/glove-word-vectors/glove-word-vectors.ipynb). For more information about word embeddings, go [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/). 

## Loading the GloVe vectors

First, we'll load the GloVe vectors. The `name` field specifies what the vectors have been trained on, here the `6B` means a corpus of 6 billion words. The `dim` argument specifies the dimensionality of the word vectors. GloVe vectors are available in 50, 100, 200 and 300 dimensions. There is also a `42B` and `840B` glove vectors, however they are only available at 300 dimensions.

In [1]:
import torchtext.vocab

glove = torchtext.vocab.Vectors('toastynews.vec', cache='./vectors/')
# glove = torchtext.vocab.Vectors('cc.zh.300.vec', cache='./vectors/')
# glove = torchtext.vocab.Vectors('model.txt', cache='./vectors/')

print(f'There are {len(glove.itos)} words in the vocabulary')

There are 320261 words in the vocabulary


As shown above, there are 400,000 unique words in the GloVe vocabulary. These are the most common words found in the corpus the vectors were trained on. **In these set of GloVe vectors, every single word is lower-case only.**

`glove.vectors` is the actual tensor containing the values of the embeddings.

In [2]:
glove.vectors.shape

torch.Size([320261, 300])

We can see what word is associated with each row by checking the `itos` (int to string) list. 

Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), row 2 for '.' (period), etc.

In [3]:
glove.itos[:10]

['，', '的', '。', '「', '」', '、', '在', '</s>', '是', '有']

We can also use the `stoi` (string to int) dictionary, in which we input a word and receive the associated integer/index. If you try get the index of a word that is not in the vocabulary, you receive an error.

In [4]:
glove.stoi['the']

71

We can get the vector of a word by first getting the integer associated with it and then indexing into the word embedding tensor with that index.

In [5]:
glove.vectors[glove.stoi['the']].shape

torch.Size([300])

We'll be doing this a lot, so we'll create a function that takes in word embeddings and a word and returns the associated vector. It'll also throw an error if the word doesn't exist in the vocabulary.

In [6]:
def get_vector(embeddings, word):
    assert word in embeddings.stoi, f'*{word}* is not in the vocab!'
    return embeddings.vectors[embeddings.stoi[word]]

As before, we use a word to get the associated vector.

In [7]:
get_vector(glove, 'the').shape

torch.Size([300])

## Similar Contexts

Now to start looking at the context of different words. 

If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary finding any vectors similar to this input word vector.

The function below returns the closest 10 words to an input word vector:

In [8]:
import torch

def closest_words(embeddings, vector, n=10):
    distances = [(w, torch.dist(vector, get_vector(embeddings, w)).item()) for w in embeddings.itos]
    return sorted(distances, key = lambda w: w[1])[:n]

Let's try it out with 'Hong Kong':

In [9]:
closest_words(glove, get_vector(glove, '香港'))

[('香港', 0.0),
 ('中国香港', 12.423322677612305),
 ('中國大陸', 13.041645050048828),
 ('香港政府', 13.063764572143555),
 ('香港衆志', 13.340444564819336),
 ('中國內地', 13.392632484436035),
 ('香港旅游', 13.608757019042969),
 ('當港', 13.69633960723877),
 ('香港特區政府', 13.734424591064453),
 ('港埠', 13.851877212524414)]

We can also try the leader of Hong Kong:

In [10]:
closest_words(glove, get_vector(glove, '特首'))

[('特首', 0.0),
 ('行政長官', 13.655364036560059),
 ('梁振英', 15.498384475708008),
 ('林鄭月娥', 16.51620864868164),
 ('政務司司長', 17.143537521362305),
 ('曾俊華', 17.417217254638672),
 ('雙特首', 17.533527374267578),
 ('梁特首', 17.8826904296875),
 ('選特首', 18.15694236755371),
 ('政司', 18.33683204650879)]

Some Hong Kongese words:

In [11]:
closest_words(glove, get_vector(glove, '毒男'))

[('毒男', 0.0),
 ('宅男', 15.281902313232422),
 ('毒撚', 15.647896766662598),
 ('醜男', 15.931194305419922),
 ('宅女', 16.067916870117188),
 ('電車男', 16.27805519104004),
 ('剩男', 16.485713958740234),
 ('毒女', 16.514097213745117),
 ('當男', 16.543933868408203),
 ('孝男', 16.557912826538086)]

In [12]:
closest_words(glove, get_vector(glove, '冇'))

[('冇', 0.0),
 ('係無', 20.875301361083984),
 ('無', 24.18804168701172),
 ('無咩', 25.047468185424805),
 ('仲無', 25.383453369140625),
 ('唔係', 26.072914123535156),
 ('有冇', 26.328956604003906),
 ('係有', 26.658960342407227),
 ('唔會', 26.949352264404297),
 ('唔到', 27.772043228149414)]

We'll also create another function that will nicely print out the tuples returned by our `closest_words` function.

In [13]:
def print_tuples(tuples):
    for w, d in tuples:
        print(f'({d:02.04f}) {w}') 

A final word to look at, 'Umbrella Movement'. A large-scale demostration that lasted 79 days in 2014 in Hong Kong:

In [14]:
print_tuples(closest_words(glove, get_vector(glove, '佔中')))

(0.0000) 佔中
(14.8871) 佔領運動
(16.1307) 佔領中環
(17.2010) 雨傘運動
(17.7764) 佔中三子
(17.8046) 雨傘革命
(18.8373) 佔領
(18.8780) 反佔中
(19.4318) 佔領國
(19.8205) 戴耀廷


In [15]:
print_tuples(closest_words(glove, get_vector(glove, '六四')))

(0.0000) 六四
(18.3861) 六四晚會
(19.8915) 平反六四
(20.1039) 天安門事件
(20.1559) 燭光晚會
(20.5140) 民運
(20.6942) 支聯會
(20.7485) 八九
(20.7691) 六四年
(21.1745) 悼念


In [16]:
print_tuples(closest_words(glove, get_vector(glove, '港獨')))

(0.0000) 港獨
(19.6626) 香港獨立
(19.7685) 港獨論
(20.1616) 港獨派
(21.1078) 明獨
(21.9146) 講獨
(22.0966) 暗獨
(22.1052) 禁獨
(22.1878) 民主自決
(22.2187) 播獨


In [17]:
print_tuples(closest_words(glove, get_vector(glove, '左膠')))

(0.0000) 左膠
(16.8034) 大中華膠
(17.5668) 和理非非
(17.8395) 和理非
(17.8493) 勇武派
(17.9251) 反骨仔
(18.0924) 左賊
(18.1387) 右膠
(18.3090) 反膠
(18.3883) 柒人


## Analogies

Another property of word embeddings is that they can be operated on just as any standard vector and give interesting results.

We'll show an example of this first, and then explain it:

In [18]:
def analogy(embeddings, word1, word2, word3, n=5):
    
    candidate_words = closest_words(embeddings, get_vector(embeddings, word2) - get_vector(embeddings, word1) + get_vector(embeddings, word3), n+3)
    
    candidate_words = [x for x in candidate_words if x[0] not in [word1, word2, word3]][:n]
    
    print(f'{word1} is to {word2} as {word3} is to...')
    
    return candidate_words

In [19]:
print_tuples(analogy(glove, '男', '王帝', '女'))

男 is to 王帝 as 女 is to...
(32.5372) 女帝
(32.5834) 恭帝
(32.8806) 明帝
(32.9668) 獻帝
(33.1555) 宋帝


This is the canonical example which shows off this property of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?

If we think about it, the vector calculated from 'king' minus 'man' gives us a "royalty vector". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this "royality vector" to 'woman', this should travel to her royal equivalent, which is a queen!

We can do this with other analogies too. About the language of Taiwan and Hong Kong:

In [20]:
print_tuples(analogy(glove, '台灣', '台語', '香港'))

台灣 is to 台語 as 香港 is to...
(17.2777) 港語
(17.6501) 廣東話
(17.8055) 滬語
(17.9160) 粤語
(18.0530) 侗語


Food, of course:

In [21]:
print_tuples(analogy(glove, '香港', '點心', '台灣'))

香港 is to 點心 as 台灣 is to...
(21.0641) 台菜
(21.7415) 小點心
(21.8444) 烤餅
(21.8960) 小餅
(21.8977) 菜餅


See if it understands uniquely Hong Kongese words:

In [22]:
print_tuples(analogy(glove, '係', '是', '嘅'))

係 is to 是 as 嘅 is to...
(14.7562) 的
(20.1876) 極是
(20.2065) 也是
(20.9165) 這是
(21.0760) 並是


In [23]:
print_tuples(analogy(glove, '什麼', '乜', '他'))

什麼 is to 乜 as 他 is to...
(27.9548) 佢
(28.2934) 乜姐
(28.5232) 乜仲
(28.5642) 佢哋
(28.7793) 人哋


A "capital city vector":

In [24]:
print_tuples(analogy(glove, '香港', '西環', '中國'))

香港 is to 西環 as 中國 is to...
(22.7017) 北京
(22.7217) 中聯辦
(22.9240) 京派
(23.0354) 中京
(23.4145) 西環路


A "political party grouping vector":

In [25]:
print_tuples(analogy(glove, '民主黨', '泛民', '民建聯'))

民主黨 is to 泛民 as 民建聯 is to...
(16.1597) 建制派
(18.8834) 泛民主派
(18.9309) 建制
(18.9661) 非建制
(18.9940) 民主派


And an "leader of country vector":

In [26]:
print_tuples(analogy(glove, '中國', '習近平', '美國'))

中國 is to 習近平 as 美國 is to...
(15.7739) 美國總統
(15.9481) 奧巴馬
(16.3379) 特朗普
(17.4217) 歐巴馬
(17.6101) 俄總統
