# B - A Closer Look at Word Embeddings

We have very briefly covered how word embeddings (also known as word vectors) are used in the tutorials. In this appendix we'll have a closer look at these embeddings and find some (hopefully) interesting results.

Embeddings transform a one-hot encoded vector (a vector that is 0 in elements except one, which is 1) into a much smaller dimension vector of real numbers. The one-hot encoded vector is also known as a *sparse vector*, whilst the real valued vector is known as a *dense vector*. 

The key concept in these word embeddings is that words that appear in similar _contexts_ appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. By context here, we mean the surrounding words. For example in the sentences "I purchased some items at the shop" and "I purchased some items at the store" the words 'shop' and 'store' appear in the same context and thus should be close together in vector space.

You may have also heard about *word2vec*. *word2vec* is an algorithm (actually a bunch of algorithms) that calculates word vectors from a corpus. In this appendix we use *GloVe* vectors, *GloVe* being another algorithm to calculate word vectors. If you want to know how *word2vec* works, check out a two part series [here](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [here](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/), and if you want to find out more about *GloVe*, check the website [here](https://nlp.stanford.edu/projects/glove/).

In PyTorch, we use word vectors with the `nn.Embedding` layer, which takes a _**[sentence length, batch size]**_ tensor and transforms it into a _**[sentence length, batch size, embedding dimensions]**_ tensor.

In tutorial 2 onwards, we also used pre-trained word embeddings (specifically the GloVe vectors) provided by TorchText. These embeddings have been trained on a gigantic corpus. We can use these pre-trained vectors within any of our models, with the idea that as they have already learned the context of each word they will give us a better starting point for our word vectors. This usually leads to faster training time and/or improved accuracy.

In this appendix we won't be training any models, instead we'll be looking at the word embeddings and finding a few interesting things about them.

A lot of the code from the first half of this appendix is taken from [here](https://github.com/spro/practical-pytorch/blob/master/glove-word-vectors/glove-word-vectors.ipynb). For more information about word embeddings, go [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/). 

## Loading the GloVe vectors

First, we'll load the GloVe vectors. The `name` field specifies what the vectors have been trained on, here the `6B` means a corpus of 6 billion words. The `dim` argument specifies the dimensionality of the word vectors. GloVe vectors are available in 50, 100, 200 and 300 dimensions. There is also a `42B` and `840B` glove vectors, however they are only available at 300 dimensions.

In [1]:
import torchtext.vocab

# glove = torchtext.vocab.Vectors('toastynews.vec', cache='./vectors/')
glove = torchtext.vocab.Vectors('cc.zh.300.vec', cache='./vectors/')
# glove = torchtext.vocab.Vectors('model.txt', cache='./vectors/')

print(f'There are {len(glove.itos)} words in the vocabulary')

There are 2000000 words in the vocabulary


As shown above, there are 400,000 unique words in the GloVe vocabulary. These are the most common words found in the corpus the vectors were trained on. **In these set of GloVe vectors, every single word is lower-case only.**

`glove.vectors` is the actual tensor containing the values of the embeddings.

In [2]:
glove.vectors.shape

torch.Size([2000000, 300])

We can see what word is associated with each row by checking the `itos` (int to string) list. 

Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), row 2 for '.' (period), etc.

In [3]:
glove.itos[:10]

['，', '的', '。', '</s>', '、', '是', '一', '在', '：', '了']

We can also use the `stoi` (string to int) dictionary, in which we input a word and receive the associated integer/index. If you try get the index of a word that is not in the vocabulary, you receive an error.

In [4]:
glove.stoi['the']

587

We can get the vector of a word by first getting the integer associated with it and then indexing into the word embedding tensor with that index.

In [5]:
glove.vectors[glove.stoi['the']].shape

torch.Size([300])

We'll be doing this a lot, so we'll create a function that takes in word embeddings and a word and returns the associated vector. It'll also throw an error if the word doesn't exist in the vocabulary.

In [6]:
def get_vector(embeddings, word):
    assert word in embeddings.stoi, f'*{word}* is not in the vocab!'
    return embeddings.vectors[embeddings.stoi[word]]

As before, we use a word to get the associated vector.

In [7]:
get_vector(glove, 'the').shape

torch.Size([300])

## Similar Contexts

Now to start looking at the context of different words. 

If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary finding any vectors similar to this input word vector.

The function below returns the closest 10 words to an input word vector:

In [8]:
import torch

def closest_words(embeddings, vector, n=10):
    distances = [(w, torch.dist(vector, get_vector(embeddings, w)).item()) for w in embeddings.itos]
    return sorted(distances, key = lambda w: w[1])[:n]

Let's try it out with 'Hong Kong':

In [9]:
closest_words(glove, get_vector(glove, '香港'))

[('香港', 0.0),
 ('新加坡', 1.626781702041626),
 ('中國內', 1.7208738327026367),
 ('香港人', 1.7438627481460571),
 ('新加波', 1.7662696838378906),
 ('中文大學及', 1.7802550792694092),
 ('城市大學', 1.7878292798995972),
 ('經香港', 1.7943989038467407),
 ('加拿大', 1.7960593700408936),
 ('浸會大學聯', 1.800727128982544)]

We can also try the leader of Hong Kong:

In [10]:
closest_words(glove, get_vector(glove, '特首'))

[('特首', 0.0),
 ('梁振英', 2.6690258979797363),
 ('曾蔭權', 2.898160696029663),
 ('林郑', 2.899057388305664),
 ('曾俊華', 2.9461543560028076),
 ('曾俊华', 2.9567720890045166),
 ('特首選', 2.95729398727417),
 ('梁特', 2.9788360595703125),
 ('爾江南', 2.9918124675750732),
 ('林郑月娥', 3.002098560333252)]

Some Hong Kongese words:

In [11]:
closest_words(glove, get_vector(glove, '毒男'))

[('毒男', 0.0),
 ('北姑', 2.480846881866455),
 ('偽人', 2.4922149181365967),
 ('毒撚', 2.4923672676086426),
 ('高登仔', 2.5160903930664062),
 ('中佬', 2.5286834239959717),
 ('獨男', 2.5298116207122803),
 ('大叔控', 2.538130760192871),
 ('麻甩佬', 2.5443649291992188),
 ('做宅', 2.5537912845611572)]

In [12]:
closest_words(glove, get_vector(glove, '冇'))

[('冇', 0.0),
 ('佢有', 2.2973246574401855),
 ('有乜', 2.304213047027588),
 ('冇咩', 2.3347063064575195),
 ('就唔', 2.3467519283294678),
 ('過唔', 2.4058477878570557),
 ('係都', 2.407468557357788),
 ('無乜', 2.408259153366089),
 ('嗰啲', 2.4190146923065186),
 ('乜事', 2.423032283782959)]

We'll also create another function that will nicely print out the tuples returned by our `closest_words` function.

In [13]:
def print_tuples(tuples):
    for w, d in tuples:
        print(f'({d:02.04f}) {w}') 

A final word to look at, 'Umbrella Movement'. A large-scale demostration that lasted 79 days in 2014 in Hong Kong:

In [14]:
print_tuples(closest_words(glove, get_vector(glove, '佔中')))

(0.0000) 佔中
(3.3302) 戴耀廷
(3.4616) 占中
(3.6070) 佔領運動
(3.6187) 雨傘運動
(3.6291) 去飲
(3.6380) 驅蝗
(3.6788) 罷學
(3.6850) 倒梁
(3.6937) 指佔


In [15]:
print_tuples(closest_words(glove, get_vector(glove, '六四')))

(0.0000) 六四
(2.6874) 六四事件
(2.7190) 六.四
(2.7375) 天安門
(2.7521) 八九六四
(2.7970) 陆肆
(2.8614) 八九年
(2.9207) 赵紫阳
(2.9313) 天安门
(2.9736) 對六四


In [16]:
print_tuples(closest_words(glove, get_vector(glove, '港獨')))

(0.0000) 港獨
(3.7869) 亂港
(3.8753) 港独
(3.9013) 賣港
(3.9145) 台獨
(3.9551) 歸英
(3.9745) 港共
(3.9798) 反蝗
(3.9987) 本土派
(4.0149) 歪論


In [17]:
print_tuples(closest_words(glove, get_vector(glove, '左膠')))

(0.0000) 左膠
(2.9313) 港豬
(3.0193) 左仔
(3.1136) 亂港
(3.1424) 膠人
(3.1552) 賣港
(3.1658) 左報
(3.1849) 糞青
(3.1910) 盲毛
(3.1988) 梁營


## Analogies

Another property of word embeddings is that they can be operated on just as any standard vector and give interesting results.

We'll show an example of this first, and then explain it:

In [18]:
def analogy(embeddings, word1, word2, word3, n=5):
    
    candidate_words = closest_words(embeddings, get_vector(embeddings, word2) - get_vector(embeddings, word1) + get_vector(embeddings, word3), n+3)
    
    candidate_words = [x for x in candidate_words if x[0] not in [word1, word2, word3]][:n]
    
    print(f'{word1} is to {word2} as {word3} is to...')
    
    return candidate_words

In [19]:
print_tuples(analogy(glove, '男', '王帝', '女'))

男 is to 王帝 as 女 is to...
(2.4312) 亲封
(2.4456) 大智者
(2.4501) 君臣们
(2.4644) 大周
(2.4653) 朕把


This is the canonical example which shows off this property of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?

If we think about it, the vector calculated from 'king' minus 'man' gives us a "royalty vector". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this "royality vector" to 'woman', this should travel to her royal equivalent, which is a queen!

We can do this with other analogies too. About the language of Taiwan and Hong Kong:

In [20]:
print_tuples(analogy(glove, '台灣', '台語', '香港'))

台灣 is to 台語 as 香港 is to...
(2.8218) 廣東話
(2.8723) 閩南語
(2.8839) 滬語
(2.9314) 菲語
(2.9348) 國台語


Food, of course:

In [21]:
print_tuples(analogy(glove, '香港', '點心', '台灣'))

香港 is to 點心 as 台灣 is to...
(3.2133) 甜點
(3.2172) 餅乾
(3.3962) 媞免
(3.4331) 妍免
(3.4484) 飲料


See if it understands uniquely Hong Kongese words:

In [22]:
print_tuples(analogy(glove, '係', '是', '嘅'))

係 is to 是 as 嘅 is to...
(2.0955) 的
(2.2166) 和
(2.2381) 那就是
(2.2829) ，
(2.3001) 尤其是


In [23]:
print_tuples(analogy(glove, '什麼', '乜', '他'))

什麼 is to 乜 as 他 is to...
(3.0310) 唔会
(3.0377) 梗系
(3.0441) 仲系
(3.0720) 系咪
(3.0952) 系唔


A "capital city vector":

In [24]:
print_tuples(analogy(glove, '香港', '西環', '中國'))

香港 is to 西環 as 中國 is to...
(3.9395) 堅尼地城
(3.9403) 黃浦區
(3.9434) 銅鑼灣
(3.9606) 維壹
(3.9751) 藍城


A "political party grouping vector":

In [25]:
print_tuples(analogy(glove, '民主黨', '泛民', '民建聯'))

民主黨 is to 泛民 as 民建聯 is to...
(3.6718) 民主派
(3.8216) 本土派
(3.8433) 梁營
(3.8765) 箍票
(3.8828) 泛民派


And an "leader of country vector":

In [26]:
print_tuples(analogy(glove, '中國', '習近平', '美國'))

中國 is to 習近平 as 美國 is to...
(2.0773) 歐巴馬
(2.2688) 川普總
(2.2774) 奧巴馬
(2.2847) 白宮與
(2.2919) 奥巴马
