In this notebook we will load a pre-trained word embedding model and explore it with [Gensim](https://radimrehurek.com/gensim/).

The word embeddings are created with GloVe and are available on the [project website](https://nlp.stanford.edu/projects/glove/). In partiular, we will use the model trained on Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download).

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

The archive contains one file per model, with the only difference being the number of dimensions. Each line is made of a token followed by whitespace-separated numbers (the n-dimensional vector components).

In [None]:
!head -n1 glove.6B.300d.txt

The format understood by Gensim is the same as the default Word2vec format, where the first line of the file must contain the number of lines and the number of dimensions. This utility function adds the first line to the GloVe embedding file. 

In [None]:
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove2word2vec("glove.6B.300d.txt", "glove_gensim.6B.300d.txt")

In [None]:
!head glove_gensim.6B.300d.txt

The embeddings are loaded into a KeyedVectors object, basically a dictionary with words as keys ad vectors as values.

In [None]:
model=KeyedVectors.load_word2vec_format("glove_gensim.6B.300d.txt",binary=False)

Once the embeddings are loaded, the model object can be queried directly as a dictionary (but it retains the other functions too).

In [None]:
print (model['rock'])

Finding the N words most similar to an input word, or calculating word pair similarity, is the same that with Word2vec.

In [None]:
for word, similarity in model.most_similar('school', topn=10):
    print (f"{similarity:.2f} {word}")

GloVe is known for its ability to perform analogical (as in *analogy*) reasoning. With Gensim, analogies are implemented with simple geometric operations. For an analogy of the form: 

    A:B = C:?

we can look for vectors similar to B and C, and at the same time dissimilar from A.

In [None]:
# Paris is to France as Berlin is to ...
print (model.most_similar(positive=[model['france'], model['berlin']], negative=[model['paris']])[0][0])

# Man is to actor as woman is to ...
print (model.most_similar(positive=[model['actor'], model['woman']], negative=[model['man']])[0][0])