# Machine Reading: Advanced Topics in Word Vectors
## Part II. Word Vectors via Word2Vec (50 mins)

This is a 4-part series of Jupyter notebooks on the topic of word embeddings originally created for a workshop during the Digital Humanities 2018 Conference in Mexico City. Each part is comprised of mix of theoretical explanations and fill-in-the-blank activities of increasing difficulty.

Instructors:
- Eun Seo Jo, <a href="mailto:eunseo@stanford.edu">*eunseo@stanford.edu*</a>, Stanford University
- Javier de la Rosa, <a href="mailto:versae@stanford.edu">*versae@stanford.edu*</a>, Stanford University
- Scott Bailey, <a href="mailto:scottbailey@stanford.edu">*scottbailey@stanford.edu*</a>, Stanford University

This unit will focus on Word2Vec as an example of neural net-based approaches of vector encodings, starting with a conceptual overview of the algorithm itself and end with an activity to train participants’ own vectors.

● 0:00 - 0:15 Conceptual explanation of Word2Vec

● 0:15 - 0:30 Word2Vec Visualization and Vectorial Features and Math

● 0:30 - 0:50 [Activity 2] Word2Vec Construction [using Gensim] and Visualization (from part 1) [We provide corpus]

In [None]:
%%capture --no-stderr
import sys
!pip install Cython  # needed to compile fasttext
!pip install -r requirements.txt
!python -m nltk.downloader all
print("All done!", file=sys.stderr)

In [None]:
import gensim
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer

OK, before we go into Word2Vec in practice, let's talk about what it is.

Word2Vec is a neural-network or deep learning based approach of generating word vectors. There are many resources out there that will go into the heavy details of deep learning in general or deep learning for NLP such as Yoav Goldberg's Neural Network Methods in Natural Language Processing (Morgan & Claypool Publishers, 2017). In this unit, we will give you a high level overview -- just enough for you to understand what w2v really means.

In [None]:
from IPython.display import Image
'''Image from Jurafsky & Martin, Speech and Language Processing, 2016'''
Image("./neuralnet.png")

Neural nets are basically a bunch of weights in the form of matrices. If you have lots of these matrices multiplies in a row, you get layers that make your network 'deep' - hence the name deep learning. Usually if your network has more than one hidden (or projection) layer it's called a 'deep' network. The 'neurons' are just functions that transform your data non-linearly. Each layer of the network will tranform your data so your weights become more sophisticated (and meaningful) with each layer.

What happens in all deep learning tasks is a prediction of some sort. In the case of word2vec, we predict words, given other words. The information for making this prediction is in your weights -- matrices. Based on whether this prediction is correct, the model will calculate the cost and alter your weights, matrices, so that you can do better on the next prediction. This is done iteratively through all of your 'training' data. 

In W2V, your actual predictions are not the end product you want. Remember, we are prediction neighboring or co-occurring words. The actualy performance is just an overall accuracy number. For our purposes, we take the weights -the coefficients- that allow you to make the best predictions. These become your word vectors. Intuitively, these are the numerical representations that differentiate one word from another word in the prediction task.

In [None]:
from IPython.display import Image
Image("./cbow_skipgram.png")

The main difference between skip-gram and CBOW, two different methods of w2v, is that while skip gram learns vectors by predicting the context words that come before and after our given word $w$, CBOW predicts the center word $w$ given context words $c$

You may have heard of negative sampling. This is just a short-cut for calculating the denominator needed for the probabilities. Because it turns out to be costly to calculate the denominator exactly everytime, negative sampling approximates the ratio by taking samples of random words from an observed distribution.

In [None]:
### reimporting and reloading materials from part 1
from nltk.corpus import gutenberg

In [None]:
mobydick = gutenberg.raw('melville-moby_dick.txt')
emma = gutenberg.raw('austen-emma.txt')
alice = gutenberg.raw('carroll-alice.txt')

In [None]:
corpus = [mobydick, emma, alice]

Let's split our corpus into sentences. 

In [None]:
sentences = sent_tokenize(corpus[0])
sentences

In [None]:
tokenizer = TreebankWordTokenizer()

Let's define a function that a takes a list of texts and converts it for gensim word2vec to use. The function will lower-case text and tokenize by sentence and word.

In [None]:
# sentences = [['hi', 'there'], ['this', 'is', 'a', 'sentence']]

def make_sentences(list_txt):
    all_txt = []
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
        print(len(sentences))  # let's check how many sentences there are per item
    return all_txt

In [None]:
sentences = make_sentences(corpus)
#Looking at the number of sentences per novel

To train our vectors we call this function below. This function has a couple dozen parameters, some of which are more important than others.
We will explain a few major parameters here. The fields that are MANDATORY are marked with an asterisk:
1. `sentences*`: This is where you provide your data. It must be in a format of iterable of iterables.
2. `sg`: Your choice of training algorithm. There are two standard ways of training W2V vectors -- 'skipgram' and 'CBOW'. If you enter 1 here the skip-gram is applied; otherwise, the default is CBOW.
3. `size*`: This is the length of your resulting word vectors. If you have a large corpus (>few billion tokens) you can go up to 100-300 dimensions. Generally word vectors with more dimensions give better results.
4. `window`: This is the window of context words you are training on. In other words, how many words come before and after your given word. A good number is 4 here but this can vary depending on what you are interested in. For instance, if you are more interested in embeddings that embody synactic meaning, smaller window sizes work better. 
5. `alpha`: The learning rate of your model. If you are interested in machine learning experimentation with your vectors you may experiment with this parameter.
6. `seed` (int): This is the random seed for your random initialization. All deep learning models initialize the weights with random floats before training. This is a useful field if you want to replicate your experiments because giving this a seed will initialize 'randomly' deterministically.
7. `min_count`: This is the minimum frequency threshold. If a given word appears with lower frequency than provided it will be ignored. This is here because words with very low frequency are hard to train.
8. `iter`: This is the number of iterations(entire run; epoch) over the corpus, also known as epochs. Usually anything between 1-10 is ok. The trade offs are that if you have higher iterations, it will take longer to train and the model may overfit on your dataset. However, longer training will allow your vectors to perform better on tasks relevant to your dataset.

Overall, most of these settings wil not concern you unless you are interested in very specific usages of word vectors.

In [None]:
from IPython.display import Image
'''Image from Pennington, et al. 2014'''
Image("./semantic_syntactic.png")

In [None]:
#Creating word vectors of length 100
model_example = gensim.models.Word2Vec(sentences, min_count=5, size=100)

Another way of training word2vec vectors with gensim is to use the LineSentence function. You can use this function if your textfile is formatted such that each line is one sentence, split by "\n". 

In [None]:
!unzip text8.zip text8

In [None]:
# provide the name of the corpus text you want to train on
linesentence_example = gensim.models.word2vec.LineSentence('text8')

In [None]:
model = gensim.models.Word2Vec(linesentence_example, min_count=5, size=100)

In [None]:
model.wv.vocab

It's often useful to save your trained model to disk so that you can reload it as needed. 

In [None]:
model.save('our_model')

In [None]:
our_model = gensim.models.Word2Vec.load('our_model') 

In [None]:
model.wv['joy']

In [None]:
model.wv['sympathy']

Given the size of word2vec models, we'll often load just the vectors into memory, and delete the full model to save RAM. 

In [None]:
my_model = our_model.wv #keep just the word vectors 

In [None]:
del our_model 

In [None]:
print(type(my_model))

In [None]:
len(my_model.vocab) #the number of words in our model

Gensim includes corpora and pretrained vectors that we can access and use.

In [None]:
import gensim.downloader as pretrained

Let's see what corpora it has available. 

In [None]:
#You can work with their corpus ... or models (below)
pretrained.info()['corpora'].keys()

In [None]:
pretrained.info('fake-news')

We can also check available pre-trained models.

In [None]:
#pretrained models/ word vectors
pretrained.info()['models'].keys()

We'll work with word2vec trained on Google News. Let's start just with the `text8` corpus. 

In [None]:
pretrained.info('glove-twitter-50')

In [None]:
news_model = pretrained.load('word2vec-google-news-300')

In [None]:
my_model = news_model.wv
del news_model

In [None]:
my_model['news']

Given these vectors, let's explore some similarity tasks. 

In [None]:
my_model.similarity('beautiful','sublime')  # Using Cosine-similarity

What do you think will be the similarity measure between 'sublime' and 'sublime'?

In [None]:
my_model.similarity('sublime','sublime')

We're using a similarity measure included in Gensim here, but we could use specific similarity measures from scikit-learn, what you've seen before.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(my_model['beautiful'].reshape(1,-1), my_model['sublime'].reshape(1,-1))

Words that are used in similar contexts should appear closer to each other than those that do not.

In [None]:
print(my_model.similarity('potato', 'leek')) 
print(my_model.similarity('anger', 'potato'))

Gensim gives us a number of handy methods, such as this one that returns a list of most similar words to a given word.

In [None]:
my_model.most_similar('democracy'), my_model.most_similar('liberalism')

In [None]:
my_model.most_similar('pluralism', topn=20)

Given a list of words, we can identify the most similar word to one we provide.

In [None]:
candidates = ['sweet','sour','bitter','nice']
my_model.most_similar_to_given('blueberry', candidates)

Let's look at the similarity of each word.

In [None]:
for c in candidates:
    print(c, my_model.similarity('blueberry',c))

If we want to a list of words that are closer to a given word than some other word of interest, there's an easy method for it. You could read the below as, "words closer to cold than is the word dry".

In [None]:
my_model.words_closer_than('cold','dry')

You can also play with analogy tasks. The commonly seen task is:

'London is to England as Baghdad is to ____?'


' A      is to A\*.     as B      is to  B\*  '
                         
Gensim provides two different ways of implementing this task. You may be more familiar with the the additive version also called the 3CosAdd method:

$$\underset{b*\in V}{\textrm{arg max}} (cos(b*,b) - cos(b*,a) + cos(b*,a*))$$

This reflects the abstraction of Baghdad - London + England. In this maximization, we are searching which word vector will allow us to produce the highest value in this equation.

The second is a more balanced approach proposed by Levy & Goldberg 2014 (http://www.aclweb.org/anthology/W14-1618)

We find B* by going through all of the possible B* in the set of vocabulary (V) and identifying which returns the highest value. In other words, finding the argument that maximizes the following equation where the epsilon is added only to avoid division by zero. This is also called the 3CosMul method:

$$\underset{b*\in V}{\textrm{arg max}} \frac{cos(b*,b)cos(b*,a*)}{cos(b*,a)+\epsilon}$$



We can implement this method with a provided function. Positive here refers to words that give the positive contribution to similarity (nominator), and negative refers to words that contribute negatively (denominatory). Here's the additive method.

In [None]:
my_model.most_similar(positive=['woman','king'], negative=['man'])

Here's the multiplicative method. 

In [None]:
my_model.most_similar_cosmul(positive=['england','baghdad'], negative=['london'])
#This is not correct! Why?

Unfortunately in this example we see that this returns Afganistan (when Baghdad is the capital of Iraq!). This is an example of how the corpus can bias our findings.

We know enough to start asking some questions. What are good vectors? What are bad vectors? How much training/data do we need?

Gensim has a lot of built-in tools. Check the documentation here: https://radimrehurek.com/gensim/models/keyedvectors.html



If you've been reading about word vectors, you may have heard about GloVe vectors. Gensim can work with those too! Let's take a look at GloVe vectors so that we can understand the difference and see how to use them in Gensim.

How are Glove and W2V different?

GloVe is also a deep learning based approach and trains in similar ways. The main difference is that GloVe predicts the ratios of co-occurrence, such as the elements your saw in the PMI matrix in part 1. GloVe doesn't have a sliding window as does W2V, it constructs a co-occurrence matrix before-hand instead. This is why sometimes GloVe is faster but W2V can be continuously trained with new data whereas GloVe must be trained from scratch(construct a new matrix from the beginning).

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
glove_file = "./glove/glove.6B.300d.txt"
glove2word2vec_file = "glove2word2vec.txt"
glove2word2vec(glove_file, glove2word2vec_file) #we simply call this function to reformat it a bit
glove_model = KeyedVectors.load_word2vec_format(glove2word2vec_file, binary=False) #read in the same file 

In [None]:
glove_model['joy']

Given the same test with England and Baghdad from above, let's see how GloVe trained on a different text file. 

In [None]:
glove_model.most_similar_cosmul(positive=['england','baghdad'], negative=['london'])
#Success!

### PCA visualizations

Principal Component Analysis is way of analying your data's principal components!
Like SVD from part 1, PCA returns dimensions in order of representing highest variance of your data.

In [None]:
import numpy as np
from sklearn.decomposition import PCA

In [None]:
countries = ["china", "russia", "france", "germany","greece","japan","italy"]

In [None]:
capitals = ["beijing","moscow","paris","berlin","athens","tokyo","rome"]

In [None]:
X = []

for loc in countries+capitals:
    X.append(glove_model[loc])

In [None]:
pca = PCA(n_components=2)
xy_coords = pca.fit_transform(X)
loc_x, loc_y = zip(*xy_coords)

In [None]:
loc_x

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt


fig, ax = plt.subplots(figsize=(16, 8))
ax.scatter(loc_x, loc_y)

for _, location in enumerate(countries+capitals):
    ax.annotate(location, (loc_x[_]+.05, loc_y[_]-.05))

plt.title("Countries and their Capitals")
plt.show()