# NLP Seminar 3: static word embeddings (word2vec, fasttext, GloVe)

In this seminar, we will use the `gensim` package, as it has unifying easy-to-use implementations and pretrained word2vec, fasttext, and GloVe models

In [None]:
#!pip install --upgrade gensim

In [None]:
import numpy as np
import pandas as pd

In [None]:
import multiprocessing
cores = multiprocessing.cpu_count()
cores

In [None]:
from nltk.tokenize import word_tokenize

# Data preparation

In [None]:
simpsons = pd.read_csv("data/simpsons_script_lines.csv",
                       usecols=["raw_character_text", "raw_location_text", "spoken_words", "normalized_text"],
                       dtype={'raw_character_text':'string', 'raw_location_text':'string',
                              'spoken_words':'string', 'normalized_text':'string'})
simpsons.head()

In [None]:
simpsons.info()

In [None]:
simpsons = simpsons.dropna().drop_duplicates().reset_index(drop=True)

In [None]:
corpus_tok = simpsons['normalized_text'].str.split().to_list()
corpus_tok[1]

In [None]:
# If you don't know the Simpsons tv show, 
# you can e.g. use the wikipedia subset corpus instead,
# (and try different words when evaluating the vectors and similarities in next sections):

#import gensim.downloader as gensim_api
#gensim_api.info('text8')['description']
#corpus_tok = gensim_api.load('text8')

## Phraser

https://radimrehurek.com/gensim/models/phrases.html

In [None]:
from gensim.models.phrases import Phrases, Phraser
phrases = Phrases(corpus_tok, min_count=30)
phraser = Phraser(phrases)
del(phrases)

In [None]:
phraser[["homer", "simpson", "eats", "chocolate"]]

In [None]:
corpus_phrased = phraser[corpus_tok]

In [None]:
corpus_phrased[1]

# Word2vec

Word2vec has two sub-methods for training the word embeddings: continuous bag of words (CBOW) and skip-gram.
In both cases, a shallow neural network is trained to predict either

- a word given a context (CBOW), or
- a context of a given a word (skip-gram).

The context is defined as the other surrounding words in a given window. The word embedding vectors are then obained from the two trained weight matrices for each word in the vocabulary.


Official website: https://code.google.com/archive/p/word2vec/

Original papers: http://arxiv.org/abs/1301.3781 and http://arxiv.org/abs/1310.4546

### Training word2vec on the Simpson scripts

In [None]:
from gensim.models import Word2Vec, KeyedVectors

In [None]:
w2v_s = Word2Vec(corpus_phrased, vector_size=150, window=3, min_count=2, sg=0, negative=5, ns_exponent=0.75,
                 alpha=0.025, min_alpha=0.0001, workers=cores-1, epochs=30)
#1st line: Method's hyperparameters
#2nd line: Optimization (gradient descent) hyperparameters

Can also be done in separate steps:

    w2v_s = Word2Vec(vector_size=150, window=3, min_count=2, sg=0, negative=5, ns_exponent=0.75,
                     alpha=0.025, min_alpha=0.0001, workers=cores-1)
    w2v_s.build_vocab(sentences, progress_per=10000)
    w2v_s.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

Word embedding vectors can then be obtained from the trained model, for each word in the training vocabulary.

In [None]:
homer_vector = w2v_s.wv.get_vector("homer", norm=True) # Father of the Simpsons
homer_vector

For a given word or vector, one can query the other most similar word vectors, in terms of cosine similarity.

In [None]:
w2v_s.wv.most_similar("homer") # Marge is Homer's wife

In [None]:
w2v_s.wv.most_similar(homer_vector)

In [None]:
w2v_s.wv.most_similar("homer_simpson") # name bigram

In [None]:
w2v_s.wv.most_similar("bart") # Bart is the son, Lisa his sister and Milhouse his best friend

One can also compute the cosine similarity between two word vectors

In [None]:
w2v_s.wv.similarity('bart', 'lisa')

In [None]:
w2v_s.wv.similarity('bart', 'bart')

Odd-one-out identification:

In [None]:
w2v_s.wv.doesnt_match(['homer', 'patty', 'selma']) # Patty and Selma are Marge's twin sisters

Word analogies: how well do embeddings vectors capture intuitive semantic and syntactic analogy questions?

In [None]:
# " Homer - man + woman = ? " - i.e. " man:Homer :: woman:? "
w2v_s.wv.most_similar(positive=["homer", "woman"], negative=["man"], topn=3) # Marge is Homer's wife

In [None]:
# " woman - Marge + Homer = ? " - i.e. " Marge:Homer :: woman:? " 
w2v_s.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=3)

### Sentence embedding

In [None]:
def document2vec(tokens, embedding_wv, phraser=None, normalize=True):
    """Returns the embedding of a sentence or document as the mean of its tokens/words embeddings."""
    if phraser:
        tokens = phraser[tokens]
    sent_mean = np.array([embedding_wv.get_vector(tok, norm=normalize) for tok in tokens]).mean(axis=0)
    return sent_mean

In [None]:
document2vec(["bart", "is", "grounded"], w2v_s.wv, phraser=phraser)

### Pretrained word2vec vectors

https://github.com/RaRe-Technologies/gensim-data#models

In [None]:
import gensim.downloader as gensim_api

In [None]:
v2w_pret = gensim_api.load('word2vec-google-news-300')

In [None]:
# Or from downloaded source (e.g. https://code.google.com/archive/p/word2vec/):
#v2w_pret = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
v2w_pret

In [None]:
v2w_pret.most_similar(positive=["eat"])

In [None]:
v2w_pret.similarity("eat", 'consume')

In [None]:
v2w_pret.doesnt_match(["eat", 'dance', 'drink'])

In [None]:
v2w_pret.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car'])

In [None]:
v2w_pret.most_similar(positive=["king", "woman"], negative=["man"], topn=3)

In [None]:
vectcalc = v2w_pret.get_vector("king", norm=True) - v2w_pret.get_vector("man", norm=True) + v2w_pret.get_vector("woman", norm=True)
v2w_pret.most_similar(vectcalc)

# Fasttext

Fastext is a static word embedding, that is very similar to word2vec. As a main difference, it uses character-level ngram vectors together with the word vectors.

Advantages:
+ Out of training vocabulary embeddings are obtainable.
+ Better representation for rare words (that are semantically similar to others).
+ Tends to perform better for syntactic tasks.
+ Is more useful in morphologically rich languages (such as German, Arabic and Russian) compared to English (German example: 'table tennis' -> 'Tischtennis'), but it heavily depends on the task.
+ might work better for small datasets.
+ The "official" implementation is quite efficient, and allows training the embedding and classifier at once (see the official `fasttest` package documentation https://fasttext.cc/docs/en/python-module.html).

Disatvantages:
- Can overfit more easily, and is a bit harder to fine tune with the additionnal character ngram hyperparameters.
- Tends to perform more poorly for semantic tasks.
- May tend to privilege too much the morphologically close synonyms compared to other semantically closer synomyms.
- Can be heavier to train.

However, the differences between fasttext and word2vec thend to decrease as the size of the training corpus increases.

Official website: https://fasttext.cc/

Original paper: https://arxiv.org/abs/1607.04606

### Training fasttext on the Simpson scripts

In [None]:
from gensim.models import FastText

In [None]:
fst_s = FastText(corpus_phrased, vector_size=150, window=3, min_count=5, sg=0, negative=5, ns_exponent=0.75,
                 min_n=3, max_n=6, #Additional fasttest hyperparameters
                 alpha=0.025, min_alpha=0.0001, workers=cores-1, epochs=30)

Can also be performed in separate steps:

    fst_s = FastText(vector_size=150, window=3, min_count=5, sg=0, negative=5, ns_exponent=0.75,
                     min_n = 1, max_n = 4,
                     alpha=0.025, min_alpha=0.0001, workers=cores-1)
    fst_s.build_vocab(corpus_phrased)
    print(len(fst_s.wv.vocab.keys()))
    fst_s.train(sentences, total_examples = fst_s.corpus_count, epochs=100) 

In [None]:
"unige" in fst_s.wv.index_to_key, "unige" in w2v_s.wv.index_to_key

In [None]:
try:
    print(w2v_s.wv.get_vector("unige", norm=True))
except:
    print("KeyError: the given token is not not present in the vocabulary.")

In [None]:
try:
    print(fst_s.wv.get_vector("unige", norm=True))
except:
    print("KeyError: the given token is not not present in the vocabulary.")

In [None]:
fst_s.wv.most_similar("homer", topn = 10)

In [None]:
fst_s.wv.most_similar("marge", topn = 10)

In [None]:
w2v_s.wv.most_similar("marge", topn = 10)

In [None]:
fst_s.wv.most_similar("eat", topn = 10)

In [None]:
w2v_s.wv.most_similar("eat", topn = 10)

### Pretrained fasttext vectors

In [None]:
fst_pret = gensim_api.load('fasttext-wiki-news-subwords-300')

In [None]:
# or from downloaded source (e.g. https://fasttext.cc/docs/en/english-vectors.html):
# fst_pret = FastText.load_fasttext_format('fasttest_file')

In [None]:
fst_pret.most_similar("eat", topn = 10)

In [None]:
v2w_pret.most_similar("eat", topn = 10)

In [None]:
fst_pret.most_similar("consume", topn = 10)

In [None]:
v2w_pret.most_similar("consume", topn = 10)

# GloVe

Contrary to word2vec and fasttext, GloVe doesn't use skipgram or CBOW networks. GloVe relies on word-context co-occurrence matrix factorization to obtain the embedded word vectors.

- GloVe can be longer to train on larger corpora, compared to word2vec.
- It has fewer hyperparameters, so it's much easier to tune, but then cannot be fine tuned for a specific task.
- word2vec and fasttext are in comparison much more sensitive to the coices of hyperparameters, and results can thus vary much more.

Official website: https://nlp.stanford.edu/projects/glove/

Original paper: https://nlp.stanford.edu/pubs/glove.pdf

https://nlp.stanford.edu/projects/glove/

In [None]:
glv_pret = gensim_api.load("glove-wiki-gigaword-200")

# Are already available in gensim:
#'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300',
#'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200'

From downloaded source (e.g. https://nlp.stanford.edu/projects/glove/), one can do:

    from gensim.test.utils import datapath, get_tmpfile
    from gensim.models import KeyedVectors
    from gensim.scripts.glove2word2vec import glove2word2vec
    glove_file = datapath('DOWNLOADED_GLOVE_VECTORS.txt')
    tmp_file = get_tmpfile("test_word2vec.txt")
    glove2word2vec(glove_file, tmp_file)
    model = KeyedVectors.load_word2vec_format(tmp_file)

In [None]:
glv_pret.most_similar(positive=["better", "fast"], negative=["good"], topn=3)

In [None]:
glv_pret.most_similar("eat", topn = 10)

In [None]:
glv_pret.most_similar("consume", topn = 10)

### Remark: training GloVe

Training GloVe vectors is not possible with gensim. If interested, one can use the [official GloVe code](https://nlp.stanford.edu/projects/glove/) (command line interface).

For a python interface, see for example the ("toy implementation") [`glove_python`](https://github.com/maciejkula/glove-python) pachage

    !pip install glove_python

### See also other vector embeddings...

https://github.com/RaRe-Technologies/gensim-data#models

In [None]:
gensim_api.info('conceptnet-numberbatch-17-06-300')['description']
#conceptnet = gensim_api.load("conceptnet-numberbatch-17-06-300")

# Saving gensim models and word vectors

One can save either the entire model (if further training is expected).

In [None]:
w2v_s.save('word2vec_simpson_model')
w2v_s = Word2Vec.load('word2vec_simpson_model')

Or only the word vectors (the `KeyedVectors`-type attribute) if the vecors are final. They are much more memory-efficient to save.

In [None]:
w2v_s.wv.save('word2vec_simpson_word_vectors')
w2v_s_wv = KeyedVectors.load('word2vec_simpson_word_vectors')

# Exercise: ML classification using advanced static embeddings

Compare the performance of the logistic regression classifier on the 20newsgroup dataset using word2vect, GloVe or fasttext to the performance achieved in the previous seminar using TF-IDF.