# Machine Reading: Advanced Topics in Word Vectors
## Part III. Pre-trained Models and Extended Vector Algorithms (50 mins)

This is a 4-part series of Jupyter notebooks on the topic of word embeddings originally created for a workshop during the Digital Humanities 2018 Conference in Mexico City. Each part is comprised of a mix of theoretical explanations and fill-in-the-blanks activities of increasing difficulty.

Instructors:
- Eun Seo Jo, <a href="mailto:eunseo@stanford.edu">*eunseo@stanford.edu*</a>, Stanford University
- Javier de la Rosa, <a href="mailto:versae@stanford.edu">*versae@stanford.edu*</a>, Stanford University
- Scott Bailey, <a href="mailto:scottbailey@stanford.edu">*scottbailey@stanford.edu*</a>, Stanford University

This unit will explore the various flavors of word embeddings specifically tailored to sentences, word meaning, paragraph, or entire documents. We will give an overview of pre-trained embeddings including where they can be found, how to use them, and what they're effective for.

- 0:00 - 0:20 Pre-trained word embeddings (where to find them, which are good, configurations, trained corpus, etc., e.g. https://github.com/facebookresearch/fastText)
- 0:20 - 0:35 Overview of other 2Vecs & other vector engineering: Paragraph2Vec, Sense2Vec, Doc2Vec, etc.
- 0:35 - 0:50 [Activity 3] Choose, download, and use a pre-trained model

---

### 0. Setting Up 

Before we get started, let's go ahead and set up our notebook. We will start by importing a few Python libraries that we will use throughout the workshop.

#### What are these libraries?

1. NumPy: This is a package for scientific computing in python. For us, NumPy is useful for vector operations. 
2. NLTK: Easy to use python package for text processing (lemmatization, tokenization, POS-tagging, etc.)
3. matplotlib, seaborn, and Plotly: Plotting packages for visualization
4. sciKit-learn: Easy to use python package for machine learning algorithms and preprocessing tools
5. gensim: Built-in word2vec and other NLP algorithms
5. fastText: Super fast word embeddings library

We will be working with a few sample texts using NLTK's corpus package.

In [None]:
%%capture --no-stderr
import sys
!pip install Cython  # needed to compile fasttext
!pip install -r requirements.txt
!python -m nltk.downloader all
print("All done!", file=sys.stderr)

If all went well, we should be able now to import the next packages into our workspace

In [None]:
import numpy as np
import nltk
# import plotly.plotly as py
import sklearn
import matplotlib.pyplot as plt
import gensim
import fasttext



---



### 1. Out-of-vocabulary words and pre-trained embeddings

So far, we've seen the power of word embeddings and how easy they are to obtain from your own corpus. In most cases, however, we do not have access to millions of unlabelled documents in our target domain that would allow for training good embeddings from scratch. Training word embeddings is very resource intensive and it may require relatively large corpora for the geometric relationships to be semantically meaningful. Still, there are some issues with regular word-oriented embeddings. To illustrate this, consider the next code that trains on the text from _Alice in Wonderland_.

In [None]:
print(nltk.corpus.gutenberg.raw('carroll-alice.txt')[0:200])

We'll use the handy `.words()` method in NLTK to access just the words.

In [None]:
words = list(map(str.lower, nltk.corpus.gutenberg.words('carroll-alice.txt')))
words[:10]

And now let's train a very simple `word2vec` model.

In [None]:
documents = [words]
model = gensim.models.Word2Vec(
    documents,
    size=25,
    window=5,
    min_count=1,
    workers=10
)
model.train(documents, total_examples=len(documents), epochs=10)
model.wv['alice']

Regardless of whether or not this model is able to compute semantic similarities or not, word vectors have been computed. However, if you try to look for words that are not in the vocabulary you'll get an error.

In [None]:
try:
    model.wv['google']
except KeyError as e:
    print(e)

This is known as the Out-Of-Vocabulary (OOV) issue in Word2Vec and similar approaches.

Now, you may think, I could get synonyms of the OOV words using something like WordNet, and then look for those words' embeddings. And while that might work in some cases, in others it is not that simple. Two such cases are new-ish words like `facebook` and `google`, or proper names of places, like `Teotihuacan`.

One way to solve this issue is to use a different measure of atomicity in your algorithm. In Word2Vec-like approaches, including GloVe, the word is the minimum unit, and as such, when looking for words that are not in the vocabulary there is certainly no vector information for it. In contrast, a different approach could train for sub-word units, for example 3-grams. While not guaranteeing that all words will be covered, a good amount of them might be, due to the fact that it's more likely for all possible trigrams to be included in a large enough corpus than all possible words. This is the approach taken by Facebook's fastText.

In [None]:
from gensim.models import FastText

fasttext_model = FastText(documents, size=25, min_count=1)
fasttext_model.wv['alice']

In [None]:
fasttext_model.wv['google']

fastText also distributes word vectors pre-trained on [Common Crawl](http://commoncrawl.org/) and [Wikipedia](https://www.wikipedia.org/) for more than 157 languages. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. They come in binary and text format: binary includes a model ready to use while the text format only contains the actual vectors associated to each word on the training set.

Gensim is soon to include a special method to load in these fasText embeddings (not working as of 3.4.0). Just take into account that only the `.bin` format allows for OOV word vectors. For the regular and usually lighter `.vec` format you still would need to load in the vectors, save a binary Gensim model, and load it back in.

In [None]:
import io
import os

filename = 'wiki.so.vec'
if not os.path.isfile(filename):
    !echo "Downloading $filename"
    !curl --progress-bar -Lo $filename https://s3-us-west-1.amazonaws.com/fasttext-vectors/$filename

somali_model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=False)
somali_model.wv['xiddiga'][:25]  # it means 'star' in Somali

In [None]:
# This might take a while
filename = 'wiki.simple.zip'
if not os.path.isfile(filename):
    !echo "Downloading $filename"
    !curl --progress-bar -Lo $filename https://s3-us-west-1.amazonaws.com/fasttext-vectors/$filename
    !unzip $filename
somali_model_oov = FastText.load_fasttext_format('wiki.simple')
somali_model_oov.wv['google']  # it means 'superstar' in Somali

The fastText English embeddings **without** sub-word information are also included in Gensim's `downloader` feature.

In [None]:
import gensim.downloader as pretrained

pretrained.info()['models']['fasttext-wiki-news-subwords-300']

In [None]:
fasttext_english = pretrained.load('fasttext-wiki-news-subwords-300')

In [None]:
fasttext_english.wv['alice'][:25]

In [None]:
!du -h ../data/GoogleNews-vectors-negative300.bin.gz

The way these out of vocabulary words work is by splitting the word into its n-grams, getting the embedding for the n-grams, and then averaging the composition to produce the final word vector for the OOV word.

While Gensim provides a way to create fastText embeddings with sub-word information and even load fastText pre-trained word embeddings, there is also a standalone tool, `fasttext`, and an accompanying Python library to do the same.

The list of pre-trained word vectors grows every day, and while it's impractical to enumerate them all, some of them are listed below.

- A
- B
- B



### 2. Extending embeddings

The same technique used for OOV words in fastText can also be used to produce embeddings for sentences, paragraphs and even entire documents.

After Bengio et al.’s initial efforts in neural language models, research in word embeddings stalled as computational power and algorithms were not yet at a level that enabled the training of a large vocabulary.

In 2008, Collobert and Weston [4] (thus C&W) demonstrated that word embeddings trained on an adequately large dataset carry syntactic and semantic meaning and improve performance on downstream tasks.

[FROM http://blog.aylien.com/overview-word-embeddings-history-word2vec-cbow-glove/]

Word2Vec is arguably the most popular of the word embedding models. Because word embeddings are a key element of deep learning models for NLP, it is generally assumed to belong to the same group. However, word2vec is not technically considered a component of deep learning, with the reasoning being that its architecture is neither deep nor uses non-linearities (in contrast to Bengio’s model and the C&W model).

Mikolov et al. recommend two architectures for learning word embeddings that, when compared with previous models, are computationally less expensive.

Unlike a language model that can only base its predictions on past words, as it is assessed based on its ability to predict each next word in the corpus, a model that only aims to produce accurate word embeddings is not subject to such restriction. Mikolov et al. therefore use both the n words before and after the target word to predict it. This is known as a continuous bag of words (CBOW), owing to the fact that it uses continuous representations whose order is of no importance.

While CBOW can be seen as a precognitive language model, skip-gram turns the language model objective on its head: rather than using the surrounding words to predict the centre word as with CBOW, skip-gram uses the centre word to predict the surrounding words

In contrast to word2vec, GloVe seeks to make explicit what word2vec does implicitly: Encoding meaning as vector offsets in an embedding space — seemingly only a serendipitous by-product of word2vec — is the specified goal of GloVe.

To be specific, the creators of GloVe illustrate that the ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences.

Distributional Semantic Models can be seen as count models as they “count” co-occurrences among words by operating on co-occurrence matrices. Neural word embedding models, in contrast, can be viewed as predict models, as they try to predict surrounding words.

In 2014, Baroni et al. demonstrated that, in nearly all tasks, predict models consistently outperform count models, and therefore provided us with a comprehensive verification for the supposed superiority of word embedding models.

It typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.

Recent studies by Jurafsky’s group [13], [14] reflect these findings and illustrate that SVD, rather than SGNS, is commonly the preferred choice accurate word representations is important.

[13]: Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Retrieved from http://arxiv.org/abs/1606.02820

[14]: Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. arXiv Preprint arXiv:1605.09096

#### Where to find them, which are good, configurations, trained corpus, etc., e.g. https://github.com/facebookresearch/fastText)

In [None]:
HTML("""
<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Your turn: Generate a vector including integers from 4 and 8 of size 10
<br>
<em>
<strong>Hint</strong>: Use the numpy functions
</em>
</p>
</div>
""")

Solution:
```python
np.random.randint(...)
```

In [None]:
# Enter your code here

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.es.vec

### Doc2Vec

What is best in the word2vec approach is that operations on the vectors approximately keep the characteristics of the words, so that joining (averaging) vectors from the words of a sentence produce a vector that is likely to represent the general topic of the sentence.

A lot of pre-trained word2vec models exist, and some of them were trained on huge volumes of data. For the purpose of this analysis, the one trained on over 2 billion tweets with 200 dimensions (one vector consists of 200 numbers) is used. The pre-trained model can be downloaded here: https://github.com/3Top/word2vec-api

Overview of other 2Vecs & other vector engineering: Paragraph2Vec, Sense2Vec, Doc2Vec, etc.

While having pre-trained models to the level of words or even beyond, as we've seen with the sub-word approaches, is helpful, sometimes we want to consider the document as our atomic unit. This might be useful for document classification tasks, such as authorship attribution, sentiment analysis, etc.

https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

In this section we describe the datasets, embeddings, and word lists used, as well as how bias is quantified. More detail, including descriptions of additional embeddings and the full word lists, are in SI Appendix, section A. All of our data and code are available on GitHub (https://github.com/nikhgarg/EmbeddingDynamicStereotypes), and we link to external data sources as appropriate.

Embeddings.
This work uses several pretrained word embeddings publicly available online; refer to the respective sources for in-depth discussion of their training parameters. These embeddings are among the most commonly used English embeddings, vary in the datasets on which they were trained, and between them cover the best-known algorithms to construct embeddings. One finding in this work is that, although there is some heterogeneity, gender and ethnic bias is generally consistent across embeddings. Here we restrict descriptions to embeddings used in the main exposition. For consistency, only single words are used, all vectors are normalized by their l2 norm, and words are converted to lowercase.

Google News word2vec vectors.
Vectors trained on about 100 billion words in the Google News dataset (24, 25). Vectors are available at https://code.google.com/archive/p/word2vec/.

Google Books/COHA.
Vectors trained on a combined corpus of genre-balanced Google Books and the COHA (48) by the authors of ref. 26. For each decade, a separate embedding is trained from the corpus data corresponding to that decade. The dataset is specifically designed to enable comparisons across decades, and the creators take special care to avoid selection bias issues. The vectors are available at https://nlp.stanford.edu/projects/histwords/, and we limit our analysis to the SVD and skip-gram with negative sampling (SGNS) (also known as word2vec) embeddings in the 1900s. Note that the Google Books data may include some non-American sources and the external metrics we use are American. However, this does not appreciably affect results. In the main text, we exclusively use SGNS embeddings; results with SVD embeddings are in SI Appendix and are qualitatively similar to the SGNS results. Unless otherwise specified, COHA indicates these embeddings trained using the SGNS algorithm.

New York Times.
We train embeddings over time from The New York Times Annotated Corpus (28), using 1.8 million articles from the New York Times between 1988 and 2005. We use the GLoVe algorithm (27) and train embeddings over 3-y windows (so the 2000 embeddings, for example, contain articles from 1999 to 2001).

In SI Appendix we also use other embeddings available at https://nlp.stanford.edu/projects/glove/.

Models
- Word2Vec Model of ECCO, “Literature and Language,” 1700-99 (1.9 billion words; skip-gram size of 10 words): http://ryanheuser.org/data/word2vec.ECCO.skipgram_n=10.model.txt.gz
- Word2Vec Models for Twenty-year Periods of 18C (ECCO, “Literature and Language,” 1700-99) (150 million words each; skip-gram size of 10 words): https://archive.org/details/word-vectors-18c-word2vec-models-across-20-year-periods
- Word2Vec Model of ECCO-TCP, 1700-99 (80 million words; skip-gram size of 10 words): http://ryanheuser.org/data/word2vec.ECCO-TCP.skipgram_n=10.txt.zip
- Word2Vec Model of ECCO-TCP, 1700-99 (80 million words; skip-gram size of 5 words): http://ryanheuser.org/data/word2vec.ECCO-TCP.txt.zip
Code
Code to evaluate a word2vec model against the Miller Analogies Test
Code to produce a semantic network from a gensim word2vec model
Code for aligning two gensim word2vec models using Procrustes matrix alignment

Notes and links:
https://github.com/versae/word_vectors_dh2018
https://arxiv.org/pdf/1310.4546.pdf
http://www.pnas.org/content/early/2018/03/30/1720347115.full#sec-17
https://github.com/nikhgarg/EmbeddingDynamicStereotypes
http://ryanheuser.org/word-vectors/
http://ryanheuser.org/word2vec-vs-the-mat/
http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
https://github.com/Kyubyong/wordvectors
https://github.com/facebookresearch/fastText
https://pypi.org/project/fasttext/
https://gist.github.com/bhaettasch/d7f4e22e79df3c8b6c20
https://towardsdatascience.com/using-fasttext-and-svd-to-visualise-word-embeddings-instantly-5b8fa870c3d1
https://radimrehurek.com/gensim/models/fasttext.html#module-gensim.models.fasttext
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb
https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c

https://docs.google.com/document/d/1nKEPA-jKvIkJyRhi2Ok_v3Hjnpv5pzsOhdSTSDNoYSo/edit?ts=5aa7ef5f