# Machine Reading: Advanced Topics in Word Vectors
## Part III. Pre-trained Models and Extended Vector Algorithms (50 mins)

This is a 4-part series of Jupyter notebooks on the topic of word embeddings originally created for a workshop during the Digital Humanities 2018 Conference in Mexico City. Each part is comprised of a mix of theoretical explanations and fill-in-the-blanks activities of increasing difficulty.

Instructors:
- Eun Seo Jo, <a href="mailto:eunseo@stanford.edu">*eunseo@stanford.edu*</a>, Stanford University
- Javier de la Rosa, <a href="mailto:versae@stanford.edu">*versae@stanford.edu*</a>, Stanford University
- Scott Bailey, <a href="mailto:scottbailey@stanford.edu">*scottbailey@stanford.edu*</a>, Stanford University

This unit will explore the various flavors of word embeddings specifically tailored to sentences, word meaning, paragraph, or entire documents. We will give an overview of pre-trained embeddings including where they can be found, how to use them, and what they're effective for.

- 0:00 - 0:20 Pre-trained word embeddings (where to find them, which are good, configurations, trained corpus, etc., e.g. https://github.com/facebookresearch/fastText)
- 0:20 - 0:35 Overview of other 2Vecs & other vector engineering: Paragraph2Vec, Sense2Vec, Doc2Vec, etc.
- 0:35 - 0:50 [Activity 3] Choose, download, and use a pre-trained model

---

### 0. Setting Up 

Before we get started, let's go ahead and set up our notebook. We will start by importing a few Python libraries that we will use throughout the workshop.

#### What are these libraries?

1. NumPy: This is a package for scientific computing in python. For us, NumPy is useful for vector operations. 
2. NLTK: Easy to use python package for text processing (lemmatization, tokenization, POS-tagging, etc.)
3. matplotlib, seaborn, and Plotly: Plotting packages for visualization
4. sciKit-learn: Easy to use python package for machine learning algorithms and preprocessing tools
5. gensim: Built-in word2vec and other NLP algorithms
5. fastText: Super fast word embeddings library

We will be working with a few sample texts using NLTK's corpus package.

In [1]:
%%capture --no-stderr
import sys
!pip install Cython  # needed to compile fasttext
!pip install -r requirements.txt
!python -m nltk.downloader all
print("All done!", file=sys.stderr)

All done!


If all went well, we should be able now to import the next packages into our workspace

In [2]:
import io
import pickle
import os

import numpy as np
import nltk
# import plotly.plotly as py
import sklearn
import matplotlib.pyplot as plt
import gensim
import fasttext



---



### 1. Out-of-vocabulary words and pre-trained embeddings

So far, we've seen the power of word embeddings and how easy they are to obtain from your own corpus. In most cases, however, we do not have access to millions of unlabelled documents in our target domain that would allow for training good embeddings from scratch. Training word embeddings is very resource intensive and it may require relatively large corpora for the geometric relationships to be semantically meaningful. Still, there are some issues with regular word-oriented embeddings. To illustrate this, consider the next code that trains on the text from _Alice in Wonderland_.

In [3]:
print(nltk.corpus.gutenberg.raw('carroll-alice.txt')[0:200])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once


We'll use the handy `.words()` method in NLTK to access just the words.

In [4]:
words = list(map(str.lower, nltk.corpus.gutenberg.words('carroll-alice.txt')))
words[:10]

['[',
 'alice',
 "'",
 's',
 'adventures',
 'in',
 'wonderland',
 'by',
 'lewis',
 'carroll']

And now let's train a very simple `word2vec` model.

In [5]:
documents = [words]
model = gensim.models.Word2Vec(
    documents,
    size=25,
    window=5,
    min_count=1,
    workers=10
)
model.train(documents, total_examples=len(documents), epochs=10)
model.wv['alice']

array([ 0.63055193, -0.53092706,  1.0810341 , -0.9113886 ,  0.09329262,
        0.3915109 , -1.3311906 ,  0.2732119 ,  0.88221633,  0.0536467 ,
       -1.2630957 ,  1.4644554 , -0.02393864,  0.7984751 , -0.70463336,
        0.6561172 , -0.9002941 , -0.9409493 ,  0.09993464, -0.27338758,
       -1.8265436 , -2.2210784 ,  0.52780426,  0.3881184 , -0.04105153],
      dtype=float32)

Regardless of whether this model is able to compute semantic similarities or not, word vectors have been computed. However, if you try to look for words that are not in the vocabulary you'll get an error.

In [6]:
try:
    model.wv['google']
except KeyError as e:
    print(e)

"word 'google' not in vocabulary"


This is known as the Out-Of-Vocabulary (OOV) issue in Word2Vec and similar approaches.

Now, you may think, I could get synonyms of the OOV words using something like WordNet, and then look for those words' embeddings. And while that might work in some cases, in others it is not that simple. Two such cases are new-ish words like `facebook` and `google`, or proper names of places, like `Teotihuacan`.

One way to solve this issue is to use a different measure of atomicity in your algorithm. In Word2Vec-like approaches, including GloVe, the word is the minimum unit, and as such, when looking for words that are not in the vocabulary there is certainly no vector information for it. In contrast, a different approach could train for sub-word units, for example 3-grams. While not guaranteeing that all words will be covered, a good amount of them might be, due to the fact that it's more likely for all possible trigrams to be included in a large enough corpus than all possible words. This is the approach taken by Facebook's fastText.

In [7]:
from gensim.models import FastText

fasttext_model = FastText(documents, size=25, min_count=1)
fasttext_model.wv['alice']

array([-0.5733724 ,  0.2103539 , -0.2240825 ,  0.45873976, -0.7554858 ,
       -0.66532815,  0.5954479 , -0.37850264,  1.0341942 ,  0.3553371 ,
        0.8246058 ,  1.2549787 , -0.00430943, -0.5939834 , -0.19272962,
       -1.3864826 ,  0.31367445, -1.2969799 , -0.28795922, -0.12565622,
       -0.19130923,  0.14566159, -1.6182916 ,  0.714505  , -0.12413616],
      dtype=float32)

In [8]:
fasttext_model.wv['google']

array([-0.49562618,  0.17970474, -0.21764563,  0.4175806 , -0.6930514 ,
       -0.5833665 ,  0.5357436 , -0.33867478,  0.93922836,  0.30197605,
        0.72848064,  1.1304808 ,  0.00360376, -0.5359104 , -0.15725143,
       -1.2278355 ,  0.26921815, -1.1647774 , -0.23449065, -0.10836578,
       -0.19098943,  0.14359646, -1.4649066 ,  0.6385228 , -0.09072962],
      dtype=float32)

fastText also distributes word vectors pre-trained on [Common Crawl](http://commoncrawl.org/) and [Wikipedia](https://www.wikipedia.org/) for more than 157 languages. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. They come in binary and text format: binary includes a model ready to use while the text format only contains the actual vectors associated to each word on the training set.

Gensim is soon to include a special method to load in these fasText embeddings (not working as of 3.4.0). Just take into account that only the `.bin` format allows for OOV word vectors. For the regular and usually lighter `.vec` format you still would need to load in the vectors, save a binary Gensim model, and load it back in.

Let's see a couple of examples of using `.vec` from the Somali and the Simplified English Wikipedia corpora available for fastText. These files are loaded in using the regular Gensim `KeyedVectors` word2vec model (`.load_word2vec_format()`), and vectors for out of vocabulary cannot be computed.

In Somali, the word `xiddigta` (meaning *the star*) should have its own vector avalilable since the word is present in the corpus.

In [9]:
filename = 'wiki.so.vec'
if not os.path.isfile(filename):
    !echo "Downloading $filename"
    !curl --progress-bar -Lo $filename https://s3-us-west-1.amazonaws.com/fasttext-vectors/$filename

somali_model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=False)
somali_model.wv['xiddigta'][:25]  # it means 'the star' in Somali

  import sys


array([ 0.156    , -0.11243  , -0.19999  , -0.0082059,  0.053104 ,
        0.0062253, -0.15436  , -0.086891 ,  0.049946 , -0.0084536,
       -0.1547   , -0.041276 ,  0.34115  ,  0.049262 , -0.099698 ,
       -0.092703 , -0.15162  ,  0.011775 , -0.0048607, -0.0026743,
       -0.11588  , -0.051329 , -0.22717  ,  0.069633 , -0.0051629],
      dtype=float32)

But the word `ciyaalsuuq` (meaning *unruly youth*) raises a `KeyError` in the word vectors dictionary.

In [10]:
try:
    somali_model.wv['ciyaalsuuq'][:25]
except KeyError as e:
    print(e)

"word 'ciyaalsuuq' not in vocabulary"


  


And the same thing occurs in English: while words like `star` are certainly available, words such as `bibliopole` (meaning *a person who buys and sells books, especially rare ones*) are not.

In [11]:
# This might take a while
filename = 'wiki.simple.zip'
if (not os.path.isfile(filename)
        and not os.path.isfile(filename.replace('.zip', '.vec'))
        and not os.path.isfile(filename.replace('.zip', '.bin'))):
    !echo "Downloading $filename"
    !curl --progress-bar -Lo $filename https://s3-us-west-1.amazonaws.com/fasttext-vectors/$filename
if (os.path.isfile(filename)
        and (not os.path.isfile(filename.replace('.zip', '.vec'))
                 or not os.path.isfile(filename.replace('.zip', '.bin')))):
    !unzip $filename

In [12]:
english_model = gensim.models.KeyedVectors.load_word2vec_format(
    filename.replace('.zip', '.vec'), binary=False)

In [13]:
english_model.wv['star'][:25] 

  """Entry point for launching an IPython kernel.


array([-0.51891  , -0.50084  , -0.0019202, -0.27244  , -0.29538  ,
        0.53932  , -0.64673  , -0.071279 , -0.037663 ,  0.12372  ,
        0.12885  ,  0.17083  , -0.44653  , -0.15452  , -0.16488  ,
        0.27257  , -0.06937  ,  0.20336  , -0.035001 ,  0.69188  ,
        0.054626 , -0.18631  , -0.26735  ,  0.14229  ,  0.0026101],
      dtype=float32)

In [14]:
try:
    english_model.wv['bibliopole'][:25] 
except KeyError as e:
    print(e)

"word 'bibliopole' not in vocabulary"


  


The fastText English embeddings **without** sub-word information are also included in Gensim's `downloader` feature.

In [15]:
import gensim.downloader as pretrained

pretrained.info()['models']['fasttext-wiki-news-subwords-300']

{'num_records': 999999,
 'file_size': 1005007116,
 'base_dataset': 'Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/fasttext-wiki-news-subwords-300/__init__.py',
 'license': 'https://creativecommons.org/licenses/by-sa/3.0/',
 'parameters': {'dimension': 300},
 'description': '1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).',
 'read_more': ['https://fasttext.cc/docs/en/english-vectors.html',
  'https://arxiv.org/abs/1712.09405',
  'https://arxiv.org/abs/1607.01759'],
 'checksum': 'de2bb3a20c46ce65c9c131e1ad9a77af',
 'file_name': 'fasttext-wiki-news-subwords-300.gz',
 'parts': 1}

In [16]:
fasttext_english = pretrained.load('fasttext-wiki-news-subwords-300')

In [17]:
fasttext_english.wv['star'][:25]

  """Entry point for launching an IPython kernel.


array([-0.023621  , -0.043329  , -0.021747  ,  0.00054497, -0.038798  ,
       -0.062416  ,  0.048514  , -0.11514   ,  0.058782  ,  0.059644  ,
       -0.018478  ,  0.080147  ,  0.078849  ,  0.074862  , -0.14981   ,
        0.028318  ,  0.090226  , -0.051512  ,  0.07596   ,  0.077579  ,
        0.081135  , -0.064339  , -0.038981  ,  0.10396   ,  0.030344  ],
      dtype=float32)

By contrast, when using the `.bin` file and loading it in Gensim using the special `Fastext.load_fasttext_format()` method, out of vocabulary words suddenly have embeddings available.

In [18]:
english_oov = FastText.load_fasttext_format('wiki.simple')

In [19]:
english_oov.wv['bibliopole'][:25]

array([ 0.37078428, -0.24126193,  0.11180832, -0.34659448,  0.48570928,
        0.20414576, -0.193517  ,  0.0696585 ,  0.09108197, -0.20096627,
        0.10924414, -0.3564498 , -0.02265201,  0.16185692, -0.2664784 ,
       -0.16940327, -0.17111772,  0.17861073, -0.01629919, -0.16885415,
        0.09249207,  0.42600164, -0.2559174 , -0.09749936, -0.09310414],
      dtype=float32)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
<strong>Activity</strong>
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Could you find a word in the `english_oov` model for which there is no embedding? And in the `english_model` model? Would the embedding for `ciyaalsuuq` be available in any of these models?
<br>
<em>
<!--
<strong>Hint</strong>: Use the numpy functions
-->
</em>
</p>
</div>

As we've seen, non-existing words, such as the Somali `ciyaalsuuq`, also become available, so it's a feature we must be very careful when using.

In [20]:
english_oov.wv['ciyaalsuuq'][:25]

array([ 0.2712076 , -0.04004084, -0.00314274, -0.13407175, -0.07640645,
        0.24314082, -0.21620868,  0.0062825 ,  0.00542721,  0.22700639,
        0.15218401, -0.0434717 , -0.09604143,  0.04694841, -0.01871864,
       -0.06880959, -0.09646718,  0.09673564, -0.19055995, -0.09663271,
        0.29467133,  0.5106731 , -0.10587808, -0.04973989,  0.14547734],
      dtype=float32)

Unsurprisingly, if we check what other words are similar in English to the Somali word `ciyaalsuuq` we get a bunch of words that are not really from English. To be completely fair, the Simple English corpus might not be as reliable as the full English one for finding semantic similarities.

In [21]:
english_model.similar_by_vector(english_oov.wv['ciyaalsuuq'])

[('staatsangehörigkeit', 0.6766067743301392),
 ('vvv', 0.6733159422874451),
 ('aarwangen', 0.6648105382919312),
 ('wyrzysk', 0.6636943817138672),
 ('herzogenbuchsee', 0.6629930138587952),
 ('waalwijk', 0.661628782749176),
 ('pfäffikon', 0.6590408682823181),
 ('rijkersstraat', 0.6584482192993164),
 ('verkhnekolymsky', 0.6578351855278015),
 ('распутина', 0.655958890914917)]

#### fastText package

While Gensim provides a way to create fastText embeddings with sub-word information and even load fastText pre-trained word embeddings, there is also a standalone tool, `fasttext`, and an accompanying Python library to do the same. Unfortunately, the Python bining haven't been updated and it seems to be broken when trying to load in binary models generated with newer versions of the fastText command line tool.

In [22]:
import fasttext

try:
    fasttext.load_model("wiki.simple.bin")
except Exception as e:
    print(e)

fastText: Cannot load wiki.simple.bin due to C++ extension failed to allocate the memory


Other functionalities, such as building embedding from your own corpus using either Skip-gram or CBOW, are available, as well as methods to create text classifiers very easily.

In [23]:
fasttext.skipgram(nltk.corpus.gutenberg.abspath('carroll-alice.txt'), 'alice_model')

<fasttext.model.WordVectorModel at 0x1b31436e80>

In [24]:
fasttext.cbow(nltk.corpus.gutenberg.abspath('carroll-alice.txt'), 'alice_model')

<fasttext.model.WordVectorModel at 0x1a217cf0b8>

In [25]:
text = """
__label__pos This is some wonderful positive text.
__label__neg This is some awful negative text.
"""
with open('sentiment_train.txt', 'w') as f:
    f.write(text.strip())
test = """
__label__pos This is wonderful.
__label__neg This is awful.
"""
with open('sentiment_test.txt', 'w') as f:
    f.write(test.strip())

classifier = fasttext.supervised('sentiment_train.txt', 'sentiment_model')
result = classifier.test('sentiment_test.txt')
print('P@1:', result.precision)
print('R@1:', result.recall)
print('Number of examples:', result.nexamples)

P@1: 1.0
R@1: 1.0
Number of examples: 2


#### Pre-trained vectors

The list of pre-trained word vectors grows every day, and while it's impractical to enumerate them all, some of them are listed below.

- English
  - fastText. Embeddings (300 dimensions) by Facebook [with](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip) and [without sub-word information](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip) trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens), and on [Common Crawl (600B tokens)](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip).
  - [Google News](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/). Embeddings (300 dimensions) by Google trained on Google News (100B) using word2vec with negative sampling and context window BoW with size ~5 ([link](http://code.google.com/p/word2vec/)). There also fastText versions from 2016 with and without sub-word information for Wikipedia and with no sub-word information for Common Crawl.
  - [LexVec](https://github.com/alexandres/lexvec). Embeddings (300 dimensions) trained using LexVec with and without sub-word information trained on Common Crawl, and on Wikipedia 2015 + NewsCrawl.
  - Freebase [IDs](https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing) and [names](https://docs.google.com/file/d/0B7XkCwpI5KDYeFdmcVltWkhtbmM/edit?usp=sharing). Embeddings (1000 dimensions) by Google trained on Gooogle News (100B) using word2vec, skip-gram and context window BoW with size ~10 ([link](http://code.google.com/p/word2vec/)).
  - [Wikipedia 2014 + Gigaword 5](http://nlp.stanford.edu/data/glove.6B.zip). Embeddings (50, 100, 200, and 300 dimensions) by GloVe trained on Wikipedia data from 2014 and newswire data from the mid 1990s through 2011 using GloVe with AdaGrad and context window 10+10 ([link](http://nlp.stanford.edu/projects/glove/)).
  - Common Crawl [42B](http://nlp.stanford.edu/data/glove.42B.300d.zip) and [840B](http://nlp.stanford.edu/data/glove.840B.300d.zip). Embeddings (300 dimensions) by GloVe trained on Common Crawl (42B and 840B) using GloVe and AdaGrad ([link](http://nlp.stanford.edu/projects/glove/)).
  - [Twitter (2B Tweets)](http://www-nlp.stanford.edu/data/glove.twitter.27B.zip). Embeddings (25, 50, 100, and 200 dimensions) by GloVe trained on Twitter (27B) using GloVe with GloVe and AdaGrad ([link](http://nlp.stanford.edu/projects/glove/)).
  - [Wikipedia dependency](http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2). Embeddings (300 dimensions) by Levy & Goldberg trained on Wikipedia 2015 using word2vec modified with word2vec and context window syntactic dependencies ([link](https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/)).
  - [DBPedia vectors (wiki2vec)](https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent). Embeddings (1000 dimensions) by Idio trained on Wikipedia (?) using word2vec with word2vec, skip-gram and context window BoW, 10 ([link](https://github.com/idio/wiki2vec#prebuilt-models)).
  - [60 Wikipedia embeddings with 4 kinds of context](http://vsmlib.readthedocs.io/en/latest/tutorial/getting_vectors.html#). Embeddings (25, 50, 100, 250, and 500 dimensions) by Li, Liu et al. trained on Wikipedia using Skip-Gram, CBOW, GloVe with original and modified and context window 2 ([link](http://vsmlib.readthedocs.io/en/latest/tutorial/getting_vectors.html#)).
- Multi-lingual
  - [fastText](https://fasttext.cc/docs/en/crawl-vectors.html). Embeddigns for 157 languages trained using fastText on Wikipedia 2016 and Common Crawl using CBOW with position-weights, 300 dimensions, with character n-grams of length 5, a window of size 5 and 10 negatives. Both vectors and binary models for OOV are available. There is an old version of these embeddings trained only on Wikipedia 2016 for almost [300 languages](https://fasttext.cc/docs/en/pretrained-vectors.html).
  - [BPEemb](https://github.com/bheinzerling/bpemb). Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) on Wikipedia 2018 with sub-word information.
  - [Kyubyong's wordvectors](https://github.com/Kyubyong/wordvectors#pre-trained-models). Embeddings with and without sub-word information trained on Wikipedia dumps from 2017 for +30 languages.
  - [Polyglot](https://sites.google.com/site/rmyeid/projects/polyglot#h.p_ID_98). Embeddings for more than 100 languages trained on their Wikipedias from 2013. Provides competitive performance with near state-of-art methods in English, Danish and Swedish.

There is even a tool, [`chakin`](https://github.com/chakki-works/chakin#supported-vectors), that allows to easily download word vectors with and without sub-word information for 11 languages.  

In [28]:
import chakin

chakin.search(lang='Japanese')

                         Name  Dimension     Corpus VocabularySize  \
6                fastText(ja)        300  Wikipedia           580K   
22  word2vec.Wiki-NEologd.50d         50  Wikipedia           335K   

                Method  Language                 Author  
6             fastText  Japanese               Facebook  
22  word2vec + NEologd  Japanese  Shiroyagi Corporation  


#### Historical Word Vectors

In the Humanities, despite the value of word embeddings, we usually want to train our own models or to have access to models that are related to a specific time period of study. It might not be of much help to analyze 19th Century literature with word vectors trained on a Google News corpus, specially since the semantic of the words themselves have been proven to change over time.

There is, however, a collection of [historical word vectors](https://nlp.stanford.edu/projects/histwords/) made avaliable to use by the Stanford NLP Group. The embeddings (300 dimensions) are generated using word2vec skip-gram with negative sampling and trained on Google N-Grams for English, English Fiction, French, German, and Simplified Chinese, and on the Corpus of Historical American English (COHA):
- English:
  - [All English](http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip) (1800s-1990s)
  - [English Fiction](http://snap.stanford.edu/historical_embeddings/eng-fiction-all_sgns.zip) (1800s-1990s)
  - [Genre-Balanced American English](http://snap.stanford.edu/historical_embeddings/coha-word_sgns.zip) (1830s-2000s) (COHA)
  - [Genre-Balanced American English, word lemmas](http://snap.stanford.edu/historical_embeddings/coha-lemma_sgns.zip) (1830s-2000s) (COHA)
- Multi-lingual:
  - [French](http://snap.stanford.edu/historical_embeddings/fre-all_sgns.zip) (1800s-1990s)
  - [German](http://snap.stanford.edu/historical_embeddings/ger-all_sgns.zip) (1800s-1990s)
  - [Simplified Chinese](Simplified Chinese (1950-1990s)) (1950-1990s)

Let's download and prepare some of these pre-trained word vectors.

In [102]:
# Downloading and preparing the pre-trained embeddings
pretrained.load('word2vec-google-news-300',
                return_path=True)  # return_path avoids to load the model in memory

for filename, dirname in (('eng-fiction-all_sgns.zip', 'fiction'),
                          ('coha-word_sgns.zip', 'coha')):
    if (not os.path.isfile(filename)
            and not os.path.isdir(dirname)):
        print(f'Downloading {filename}')
        !curl --progress-bar -Lo $filename http://snap.stanford.edu/historical_embeddings/$filename
    if (os.path.isfile(filename)
            and not os.path.isdir(dirname)):
        print(f'Uncompressing {filename}')
        !unzip -q -o $filename -d $dirname

In [155]:
for corpus, years in (('fiction', (1850, 1900, 1950)),  # range(1800, 1991, 10)
                      ('coha', (1850, 1900, 1950))):  # range(1810, 2001, 10)
    for year in tqdm(list(years), desc=f'Generating vector files - {corpus}'):
        if os.path.isfile(f'{corpus}/{year}.vec'):
            continue
        with open(f'{corpus}/{year}.vec', 'w') as vector_file:
            vectors = np.load(open(f'{corpus}/sgns/{year}-w.npy', 'rb'))
            words = pickle.load(open(f'{corpus}/sgns/{year}-vocab.pkl', 'rb'))
            vector_file.write("{} {}".format(*vectors.shape))
            for index, word in enumerate(words):
                vector = np.array2string(vectors[index],
                                         formatter={'float_kind':'{0:.9f}'.format})[1:-1]
                vector = vector.replace('\n', '')
                vector_file.write(f'\n{word} {vector}')

Generating vector files - fiction: 100%|██████████| 3/3 [00:00<00:00, 2399.95it/s]
Generating vector files - coha: 100%|██████████| 3/3 [00:00<00:00, 5318.22it/s]


<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
<strong>Activity</strong>
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Word embeddigns allow for analogy checking. For example, `man is to king as woman is to queen`, expressed as `man:king :: woman:queen`, has its reflection on the vector representions of the words `man`, `king`, `woman`, `queen` in such a way that $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$. However, this can also highlight some biases in the specific corpora the model has been trained on. Compare the next pairs of words with analogies to `she-he` to spot biases in Google News (2015), English Fiction (1850, 1900, 1950) and Genre-Balanced American English (1850, 1900, 1950) embeddings:
`sewing-carpentry`,
`housewife-shopkeeper`,
`nurse-surgeon`,
`softball-baseball`,
`blond-burly`,
`feminism-conservatism`,
`cosmetics-pharmaceuticals`,
`giggle-chuckle`,
`vocalist-guitarist`,
`petite-lanky`,
`sassy-snappy`,
`diva-superstar`,
`charming-affable`,
`volleyball-football`,
`cupcakes-pizzas`,
`hairdresser-barber`.
<br>
<em>
<strong>Hint</strong>: Use Gensim's `model.wv.most_similar_cosmul()`/`model.wv.most_similar()`  functions
</em>
</p>
</div>

In [104]:
fiction_1850 = gensim.models.KeyedVectors.load_word2vec_format(
    'fiction/1850.vec', binary=False)

In [113]:
fiction_1850.most_similar(positive=['woman', 'king'], negative=['man'])[0]

('queen', 0.6996211409568787)

In [147]:
fiction_1850.most_similar(positive=['woman', 'gallant'], negative=['man'])[0][0]

[('female', 0.6027143001556396),
 ('venerable', 0.5537256598472595),
 ('nobleman', 0.5382981896400452),
 ('jolly', 0.5375019311904907),
 ('nurse', 0.5263566374778748),
 ('warrior', 0.522939145565033),
 ('clad', 0.4778549373149872),
 ('beautiful', 0.4676932394504547),
 ('sailor', 0.46269071102142334),
 ('flag', 0.4588888883590698)]

In [148]:
fiction_1850.similarity('gallant', 'female')

0.6151452630072427

In [149]:
google_news = pretrained.load('word2vec-google-news-300')

In [150]:
google_news.most_similar(positive=['woman', 'gallant'], negative=['man'])

[('valiant', 0.6204848885536194),
 ('gallantly', 0.5338876247406006),
 ('courageous', 0.5054805278778076),
 ('gutsy', 0.4830174148082733),
 ('fought_valiantly', 0.478645384311676),
 ('fought_gallantly', 0.47833332419395447),
 ('brave', 0.4743400514125824),
 ('heroic', 0.4726385772228241),
 ('Jadah_Rose', 0.4713868498802185),
 ('fought_bravely', 0.4672378599643707)]

In [153]:
google_news.similarity('gallant', 'female')

0.14988984633117677

In [154]:
google_news.similarity('gallant', 'valiant')

0.7676422903598752

---

### 2. Extending Vector Algorithms

The way out of vocabulary words vectors are obtained is by splitting the word into its n-grams, getting the embedding for the n-grams, and then averaging the composition to produce the final word vector for the OOV word.

The same technique used for OOV words in fastText can also be used to produce embeddings for sentences, paragraphs and even entire documents.

After Bengio et al.’s initial efforts in neural language models, research in word embeddings stalled as computational power and algorithms were not yet at a level that enabled the training of a large vocabulary.

In 2008, Collobert and Weston [4] (thus C&W) demonstrated that word embeddings trained on an adequately large dataset carry syntactic and semantic meaning and improve performance on downstream tasks.

[FROM http://blog.aylien.com/overview-word-embeddings-history-word2vec-cbow-glove/]

Word2Vec is arguably the most popular of the word embedding models. Because word embeddings are a key element of deep learning models for NLP, it is generally assumed to belong to the same group. However, word2vec is not technically considered a component of deep learning, with the reasoning being that its architecture is neither deep nor uses non-linearities (in contrast to Bengio’s model and the C&W model).

Mikolov et al. recommend two architectures for learning word embeddings that, when compared with previous models, are computationally less expensive.

Unlike a language model that can only base its predictions on past words, as it is assessed based on its ability to predict each next word in the corpus, a model that only aims to produce accurate word embeddings is not subject to such restriction. Mikolov et al. therefore use both the n words before and after the target word to predict it. This is known as a continuous bag of words (CBOW), owing to the fact that it uses continuous representations whose order is of no importance.

While CBOW can be seen as a precognitive language model, skip-gram turns the language model objective on its head: rather than using the surrounding words to predict the centre word as with CBOW, skip-gram uses the centre word to predict the surrounding words

In contrast to word2vec, GloVe seeks to make explicit what word2vec does implicitly: Encoding meaning as vector offsets in an embedding space — seemingly only a serendipitous by-product of word2vec — is the specified goal of GloVe.

To be specific, the creators of GloVe illustrate that the ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences.

Distributional Semantic Models can be seen as count models as they “count” co-occurrences among words by operating on co-occurrence matrices. Neural word embedding models, in contrast, can be viewed as predict models, as they try to predict surrounding words.

In 2014, Baroni et al. demonstrated that, in nearly all tasks, predict models consistently outperform count models, and therefore provided us with a comprehensive verification for the supposed superiority of word embedding models.

It typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.

Recent studies by Jurafsky’s group [13], [14] reflect these findings and illustrate that SVD, rather than SGNS, is commonly the preferred choice accurate word representations is important.

[13]: Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Retrieved from http://arxiv.org/abs/1606.02820

[14]: Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. arXiv Preprint arXiv:1605.09096

#### Where to find them, which are good, configurations, trained corpus, etc., e.g. https://github.com/facebookresearch/fastText)

In [None]:
HTML("""
<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Your turn: Generate a vector including integers from 4 and 8 of size 10
<br>
<em>
<strong>Hint</strong>: Use the numpy functions
</em>
</p>
</div>
""")

Solution:
```python
np.random.randint(...)
```

In [None]:
# Enter your code here

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.es.vec

### Doc2Vec

What is best in the word2vec approach is that operations on the vectors approximately keep the characteristics of the words, so that joining (averaging) vectors from the words of a sentence produce a vector that is likely to represent the general topic of the sentence.

A lot of pre-trained word2vec models exist, and some of them were trained on huge volumes of data. For the purpose of this analysis, the one trained on over 2 billion tweets with 200 dimensions (one vector consists of 200 numbers) is used. The pre-trained model can be downloaded here: https://github.com/3Top/word2vec-api

Overview of other 2Vecs & other vector engineering: Paragraph2Vec, Sense2Vec, Doc2Vec, etc.

While having pre-trained models to the level of words or even beyond, as we've seen with the sub-word approaches, is helpful, sometimes we want to consider the document as our atomic unit. This might be useful for document classification tasks, such as authorship attribution, sentiment analysis, etc.

https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

In this section we describe the datasets, embeddings, and word lists used, as well as how bias is quantified. More detail, including descriptions of additional embeddings and the full word lists, are in SI Appendix, section A. All of our data and code are available on GitHub (https://github.com/nikhgarg/EmbeddingDynamicStereotypes), and we link to external data sources as appropriate.

Embeddings.
This work uses several pretrained word embeddings publicly available online; refer to the respective sources for in-depth discussion of their training parameters. These embeddings are among the most commonly used English embeddings, vary in the datasets on which they were trained, and between them cover the best-known algorithms to construct embeddings. One finding in this work is that, although there is some heterogeneity, gender and ethnic bias is generally consistent across embeddings. Here we restrict descriptions to embeddings used in the main exposition. For consistency, only single words are used, all vectors are normalized by their l2 norm, and words are converted to lowercase.

Google News word2vec vectors.
Vectors trained on about 100 billion words in the Google News dataset (24, 25). Vectors are available at https://code.google.com/archive/p/word2vec/.

Google Books/COHA.
Vectors trained on a combined corpus of genre-balanced Google Books and the COHA (48) by the authors of ref. 26. For each decade, a separate embedding is trained from the corpus data corresponding to that decade. The dataset is specifically designed to enable comparisons across decades, and the creators take special care to avoid selection bias issues. The vectors are available at https://nlp.stanford.edu/projects/histwords/, and we limit our analysis to the SVD and skip-gram with negative sampling (SGNS) (also known as word2vec) embeddings in the 1900s. Note that the Google Books data may include some non-American sources and the external metrics we use are American. However, this does not appreciably affect results. In the main text, we exclusively use SGNS embeddings; results with SVD embeddings are in SI Appendix and are qualitatively similar to the SGNS results. Unless otherwise specified, COHA indicates these embeddings trained using the SGNS algorithm.

New York Times.
We train embeddings over time from The New York Times Annotated Corpus (28), using 1.8 million articles from the New York Times between 1988 and 2005. We use the GLoVe algorithm (27) and train embeddings over 3-y windows (so the 2000 embeddings, for example, contain articles from 1999 to 2001).

In SI Appendix we also use other embeddings available at https://nlp.stanford.edu/projects/glove/.

Models
- Word2Vec Model of ECCO, “Literature and Language,” 1700-99 (1.9 billion words; skip-gram size of 10 words): http://ryanheuser.org/data/word2vec.ECCO.skipgram_n=10.model.txt.gz
- Word2Vec Models for Twenty-year Periods of 18C (ECCO, “Literature and Language,” 1700-99) (150 million words each; skip-gram size of 10 words): https://archive.org/details/word-vectors-18c-word2vec-models-across-20-year-periods
- Word2Vec Model of ECCO-TCP, 1700-99 (80 million words; skip-gram size of 10 words): http://ryanheuser.org/data/word2vec.ECCO-TCP.skipgram_n=10.txt.zip
- Word2Vec Model of ECCO-TCP, 1700-99 (80 million words; skip-gram size of 5 words): http://ryanheuser.org/data/word2vec.ECCO-TCP.txt.zip
Code
Code to evaluate a word2vec model against the Miller Analogies Test
Code to produce a semantic network from a gensim word2vec model
Code for aligning two gensim word2vec models using Procrustes matrix alignment

Notes and links:
https://github.com/versae/word_vectors_dh2018
https://arxiv.org/pdf/1310.4546.pdf
http://www.pnas.org/content/early/2018/03/30/1720347115.full#sec-17
https://github.com/nikhgarg/EmbeddingDynamicStereotypes
http://ryanheuser.org/word-vectors/
http://ryanheuser.org/word2vec-vs-the-mat/
http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
https://github.com/Kyubyong/wordvectors
https://github.com/facebookresearch/fastText
https://pypi.org/project/fasttext/
https://gist.github.com/bhaettasch/d7f4e22e79df3c8b6c20
https://towardsdatascience.com/using-fasttext-and-svd-to-visualise-word-embeddings-instantly-5b8fa870c3d1
https://radimrehurek.com/gensim/models/fasttext.html#module-gensim.models.fasttext
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb
https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c

https://docs.google.com/document/d/1nKEPA-jKvIkJyRhi2Ok_v3Hjnpv5pzsOhdSTSDNoYSo/edit?ts=5aa7ef5f