# Word2Vec

This lesson is designed to explore features of word embeddings produced through the word2vec model.

The primary corpus we use consists of the <a href="http://txtlab.org/?p=601">150 English-language novels</a> made available by the <em>.txtLab</em> at McGill University. We also look at a <a href="http://ryanheuser.org/word-vectors-1/">Word2Vec model trained on the ECCO-TCP corpus</a> of 2,350 eighteenth-century literary texts made available by Ryan Heuser. (Note that I have shortened the number of terms in the model by half in order to conserve memory.)

For further background on Word2Vec's mechanics, I suggest this <a href="https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html">brief tutorial</a> by Google, especially the sections "Motivation," "Skip-Gram Model," and "Visualizing."

### Workshop Agenda
<ol>
<li>Import & Pre-Processing</li>
<li>Word2Vec</li>
<ol><li>Training</li>
<li>Embeddings</li>
<li>Visualization</li>
</ol>
<li>Saving/Loading Models</li>
</ol>

# 0. Prep

### Visualization parameters

In [None]:
%pylab inline
matplotlib.style.use('ggplot')

### Import Packages

In [None]:
# Data Wrangling

import os
import numpy as np
import pandas
from scipy.spatial.distance import cosine
from sklearn.metrics import pairwise
from sklearn.manifold import MDS, TSNE

In [None]:
# Natural Language Processing

import gensim
import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
# Custom Tokenizer for Classroom Use

def fast_tokenize(text):
    
    # Get a list of punctuation marks
    from string import punctuation
    
    lower_case = text.lower()
    
    # Iterate through text removing punctuation characters
    no_punct = "".join([char for char in lower_case if char not in punctuation])
    
    # Split text over whitespace into list of words
    tokens = no_punct.split()
    
    return tokens

# 1. Import & Pre-Processing

### Corpus Description
English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts reside on disk, each in a separate plaintext file. Metadata is contained in a spreadsheet distributed with the novel files.

### Metadata Columns
<ol><li>Filename: Name of file on disk</li>
<li>ID: Unique ID in Piper corpus</li>
<li>Language: Language of novel</li>
<li>Date: Initial publication date</li>
<li>Title: Title of novel</li>
<li>Gender: Authorial gender</li>
<li>Person: Textual perspective</li>
<li>Length: Number of tokens in novel</li></ol>

## Import Metadata

In [None]:
# Import Metadata into Pandas Dataframe

meta_df = pandas.read_csv('resources/txtlab_Novel450_English.csv')

In [None]:
# Check Metadata

meta_df

## Import Corpus

In [None]:
# Set location of corpus folder

fiction_folder = 'txtlab_Novel450_English/'

In [None]:
# Collect the text of each file in the 'fiction_folder' on the hard drive

# Create empty list, each entry will be the string for a given novel
novel_list = []

# Iterate through filenames in 'fiction_folder'
for filename in os.listdir(fiction_folder):
    
    # Read novel text as single string
    with open(fiction_folder + filename, 'r') as file_in:
        this_novel = file_in.read()
    
    # Add novel text as single string to master list
    novel_list.append(this_novel)

In [None]:
# Inspect first item in novel_list

novel_list[0]

## Pre-Processing
Word2Vec learns about the relationships among words by observing them in context. This means that we want to split our texts into word-units. However, we  want to maintain sentence boundaries as well, since the last word of the previous sentence might skew the meaning of the next sentence.

Since novels were imported as single strings, we'll first use <i>sent_tokenize</i> to divide them into sentences, and second, we'll split each sentence into its own list of words.

In [None]:
# Split each novel into sentences

sentences = [sentence for novel in novel_list for sentence in sent_tokenize(novel)]

In [None]:
# Inspect first sentence

sentences[0]

In [None]:
# Split each sentence into tokens

words_by_sentence = [fast_tokenize(sentence) for sentence in sentences]

In [None]:
# Remove any sentences that contain zero tokens

words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]

In [None]:
# Inspect first sentence

words_by_sentence[0]

# 2. Word2Vec

### Word Embedding
Word2Vec is the most prominent word embedding algorithm. Word embedding generally attempts to identify semantic relationships between words by observing them in context.

Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts.  This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.

The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").

In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.

### Word2Vec Features
<ul>
<li>Size: Number of dimensions for word embedding model</li>
<li>Window: Number of context words to observe in each direction</li>
<li>min_count: Minimum frequency for words included in model</li>
<li>sg (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram</li>
<li>Alpha: Learning rate (initial); prevents model from over-correcting, enables finer tuning</li>
<li>Iterations: Number of passes through dataset</li>
<li>Batch Size: Number of words to sample from data during each pass</li>
</ul>

Note: Script uses default value for each argument

## Training

In [None]:
# Train word2vec model from txtLab corpus

model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5, \
                               min_count=25, sg=1, alpha=0.025, iter=5, batch_words=10000)

## Embeddings

In [None]:
# Return dense word vector

model['whale']

## Vector-Space Operations

### Similarity
Since words are represented as dense vectors, we can ask how similiar words' meanings are based on their cosine similarity (essentially how much they overlap). <em>gensim</em> has a few dout-of-the-box functions that enable different kinds of comparisons.

In [None]:
# Find cosine distance between two given word vectors

model.similarity('pride','prejudice')

In [None]:
# Find nearest word vectors by cosine distance

model.most_similar('pride')

In [None]:
# Given a list of words, we can ask which doesn't belong

# Finds mean vector of words in list
# and identifies the word further from that mean

model.doesnt_match(['pride','prejudice', 'whale'])

### Multiple Valences
A word embedding may encode both primary and secondary meanings that are both present at the same time. In order to identify secondary meanings in a word, we can subtract the vectors of primary (or simply unwanted) meanings. For example, we may wish to remove the sense of <em>river bank</em> from the word <em>bank</em>. This would be written mathetmatically as <em>RIVER - BANK</em>, which in <em>gensim</em>'s interface lists <em>RIVER</em> as a positive meaning and <em>BANK</em> as a negative one.

In [None]:
# Get most similar words to BANK, in order
# to get a sense for its primary meaning

model.most_similar('bank')

In [None]:
# Remove the sense of "river bank" from "bank" and see what is left

model.most_similar(positive=['bank'], negative=['river'])

### Analogy
Analogies are rendered as simple mathematical operations in vector space. For example, the canonic word2vec analogy <em>MAN is to KING as WOMAN is to ??</em> is rendered as <em>KING - MAN + WOMAN</em>. In the gensim interface, we designate <em>KING</em> and <em>WOMAN</em> as positive terms and <em>MAN</em> as a negative term, since it is subtracted from those.

In [None]:
# Get most similar words to KING, in order
# to get a sense for its primary meaning

model.most_similar('king')

In [None]:
# The canonic word2vec analogy: King - Man + Woman -> Queen

model.most_similar(positive=['woman', 'king'], negative=['man'])

### Gendered Vectors
Can we find gender a la Schmidt (2015)? (Note that this method uses vector projection, whereas Schmidt had used rejection.)

In [None]:
# Feminine Vector

model.most_similar(positive=['she','her','hers','herself'], negative=['he','him','his','himself'])

In [None]:
# Masculine Vector

model.most_similar(positive=['he','him','his','himself'], negative=['she','her','hers','herself'])

### Exercises

In [None]:
## EX. Use the most_similar method to find the tokens nearest to 'car' in our model.
##     Do the same for 'motorcar'.

## Q.  What characterizes each word in our corpus? Does this make sense?

In [None]:
model.most_similar('car')

In [None]:
model.most_similar('motorcar')

In [None]:
## EX. How does our model answer the analogy: MADRID is to SPAIN as PARIS is to __________

## Q.  What has our model learned about nation-states?

In [None]:
model.most_similar(positive=['spain', 'paris'], negative=['madrid'])

In [None]:
## EX. Perform the canonic Word2Vec addition again but leave out a term:
##     Try 'king' - 'man', 'woman' - 'man', 'woman' + 'king'

## Q.  What do these indicate semantically?

In [None]:
model.most_similar(positive=['king'], negative=['man'])

In [None]:
model.most_similar(positive=['woman'], negative=['man'])

In [None]:
model.most_similar(positive=['woman', 'king'])

## Visualization

In [None]:
# Dictionary of words in model

model.wv.vocab
#model.vocab # deprecated

In [None]:
# Visualizing the whole vocabulary would make it hard to read

len(model.wv.vocab)
#len(model.vocab) # deprecated

In [None]:
# For interpretability, we'll select words that already have a semantic relation

her_tokens = [token for token,weight in model.most_similar(positive=['she','her','hers','herself'], \
                                                       negative=['he','him','his','himself'], topn=50)]

In [None]:
# Inspect list

her_tokens

In [None]:
# Get the vector for each sampled word

vectors = [model[word] for word in her_tokens]  

In [None]:
# Calculate distances among texts in vector space

dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')

In [None]:
# Multi-Dimensional Scaling (Project vectors into 2-D)

mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)

In [None]:
# Fussing with matplotlib

_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(her_tokens[i], ((embeddings[i,0], embeddings[i,1])))

In [None]:
# For comparison, here is the same graph using a masculine-pronoun vector

his_tokens = [token for token,weight in model.most_similar(positive=['he','him','his','himself'], \
                                                       negative=['she','her','hers','herself'], topn=50)]
vectors = [model[word] for word in his_tokens]
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(his_tokens[i], ((embeddings[i,0], embeddings[i,1])))

In [None]:
## Q. What kinds of semantic relationships exist in the diagram above?
##    Are there any words that seem out of place?

<img src="resources/fem_vectors.png", width="50%",style="float: left;" /><img src="resources/masc_vectors.png", width="50%" />

# 3. Saving/Loading Models

In [None]:
# Save current model for later use

model.wv.save_word2vec_format('resources/word2vec.txtlab_Novel150_English.txt')
#model.save_word2vec_format('resources/word2vec.txtlab_Novel150_English.txt') # deprecated

In [None]:
# Load up models from disk

# Model trained on Eighteenth Century Collections Online corpus (~2500 texts)
# Made available by Ryan Heuser: http://ryanheuser.org/word-vectors-1/

ecco_model = gensim.models.KeyedVectors.load_word2vec_format('resources/word2vec.ECCO-TCP.txt')
#ecco_model = gensim.models.Word2Vec.load_word2vec_format('resources/word2vec.ECCO-TCP.txt') # deprecated

In [None]:
# What are similar words to BANK?

ecco_model.most_similar('bank')

In [None]:
# What if we remove the sense of "river bank"?

ecco_model.most_similar(positive=['bank'], negative=['river'])

### Exercise

In [None]:
## EX. Heuser's blog post explores an analogy in eighteenth-century thought that
##     RICHES are to VIRTUE what LEARNING is to GENIUS. How true is this in
##     the ECCO-trained Word2Vec model? Is it true in the one we trained?

##  Q. How might we compare word2vec models more generally?

In [None]:
# ECCO model: RICHES are to VIRTUE what LEARNING is to ??

ecco_model.most_similar(positive=['virtue', 'learning'], negative=['riches'])

In [None]:
# txtLab model: RICHES are to VIRTUE what LEARNING is to ??

model.most_similar(positive=['virtue', 'learning'], negative=['riches'])

# 4. Open Questions
At this point, we have seen a number of mathemetical operations that we may use to explore word2vec's word embeddings. These enable us to answer a set of new, interesting questions dealing with semantics, yet there are many other questions that remain unanswered.

For example:
<ol>
<li>How to compare word usages in different texts (within the same model)?</li>
<li>How to compare word meanings in different models? compare whole models?</li>
<li>What about the space “in between” words?</li>
<li>Do we agree with the Distributional Hypothesis that words with the same contexts share their meanings?</li>
<ol><li>If not, then what information do we think is encoded in a word’s context?</li></ol>
<li>What good, humanistic research questions do analogies shed light on?</li>
<ol><li>shades of meaning?</li><li>context similarity?</li></ol>
</ol>