# Word Vectors ##

*This lesson is based on the [Advanced Topics in Word Vectors workshop](https://dh2018.adho.org/en/machine-reading-part-ii-advanced-topics-in-word-vectors/) at DH 2018 as well as tutorials by [Radim Rehurek](https://rare-technologies.com/word2vec-tutorial/) and [Chris McCormick](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)** 

TK: EXPLAIN WHY WORD VECTORS?

## Install gensim

In [2]:
# gensim is already installed on JupyterHub
#!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/70/cf/87b25b265d23498b2b70ce873495cf7ef91394c4baff240210e26f3bc18a/gensim-3.8.3-cp37-cp37m-macosx_10_9_x86_64.whl (24.2MB)
[K     |████████████████████████████████| 24.2MB 20.2MB/s eta 0:00:01
Collecting smart-open>=1.8.1 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/91/6f/788e657fb513deebadfbb38b346d4878b2fded0f72fe7d937b1646137f46/smart_open-2.1.1.tar.gz (111kB)
[K     |████████████████████████████████| 112kB 4.0MB/s eta 0:00:01
Collecting boto3 (from smart-open>=1.8.1->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/c4/24/b9facc760789cf844880c178b64d26d9f4a0ef06af3e99473f38fba94657/boto3-1.14.56-py2.py3-none-any.whl (129kB)
[K     |████████████████████████████████| 133kB 37.1MB/s eta 0:00:01
Collecting s3transfer<0.4.0,>=0.3.0 (from boto3->smart-open>=1.8.1->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f

## Import gensim and nltk tokenizers

In [9]:
import gensim # remember this! 
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer
import nltk
nltk.download('punkt')
import glob
from pathlib import Path

[nltk_data] Downloading package punkt to /Users/dsinyki/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read in our corpus

In [4]:
# next, we need to create our corpus.
# in this case, we can start by creating the same list of docs as in our previous class

import os

base_dir = "../docs/NYT-Obituaries/" # NOTE: Your path may be different!!!

all_docs = [] # our list which will store the text of each doc; empty for now

docs = os.listdir(base_dir) # get a list of all the files in the directory

for doc in docs: # iterate through the docs
    if not doc.startswith('.'): # get only the .txt files
        with open(base_dir + doc, "r", encoding="utf-8") as file: # force unicode conversion to keep PCs happy
            text = file.read() # read in the file as a single text string
            all_docs.append(text) # append it to the all_docs list

# lastly, just check the length of all_docs to see if it's 147
len(all_docs)

378

## Preprocessing

We need each doc tokenized by sentence. 

So let's define a function that a takes a list of texts (e.g. our all_docs list) and converts it for gensim word2vec to use. The function will lower-case text and tokenize by sentence and word.

In [13]:
# need our handy nltk tokenizer 
tokenizer = TreebankWordTokenizer()

# and we'll get titles
directory = "../docs/NYT-Obituaries/"
files = glob.glob(f"{directory}/*.txt")
obit_titles = [Path(file).stem for file in files]

# and the function
def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
        print(obit_titles[counter])
        print(len(sentences))  # let's check how many sentences there are per item
        print("\n")
        counter += 1
    return all_txt

In [14]:
# now let's run it

sentences = make_sentences(all_docs)

1945-Adolf-Hitler
597


1915-F-W-Taylor
15


1975-Chiang-Kai-shek
240


1984-Ethel-Merman
110


1953-Jim-Thorpe
79


1964-Nella-Larsen
44


1955-Margaret-Abbott
52


1984-Lillian-Hellman
159


1959-Cecil-De-Mille
65


1928-Mabel-Craty
14


1973-Eddie-Rickenbacker
125


1989-Ferdinand-Marcos
53


1991-Martha-Graham
118


1997-Deng-Xiaoping
274


1938-George-E-Hale
31


1885-Ulysses-Grant
1619


1909-Sarah-Orne-Jewett
20


1957-Christian-Dior
43


1987-Clare-Boothe-Luce
94


1976-Jacques-Monod
41


1954-Getulio-Vargas
56


1979-Stan-Kenton
58


1990-Leonard-Bernstein
203


1972-Jackie-Robinson
164


1998-Fred-W-Friendly
81


1991-Leo-Durocher
44


1915-B-T-Washington
54


1997-James-Stewart
101


1981-Joe-Louis
156


1983-Muddy-Waters
38


1942-George-M-Cohan
160


1989-Samuel-Beckett
160


1962-Marilyn-Monroe
84


2000-Charles-M-Schulz
296


1967-Gregory-Pincus
59


1894-R-L-Stevenson
106


1978-Bruce-Catton
8


1982-Arthur-Rubinstein
174


1875-Andrew-Johnson
103


1974-Charles-Lindber

1981-Robert-Moses
186


1989-Robert-Penn-Warren
84


1901-William-McKinley
161


1970-Walter-Reuther
23


1930-Balfour
141


1984-Indira-Gandhi
309


1978-Golda-Meir
242


1983-Earl-Hines
25


1974-Katharine-Cornell
102


1982-Lee-Strasberg
86


1939-Pope-Pius-XI
112


1886-Mary-Ewing-Outerbridge
49


1993-Dizzy-Gillespie
95


1910-Florence-Nightingale
75


1960-Richard-Wright
36


1986-The-Challenger
38


1992-Menachem-Begin
150


1998-Galina-Ulanova
32


1976-Max-Ernst
98


1993-Cesar-Chavez
78


1965-Adlai-Ewing-Stevenson
187


1935-Adolph-S-Ochs
446


1941-Lou-Gehrig
8


1961-Carl-G-Jung
92


1963-Robert-Frost
108


1965-Edward-R-Murrow
197


1971-Dean-Acheson
166


1986-Jorge-Luis-Borges
102


1966-Walt-Disney
142


1996-Carl-Sagan
58


1959-Ross-G-Harrison
14


1945-Jerome-Kern
62


1991-Frank-Capra
69


1987-Andres-Segovie
85


1987-Rita-Hayworth
62


1993-William-Golding
42


1932-Florenz-Ziegfeld
66


1938-Constantin-Stanislavsky
55




## Train vectors

To train our vectors we call the function below. This function has a couple dozen parameters, some of which are more important than others.
Here are a few major ones. The fields that are MANDATORY are marked with an asterisk:
1. `sentences*`: This is where you provide your data. It must be in a format of iterable of iterables.
2. `sg`: Your choice of training algorithm. There are two standard ways of training W2V vectors -- 'skipgram' and 'CBOW'. If you enter 1 here the skip-gram is applied; otherwise, the default is CBOW.
3. `size*`: This is the length of your resulting word vectors. If you have a large corpus (>few billion tokens) you can go up to 100-300 dimensions. Generally word vectors with more dimensions give better results.
4. `window`: This is the window of context words you are training on. In other words, how many words come before and after your given word. A good number is 4 here but this can vary depending on what you are interested in. For instance, if you are more interested in embeddings that embody semantic meaning, smaller window sizes work better. 
5. `alpha`: The learning rate of your model. If you are interested in machine learning experimentation with your vectors you may experiment with this parameter.
6. `seed` (int): This is the random seed for your random initialization. All deep learning models initialize the weights with random floats before training. This is a useful field if you want to replicate your experiments because giving this a seed will initialize 'randomly' deterministically.
7. `min_count`: This is the minimum frequency threshold. If a given word appears with lower frequency than provided it will be ignored. This is here because words with very low frequency are hard to train.
8. `iter`: This is the number of iterations(entire run) over the corpus, also known as epochs. Usually anything between 1-10 is ok. The trade offs are that if you have higher iterations, it will take longer to train and the model may overfit on your dataset. However, longer training will allow your vectors to perform better on tasks relevant to your dataset.

Overall, most of these settings wil not concern you unless you are interested in very specific usages of word vectors.

In [None]:
# let's train our model!

ccp_model = gensim.models.Word2Vec(
    sentences,
    min_count=2, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython

In [None]:
# take a quick look at the vocab

ccp_model.wv.vocab

It's often useful to save your trained model to disk so that you can reload it as needed. 

In [None]:
ccp_model.save('ccp_model')

And you can load a model in the same way (remember this from our topic model)

In [None]:
old_model = gensim.models.Word2Vec.load('ccp_model') 

In [None]:
# testing some basic functions

# basic similarity
ccp_model.wv.most_similar("freedom", topn=10)

In [None]:
# similarity b/t two words

print(ccp_model.wv.similarity(w1="freedom",w2="justice"))
print(ccp_model.wv.similarity(w1="freedom",w2="dinner"))

You can also play with analogy tasks. The commonly seen task is:

'Man is to King as Woman is to ____?'


' A      is to A\*.     as B      is to  B\*  '
                         
Gensim provides two different ways of implementing this task. You may be more familiar with the the additive version also called the 3CosAdd method:

$$\underset{b*\in V}{\textrm{arg max}} (cos(b*,b) - cos(b*,a) + cos(b*,a*))$$

This reflects the abstraction of Woman - Man + King. In this maximization, we are searching which word vector will allow us to produce the highest value in this equation.

We can implement this method with a provided function. Positive here refers to words that give the positive contribution to similarity (nominator), and negative refers to words that contribute negatively (denominatory). Here's the additive method.

In [None]:
# analogies (performed via )
# format is: "freedom is to slavery as liberty is to ???"

ccp_model.wv.most_similar(positive=['liberty', 'slavery'], negative=['freedom'])

Gensim has quite a few built-in tools, and it's worth taking some time to see what's available. Check the documentation here: [https://radimrehurek.com/gensim/models/keyedvectors.html](https://radimrehurek.com/gensim/models/keyedvectors.html)


In [None]:
### Let's do some visualization ###

import numpy as np

# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

In [None]:
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.wv.vocab.keys()), sample)
        else:
            words = [ word for word in model.wv.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(ccp_model, ['freedom','liberty','slavery','abolition','emancipation'])

# display_pca_scatterplot(ccp_model, sample=20)