# Machine Reading: Advanced Topics in Word Vectors
## Part II. Word Vectors via Word2Vec (50 mins)

This is a 4-part series of Jupyter notebooks on the topic of word embeddings originally created for a workshop during the Digital Humanities 2018 Conference in Mexico City. Each part is comprised of a mix of theoretical explanations and fill-in-the-blanks activities of increasing difficulty.

Instructors:
- Eun Seo Jo, <a href="mailto:eunseo@stanford.edu">*eunseo@stanford.edu*</a>, Stanford University
- Javier de la Rosa, <a href="mailto:versae@stanford.edu">*versae@stanford.edu*</a>, Stanford University
- Scott Bailey, <a href="mailto:scottbailey@stanford.edu">*scottbailey@stanford.edu*</a>, Stanford University

This unit will focus on Word2Vec as an example of neural net-based approaches of vector encodings, starting with a conceptual overview of the algorithm itself and end with an activity to train participants’ own vectors.

● 0:00 - 0:15 Conceptual explanation of Word2Vec

● 0:15 - 0:30 Word2Vec Visualization and Vectorial Features and Math

● 0:30 - 0:50 [Activity 2] Word2Vec Construction [using Gensim] and Visualization (from part 1) [We provide corpus]

In [None]:
!pip install -r requirements.txt

In [1]:
import gensim
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer

In [2]:
### reimporting and reloading materials from part 1
from nltk.corpus import gutenberg

In [3]:
mobydick = gutenberg.raw('melville-moby_dick.txt')
emma = gutenberg.raw('austen-emma.txt')
parents = gutenberg.raw('edgeworth-parents.txt')

In [4]:
corpus = [mobydick, emma, parents]

In [5]:
#Let's split our corpus into setences. Example of doing this on mobydick.
sentences = sent_tokenize(corpus[0])


In [6]:
tokenizer = TreebankWordTokenizer()

In [7]:
#Takes as input a list of text and converts it for gensim word2vec 
# (lower case, sentence tokenization, tokenization)
# sentences = [['hi', 'there'], ['this', 'is', 'a', 'sentence']]

def makeSentences(list_txt):
  all_txt = []
  for txt in list_txt:
    lower_txt = txt.lower()
    sentences = sent_tokenize(lower_txt)
    sentences = [tokenizer.tokenize(sent) for sent in sentences]
    all_txt += sentences
    print(len(sentences))# let's check how many sentences there are per item
  return all_txt

In [8]:
sentences = makeSentences(corpus)

9822
7489
10054


In [9]:
#To train our vectors we call this funtion below. This function has a couple dozen parameters some are more important than others.
#We will explain a few major parameters here. The fields that are MANDATORY are marked with an asterisk:
# 1. sentences*: This is where you provide your data. It must be in a format of iterable of iterables.
# 2. sg: Your choice of training algorithm. There are two ways of training W2V vectors -- 'skipgram' and 'CBOW'.
#        If you enter 1 here the skip-gram is applied; otherwise, the default is CBOW.
# 3. size*: This is the length of your resulting word vectors. If you have a large corpus (>1 billion tokens) you can 
#          go up to 100-300 dimensions. Generally word vectors with more dimensions gives better results.
# 4. window: This is the window of context words you are training on. In other words, how many words come before and after your given word.
#          A good number is 50 here but this can vary depending on what you are interested in. For instance, if you are more interested
#          in embeddings that embody semantic meaning, smaller window sizes work better. 
# 5. alpha: learning rate of your model. If you are interested in machine learning experimentations of your vectors you may experiment with this parameter.
# 6. seed (int): this is the random seed for your random initialization. All deep learning models initialize the weights with random floats before training.
#          This is a useful field if you want to replicate your experiments because giving this a seed will initialize 'randomly' deterministically.
# 7. min_count: This is the minimum frequency threshold. If a given word appears with lower frequency than provided it will be ignored. This is here because words with very low
#             frequency are hard to train.
# 8. iter: This is the number of iterations(entire run) over the corpus, also known as epochs. Usually anything between 1-10 is ok. 
#        The trade offs are that if you have higher iterations, it will take longer to train and the model may overfit on your dataset.
#     However, longer training will allow your vectors to perform better on tasks relevant to your dataset.

# Overall, most of these settings wil not concern you unless you are interested in very specific usages of word vectors.

model_example = gensim.models.Word2Vec(sentences, min_count=1, size=100)


In [10]:
#Another way of training word2vec vectors with gensim is to use the LineSentence function
linesentence_example = gensim.models.word2vec.LineSentence('text8') #provide the name of the corpus text you want to train on

In [11]:
model = gensim.models.Word2Vec(linesentence_example, min_count=1, size=100)

In [12]:
model.wv.vocab

{'anarchism': <gensim.models.keyedvectors.Vocab at 0x161997048>,
 'originated': <gensim.models.keyedvectors.Vocab at 0x161997080>,
 'as': <gensim.models.keyedvectors.Vocab at 0x1619970b8>,
 'a': <gensim.models.keyedvectors.Vocab at 0x1619970f0>,
 'term': <gensim.models.keyedvectors.Vocab at 0x161997128>,
 'of': <gensim.models.keyedvectors.Vocab at 0x161997160>,
 'abuse': <gensim.models.keyedvectors.Vocab at 0x161997198>,
 'first': <gensim.models.keyedvectors.Vocab at 0x1619971d0>,
 'used': <gensim.models.keyedvectors.Vocab at 0x161997208>,
 'against': <gensim.models.keyedvectors.Vocab at 0x161997240>,
 'early': <gensim.models.keyedvectors.Vocab at 0x161997278>,
 'working': <gensim.models.keyedvectors.Vocab at 0x1619972b0>,
 'class': <gensim.models.keyedvectors.Vocab at 0x1619972e8>,
 'radicals': <gensim.models.keyedvectors.Vocab at 0x161997320>,
 'including': <gensim.models.keyedvectors.Vocab at 0x161997358>,
 'the': <gensim.models.keyedvectors.Vocab at 0x161997390>,
 'diggers': <gensi

In [13]:
# It's possible to save your trained model in your disk
model.save('/tmp/our_model')

In [14]:
#Then you can reload your trained model
our_model = gensim.models.Word2Vec.load('/tmp/our_model') 

In [15]:
model.wv['is']

array([-7.4309945e-01, -1.1010030e-01,  3.0996794e-01,  1.0076294e+00,
        2.7159126e+00,  2.5857800e-01, -6.3871849e-01,  1.2006485e+00,
       -3.8759594e+00,  9.6304402e-02, -3.0266291e-01,  1.9082183e+00,
        4.0493245e+00, -1.7267005e+00, -3.2712541e+00, -1.8884847e+00,
       -1.4533138e+00,  2.6137164e+00,  1.0045251e-01,  6.1564702e-01,
        1.0300859e+00,  8.2182145e-01,  1.3351685e+00, -1.6277651e+00,
       -1.2571299e-01, -4.8283296e+00, -3.6199135e-01, -1.2635465e+00,
        3.5480592e-02, -5.8880095e+00, -1.1384472e+00, -2.6918564e+00,
        1.9459590e+00,  7.6544100e-01, -1.7917299e-01,  1.7154934e+00,
        1.1321136e+00, -2.3434646e-01, -2.9924850e+00, -1.9023378e-01,
        1.9284687e+00, -2.2516952e+00,  2.1658063e+00, -1.2017798e+00,
       -4.1064134e-01, -1.3657644e+00,  2.9619279e+00,  8.2029527e-01,
       -1.7149898e+00, -1.8886505e+00,  1.3428019e+00, -1.4975270e+00,
       -3.3876422e-01, -1.2149334e+00,  9.3175536e-01, -5.6950706e-01,
      

In [16]:
model.wv['you']

array([-2.8520675 , -2.2473452 ,  2.1454644 ,  1.5337406 ,  1.0659196 ,
        3.997966  ,  1.5338069 ,  1.5298592 , -3.5749621 , -1.933821  ,
        3.38861   , -3.7548635 ,  2.2442281 ,  2.9169722 ,  2.3960907 ,
        1.434006  ,  1.2322818 , -2.1850078 , -3.0352428 ,  0.2722496 ,
        0.04412484,  0.96946913, -1.9357928 ,  2.0264134 ,  0.16475327,
       -3.4373128 , -0.56125987, -0.06174254, -2.1029367 , -1.8574634 ,
       -0.59033763, -0.08494068, -0.36253834,  0.13625075,  3.5136797 ,
       -1.2064016 , -1.4663715 , -0.36017847,  1.8220122 , -1.9693489 ,
        1.8932991 , -1.6807297 ,  2.9503462 , -1.4895293 ,  1.2889229 ,
       -0.7056401 , -1.011475  ,  3.4393313 , -3.5616174 , -4.035255  ,
       -1.4711683 ,  0.13222745, -1.0308939 , -0.20575708,  0.8291327 ,
        0.6665716 , -1.4794097 , -2.5726752 ,  2.6220715 ,  4.0666747 ,
        1.7878091 ,  1.3967648 ,  1.4467835 , -2.2753108 ,  1.830323  ,
       -1.9258434 ,  3.007969  , -0.43829307,  2.7473843 , -1.77

In [17]:
type(model.wv)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [18]:
my_model = our_model.wv #save just the vectors from your model

In [19]:
del our_model #this is to save RAM

In [20]:
print(type(my_model))

<class 'gensim.models.keyedvectors.Word2VecKeyedVectors'>


In [21]:
len(my_model.vocab) #the number of words in our model

253854

In [22]:
#If you are interested in using pretrained vectors you can also call pretrained materials
import gensim.downloader as pretrained

In [23]:
#all corpora available are here.
#You can use this if you want to train from the given corpora.
pretrained.info()['corpora'].keys()

dict_keys(['semeval-2016-2017-task3-subtaskBC', 'semeval-2016-2017-task3-subtaskA-unannotated', 'patent-2017', 'quora-duplicate-questions', 'wiki-english-20171001', 'text8', 'fake-news', '20-newsgroups', '__testing_matrix-synopsis', '__testing_multipart-matrix-synopsis'])

In [24]:
#all pretrained models available are here.
#You can use this if you want to use the trained models.
pretrained.info()['models'].keys()

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

In [25]:
#Let's work with the word2vec trained on google news
#Let's look at some description for the corpus named 'text8'
pretrained.info('text8')

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [26]:
news_model = pretrained.load('word2vec-google-news-300')

In [27]:
my_model = news_model.wv
del news_model

  """Entry point for launching an IPython kernel.


In [28]:
my_model['test']

array([-1.42578125e-01, -3.68652344e-02,  1.35742188e-01, -6.20117188e-02,
        7.95898438e-02,  1.90429688e-02, -8.15429688e-02, -1.27929688e-01,
       -2.95410156e-02,  2.36328125e-01, -1.21582031e-01, -2.14843750e-01,
        1.29882812e-01, -2.70996094e-02, -5.20019531e-02,  2.15820312e-01,
       -1.81640625e-01,  5.10253906e-02, -1.60156250e-01, -1.76757812e-01,
        1.83105469e-02, -4.12597656e-02, -2.32421875e-01, -1.03149414e-02,
        1.45507812e-01,  5.24902344e-02, -3.96484375e-01, -1.92871094e-02,
        2.51770020e-03, -1.26953125e-02, -4.39453125e-02,  3.07617188e-02,
        9.57031250e-02, -1.75781250e-01,  1.04370117e-02,  1.89453125e-01,
       -2.36328125e-01,  4.37011719e-02,  2.81250000e-01, -2.07519531e-02,
       -1.81640625e-01, -2.17773438e-01,  2.33398438e-01,  5.29785156e-02,
       -1.13769531e-01,  9.39941406e-03, -1.49414062e-01,  1.99218750e-01,
       -1.75781250e-01,  3.16406250e-01,  8.10546875e-02, -6.12792969e-02,
       -1.52343750e-01, -

In [None]:
#similarity tasks

In [29]:
my_model.similarity('beautiful','sublime') #Using Cosine-similarity


0.44833773461372856

In [30]:
#What do you think will be the similarity measure between 'sublime' and 'sublime'?
my_model.similarity('sublime','sublime')

0.9999999999999999

In [31]:
#Another way of doing the same thing would be
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(my_model['beautiful'].reshape(1,-1), my_model['sublime'].reshape(1,-1))

array([[0.44833776]], dtype=float32)

In [32]:
#We can see that words that are used in similar contexts appear closer to each other than those not!
print(my_model.similarity('potato', 'leek')) 
print(my_model.similarity('anger', 'potato'))

0.47892690592010023
0.0155280012408357


In [33]:
#There is also a built in tool for returning a list of most similar words to a given word.
my_model.most_similar('democracy'), my_model.most_similar('liberalism')

([('democratic', 0.864448070526123),
  ('participatory_democracy', 0.7170747518539429),
  ('democracies', 0.710315465927124),
  ('democratization', 0.7038753628730774),
  ('pluralism', 0.6955404281616211),
  ('multiparty_democracy', 0.6909584403038025),
  ('democractic', 0.6702337265014648),
  ('democratic_ideals', 0.6659173369407654),
  ('pluralist_democracy', 0.6640663743019104),
  ('constitutionalism', 0.6593047380447388)],
 [('conservatism', 0.8018674850463867),
  ('progressivism', 0.7581014633178711),
  ('leftism', 0.7086032629013062),
  ('libertarianism', 0.7042546272277832),
  ('statism', 0.693485677242279),
  ('liberal', 0.6907880902290344),
  ('liberal_internationalism', 0.6821509003639221),
  ('Liberalism', 0.6777896285057068),
  ('classical_liberalism', 0.6775410771369934),
  ('Conservatism', 0.677297830581665)])

In [34]:
my_model.most_similar('pluralism', topn=20)

[('pluralistic', 0.7280019521713257),
 ('pluralist', 0.7193902730941772),
 ('democracy', 0.6955404281616211),
 ('pluralist_democracy', 0.6928413510322571),
 ('democratic', 0.6840701699256897),
 ('religious_pluralism', 0.6559314131736755),
 ('pluralist_society', 0.6442460417747498),
 ('pluralism_tolerance', 0.631464421749115),
 ('democratization', 0.6295616030693054),
 ('pluralistic_democracy', 0.6286231279373169),
 ('secularism', 0.6188281774520874),
 ('multireligious_society', 0.6062011122703552),
 ('pluralistic_democratic', 0.603635311126709),
 ('religious_tolerance', 0.6026619076728821),
 ('democratic_polity', 0.5967641472816467),
 ('democratic_ideals', 0.5910589098930359),
 ('pluralistic_society', 0.5879446864128113),
 ('religious_toleration', 0.5857032537460327),
 ('democratic_freedoms', 0.5824508666992188),
 ('illiberal_democracy', 0.5818257331848145)]

In [35]:
#You can also identify the word with the most similar vector from a list of candidates
candidates = ['sweet','sour','bitter','nice']
my_model.most_similar_to_given('blueberry', candidates)

'sweet'

In [36]:
#You can see that of the candidates 'sour' has the highest similarity measure with 'blueberry' at least in our corpus.
for c in candidates:
    print(c, my_model.similarity('blueberry',c))

sweet 0.3033926176016938
sour 0.20343720426644013
bitter 0.18500672031976945
nice 0.07704492863613943


In [37]:
#You can generate of list of words that are closer to 'cold' than is 'dry'
my_model.words_closer_than('cold','dry')

['winter',
 'warm',
 'temperatures',
 'wet',
 'freezing',
 'warmer',
 'icy',
 'chill',
 'chilly',
 'Cold',
 'windy',
 'colder',
 'snowy',
 'chilled',
 'frigid',
 'humid',
 'coldest',
 'freezing_temperatures',
 'cold_snap',
 'foggy',
 'frosty',
 'shivering',
 'arctic',
 'balmy',
 'wintry',
 'bitterly_cold',
 'frigid_temperatures',
 'toasty',
 'hot_humid',
 'unseasonably_warm',
 'colder_temperatures',
 'bone_chilling',
 'chilly_temperatures',
 'cold_winters',
 'subzero_temperatures',
 'unseasonably_cold',
 'wintery',
 'COLD',
 'unusually_warm',
 'coldest_winter',
 'sub_freezing_temperatures',
 'bone_chilling_cold',
 'Bitter_cold',
 'subzero',
 'arctic_blast',
 'unseasonably_hot',
 'warm_humid',
 'frigid_weather',
 'bone_chilling_temperatures',
 'teeth_chattering',
 'chilliest',
 'toasty_warm',
 'subfreezing',
 'cooler_temps',
 'Sweltering',
 'unseasonable_cold',
 'freezing_temps',
 'frost_bitten',
 'subzero_weather',
 'Brrrr',
 'unseasonably_chilly',
 'warmish',
 'frigid_temps',
 'biting

You can also play with analogy tasks. The commonly seen task is:

'London is to England as Baghdad is to ____?'


' A      is to A\*.     as B      is to  B\*  '
                         
Gensim provides two different ways of implementing this task. You may be more familiar with the the additive version also called the 3CosAdd method:

$$\underset{b*\in V}{\textrm{arg max}} (cos(b*,b) - cos(b*,a) + cos(b*,a*))$$

This reflects the abstraction of Baghdad - London + England. In this maximization, we are searching which word vector will allow us to produce the highest value in this equation.

The second is a more balanced approach proposed by Levy & Goldberg 2014 (http://www.aclweb.org/anthology/W14-1618)

We find B* by going through all of the possible B* in the set of vocabulary (V) and identifying which returns the highest value. In other words, finding the argument that maximizes the following equation where the epsilon is added only to avoid division by zero. This is also called the 3CosMul method:

$$\underset{b*\in V}{\textrm{arg max}} \frac{cos(b*,b)cos(b*,a*)}{cos(b*,a)+\epsilon}$$



In [38]:
#We can implement this method with the provided function.
#positive here refers to the words that give the positive contribution to similarity (nominator) and the negative refers to words that contribute negatively (denominator)
# This is the addition method.
my_model.most_similar(positive=['woman','king'], negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.549946129322052),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

In [39]:
#This is the multiplication method.
my_model.most_similar_cosmul(positive=['england','baghdad'], negative=['london'])
#Unforuantely in this example we see that this returns Afganistan (when Baghdad is the capital of Iraq!). This is an example of how the corpus can bias our findings.

[('afganistan', 0.8269765973091125),
 ('afghanistan', 0.8165150284767151),
 ('iraqis', 0.8074595332145691),
 ('iraqi', 0.7976524829864502),
 ('taliban', 0.7952468395233154),
 ('al_queda', 0.7876250147819519),
 ('iraq', 0.7850340008735657),
 ('gadhafi', 0.7798774838447571),
 ('sri_lanka', 0.778986394405365),
 ('al_qaida', 0.7776364088058472)]

What are good vectors? What are bad vectors? How much training/data do we need?

Question! What would happen is you retrained your model on the same corpus? Would you get the same vectors?

In [None]:
# see documentation here for more built-in tools! https://radimrehurek.com/gensim/models/keyedvectors.html



In [None]:
#What about GloVe? You can use GloVe vectors with Gensim too!
#Overview of GloVe. What's the difference? How to use GloVe vectors with Gensim

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
glove_file = "./glove.6B/glove.6B.300d.txt"
glove2word2vec_file = "glove2word2vec.txt"
glove2word2vec(glove_file, glove2word2vec_file) #we simply call this function to reformat it a bit
glove_model = KeyedVectors.load_word2vec_format(glove2word2vec_file, binary=False) #read in the same file 

In [None]:
glove_model['test']

In [None]:
#Let's see what happens if we run the same task on this new set of vectors
glove_model.most_similar_cosmul(positive=['england','baghdad'], negative=['london'])

In [None]:
#PCA visualizations

In [None]:
import numpy as np
from sklearn.decomposition import PCA




In [None]:
countries = ["china", "russia", "france", "germany","greece","japan","italy"]

In [None]:
capitals = ["beijing","moscow","paris","berlin","athens","tokyo","rome"]

In [None]:
X = []

for loc in countries+capitals:
    X.append(glove_model[loc])

In [None]:
pca = PCA(n_components=2)
xy_coords = pca.fit_transform(X)
loc_x, loc_y = zip(*xy_coords)

In [None]:
loc_x

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots()
ax.scatter(loc_x, loc_y)

for _, location in enumerate(countries+capitals):
    ax.annotate(location, (loc_x[_],loc_y[_]))

plt.title("Countries and their Capitals")
plt.show()