2. Word Vectors via Word2Vec (50 mins)

This unit will focus on Word2Vec as an example of neural net-based approaches of vector encodings, starting with a conceptual overview of the algorithm itself and end with an activity to train participants’ own vectors.

● 0:00 - 0:15 Conceptual explanation of Word2Vec

● 0:15 - 0:30 Word2Vec Visualization and Vectorial Features and Math

● 0:30 - 0:50 [Activity 2] Word2Vec Construction [using Gensim] and Visualization (from part 1) [We provide corpus]

https://arxiv.org/pdf/1310.4546.pdf

In [1]:
! pip install gensim

[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m


In [2]:
import gensim
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer

In [3]:
### reimporting and reloading materials from part 1
from nltk.corpus import gutenberg

In [4]:
mobydick = gutenberg.raw('melville-moby_dick.txt')
emma = gutenberg.raw('austen-emma.txt')
parents = gutenberg.raw('edgeworth-parents.txt')



In [6]:
corpus = [mobydick, emma, parents]

In [9]:
#Let's split our corpus into setences. Example of doing this on mobydick.
sentences = sent_tokenize(corpus[0])
sentences

['[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.',
 '(Supplied by a Late Consumptive Usher to a Grammar School)\r\n\r\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\r\nnow.',
 'He was ever dusting his old lexicons and grammars, with a queer\r\nhandkerchief, mockingly embellished with all the gay flags of all the\r\nknown nations of the world.',
 'He loved to dust his old grammars; it\r\nsomehow mildly reminded him of his mortality.',
 '"While you take in hand to school others, and to teach them by what\r\nname a whale-fish is to be called in our tongue leaving out, through\r\nignorance, the letter H, which almost alone maketh the signification\r\nof the word, you deliver that which is not true."',
 '--HACKLUYT\r\n\r\n"WHALE.',
 '... Sw. and Dan.',
 'HVAL.',
 'This animal is named from roundness\r\nor rolling; for in Dan.',
 'HVALT is arched or vaulted."',
 '--WEBSTER\'S\r\nDICTIONARY\r\n\r\n"WHALE.',
 '...',
 'It is more immediately from the Dut.',
 '

In [10]:
tokenizer = TreebankWordTokenizer()

In [11]:
#Takes as input a list of text and converts it for gensim word2vec 
# (lower case, sentence tokenization, tokenization)
# sentences = [['hi', 'there'], ['this', 'is', 'a', 'sentence']]

def makeSentences(list_txt):
  all_txt = []
  for txt in list_txt:
    lower_txt = txt.lower()
    sentences = sent_tokenize(lower_txt)
    sentences = [tokenizer.tokenize(sent) for sent in sentences]
    all_txt += sentences
    print(len(sentences))# let's check how many sentences there are per item
  return all_txt

In [12]:
sentences = makeSentences(corpus)

9822
7489
10054


In [13]:
model = gensim.models.Word2Vec(sentences, min_count=1, size=100)
#Talk about parameters here: what is the min_count? size?

In [15]:
model.save('/tmp/literature_model')

In [18]:
our_model = gensim.models.Word2Vec.load('/tmp/literature_model')

In [19]:
model.wv['is']

array([ 0.6028927 , -0.7777762 ,  0.18949626, -0.629372  ,  0.05698225,
        1.0007701 , -0.8298375 ,  1.3167062 , -0.01267223, -0.7367658 ,
       -0.07186026, -1.2743895 , -2.3586578 ,  0.06197103,  0.5547771 ,
        0.15549038,  0.4583095 , -0.07889899, -0.1209641 ,  0.9516043 ,
       -0.0771857 , -0.5141985 ,  0.08315182,  0.84877944,  1.5110035 ,
       -0.9500215 , -1.2991146 , -0.31448227, -0.49781078,  1.2881632 ,
        1.774961  ,  0.28109646, -0.34057522, -0.5788013 ,  0.93758994,
       -0.44558266,  1.3082162 ,  0.02055817,  0.13914266,  0.1281044 ,
        0.9656632 ,  1.875368  , -0.57465404,  1.6738157 , -0.9099401 ,
        0.9441285 , -1.1578401 ,  0.45846286,  0.41242424, -0.9207564 ,
        1.6598226 ,  0.30932137, -0.80886215, -1.3936989 ,  0.09418197,
        0.17340904,  0.4744462 , -1.9613353 ,  0.01563737, -0.37881997,
        0.41141257, -0.5703081 ,  1.6024007 ,  0.58433425,  0.5611721 ,
       -0.36372542,  0.3625475 , -0.67996097,  0.5167259 ,  1.21

In [20]:
model.wv['you']

array([-4.6664506e-01, -6.1134720e-04, -2.2705736e+00, -9.6369076e-01,
        5.5901313e-01,  1.6279762e+00, -4.2503488e-01,  2.3781908e+00,
       -1.0107208e+00,  9.0882623e-01,  1.9658873e+00, -1.1651001e+00,
       -5.5454576e-01,  1.0288202e+00, -7.0896173e-01,  1.7924075e+00,
        2.0090660e-02, -9.6196002e-01,  2.5438491e-02, -2.7405298e-01,
        4.5699549e-01, -1.2269855e+00,  8.3364069e-01, -3.6479169e-01,
        2.8435841e-01, -4.8917735e-01,  9.7392404e-01,  6.6269588e-01,
       -1.6208788e+00, -5.4389322e-01,  3.0305231e-01, -6.1421186e-01,
        1.5095739e-01,  6.3901871e-01,  2.3028688e-01,  2.2322680e-01,
        7.7685434e-01, -2.8923240e-01, -9.3599886e-01, -3.1400007e-01,
        1.3671759e-01,  4.4095689e-01,  2.3665874e+00,  1.3656111e+00,
       -1.7311021e+00,  1.1489569e+00, -1.1607057e+00,  7.2174460e-01,
       -6.2450558e-02, -4.8682761e-01,  2.3278627e+00,  2.9252622e-01,
        1.4020764e+00,  1.0131857e+00,  1.0370536e+00,  6.2263483e-01,
      

In [21]:
type(model.wv)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [22]:
my_model = model.wv

In [23]:
del model #this is to save RAM

In [27]:
print(type(my_model))

<class 'gensim.models.keyedvectors.Word2VecKeyedVectors'>


In [25]:
len(my_model.vocab) #the number of words in our model

24387

In [26]:
#similarity tasks

In [28]:
my_model.similarity('woman','man')

0.916291764312068

In [29]:
my_model.similarity('woman', 'dance')

0.5285298301141894

In [30]:
# see documentation here for more built-in tools! https://radimrehurek.com/gensim/models/keyedvectors.html



In [31]:
my_model.most_similar(positive=['love','happy'], negative=['murder'])

[('too', 0.8632845878601074),
 ('enough', 0.8624353408813477),
 ('quite', 0.8545994758605957),
 ('yet', 0.8403531312942505),
 ('because', 0.8398067951202393),
 ('though', 0.8375355005264282),
 ('possible', 0.8366643190383911),
 ('infidel', 0.8266156911849976),
 ('known', 0.8263053894042969),
 ('so', 0.8255939483642578)]

In [32]:
my_model.most_similar_cosmul(positive=['woman','king'], negative=['man'])

[('scuttle', 1.1406124830245972),
 ('advancing', 1.138426423072815),
 ('heaved', 1.137731909751892),
 ('hiding', 1.1376347541809082),
 ('tears', 1.1362838745117188),
 ('sullen', 1.135901689529419),
 ('eager', 1.1351274251937866),
 ('aside', 1.1343854665756226),
 ('lips', 1.133859634399414),
 ('hilarious', 1.1337435245513916)]

Question! What would happen is you retrained your model on the same corpus? Would you get the same vectors?

TODO: Theory on different parameters of Word2Vec that you can tune!

TODO: What are good vectors? What are bad vectors? How much training/data do we need?

In [37]:
#We will work with pretrained vectors/uploaded corpora(from gensim) for the rest of this section
import gensim.downloader as pretrained

In [38]:
#all corpora available
pretrained.info()['corpora'].keys()

dict_keys(['semeval-2016-2017-task3-subtaskBC', 'semeval-2016-2017-task3-subtaskA-unannotated', 'patent-2017', 'quora-duplicate-questions', 'wiki-english-20171001', 'text8', 'fake-news', '20-newsgroups', '__testing_matrix-synopsis', '__testing_multipart-matrix-synopsis'])

In [39]:
#all pretrained models available
pretrained.info()['models'].keys()

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

In [42]:
#Let's work with the word2vec trained on google news
#Let's look at some description for the corpus named 'text8'
pretrained.info('text8')

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [43]:
news_model = pretrained.load('word2vec-google-news-300')



While this is taking a bit to download... (It should take just a few mins)

In [44]:
del news_model