# 1. Exploring the gutenburg corpus
Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works. Most of the items in its collection are full texts of public domain books.

In [1]:
from nltk.corpus import gutenberg
gutenberg.readme().replace('\n', ' ')

'Project Gutenberg Selections http://gutenberg.net/  This corpus contains etexts from from Project Gutenberg, by the following authors:  * Jane Austen (3) * William Blake (2) * Thornton W. Burgess * Sarah Cone Bryant * Lewis Carroll * G. K. Chesterton (3) * Maria Edgeworth * King James Bible * Herman Melville * John Milton * William Shakespeare (3) * Walt Whitman  The beginning of the body of each book could not be identified automatically, so the semi-generic header of each file has been removed, and included below. Some source files ended with a line "End of The Project Gutenberg Etext...", and this has been deleted.  Information about Project Gutenberg (one page)  We produce about two million dollars for each hour we work.  The fifty hours is one conservative estimate for how long it we take to get any etext selected, entered, proofread, edited, copyright searched and analyzed, the copyright letters written, etc.  This projected audience is one hundred million readers.  If our value

In [2]:
gutenberg.fileids()


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [3]:
bible_kjv_sents = gutenberg.sents('bible-kjv.txt')
len(bible_kjv_sents)

30103

# 2. Implementing Word2Vec

In [4]:
from string import punctuation

discard_punctuation_and_lowercased_sents = [[word.lower() for word in sent if word not in punctuation and word.isalpha()] 
                                            for sent in bible_kjv_sents]
discard_punctuation_and_lowercased_sents[3]

['in',
 'the',
 'beginning',
 'god',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth']

In [7]:
from gensim.models import word2vec

bible_kjv_word2vec_model = word2vec.Word2Vec(discard_punctuation_and_lowercased_sents, min_count=5, size=200)
bible_kjv_word2vec_model.save('bible_word2vec_gensim')
# model = Word2Vec.load(fname) # To load a model
word_vectors = bible_kjv_word2vec_model.wv
del bible_kjv_word2vec_model # When we finish training the model, we can only delete it and keep the word vectors.
word_vectors.save_word2vec_format('bible_word2vec_org', 'bible_word2vec_vocabulary')
len(word_vectors.vocab)

5279

In [8]:
word_vectors.most_similar(['god']) # Most similar as in closest in the word graph. Word2vec is essentially about proportions of word occurrences in relations holding in general over large corpora of text. Consider word analogy ‘man is to woman as king is to X’ which was famously demonstrated in word2vec. The algorithm is able to come up with an answer queen, almost magically by simple vector differences. The main idea, called distributional hypothesis, is that similar words appear in similar contexts of words around them.


[('lord', 0.7753700017929077),
 ('salvation', 0.7529996037483215),
 ('truth', 0.7514059543609619),
 ('spirit', 0.7420793771743774),
 ('hosts', 0.7316104769706726),
 ('faith', 0.701904296875),
 ('fear', 0.6927851438522339),
 ('mercy', 0.6920644640922546),
 ('grace', 0.6917527914047241),
 ('christ', 0.6876660585403442)]

In [9]:
word_vectors.most_similar(['heaven'], topn=3)


[('earth', 0.7465330958366394),
 ('heavens', 0.7058630585670471),
 ('darkness', 0.6699602603912354)]

In [10]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)


[('governor', 0.6332509517669678)]

In [11]:
word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'], topn=1)


[('queen', 0.9654183387756348)]

In [12]:
word_vectors.similarity('lord', 'god')


0.77537006

In [13]:
word_vectors.doesnt_match("lord god salvation food spirit".split())


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'food'