# Intro and setup

This notebook demonstrates the use of word2vec in Python using the  [gensim libraries](https://github.com/RaRe-Technologies/gensim).  Information is available on the [gensim website](https://radimrehurek.com/gensim/index.html) along with tutorials and the [API](https://radimrehurek.com/gensim/apiref.html). 

You can install them to your local machine using the command:
```
pip install --upgrade gensim
```

In [1]:
# turn off pretty printing to get horizontal display - optional, but I'm saving space for display
%pprint

Pretty printing has been turned OFF


In [2]:
import os

# Data

In [3]:
with open('shakespeare.txt', 'r') as f:
    raw_data = f.read()

In [4]:
type(raw_data)

<class 'str'>

In [5]:
# what does it look like?
raw_data[:1000]

"A MIDSUMMER-NIGHT'S DREAM\n\nNow , fair Hippolyta , our nuptial hour \nDraws on apace : four happy days bring in \nAnother moon ; but O ! methinks how slow \nThis old moon wanes ; she lingers my desires ,\nLike to a step dame , or a dowager \nLong withering out a young man's revenue .\n\nFour days will quickly steep themselves in night ;\nFour nights will quickly dream away the time ;\nAnd then the moon , like to a silver bow \nNew-bent in heaven , shall behold the night \nOf our solemnities .\n\nGo , Philostrate ,\nStir up the Athenian youth to merriments ;\nAwake the pert and nimble spirit of mirth ;\nTurn melancholy forth to funerals ;\nThe pale companion is not for our pomp .\n\nHippolyta , I woo'd thee with my sword ,\nAnd won thy love doing thee injuries ;\nBut I will wed thee in another key ,\nWith pomp , with triumph , and with revelling .\n\n\nHappy be Theseus , our renowned duke !\n\nThanks , good Egeus : what's the news with thee ?\n\nFull of vexation come I , with complain

In [6]:
words = raw_data.split()

In [7]:
len(words)

980637

In [8]:
# how many unique words?
len(set(words))

33505

In [9]:
# the kinds of words you might expect from shakespeare
list(set(words))[:100]

['mots', "'Suffer", 'plot-proof', 'varying', "'Convey", 'duck', 'handful', 'chambermaids', 'sterner', "requir'd", 'love-a', 'sour-cold', 'tempers', 'ecce', 'charmeth', 'troubled', 'Subdued', 'gird', 'Yond', 'differences', 'reaching', 'Promotion', 'hair', "alehouse'", 'rhetoric', "new-ta'en", 'out-sleep', 'unyoke', 'plucks', 'threes', 'strives', 'thereabout', "Berowne's", 'Pandar', 'lege', 'devote', 'nave', 'wipes', 'Legitimate', 'promise-crammed', 'tristful', 'fides', 'hams', 'Raise', 'childish', "scratch'd", 'horseback', 'ballad-mongers', 'lately', 'Whips', 'replication', "Profess'd", 'melts', "Determin'd", 'louts', "Patroclus'", 'All-worthy', 'inhuman', 'favourites', 'quakes', 'fornications', 'animals', 'agen', 'bowget', 'fallen', "upon't", 'immortally', 'distinction', 'springs', 'Hobbididance', "o'er-walk", "peck'd", "you'", 'compass', "still'd", 'immediate', "ladies'", "Neptune's", 'whipster', 'Lymoges', 'Siward', 'handless', 'attract', 'self-affrighted', 'summer-swelling', 'sea-mo

# Gensim

In [10]:
import gensim

ModuleNotFoundError: No module named 'gensim'

## Setup data
`gensim.models` takes a corpus broken into sentences.  I'm using the `Text8Corpus` iterator that comes as part of the `word2vec` class.  You can use any other data as long as you create an iterable to yield sentences.

In [None]:
from gensim.summarization.textcleaner import split_sentences
model = gensim.models.Word2Vec(
    [[str(word) for word in sentence.split()] for sentence in split_sentences(raw_data)],
    size=150,
    window=10,
    min_count=2,
    workers=-1,
    iter=10)

## run model

In [None]:
# save it as binary
model.save('demo-model')

In [None]:
print(model)

## Vocabulary

In [None]:
# get list of word vectors
model_words = list(model.wv.vocab)

In [None]:
len(model_words)

In [None]:
# get sorted list of word vectors
words_indexes = list(model.wv.index2word)

In [None]:
len(words_indexes)

In [None]:
words[:20]

In [None]:
# check the index for a word
model.wv.vocab['one'].index

## Vectors

In [None]:
model.wv?

In [None]:
model.wv.get_vector('one')

In [None]:
len(model.wv.get_vector('one'))

### Distance from mean
<a id="distance-from-mean"></a>

In [None]:
model.wv.doesnt_match?

In [None]:
# find word in list that is farthest from the mean
model.wv.doesnt_match("breakfast cereal dinner lunch".split())

In [None]:
model.wv.doesnt_match("cook janitor pilot sport teacher".split())

In [None]:
model.wv.doesnt_match("kill dead knife love".split())

In [None]:
model.wv.doesnt_match("insect animal cat tree".split())

In [None]:
model.wv.doesnt_match("dog cat parrot lion".split())

## Similarity
<a id="similarity"></a>

### Cosine similarity

In [None]:
model.wv.similarity?

In [None]:
model.wv.similarity('angry', 'happy')

In [None]:
model.wv.similarity('woman', 'tree')

In [None]:
model.wv.similarity('tree', 'shrub')

In [None]:
model.wv.similarity('tree', 'bush')

In [None]:
# distance is just the opposite of similarity
model.wv.distance('woman', 'tree')

In [None]:
model.wv.distance('woman', 'man') + model.wv.similarity('woman', 'man')

In [None]:
# closest by cosine similarity
model.wv.similar_by_word('woman', topn=10)

In [None]:
# closest by cosine similarity
model.wv.similar_by_word('she', topn=10)

In [None]:
model.wv.most_similar?

In [None]:
model.wv.most_similar(positive=['woman'], topn=10)

In [None]:
model.wv.most_similar(negative=['woman'], topn=10)

In [None]:
model.wv.most_similar(positive=['woman', 'king'], topn=10)

In [None]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

### Multiplicative combination

In [None]:
model.wv.most_similar_cosmul?

In [None]:
model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'], topn=10)

In [None]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

In [None]:
model.wv.most_similar_cosmul(positive=['woman', 'king'], topn=10)

In [None]:
model.wv.most_similar(positive=['woman', 'king'], topn=10)