# Deep NLP - Word Embeddings

Think back to NLP as we've understood it so far.

If we've had some luck with NLP modeling, likely with a NaiveBayes algorithm, we were able to illustrate some correlations between words and some other feature of interest.

But to whatever extent that our models were able to make connections and pick up on correlations, they did this *without any understanding of the **meaning** of the words in question*.

Let's think for a minute about words and objective meanings!

We can make sense of meaning for computational purposes by thinking about meaning in terms of similarity, i.e. thinking about meaning *holistically*.

Q. Is there any precedent for this way of thinking about meaning? <br/>
A. [Yes](https://plato.stanford.edu/entries/meaning-holism/#ArgForMeaHol)

So what will this look like for us?

*Remember cosine similarity?*

$\rightarrow$We'll have much the same idea here: Associate each word with values along particular dimensions in a multi-dimensional space. If we had a dimension for *softness*, for example, then pillows and marshmallows would score higher on it than rocks and bricks.

In [None]:
# ! pip install update gensim

In [1]:
import gensim
import numpy as np

In [2]:
# Reading in the data

import json

with open('data/JEOPARDY_QUESTIONS1.json') as f:
    data = json.load(f)

In [3]:
# Let's check the datatype of our data
type(data)


list

In [4]:
# And the length
len(data)


216930

In [5]:
# Let's look at the first element in our list



In [None]:
# How many words do we have in our first question?



In [None]:
# Let's try that again!




In [6]:
# Let's count the total number of
# clue words we have.
length = 0

for clue in data:
    length += len(clue['question'].split(' '))
    
length

3169994

## Using Word2Vec

In [7]:
import string

In [8]:
# Word2Vec requires that our text have the form of a list
# of 'sentences', where each sentence is itself a list of
# words. In other words, it takes in lists of lists.
#How can we put our _Jeopardy!_ clues in that shape?

#take out punctuation and split on the spaces
text = []
for clue in data:
    sentence = clue['question'].translate(str.maketrans('', '', string.punctuation)).split(' ')
    
    new_sent = []
    for word in sentence:
        new_sent.append(word.lower())
        
    text.append(new_sent)
    

In [9]:
# Let's check the new structure of our first clue
text[0]


['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'mans',
 'theory']

In [10]:
# Constructing the model is simply a matter of
# instantiating a Word2Vec object.

#Word2Vec arguments: want size to be 50 or more. alpha is the learning rate of the network.
#  window=5 means that it checks the closest 2 words before and after to center the context. 
#  sg specifies model type. sg = 0 means skip grant is not implemented, bag of words is the default
model = gensim.models.Word2Vec(text, sg=1)


Consider word opposites:

King + Woman - Man = Queen
Brother + Woman - Man = Sister


In [11]:
# To train, call 'train()'!

model.train(text, total_examples=model.corpus_count, epochs=model.epochs)


(11336095, 15849970)

In [14]:
# Checking word  count
model.corpus_total_words()


AttributeError: 'Word2Vec' object has no attribute 'corpus_total_words'

## model.wv

In [None]:
# The '.wv' attribute stores the word vectors

wv

In [15]:
# The vectors are keyed by the words

model.wv['child']

array([ 0.01286276,  0.04223821,  0.17784935, -0.4020581 , -0.03477514,
       -0.13485502,  0.31926864,  0.35658953, -0.24488859,  0.3407168 ,
       -0.8403199 , -0.19810818, -0.3942617 ,  0.22080459,  0.03283539,
        0.07617025,  0.03391942, -0.08365641, -0.37056348,  0.09479607,
        0.2290656 , -0.06891888, -0.15716352,  0.09259291, -0.1883377 ,
        0.28286704, -0.20068912,  0.31527257, -0.02373368,  0.31248242,
       -0.29146606,  0.09534586,  0.02951768, -0.12433862, -0.17158964,
        0.290063  , -0.35654885, -0.6427458 ,  0.39467543,  0.64272785,
       -0.22312053, -0.04415512, -0.04854603,  0.0439701 , -0.26158178,
       -0.08112038,  0.06152572, -0.22170934, -0.04389291,  0.4030124 ,
       -0.42717826,  0.21808307,  0.22687049,  0.30979577,  0.30841038,
        0.23838502,  0.6918088 ,  0.27131388,  0.851383  ,  0.38442045,
        0.22831273,  0.43928698,  0.33453947, -0.21112314, -0.02513519,
        0.22559677, -0.03640351, -0.09103386,  0.28360656, -0.17

### model.wv methods
#### 'most_similar()' and 'similarity()'

In [16]:
model.wv.most_similar('furniture')

[('bicycles', 0.6993882656097412),
 ('pottery', 0.6992767453193665),
 ('linen', 0.695056676864624),
 ('artwork', 0.6905595064163208),
 ('flooring', 0.6805717945098877),
 ('jewelry', 0.6717463731765747),
 ('chippendale', 0.6682672500610352),
 ('drip', 0.6680567264556885),
 ('decorative', 0.6673911809921265),
 ('integral', 0.6633995771408081)]

In [17]:
model.wv.similarity('furniture', 'jewelry')

0.6717463545422054

In [18]:
# What's most similar to 'cat'?

model.wv.most_similar('cat')

[('cheetah', 0.7204302549362183),
 ('terrier', 0.6932100057601929),
 ('hound', 0.6894108653068542),
 ('dog', 0.6886221170425415),
 ('pup', 0.6780186891555786),
 ('furry', 0.6779803037643433),
 ('possum', 0.6769487261772156),
 ('scavenger', 0.6756904721260071),
 ('pachyderm', 0.6727346181869507),
 ('shorthaired', 0.6715091466903687)]

In [19]:
# Let's try the familiar example: King - Man + Woman = Queen

model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=10)

[('queen', 0.6939201354980469),
 ('consort', 0.6213732957839966),
 ('monarch', 0.6076021790504456),
 ('aquitaine', 0.6039831638336182),
 ('throne', 0.6023117303848267),
 ('tudor', 0.5974721908569336),
 ('margrethe', 0.5876715779304504),
 ('noor', 0.5870194435119629),
 ('princess', 0.5794650316238403),
 ('elizabeth', 0.5712262392044067)]

In [20]:
# Shakespeare
model.wv.most_similar(['shakespeare'])


[('sophocles', 0.7362139225006104),
 ('shakespeares', 0.722474217414856),
 ('shakespearean', 0.6990102529525757),
 ('falstaff', 0.6911706328392029),
 ('romeo', 0.6647429466247559),
 ('euripides', 0.6606671810150146),
 ('shaws', 0.6579442620277405),
 ('moliere', 0.6521010398864746),
 ('ibsen', 0.6501968502998352),
 ('hussy', 0.6451749801635742)]

In [21]:
# Greg
model.most_similar(['greg'])


  


[('kinnear', 0.8386141061782837),
 ('connors', 0.8029658794403076),
 ('shoeless', 0.7988706827163696),
 ('kareem', 0.794715166091919),
 ('abduljabbar', 0.7913160920143127),
 ('baxter', 0.791248083114624),
 ('langham', 0.788703441619873),
 ('bebe', 0.7873073220252991),
 ('hartman', 0.784852147102356),
 ('dennehy', 0.7836828231811523)]

In [22]:
# Washington

model.most_similar(['washington'])

  This is separate from the ipykernel package so we can avoid doing imports until


[('dc', 0.8136782646179199),
 ('arlington', 0.6723763346672058),
 ('dcs', 0.6637563705444336),
 ('dca', 0.6506284475326538),
 ('washingtons', 0.641762912273407),
 ('p3', 0.6323374509811401),
 ('virginia', 0.6281608939170837),
 ('missouri', 0.6256818771362305),
 ('newseum', 0.614162802696228),
 ('hw', 0.6129159331321716)]

#### 'doesnt_match()' returns the element with the least cosine similarity

In [23]:
model.wv.doesnt_match(['breakfast', 'lunch', 'frog', 'dinner'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'frog'

In [24]:
model.wv.doesnt_match(['tree', 'flower', 'bush', 'plant', 'toothbrush'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'bush'

#### 'closer_than()'

In [25]:
# Which words are closer to 'king' than 'queen' is?

model.wv.closer_than(['king'], ['queen'])

ValueError: could not convert string to float: 'king'

#### 'distance()'

In [26]:
# For this it will make more sense to
# normalize our vectors.

for vector in model.wv:
    norm_vecs.map()

TypeError: 'int' object is not iterable

In [27]:
model.wv.distance('king', 'king')

1.1102230246251565e-16

In [28]:
model.wv.distance('joy', 'happiness')

0.47844749159746347

#### 'evaluate_word_analogies()'

Check out [this text file](https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt)!

In [30]:
relatives = model.mv.evaluate_word_analogies(
    'https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt')[1][4]

AttributeError: 'Word2Vec' object has no attribute 'mv'

In [31]:
len(relatives['correct'])

NameError: name 'relatives' is not defined

In [None]:
len(relatives['incorrect'])

In [None]:
relatives['correct'][:5]

In [None]:
relatives['incorrect'][:5]