# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [1]:
import nltk

import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

from nltk.corpus import brown
from nltk.corpus import stopwords
from string import punctuation
from collections import Counter
nltk.download('stopwords')
nltk.download('brown')

corpus = brown.sents()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\swara\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\swara\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [2]:
loaded_model = Word2Vec(corpus, vector_size=100, window=5, min_count=2, workers=4)

# Save the model in Word2Vec format for later use
loaded_model.wv.save_word2vec_format("brown_corpus_word2vec_format.txt", binary=False)

# Load the model back for testing
model = KeyedVectors.load_word2vec_format("brown_corpus_word2vec_format.txt", binary=False)

In [3]:
#return the vectors
model['coffee'].shape

(100,)

### Similarity

In [4]:
model.most_similar('coffee')

[('beer', 0.9718437790870667),
 ('mud', 0.9710042476654053),
 ('cloth', 0.9670085906982422),
 ('bid', 0.966858983039856),
 ('elephants', 0.9642542004585266),
 ('putt', 0.9636809825897217),
 ('bone', 0.961922824382782),
 ('target', 0.9612301588058472),
 ('cab', 0.9603677988052368),
 ('command', 0.9600244164466858)]

In [5]:
model.most_similar('language')

[('comparison', 0.9648534655570984),
 ('congregation', 0.9621796011924744),
 ('character', 0.9620857834815979),
 ('danger', 0.9600976705551147),
 ('analysis', 0.9596229791641235),
 ('proof', 0.9583678841590881),
 ('tradition', 0.9559996128082275),
 ('ultimate', 0.9554035663604736),
 ('humanity', 0.9540858268737793),
 ('theory', 0.9535051584243774)]

In [6]:
#multiple meanings....
model.most_similar("plant")

[('energy', 0.9676889181137085),
 ('frame', 0.9576639533042908),
 ('annual', 0.9556906819343567),
 ('location', 0.9556302428245544),
 ('central', 0.954703152179718),
 ('forming', 0.9532434940338135),
 ('enterprise', 0.9522539377212524),
 ('diffusion', 0.9509708881378174),
 ('aid', 0.9498393535614014),
 ('farm', 0.9496110081672668)]

In [7]:
#woman + king - man
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

followers: 0.9489


In [8]:
#woman + king - man
result = model.most_similar(positive=['woman', 'code'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

organized: 0.9656


### Cosine Similarity

We have talked about this in the last class.  Here we can conveniently use `distance` to find the cosine distance between two words. Note that distance = 1 - similarity.

In [9]:
w1 = "dog"
w2 = "cat"
w3 = "fruit"
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#dog is much closer to cat then dog to fruit
w1_w2_dist, w1_w3_dist

(0.1325734257698059, 0.08407062292098999)

In [10]:
w1 = "happy" # synonym 1
w2 = "cheerful" # synonym 2
w3 = "sad" # antonym
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#$w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful"!!
#those similarlity does not handle antonym....
w1_w2_dist, w1_w3_dist

(0.2810414433479309, 0.08063018321990967)

### Bias

You guys....one very important thing is that NLP models are biased.....very bad....

In [11]:
import pprint

pprint.pprint(model.most_similar(positive=['woman', 'worker'], negative=['man']))

[('tract', 0.9730052351951599),
 ('Control', 0.966152548789978),
 ('African', 0.9653077125549316),
 ('Model', 0.9647234082221985),
 ('spectacular', 0.9641784429550171),
 ('megatons', 0.9639370441436768),
 ('biggest', 0.9634506702423096),
 ('splendid', 0.9634196162223816),
 ('assembly', 0.9632116556167603),
 ('WTV', 0.9631566405296326)]


In [12]:
pprint.pprint(model.most_similar(positive=['man', 'worker'], negative=['woman']))

[('originate', 0.8782269358634949),
 ('interfere', 0.8769674897193909),
 ('Everyone', 0.8759981393814087),
 ('promise', 0.8742619156837463),
 ('stress', 0.8734277486801147),
 ('stain', 0.8725233674049377),
 ('concentrate', 0.8714022040367126),
 ('Hiroshima', 0.8682627081871033),
 ('greatness', 0.8671994209289551),
 ('sign', 0.8631095290184021)]


In [13]:
pprint.pprint(model.most_similar(positive=["woman", "doctor"], negative=["man"]))

[('device', 0.9400931000709534),
 ('Lord', 0.9389525055885315),
 ('abandoned', 0.9359262585639954),
 ('accepted', 0.9350138902664185),
 ('ended', 0.9346182346343994),
 ('conductor', 0.9342072010040283),
 ('magnificent', 0.9320111870765686),
 ('won', 0.9295842051506042),
 ('generation', 0.9290319085121155),
 ('released', 0.9280744791030884)]


In [14]:
pprint.pprint(model.most_similar(positive=["man", "code"], negative=["woman"]))

[('mankind', 0.9052137732505798),
 ('radiopasteurization', 0.8901543617248535),
 ('stain', 0.8867220282554626),
 ('justice', 0.8858324289321899),
 ('conformity', 0.8846319913864136),
 ('plumbing', 0.8841564059257507),
 ('promise', 0.8839837312698364),
 ('Madden', 0.8839766979217529),
 ('fellowship', 0.8827660083770752),
 ('innovation', 0.8827584981918335)]


### Analogy

In [15]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [16]:
analogy('good', 'fantastic', 'bad')

'specifically'

In [17]:
analogy('bird', 'fly', 'human')

'development'

In [21]:
#which word in the list does not belong
print(model.doesnt_match("coke pepsi sprite water".split()))

coke


In [22]:
# Create a pickle of the model
import pickle

with open('glove_gensim.pkl', 'wb') as f:
    pickle.dump(model, f)