# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [1]:
import nltk

from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

from nltk.corpus import brown
nltk.download('brown')

corpus = brown.sents()

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\swara\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [2]:
loaded_model = Word2Vec(corpus, vector_size=100, window=5, min_count=2, workers=4)

# Save the model in Word2Vec format for later use
loaded_model.wv.save_word2vec_format("brown_corpus_word2vec_format.txt", binary=False)

# Load the model back for testing
model = KeyedVectors.load_word2vec_format("brown_corpus_word2vec_format.txt", binary=False)

In [3]:
#return the vectors
model['coffee'].shape

(100,)

### Similarity

In [4]:
model.most_similar('coffee')

[('beer', 0.971849262714386),
 ('mud', 0.9710080027580261),
 ('cloth', 0.9670559167861938),
 ('bid', 0.9668721556663513),
 ('elephants', 0.9642710089683533),
 ('putt', 0.9636533856391907),
 ('bone', 0.9619432091712952),
 ('target', 0.961246132850647),
 ('cab', 0.9603875875473022),
 ('command', 0.960010826587677)]

In [5]:
model.most_similar('language')

[('comparison', 0.9648732542991638),
 ('congregation', 0.9622362852096558),
 ('character', 0.9620834589004517),
 ('danger', 0.9600833058357239),
 ('analysis', 0.9596493244171143),
 ('proof', 0.958351194858551),
 ('tradition', 0.9560423493385315),
 ('ultimate', 0.9554263949394226),
 ('humanity', 0.9541069865226746),
 ('theory', 0.9535269141197205)]

In [6]:
#multiple meanings....
model.most_similar("plant")

[('energy', 0.9677100777626038),
 ('frame', 0.957679271697998),
 ('annual', 0.9556729197502136),
 ('location', 0.9556114077568054),
 ('central', 0.9546933770179749),
 ('forming', 0.9532230496406555),
 ('enterprise', 0.9522820711135864),
 ('diffusion', 0.9509390592575073),
 ('aid', 0.9498111009597778),
 ('farm', 0.9496405720710754)]

In [7]:
#woman + king - man
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

followers: 0.9490


In [8]:
#woman + king - man
result = model.most_similar(positive=['woman', 'code'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

organized: 0.9657


### Cosine Similarity

We have talked about this in the last class.  Here we can conveniently use `distance` to find the cosine distance between two words. Note that distance = 1 - similarity.

In [9]:
w1 = "dog"
w2 = "cat"
w3 = "fruit"
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#dog is much closer to cat then dog to fruit
w1_w2_dist, w1_w3_dist

(0.13251203298568726, 0.08375269174575806)

In [10]:
w1 = "happy" # synonym 1
w2 = "cheerful" # synonym 2
w3 = "sad" # antonym
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#$w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful"!!
#those similarlity does not handle antonym....
w1_w2_dist, w1_w3_dist

(0.28125160932540894, 0.08064490556716919)

### Bias

You guys....one very important thing is that NLP models are biased.....very bad....

In [11]:
import pprint

pprint.pprint(model.most_similar(positive=['woman', 'worker'], negative=['man']))

[('tract', 0.9729620814323425),
 ('Control', 0.9661584496498108),
 ('African', 0.9654223918914795),
 ('Model', 0.9647645950317383),
 ('spectacular', 0.9641609787940979),
 ('megatons', 0.963944137096405),
 ('splendid', 0.9634275436401367),
 ('biggest', 0.9634273052215576),
 ('assembly', 0.9632208943367004),
 ('WTV', 0.9631377458572388)]


In [12]:
pprint.pprint(model.most_similar(positive=['man', 'worker'], negative=['woman']))

[('originate', 0.878252387046814),
 ('interfere', 0.8769871592521667),
 ('Everyone', 0.8759522438049316),
 ('promise', 0.8743034601211548),
 ('stress', 0.8733978271484375),
 ('stain', 0.8725126385688782),
 ('concentrate', 0.8714606165885925),
 ('Hiroshima', 0.868313193321228),
 ('greatness', 0.8671536445617676),
 ('sign', 0.8631444573402405)]


In [13]:
pprint.pprint(model.most_similar(positive=["woman", "doctor"], negative=["man"]))

[('device', 0.9400398135185242),
 ('Lord', 0.9387309551239014),
 ('abandoned', 0.9359316229820251),
 ('accepted', 0.935127317905426),
 ('ended', 0.9344491362571716),
 ('conductor', 0.9341519474983215),
 ('magnificent', 0.9319649934768677),
 ('won', 0.929638147354126),
 ('generation', 0.9288925528526306),
 ('released', 0.9281613230705261)]


In [14]:
pprint.pprint(model.most_similar(positive=["human", "code"], negative=["woman"]))

[('economic', 0.9032350182533264),
 ('national', 0.8749502897262573),
 ('assistance', 0.869625985622406),
 ('social', 0.8682600855827332),
 ('management', 0.86331707239151),
 ('facilities', 0.8590260744094849),
 ('industrial', 0.8562296032905579),
 ('practical', 0.8549009561538696),
 ('discrimination', 0.8546333909034729),
 ('economies', 0.8498783707618713)]


### Analogy

In [15]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [16]:
analogy('good', 'fantastic', 'bad')

'specifically'

In [17]:
analogy('bird', 'fly', 'human')

'development'

In [18]:
#which word in the list does not belong
print(model.doesnt_match("coke pepsi sprite water".split()))

coke


In [19]:
# Create a pickle of the model
import pickle

with open('glove_gensim.pkl', 'wb') as f:
    pickle.dump(model, f)