# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [1]:
import nltk
import pprint

from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

from nltk.corpus import brown
nltk.download('brown')

corpus = brown.sents()

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\swara\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [2]:
loaded_model = Word2Vec(corpus, vector_size=100, window=5, min_count=2, workers=4)

# Save the model in Word2Vec format for later use
loaded_model.wv.save_word2vec_format("brown_corpus_word2vec_format.txt", binary=False)

# Load the model back for testing
model = KeyedVectors.load_word2vec_format("brown_corpus_word2vec_format.txt", binary=False)

In [3]:
#return the vectors
model['coffee'].shape

(100,)

### Similarity

In [4]:
model.most_similar('coffee')

[('beer', 0.9718764424324036),
 ('mud', 0.9709972143173218),
 ('cloth', 0.9670196771621704),
 ('bid', 0.9667823314666748),
 ('elephants', 0.9642915725708008),
 ('putt', 0.963595986366272),
 ('bone', 0.9619939923286438),
 ('target', 0.9612566828727722),
 ('cab', 0.9603590965270996),
 ('command', 0.960012674331665)]

In [5]:
model.most_similar('language')

[('comparison', 0.9648641347885132),
 ('congregation', 0.962188720703125),
 ('character', 0.9620252251625061),
 ('danger', 0.9601103663444519),
 ('analysis', 0.9596117734909058),
 ('proof', 0.9583585858345032),
 ('tradition', 0.9559952020645142),
 ('ultimate', 0.9554356336593628),
 ('humanity', 0.9540989995002747),
 ('theory', 0.9535123705863953)]

In [6]:
#multiple meanings....
model.most_similar("plant")

[('energy', 0.9676952958106995),
 ('frame', 0.9576594829559326),
 ('annual', 0.9557315111160278),
 ('location', 0.9555957913398743),
 ('central', 0.9546601176261902),
 ('forming', 0.953223466873169),
 ('enterprise', 0.9522941708564758),
 ('diffusion', 0.9510025978088379),
 ('aid', 0.949894368648529),
 ('farm', 0.9496124386787415)]

In [7]:
#woman + king - man
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

followers: 0.9490


In [8]:
#woman + king - man
result = model.most_similar(positive=['woman', 'code'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

organized: 0.9657


### Cosine Similarity

We have talked about this in the last class.  Here we can conveniently use `distance` to find the cosine distance between two words. Note that distance = 1 - similarity.

In [9]:
w1 = "dog"
w2 = "cat"
w3 = "fruit"
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#dog is much closer to cat then dog to fruit
w1_w2_dist, w1_w3_dist

(0.13264435529708862, 0.08378010988235474)

In [10]:
w1 = "happy" # synonym 1
w2 = "cheerful" # synonym 2
w3 = "sad" # antonym
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#$w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful"!!
#those similarlity does not handle antonym....
w1_w2_dist, w1_w3_dist

(0.2810835838317871, 0.08059042692184448)

### Bias

You guys....one very important thing is that NLP models are biased.....very bad....

In [11]:
pprint.pprint(model.most_similar(positive=['woman', 'worker'], negative=['man']))

[('tract', 0.9729741215705872),
 ('Control', 0.9661687016487122),
 ('African', 0.9653110504150391),
 ('Model', 0.9647584557533264),
 ('spectacular', 0.964256227016449),
 ('megatons', 0.9639296531677246),
 ('biggest', 0.9634395837783813),
 ('splendid', 0.9634092450141907),
 ('WTV', 0.9632601737976074),
 ('assembly', 0.9631929993629456)]


In [12]:
pprint.pprint(model.most_similar(positive=['man', 'worker'], negative=['woman']))

[('originate', 0.8781259059906006),
 ('interfere', 0.8769797682762146),
 ('Everyone', 0.8759639859199524),
 ('promise', 0.8743672370910645),
 ('stress', 0.8733637928962708),
 ('stain', 0.872747004032135),
 ('concentrate', 0.8715300559997559),
 ('Hiroshima', 0.8685178160667419),
 ('greatness', 0.8672943115234375),
 ('sign', 0.8632235527038574)]


In [13]:
pprint.pprint(model.most_similar(positive=["woman", "doctor"], negative=["man"]))

[('device', 0.9401828646659851),
 ('Lord', 0.9385852217674255),
 ('abandoned', 0.9358891248703003),
 ('accepted', 0.9350557327270508),
 ('ended', 0.9345287084579468),
 ('conductor', 0.9343954920768738),
 ('magnificent', 0.9321774840354919),
 ('won', 0.9295370578765869),
 ('generation', 0.9290923476219177),
 ('released', 0.9281633496284485)]


In [14]:
pprint.pprint(model.most_similar(positive=["human", "code"], negative=["woman"]))

[('economic', 0.9033142924308777),
 ('national', 0.8750497102737427),
 ('assistance', 0.8694512844085693),
 ('social', 0.868399441242218),
 ('management', 0.8633675575256348),
 ('facilities', 0.8589795231819153),
 ('industrial', 0.8562678098678589),
 ('practical', 0.8548918962478638),
 ('discrimination', 0.8547709584236145),
 ('economies', 0.8499754667282104)]


### Analogy

In [15]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [16]:
analogy('good', 'fantastic', 'bad')

'specifically'

In [17]:
analogy('bird', 'fly', 'human')

'development'

In [18]:
#which word in the list does not belong
print(model.doesnt_match("coke pepsi sprite water".split()))

coke


In [19]:
# Create a pickle of the model
import pickle

with open('../../app/models/glove_gensim/glove_gensim.pkl', 'wb') as f:
    pickle.dump(model, f)