<a href="https://colab.research.google.com/github/souradipta93/NLP/blob/main/GloVe_Walkthrough_full_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Package Version
- gensim==3.6.0

# GloVe: Global Vectors for Word Representation
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

### Download pre-trained word vectors

Pre-trained word vectors. This data is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/.

- Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): [glove.6B.zip](http://nlp.stanford.edu/data/glove.6B.zip)
- Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip
- Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip
- Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip

### Load GloVe Embeddings
- We need to convert the GloVe file format to the word2vec file format
- We will generate a word2vec file

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = 'glove.6B.300d.txt'
output_file_word2vec = 'glove.6B.300d.txt.word2vec'
glove2word2vec(glove_file, output_file_word2vec)

- We have created a file named `glove.6B.300d.txt.word2vec`
- Now we can load this like a word2vec file

In [None]:
from gensim.models import KeyedVectors

file_name = 'glove.6B.300d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(file_name, binary=False)

### Check out what the embedding looks like
- Let's take one word and check it's embedding

In [None]:
word_embedding = model['cat']
print("Shape: {}".format(word_embedding.shape))
print("Word Embedding: {}".format(word_embedding))

### What happens if a word is out of the dictionary?

In [None]:
word = 'pending'
if word in model:
  print('{} is in the model'.format(word))
else:
  print('{} is NOT in the model'.format(word))

### Check the words most similar to a particular word

In [None]:
model.most_similar(positive=['learning'])

### Check the words that are like X but unlike Y

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

### Which word from the given list doesn’t go with the others?

In [None]:
model.doesnt_match("breakfast cereal dinner lunch".split())

### Compute cosine similarity between two words

In [None]:
model.similarity('woman', 'man')

In [None]:
model.similarity('woman', 'woman')

In [None]:
model.similarity('apple', 'google')

In [None]:
model.similarity('apple', 'orange')

### Compute cosine similarity between two sets of words

In [None]:
model.n_similarity(['man', 'woman'], ['girl', 'boy'])

In [None]:
model.n_similarity(['google', 'amazon'], ['facebook', 'microsoft'])

### Find the top-N most similar words by vector

In [None]:
model.similar_by_vector(model['king'] - model['queen'] + model['woman'], topn=3)