# Getting started with Word2Vec in Gensim and making it work!

<li> A model that can provide numerical vectors for a given word.
<li> The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. 
<li> This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. 
<li> For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 
<li> Using the Gensim’s downloader API, you can download pre-built word embedding models like word2vec, fasttext, GloVe and ConceptNet. These are built on large corpuses of commonly occurring text data such as wikipedia, google news etc.
    
In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work! I have heard a lot of complaints about poor performance etc, but its really a combination of two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

In [1]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

### Dataset 
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. 

However, if you are working in a specialized niche such as technical documents, you may not able to get word embeddings for all the words. So, in such cases its desirable to train your own model.

Gensim’s Word2Vec implementation let’s you train your own word embedding model for a given corpus.

In [2]:
api.info("text8") 

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [3]:
# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]

In [4]:
print(len(data))

1701


In [5]:
# Split the data into 2 parts. Part 2 will be used later to update the model
data_part1 = data[:1000]
data_part2 = data[1000:]

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [6]:
# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(data_part1, min_count = 0, workers=cpu_count())

## Now, let's look at some output 
This first example shows a simple case of looking up words similar to the word`dirty`. All we need to do here is to call the `most_similar` function and provide the word `dirty` as the positive example. This returns the top 10 similar words. 

In [7]:
# Get the word vector for given word
model['dirty']

  


array([-0.27412024,  0.23787312,  0.3729944 , -0.08957253, -0.02634603,
        0.19135708, -0.25258502, -0.45909408,  0.20109582, -0.3300618 ,
        0.20776694,  0.370255  ,  0.4254122 , -0.02167911, -0.13819449,
       -0.04677937,  0.05262958, -0.23195945, -0.279957  ,  0.38832876,
       -0.17701848,  0.330062  , -0.22526477,  0.6756087 , -0.18569396,
        0.15856412,  0.21807009, -0.07632352, -0.08821549, -0.07251844,
        0.27780274, -0.08379967,  0.26500228,  0.5077719 ,  0.24512932,
        0.0725575 ,  0.34582794,  0.07727274,  0.0108995 , -0.10273159,
        0.58718973,  0.12509617, -0.12708673,  0.35315925,  0.39208713,
       -0.03298496, -0.17413591, -0.3658723 , -0.2539985 , -0.00218889,
       -0.09447198, -0.1655599 ,  0.0114888 , -0.13456692,  1.1735516 ,
       -0.40947646, -0.12112419,  0.8118783 , -0.0702237 ,  0.2821765 ,
       -0.0229708 , -0.1083468 ,  0.41518754, -0.34095782,  0.5043838 ,
        0.01344461,  0.23361465, -0.17059818, -0.06331053, -0.16

In [8]:
model.most_similar('dirty')

  """Entry point for launching an IPython kernel.


[('crazy', 0.7999100685119629),
 ('grim', 0.7908047437667847),
 ('microhexura', 0.775342583656311),
 ('candy', 0.7723380923271179),
 ('whipped', 0.7654463648796082),
 ('gobbler', 0.7650245428085327),
 ('crying', 0.7625658512115479),
 ('goodbye', 0.7610893249511719),
 ('hungry', 0.7608863115310669),
 ('trash', 0.7570898532867432)]

In [9]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)

[('spain', 0.8390151262283325),
 ('italy', 0.806950569152832),
 ('belgium', 0.7989537715911865),
 ('portugal', 0.7839831709861755),
 ('austria', 0.7657570838928223),
 ('hungary', 0.7552558183670044)]

In [22]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar (positive=w1,topn=6)

[('outraged', 0.8761366605758667),
 ('beaten', 0.8317650556564331),
 ('surprised', 0.8276345729827881),
 ('upset', 0.8237363696098328),
 ('disappointed', 0.8218487501144409),
 ('expecting', 0.8026745915412903)]

In [23]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)

[('conveyor', 0.7666807174682617),
 ('plate', 0.7572683095932007),
 ('spinning', 0.7428174614906311),
 ('bag', 0.7353559136390686),
 ('coating', 0.7335289120674133),
 ('shaft', 0.7316291332244873),
 ('mist', 0.7315459251403809),
 ('quadrangular', 0.7294576168060303),
 ('surface', 0.7294378876686096),
 ('wind', 0.7287774085998535)]

### Similarity between two words in the vocabulary
You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

In [29]:
# similarity between two identical words
model.wv.similarity(w1="france",w2="spain")

0.84698147

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [30]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'france'

In [9]:
# Save model
model.save('newmodel')

In [10]:
# Load Model
model = Word2Vec.load('newmodel')

# GloVe: Global Vectors for Word Representation
https://nlp.stanford.edu/projects/glove/

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.


In [10]:
import gensim.downloader as api
model = api.load("glove-twitter-25")  # load glove vectors

In [11]:
model.most_similar("cat") 

[('dog', 0.9590819478034973),
 ('monkey', 0.9203578233718872),
 ('bear', 0.9143137335777283),
 ('pet', 0.9108031392097473),
 ('girl', 0.8880630135536194),
 ('horse', 0.8872727155685425),
 ('kitty', 0.8870542049407959),
 ('puppy', 0.886769711971283),
 ('hot', 0.8865255117416382),
 ('lady', 0.8845518827438354)]