## Building word2vec model using Gensim

Now that we have understood how word2vec model works, let us see how to build word2vec model using gensim library. Gensim is one of the popular scientific software packages widely used for building vector space models. It can be easily installed via pip. So, we can just type the following command in our terminal to install the gensim library:

pip install -U gensim

Now, we will learn how to build word2vec model using gensim.

In [None]:
import warnings
warnings.filterwarnings('ignore')

#data processing
import pandas as pd
import re
from nltk.corpus import stopwords
stopWords = stopwords.words('english')

#modelling
from gensim.models import Word2Vec
from gensim.models import Phrases
from gensim.models.phrases import Phraser

## Load the Data

Load the dataset. The dataset used in this section is available in the data folder as text.zip.

In [None]:
data = pd.read_csv('text.csv',header=None)

Let us see what we got in our data:

In [None]:
data.head()

Unnamed: 0,0
0,room kind clean strong smell dogs. generally a...
1,stayed crown plaza april april . staff friendl...
2,booked hotel hotwire lowest price could find. ...
3,stayed husband sons way alaska cruise. loved h...
4,girlfriends stayed celebrate th birthdays. pla...


## Preprocess and prepare the dataset

Define a function for preprocessing the data:

In [None]:
def pre_process(text):

    #convert to lowercase
    text = str(text).lower()

    #remove all special characters and keep only alpha numeric characters and spaces
    text = re.sub(r'[^A-Za-z0-9\s.]',r'',text)

    #remove new lines
    text = re.sub(r'\n',r' ',text)

    # remove stop words
    text = " ".join([word for word in text.split() if word not in stopWords])

    return text

We will see how the preprocessed text looks like:

In [None]:
pre_process(data[0][50])

'agree fancy. everything needed. breakfast pool hot tub nice shuttle airport later checkout time. noise issue tough sleep through. awhile forget noisy door nearby noisy guests. complained management later email credit compd us amount requested would return.'

Preprocess the whole dataset:

In [None]:
data[0] = data[0].map(lambda x: pre_process(x))

After preprocession our dataset looks like:

In [None]:
data[0].head()

0    room kind clean strong smell dogs. generally a...
1    stayed crown plaza april april . staff friendl...
2    booked hotel hotwire lowest price could find. ...
3    stayed husband sons way alaska cruise. loved h...
4    girlfriends stayed celebrate th birthdays. pla...
Name: 0, dtype: object

Genism library requires input in the from of list of lists. i.e,

text = [ [word1, word2, word3], [word1, word2, word3] ]

We know that each row in our data contains a set of sentences. So we split them by '.' and convert them into list i.e,

In [None]:
data[0][1].split('.')[:5]

['stayed crown plaza april april ',
 ' staff friendly attentive',
 ' elevators tiny ',
 ' food restaurant delicious priced little high side',
 ' course washington dc']

Now, We have the data in a list. But we need to convert them into a list of lists. So, now again we split them by space ' '. i.e, First we split the data by '.' and then we split them by ' ' so that we can get our data in a list of lists:

In [None]:
corpus = []
for line in data[0][1].split('.'):
    words = [x for x in line.split()]
    corpus.append(words)

As you can see below, we have our inputs in the form of lists of lists:

In [None]:
corpus[:2]

[['stayed', 'crown', 'plaza', 'april', 'april'],
 ['staff', 'friendly', 'attentive']]

Convert the whole text in our dataset to a list of lists and build a corpus. Corpus is just the collection of vocabulary.

In [None]:
data = data[0].map(lambda x: x.split('.'))

corpus = []
for i in (range(len(data))):
    for line in data[i]:
        words = [x for x in line.split()]
        corpus.append(words)

corpus[:2]

[['room', 'kind', 'clean', 'strong', 'smell', 'dogs'],
 ['generally', 'average', 'ok', 'overnight', 'stay', 'youre', 'fussy']]

Now the problem we have is our corpus contains only unigrams and it will not give us results when we give bigram as an input, for an example say 'san francisco'.

So we use gensim's Phrases functions which collect all the words which occur together and add an underscore between them. So now 'san francisco' becomes 'san_francisco'. We set the min_count parameter to 25 which implies we ignore all the words and bigrams which appears lesser than this.

In [None]:
phrases = Phrases(sentences=corpus,min_count=25,threshold=50)
bigram = Phraser(phrases)

In [None]:
for index,sentence in enumerate(corpus):
    corpus[index] = bigram[sentence]

As you can see below underscore has been added to the bigrams in our corpus:

In [None]:
corpus[111]

['connected', 'rivercenter', 'mall', 'downtown', 'san_antonio']

In [None]:
corpus[9]

['course', 'washington_dc']

## Build the Model

Now let us build the model. Let us define some of the important hyperparameters that the model needs.


* Size represents the size of the vector i.e dimensions of the vector to represent a word. The size can be chosen according to our data size. If our data is very small then we can set our size to a small value, but if we have significantly large dataset then we can set our vector size to 300. In our case, we set our size to 100

* Window size represents the distance that should be considered between the target word and its neighboring word. Words exceeding the window size from the target word will not be considered for learning. Typically, a small window size is preferred.

* Min count represents the minimum frequency of words. i.e if the particular word's occurrence is less than a min_count then we can simply ignore that word.

* workers specify the number of worker threads we need to train the model

* sg=1 implies we use skip-gram method for training if sg=0 then it implies we use CBOW for training

In [None]:
size = 100
window_size = 2
epochs = 100
min_count = 2
workers = 4
sg = 1

Train the model:

In [None]:
model = Word2Vec(corpus,sg=1,window=window_size,vector_size=size, min_count=min_count,workers=workers,epochs=epochs)

To save and load the model, we can simply use save and load functions respectivley.

Save the model:

In [None]:
model.save('word2vec.model')

Load the saved word2vec model:

In [None]:
model = Word2Vec.load('word2vec.model')

## Evaluate the Embeddings

After training the model, we evaluate them. Let us see what the model has been learned and how well it has understood the semantics of words. Genism provides a most_similar function which gives us top similar words related to the given word.

As you can see below, given san_deigo as an input we are getting all other related city names as most similar words:

In [None]:
model.wv.most_similar('san_diego')

[('san_antonio', 0.7965883016586304),
 ('austin', 0.7555802464485168),
 ('san_francisco', 0.7504624128341675),
 ('boston', 0.7418094277381897),
 ('dallas', 0.7365104556083679),
 ('indianapolis', 0.7337252497673035),
 ('memphis', 0.7308087944984436),
 ('seattle', 0.73016756772995),
 ('la', 0.7283918857574463),
 ('phoenix', 0.7246084809303284)]

We can also apply arithmetic operations on our vector to check how accurate our vectors are, For instance, woman + king - man = queen:

In [None]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.7528093457221985)]

We can also find the words that do not match in the given set of words, for instance in the below list called text except the word holiday all others are city names and since our word2vec has understood the semantics of each word it returns the word holiday as the one that does not match with the other words in the list.

In [None]:
text = ['los_angeles','indianapolis', 'holiday', 'san_antonio','new_york']

model.wv.doesnt_match(text)

'holiday'

Thus, with word2vec model, we can generate useful word embeddings which captures the syntactic and semantic meanings of the word. In the next section, we will learn how to visualize this word embeddings generated by the word2vec model in TensorBoard.

## Visualizing Word Embeddings in TensorBoard



In the last section, we learned how to build word2vec model for generating word embeddings using gensim.
Now, we will see how to visualize those embeddings using TensorBoard. Visualizing word embeddings help us to understand the projection space and also helps us to easily validate the embeddings. TensorBoard provides us a built-in visualizer called the embedding projector for interactively visualizing and analyzing the high-dimensional data like our word embeddings. We will learn how can we use the tensorboard's projector for visualizing the word embeddings step by step.


Import the required libraries:

In [None]:
import warnings
warnings.filterwarnings(action='ignore')
import tensorflow as tf
import numpy as np
import gensim
import os

Load the saved model:

In [None]:
file_name = "word2vec.model"
model = gensim.models.keyedvectors.KeyedVectors.load(file_name)

Once after loading the model, we will save length of the vocaublary (number of words in our vocabulary) into a variable called max_size:

In [None]:
max_size = len(model.wv.index_to_key)-1

We learned that the dimension of word vectors will be $ V \times N$. That is, Length of the vocabulary ($V$) $\times$ Number of neurons in the hidden layer ($N$). So, we initialize a matrix named  w2v with the shape as max_size which is the vocabulary size and the model's first layer size which is the number of neurons in the hidden layer:

In [None]:
w2v = np.zeros((max_size,model.layer1_size))

Now we create a new file called metadata.tsv where we save all the words in our model and we also store the embedding of each word in the w2v matrix:

In [None]:
if not os.path.exists('projections'):
    os.makedirs('projections')

with open("projections/metadata.tsv", 'w+') as file_metadata:

    for i, word in enumerate(model.wv.index_to_key[:max_size]):

        #store the embeddings of the word
        w2v[i] = model.wv[word]

        #write the word to a file
        file_metadata.write(word + '\n')

Next, we initialize the tensorflow session:

In [None]:
sess = tf.InteractiveSession()

Initialize the tensorflow variable called embeddings that holds the word embeddings:

In [None]:
with tf.device("/cpu:0"):
    embedding = tf.Variable(w2v, trainable=False, name='embedding')

Initialize all variables:

In [None]:
tf.global_variables_initializer().run()

Create an object to the saver class which is actually used for saving and restoring variables to and from our checkpoints:

In [None]:
saver = tf.train.Saver()

Using FileWriter, we save our summaries and events to our event file:

In [None]:
writer = tf.summary.FileWriter('projections', sess.graph)

Initialize the projectors and add the embeddings:

In [None]:
config = projector.ProjectorConfig()
embed= config.embeddings.add()

Next, we specify our tensor_name as embedding and metadata_path to the metadata.tsv file where we have the words:

In [None]:
embed.tensor_name = 'embedding'
embed.metadata_path = 'metadata.tsv'

And finally, save the model:

In [None]:
projector.visualize_embeddings(writer, config)

saver.save(sess, 'projections/model.ckpt', global_step=max_size)

'projections/model.ckpt-28070'

Now, open the terminal and type the following command to open the tensorboard,

tensorboard --logdir=projections --port=8000

Thus, visualizing word embeddings in TensorBoard helps us to easily validate them. In the next section, We will how to convert paragraphs/documents to vectors using two different algorithms called PV-DM and PV-DBOW.

# Finding similar documents using Doc2Vec

We just learned how PV-DM and PV-DBOW convert the documents to a vector. Now, we will see how to perform document classification using Doc2Vec.

In this section, we will use the 20 newsgroups dataset. It consists of 20,000 documents over 20 different
news categories. We will use only four categories: Computer, Politics, Science, and Sports. We have 1000 documents under each of these four categories.

We rename the documents with a prefix, category_. For example, all science documents are renamed as
Science_1, Science_2, and so on. After renaming them, we combine all the documents and
place them in a single folder. The combined data is available in the data folder has new_dataset.zip.


## Import the required libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import os
import gensim
from gensim.models.doc2vec import TaggedDocument

from nltk import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = RegexpTokenizer(r'\w+')
stopWords = set(stopwords.words('english'))

## Data Preparation

Load all the documents and save the document names in docLabels list and document content in a list called data:

In [None]:
docLabels = []
docLabels = [f for f in os.listdir('data/news_dataset') if  f.endswith('.txt')]

data = []
for doc in docLabels:
      data.append(open('data/news_dataset/'+doc).read())

As shown below, docLabels has names of our documents:

In [None]:
docLabels[:5]

['Electronics_827.txt',
 'Electronics_848.txt',
 'Science_377.txt',
 'Science_24.txt',
 'Politics_38.txt']

Define a class called DocIterator which acts as an iterator to runs over all the documents:

In [None]:
class DocIterator(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list

    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield TaggedDocument(words=doc.split(), tags=[self.labels_list[idx]])

Create an object called 'it' to the DocIterator class:

In [None]:
it = DocIterator(data, docLabels)

## Build the model

Now let us build the model. Let us define some of the important hyperparameters of the model.

* Size represents our embedding size.

* alpha represents our learning rate.

* min_alpha implies that our learning rate alpha will decay to min_alpha during training.

* dm=1 implies we use ‘distributed memory’ (PV-DM) and if we set dm =0 it implies we use ‘distributed bag of words’ (PV-DBOW) for training.

* min_count represents the minimum frequcy of words. i.e if the paritcular word's occrurence is less than a min_count than we can simply ignore that word.

In [None]:
size = 100
alpha = 0.025
min_alpha = 0.025
dm = 1
min_count = 1

Define the model:

In [None]:
model = gensim.models.Doc2Vec(size=size, min_count=min_count, alpha=alpha, min_alpha=min_alpha, dm=dm)
model.build_vocab(it)

Train the model:

In [None]:
for epoch in range(100):
    model.train(it,total_examples=120,epochs = model.iter)
    model.alpha -= 0.002
    model.min_alpha = model.alpha

Save the model:

In [None]:
model.save('model/doc2vec.model')


We can load the saved model using load function:

In [None]:
d2v_model = gensim.models.doc2vec.Doc2Vec.load('model/doc2vec.model')

## Evaluate the model

After training, we evaluate the model performance. As shown below, when we feed Electronics_724.txt document as an input, it returns all the related documents with their corresponding scores:

In [None]:
model.docvecs.most_similar('Electronics_724.txt')

[('Electronics_407.txt', 0.9127770662307739),
 ('Electronics_163.txt', 0.8796253800392151),
 ('Science_480.txt', 0.8787260055541992),
 ('Science_769.txt', 0.8782669305801392),
 ('Science_627.txt', 0.8712874054908752),
 ('Science_737.txt', 0.8702232241630554),
 ('Electronics_461.txt', 0.8684250116348267),
 ('Science_377.txt', 0.8677175045013428),
 ('Electronics_786.txt', 0.867066502571106),
 ('Politics_167.txt', 0.8663994669914246)]

We learned how to generate embeddings for the documents using doc2vec algorithms, in the next section, we will learn how to generate sentence embeddings using skip-thoughts and quick-thoughts algorithms.