## Building word2vec model using Gensim

Now that we have understood how word2vec model works, let us see how to build word2vec model using gensim library. Gensim is one of the popular scientific software packages widely used for building vector space models. It can be easily installed via pip. So, we can just type the following command in our terminal to install the gensim library:

pip install -U gensim

Now, we will learn how to build word2vec model using gensim. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

#data processing
import pandas as pd
import re
from nltk.corpus import stopwords
stopWords = stopwords.words('english')

#modelling
from gensim.models import Word2Vec
from gensim.models import Phrases
from gensim.models.phrases import Phraser

## Load the Data

Load the dataset. The dataset used in this section is available in the data folder as text.zip.

In [2]:
data = pd.read_csv('data/text.csv',header=None)

Let us see what we got in our data:

In [3]:
data.head()

Unnamed: 0,0
0,room kind clean strong smell dogs. generally a...
1,stayed crown plaza april april . staff friendl...
2,booked hotel hotwire lowest price could find. ...
3,stayed husband sons way alaska cruise. loved h...
4,girlfriends stayed celebrate th birthdays. pla...


## Preprocess and prepare the dataset

Define a function for preprocessing the data:

In [4]:
def pre_process(text):
    
    #convert to lowercase
    text = str(text).lower()
    
    #remove all special characters and keep only alpha numeric characters and spaces
    text = re.sub(r'[^A-Za-z0-9\s.]',r'',text)
    
    #remove new lines
    text = re.sub(r'\n',r' ',text)
    
    # remove stop words
    text = " ".join([word for word in text.split() if word not in stopWords])
    
    return text

We will see how the preprocessed text looks like:

In [5]:
pre_process(data[0][50])

'agree fancy. everything needed. breakfast pool hot tub nice shuttle airport later checkout time. noise issue tough sleep through. awhile forget noisy door nearby noisy guests. complained management later email credit compd us amount requested would return.'

Preprocess the whole dataset:

In [6]:
data[0] = data[0].map(lambda x: pre_process(x))

After preprocession our dataset looks like:

In [7]:
data[0].head()

0    room kind clean strong smell dogs. generally a...
1    stayed crown plaza april april . staff friendl...
2    booked hotel hotwire lowest price could find. ...
3    stayed husband sons way alaska cruise. loved h...
4    girlfriends stayed celebrate th birthdays. pla...
Name: 0, dtype: object

Genism library requires input in the from of list of lists. i.e,

text = [ [word1, word2, word3], [word1, word2, word3] ]

We know that each row in our data contains a set of sentences. So we split them by '.' and convert them into list i.e,

In [8]:
data[0][1].split('.')[:5]

['stayed crown plaza april april ',
 ' staff friendly attentive',
 ' elevators tiny ',
 ' food restaurant delicious priced little high side',
 ' course washington dc']

Now, We have the data in a list. But we need to convert them into a list of lists. So, now again we split them by space ' '. i.e, First we split the data by '.' and then we split them by ' ' so that we can get our data in a list of lists:

In [9]:
corpus = []
for line in data[0][1].split('.'):
    words = [x for x in line.split()]
    corpus.append(words)

As you can see below, we have our inputs in the form of lists of lists:

In [10]:
corpus[:2]

[['stayed', 'crown', 'plaza', 'april', 'april'],
 ['staff', 'friendly', 'attentive']]

Convert the whole text in our dataset to a list of lists and build a corpus. Corpus is just the collection of vocabulary. 

In [11]:
data = data[0].map(lambda x: x.split('.'))

corpus = []
for i in (range(len(data))):
    for line in data[i]:
        words = [x for x in line.split()]
        corpus.append(words)

corpus[:2]

[['room', 'kind', 'clean', 'strong', 'smell', 'dogs'],
 ['generally', 'average', 'ok', 'overnight', 'stay', 'youre', 'fussy']]

Now the problem we have is our corpus contains only unigrams and it will not give us results when we give bigram as an input, for an example say 'san francisco'. 

So we use gensim's Phrases functions which collect all the words which occur together and add an underscore between them. So now 'san francisco' becomes 'san_francisco'. We set the min_count parameter to 25 which implies we ignore all the words and bigrams which appears lesser than this.

In [12]:
phrases = Phrases(sentences=corpus,min_count=25,threshold=50)
bigram = Phraser(phrases)

In [13]:
for index,sentence in enumerate(corpus):
    corpus[index] = bigram[sentence]

As you can see below underscore has been added to the bigrams in our corpus:

In [14]:
corpus[111]

[u'connected', u'rivercenter', u'mall', u'downtown', u'san_antonio']

In [15]:
corpus[9]

[u'course', u'washington_dc']

## Build the Model

Now let us build the model. Let us define some of the important hyperparameters that the model needs.


* Size represents the size of the vector i.e dimensions of the vector to represent a word. The size can be chosen according to our data size. If our data is very small then we can set our size to a small value, but if we have significantly large dataset then we can set our vector size to 300. In our case, we set our size to 100

* Window size represents the distance that should be considered between the target word and its neighboring word. Words exceeding the window size from the target word will not be considered for learning. Typically, a small window size is preferred.

* Min count represents the minimum frequency of words. i.e if the particular word's occurrence is less than a min_count then we can simply ignore that word.

* workers specify the number of worker threads we need to train the model 

* sg=1 implies we use skip-gram method for training if sg=0 then it implies we use CBOW for training

In [16]:
size = 100
window_size = 2
epochs = 100
min_count = 2
workers = 4
sg = 1

Train the model:

In [17]:
model = Word2Vec(corpus,sg=1,window=window_size,size=size, min_count=min_count,workers=workers,iter=epochs)

To save and load the model, we can simply use save and load functions respectivley. 

Save the model:

In [18]:
model.save('model/word2vec.model')

Load the saved word2vec model:

In [19]:
model = Word2Vec.load('model/word2vec.model')

## Evaluate the Embeddings

After training the model, we evaluate them. Let us see what the model has been learned and how well it has understood the semantics of words. Genism provides a most_similar function which gives us top similar words related to the given word.

As you can see below, given san_deigo as an input we are getting all other related city names as most similar words:

In [20]:
model.most_similar('san_diego')

[(u'san_antonio', 0.7988572120666504),
 (u'san_francisco', 0.7679016590118408),
 (u'dallas', 0.7511192560195923),
 (u'austin', 0.7496083974838257),
 (u'memphis', 0.7448670864105225),
 (u'seattle', 0.739619255065918),
 (u'phoenix', 0.7395424842834473),
 (u'boston', 0.738990068435669),
 (u'indianapolis', 0.7376643419265747),
 (u'la', 0.7074114084243774)]

We can also apply arithmetic operations on our vector to check how accurate our vectors are, For instance, woman + king - man = queen:

In [21]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[(u'queen', 0.7070209980010986)]

We can also find the words that do not match in the given set of words, for instance in the below list called text except the word holiday all others are city names and since our word2vec has understood the semantics of each word it returns the word holiday as the one that does not match with the other words in the list. 

In [22]:
text = ['los_angeles','indianapolis', 'holiday', 'san_antonio','new_york']

model.doesnt_match(text)

'holiday'

Thus, with word2vec model, we can generate useful word embeddings which captures the syntactic and semantic meanings of the word. In the next section, we will learn how to visualize this word embeddings generated by the word2vec model in TensorBoard.