<a href="https://colab.research.google.com/github/vishalghosalkar001/testRepo001/blob/master/Word2Vec_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import nltk
pd.options.display.max_colwidth = 200
%matplotlib inline

In [2]:
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']

corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

Unnamed: 0,Document,Category
0,The sky is blue and beautiful.,weather
1,Love this blue and beautiful sky!,weather
2,The quick brown fox jumps over the lazy dog.,animals
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food
4,"I love green eggs, ham, sausages and bacon!",food
5,The brown fox is quick and the blue dog is lazy!,animals
6,The sky is very blue and the sky is very beautiful today,weather
7,The dog is lazy but the brown fox is quick!,animals


In [None]:
nltk.download('stopwords')
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
norm_corpus = normalize_corpus(corpus)
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

##The Word2Vec Model
This model was created by Google in 2013 and is a predictive deep learning based model to compute and generate high quality, distributed and continuous dense vector representations of words, which capture contextual and semantic similarity. Essentially these are unsupervised models which can take in massive textual corpora, create a vocabulary of possible words and generate dense word embeddings for each word in the vector space representing that vocabulary. Usually you can specify the size of the word embedding vectors and the total number of vectors are essentially the size of the vocabulary. This makes the dimensionality of this dense vector space much lower than the high-dimensional sparse vector space built using traditional Bag of Words models.

####The Continuous Bag of Words (CBOW) Model
The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the context_window words.

####Build the corpus vocabulary
To start off, we will first build our corpus vocabulary where we extract out each unique word from our vocabulary and map a unique numeric identifier to it.

In [None]:
from keras.preprocessing import text
from keras.utils import np_utils
from keras.preprocessing import sequence

tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(norm_corpus)
word2id = tokenizer.word_index

# build vocabulary of unique words
word2id['PAD'] = 0
id2word = {v:k for k, v in word2id.items()}
wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_corpus]

vocab_size = len(word2id)
embed_size = 30
window_size = 2 # context window size

print('Vocabulary Size:', vocab_size)
print('Vocabulary Sample:', list(word2id.items())[:10])

Vocabulary Size: 21
Vocabulary Sample: [('sky', 1), ('blue', 2), ('beautiful', 3), ('quick', 4), ('brown', 5), ('fox', 6), ('lazy', 7), ('dog', 8), ('love', 9), ('sausages', 10)]


####Build a CBOW (context, target) generator
We need pairs which consist of a target centre word and surround context words. In our implementation, a target word is of length 1 and surrounding context is of length 2 x window_size where we take window_size words before and after the target word in our corpus. This will become clearer with the following example.

In [None]:
def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size*2
    for words in corpus:
        sentence_length = len(words)
        for index, word in enumerate(words):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            
            context_words.append([words[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < sentence_length 
                                 and i != index])
            label_word.append(word)

            x = sequence.pad_sequences(context_words, maxlen=context_length)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)
            
            
# Test this out for some samples
i = 0
for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        print('Context (X):', [id2word[w] for w in x[0]], '-> Target (Y):', id2word[np.argwhere(y[0])[0][0]])
    
        if i == 10:
            break
        i += 1

Context (X): ['quick', 'brown', 'jumps', 'lazy'] -> Target (Y): fox
Context (X): ['brown', 'fox', 'lazy', 'dog'] -> Target (Y): jumps
Context (X): ['kings', 'breakfast', 'ham', 'bacon'] -> Target (Y): sausages
Context (X): ['breakfast', 'sausages', 'bacon', 'eggs'] -> Target (Y): ham
Context (X): ['sausages', 'ham', 'eggs', 'toast'] -> Target (Y): bacon
Context (X): ['ham', 'bacon', 'toast', 'beans'] -> Target (Y): eggs
Context (X): ['love', 'green', 'ham', 'sausages'] -> Target (Y): eggs
Context (X): ['green', 'eggs', 'sausages', 'bacon'] -> Target (Y): ham
Context (X): ['brown', 'fox', 'blue', 'dog'] -> Target (Y): quick
Context (X): ['fox', 'quick', 'dog', 'lazy'] -> Target (Y): blue
Context (X): ['sky', 'blue', 'beautiful', 'today'] -> Target (Y): sky


####Build the CBOW model architecture
We now leverage keras on top of tensorflow to build our deep learning architecture for the CBOW model. For this our inputs will be our context words which are passed to an embedding layer (initialized with random weights). The word embeddings are propagated to a lambda layer where we average out the word embeddings (hence called CBOW because we don’t really consider the order or sequence in the context words when averaged) and then we pass this averaged context embedding to a dense softmax layer which predicts our target word. We match this with the actual target word, compute the loss by leveraging the categorical_crossentropy loss and perform backpropagation with each epoch to update the embedding layer in the process. Following code shows us our model architecture.

In [None]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

# build CBOW architecture
cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
cbow.add(Dense(vocab_size, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# view model summary
print(cbow.summary())

To summarize the core concepts of this model in simple terms, we have input context words of dimensions (2 x window_size), we will pass them to an embedding layer of size (vocab_size x embed_size) which will give us dense word embeddings for each of these context words (1 x embed_size for each word). Next up we use a lambda layer to average out these embeddings and get an average dense embedding (1 x embed_size) which is sent to the dense softmax layer which outputs the most likely target word. We compare this with the actual target word, compute the loss, backpropagate the errors to adjust the weights (in the embedding layer) and repeat this process for all (context, target) pairs for multiple epochs. The following figure tries to explain the same.

####Train the Model
Running the model on our complete corpus takes a fair bit of time, so I just ran it for 5 epochs. You can leverage the following code and increase it for more epochs if necessary.

In [None]:
for epoch in range(1, 100):
    loss = 0.
    i = 0
    for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 100000 == 0:
            print('Processed {} (context, word) pairs'.format(i))

    print('Epoch:', epoch, '\tLoss:', loss)
    print()

####Get Word Embeddings
To get word embeddings for our entire vocabulary, we can extract out the same from our embedding layer by leveraging the following code. We don’t take the embedding at position 0 since it belongs to the padding (PAD) term which is not really a word of interest.

In [None]:
weights = cbow.get_weights()[0]
weights = weights[1:]
print(weights.shape)

pd.DataFrame(weights, index=list(id2word.values())[1:]).head()

(20, 30)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
blue,-1.056325,0.073911,0.398076,1.045604,0.087906,-0.647943,-0.755652,0.314727,0.366467,-0.394197,...,0.73463,-0.311624,-1.041767,-0.807258,-0.05628,0.727573,-0.08036,0.097368,-0.608945,-1.049852
beautiful,-0.00949,0.661241,-0.474563,0.409241,1.350211,0.154913,-0.607648,0.066068,0.908472,-0.964226,...,0.56887,-0.918025,-1.000859,-0.795024,-0.595539,-1.008232,-0.485314,-0.312635,-0.006286,0.127771
quick,-1.001877,-0.383566,1.523314,0.107873,-0.662151,0.005115,-0.333009,1.096868,-0.029924,-0.412661,...,0.226992,-0.615409,-0.872711,-0.526169,-0.287452,0.439517,-0.18792,-0.420505,-0.150454,-0.66411
brown,0.486977,-1.368317,-0.684468,-0.146448,0.835967,-1.198132,-1.412988,-0.080053,-1.237034,-0.042402,...,0.603156,0.068608,0.64252,0.40817,-1.34379,1.039482,0.881929,0.767231,-1.247231,0.808812
fox,1.226469,-0.987464,-0.933312,0.051496,0.98751,0.292376,-0.923795,0.959272,-0.02527,0.166759,...,-0.321244,-0.330172,0.625232,-0.545134,-1.019887,-0.957494,0.492093,0.963302,0.252083,1.2237


Thus you can clearly see that each word has a dense embedding of size (1x30) as depicted in the preceding output. Let’s try and find out some contextually similar words for specific words of interest based on these embeddings. For this, we build out a pairwise distance matrix amongst all the words in our vocabulary based on the dense embedding vectors and then find out the n-nearest neighbors of each word of interest based on the shortest (euclidean) distance.

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

# compute pairwise distance matrix
distance_matrix = euclidean_distances(weights)
print(distance_matrix.shape)

# view contextually similar words
similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:3]+1] 
                   for search_term in ['blue', 'beautiful', 'quick', 'brown', 'fox']}

similar_words

(20, 20)


{'beautiful': ['sky', 'love'],
 'blue': ['today', 'sky'],
 'brown': ['jumps', 'fox'],
 'fox': ['dog', 'jumps'],
 'quick': ['jumps', 'lazy']}