In [None]:
# Download the data

import os
import urllib

def maybe_download(filename, url, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

url = 'http://mattmahoney.net/dc/'
filename = maybe_download('text8.zip', url, 31344016)

In [None]:
# Read the data into a list of strings.
import zipfile
import tensorflow as tf

def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return data

vocabulary = read_data(filename)

In [None]:
vocabulary = data

In [None]:
print(vocabulary[:7])

In [None]:
"""
As you can observe, the returned vocabulary data contains a list of plain English words, 
ordered as they are in the sentences of the original extracted text file.  Now that we have 
all the words extracted in a list, we have to do some further processing to enable us to create 
our skip-gram batch data.  These further steps are:

Extract the top 10,000 most common words to include in our embedding vector
Gather together all the unique words and index them with a unique integer value – this is what 
is required to create an equivalent one-hot type input for the word.  We’ll use a dictionary to do this
Loop through every word in the dataset (vocabulary variable) and assign it to the unique integer 
word identified, created in Step 2 above.  This will allow easy lookup / processing of the word data stream
"""

In [22]:
# build dataset
"""
The first step is setting up a “counter” list, which will store the number of times a word is 
found within the data-set.  Because we are restricting our vocabulary to only 10,000 words, any 
words not within the top 10,000 most common words will be marked with an “UNK” designation, standing 
for “unknown”.  The initialized count list is then extended, using the Python collections module and the 
Counter() class and the associated most_common() function.  These count the number of words in the given 
argument (words) and then returns the n most common words in a list format.

The next part of this function creates a dictionary, called dictionary which is populated by keys 
corresponding to each unique word.  The value assigned to each unique word key is simply an increasing 
integer count of the size of the dictionary.  So, for instance, the most common word will receive the 
value 1, the second most common the value 2, the third most common word the value 3, and so on 
(the integer 0 is assigned to the ‘UNK’ words).   This step creates a unique integer value for each 
word within the vocabulary – accomplishing the second step of the process which was defined above.

Next, the function loops through each word in our full words data set – the data set which was output 
from the read_data() function.  A list called data is created, which will be the same length as words 
but instead of being a list of individual words, it will instead be a list of integers – with each word 
now being represented by the unique integer that was assigned to this word in dictionary.  So, for the 
first sentence of our data-set [‘anarchism’, ‘originated’, ‘as’, ‘a’, ‘term’, ‘of’, ‘abuse’], now looks 
like this in the data variable: [5242, 3083, 12, 6, 195, 2, 3136].  This part of the function addresses 
step 3 in the list above.

Finally, the function creates a dictionary called reverse_dictionary that allows us to look up a word based 
on its unique integer identifier, rather than looking up the identifier based on the word i.e. the original 
dictionary.  

The final aspect of setting up our data is now to create a data set comprising of our input words and 
associated grams, which can be used to train our Word2Vec embedding system.  The code to do this is:
"""
import collections
import itertools

def build_dataset(words, n_words):
    """Process raw inputs into a dataset.
    @param words - the word and the number of time the word appeared in the original document
    """
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))

    # build our 10,000 word vocabulary
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
        
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0
            unk_count +=1
            
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary
        
# print(vocabulary)
n_words = 10000    # collect most common 10000 words since we are building a vocabulary of 10000 words 
data, count, dictionary, reversed_dictionary = build_dataset(vocabulary, 10000)

print(data[:7])
print(count[:7])
print(dictionary)
print(reversed_dictionary)

# res = dict(itertools.islice(dictionary, 2))
# rev_res = dict(itertools.islice(reversed_dictionary, 2))
        

[5234, 3081, 12, 6, 195, 2, 3134]
[['UNK', 1737307], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201), ('a', 325873)]


In [None]:
# The final aspect of setting up our data is now to create a data set comprising of our input 
# words and associated grams, which can be used to train our Word2Vec embedding system. 
"""
This function will generate mini-batches to use during our training (again, see here for 
information on mini-batch training).  These batches will consist of input words (stored in batch) 
and random associated context words within the gram as the labels to predict (stored in context).  
For instance, in the 5-gram “the cat sat on the”, the input word will be center word i.e. “sat” 
and the context words that will be predicted will be drawn randomly from the remaining words of 
the gram: [‘the’, ‘cat’, ‘on’, ‘the’].  In this function, the number of words drawn randomly from 
the surrounding context is defined by the argument num_skips.  The size of the window of context 
words to draw from around the input word is defined in the argument skip_window – in the example 
above (“the cat sat on the”), we have a skip window width of 2 around the input word “sat”.

In the function above, first the batch and label outputs are defined as variables of size batch_size.  
Then the span size is defined, which is basically the size of the word list that the input word and 
context samples will be drawn from.  In the example sub-sentence above “the cat sat on the”, 
the span is 5 = 2 x skip window + 1.  After this a buffer is created:
"""
import numpy as np
import collections

data_index = 0
def generate_batch(data, batch_size, num_skips, skip_window):
    global data_index 
    assert batch_size % num_skips == 0
    assert num_skips <= 2*skip_window
    batch = np.ndarray(shape(batch_size), dtype=np.int32)
    context = np.ndarray(shape(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # input word at the center of the buffer
        targets_to_avoid = [skip_window]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]  # this is the input word
            context[i * num_skips + j, 0] = buffer[target]  # these are the context words
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    # Backtrack a little bit to avoid skipping words in the end of a batch
    data_index = (data_index + len(data) - span) % len(data)
    return batch, context


In [None]:
# word2vec in Keras
# Constants and the validation set
"""
The first constant, window_size, is the window of words around the target word that will be 
used to draw the context words from.  The second constant, vector_dim, is the size of each of 
our word embedding vectors – in this case, our embedding layer will be of size 10,000 x 300.  
Finally, we have a large epochs variable – this designates the number of training iterations we are 
going to run.  Word embedding, even with negative sampling, can be a time-consuming process.

The next set of commands relate to the words we are going to check to see what other words grow in
similarity to this validation set. During training, we will check which words begin to be deemed 
similar by the word embedding vectors and make sure these line up with our understanding of the 
meaning of these words.  In this case, we will select 16 words to check, and pick these words 
randomly from the top 100 most common words in the data-set (collect_data has assigned the most 
common words in the data set integers in ascending order i.e. the most common word is assigned 1, 
the next most common 2, etc.).

Next, we are going to look at a handy function in Keras which does all the skip-gram / context processing for us.
"""

import numpy as np

window_size = 3
vector_dim = 300
epochs = 1000000

valid_size = 16
valid_window = 100
valid_examples = np.random.choice(valid_window, valid_size, replace=False)



In [None]:
"""
The skip-gram function in Keras - 
To train our data set using negative sampling and the skip-gram method, we need to create data samples 
for both valid context words and for negative samples. This involves scanning through the data set and 
picking target words, then randomly selecting context words from within the window of words around the 
target word (i.e. if the target word is “on” from “the cat sat on the mat”, with a window size of 2 the words 
“cat”, “sat”, “the”, “mat” could all be randomly selected as valid context words).  It also involves randomly selecting negative samples outside of the selected target word context. Finally, we also need to set a label of 1 or 0, depending on whether the supplied context word is a true context word or a negative sample.  Thankfully, Keras has a function (skipgrams) which does all that for us – consider the following code:
"""
import keras.skipgrams

sampling_table = sequence.make_sampling_table(vocab_size)
couples, labels = skipgrams(data, vocab_size, window_size=window_size, sampling_table=sampling_table)
word_target, word_context = zip(*couples)
word_target = np.array(word_target, dtype="int32")
word_context = np.array(word_context, dtype="int32")

print(couples[:10], labels[:10])



In [26]:
! pip3 install Keras

Collecting Keras
  Downloading Keras-2.3.1-py2.py3-none-any.whl (377 kB)
[K     |████████████████████████████████| 377 kB 2.3 MB/s eta 0:00:01
Installing collected packages: Keras
Successfully installed Keras-2.3.1


In [31]:
# create some input variables
"""
The Keras functional API and the embedding layers
In this Word2Vec Keras implementation, we’ll be using the Keras functional API.  In my previous 
Keras tutorial, I used the Keras sequential layer framework. This sequential layer framework allows 
the developer to easily bolt together layers, with the tensor outputs from each layer flowing easily 
and implicitly into the next layer.  In this case, we are going to do some things which are a little 
tricky – the sharing of a single embedding layer between two tensors, and an auxiliary output to measure 
similarity – and therefore we can’t use a straightforward sequential implementation.

Thankfully, the functional API is also pretty easy to use.  I’ll introduce it as we move through the code. 
The first thing we need to do is specify the structure of our model, as per the architecture diagram which 
I have shown above. As an initial step, we’ll create our input variables and embedding layer:


First off, we need to specify what tensors are going to be input to our model, along with their size. 
In this case, we are just going to supply individual target and context words, so the input size for each 
input variable is simply (1,).  Next, we create an embedding layer, which Keras already has specified as a 
layer for us – Embedding().  The first argument to this layer definition is the number of rows of our embedding layer – which is the size of our vocabulary (10,000).  The second is the size of each word’s embedding vector (the columns) – in this case, 300. We also specify the input length to the layer – in this case, it matches our input variables i.e. 1.  Finally, we give it a name, as we will want to access the weights of this layer after we’ve trained it, and we can easily access the layer weights using the name.

The weights for this layer are initialized automatically, but you can also specify an optional embeddings_initializer argument whereby you supply a Keras initializer object.  Next, as per our architecture, we need to look up an embedding vector (length = 300) for our target and context words, by supplying the embedding layer with the word’s unique integer value:


"""
from keras.layers.embeddings import Embedding
import tensorflow as tf 


input_target = tf.keras.Input((1,)) # instantiate a keras tensor
input_context = tf.keras.Input((1,))

embedding = Embedding(vocab_size, vector_dim, input_length=1, name='embedding')



Tensor("input_2:0", shape=(None, 1), dtype=float32)


NameError: name 'vocab_size' is not defined

In [None]:
"""
As can be observed in the code above, the embedding vector is easily retrieved by supplying the word 
integer (i.e. input_target and input_context) in brackets to the previously created embedding operation/layer. 
For each word vector, we then use a Keras Reshape layer to reshape it ready for our upcoming dot product 
and similarity operation, as per our architecture.

The next layer involves calculating our cosine similarity between the supplied word vectors:
"""

target = embedding(input_target)
target = Reshape((vector_dim, 1))(target)
context = embedding(input_context)
context = Reshape((vector_dim, 1))(context)

In [None]:
"""
As can be observed, Keras supplies a merge operation with a mode argument which we can set 
to ‘cos’ – this is the cosine similarity between the two word vectors, target, and context. 
This similarity operation will be returned via the output of a secondary model – but more on how 
this is performed later.

The next step is to continue on with our primary model architecture, and the dot product as our 
measure of similarity which we are going to use in the primary flow of the negative sampling architecture:
"""
# setup a cosine similarity operation which will be output in a secondary model
similarity = merge([target, context], mode='cos', dot_axes=0)


In [None]:
"""
Again, we use the Keras merge operation and apply it to our target and context word vectors, with the mode argument set to ‘dot’ to get the simple dot product.  We then do another Reshape layer, and take the reshaped dot product value (a single data point/scalar) and apply it to a Keras Dense layer, with the activation function of the layer set to ‘sigmoid’.  This is the output of our Word2Vec Keras architecture.

Next, we need to gather everything into a Keras model and compile it, ready for training:
"""
# now perform the dot product operation to get a similarity measure
dot_product = merge([target, context], mode='dot', dot_axes=1)
dot_product = Reshape((1,))(dot_product)

# add the sigmoid output layer
output = Dense(1, activation='sigmoid')(dot_product)


In [None]:
"""
Here, we create the functional API based model for our Word2Vec Keras architecture.  What the model definition requires is a specification of the input arrays to the model (these need to be numpy arrays) and an output tensor – these are supplied as per the previously explained architecture.  We then compile the model, by supplying a loss function that we are going to use (in this case, binary cross entropy i.e. cross entropy when the labels are either 0 or 1) and an optimizer (in this case, rmsprop).  The loss function is applied to the output variable.

The question now is, if we want to use the similarity operation which we defined in the architecture to allow us to check on how things are progressing during training, how do we access it? We could output it via the model definition (i.e. output=[similarity, output]) but then Keras would be trying to apply the loss function and the optimizer to this value during training and this isn’t what we created the operation for.

There is another way, which is quite handy – we create another model:
"""
# create the primary training model
model = Model(input=[input_target, input_context], output=output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop')



In [None]:
"""
We can now use this validation_model to access the similarity operation, and this model will actually share the embedding layer with the primary model.  Note, because this model won’t be involved in training, we don’t have to run a Keras compile operation on it.

Now we are ready to train the model – but first, let’s setup a function to print out the words with the closest similarity to our validation examples (valid_examples).
"""
# create a secondary validation model to run our similarity checks during training
validation_model = Model(input=[input_target, input_context], output=similarity)

In [None]:
"""
The similarity callback
We want to create a “callback” which we can use to figure out which words are closest in similarity to our validation examples, so we can monitor the training progress of our embedding layer.

This class runs through all the valid_examples and gets the similarity score between the given validation word and all the other words in the vocabulary.  It gets the similarity score by running _get_sim(), which features a loop which runs through each word in the vocabulary, and runs a predict_on_batch() operation on the validation model – this basically looks up the embedding vectors for the two supplied words (the valid_example and the looped vocabulary example) and returns the similarity operation result.  The main loop then sorts the similarity in descending order and creates a string to print out the top 8 words with the closest similarity to the validation example.

The output of this callback will be seen during our training loop, which is presented below.

"""
class SimilarityCallback:
    def run_sim(self):
        for i in range(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8  # number of nearest neighbors
            sim = self._get_sim(valid_examples[i])
            nearest = (-sim).argsort()[1:top_k + 1]
            log_str = 'Nearest to %s:' % valid_word
            for k in range(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log_str = '%s %s,' % (log_str, close_word)
            print(log_str)

    @staticmethod
    def _get_sim(valid_word_idx):
        sim = np.zeros((vocab_size,))
        in_arr1 = np.zeros((1,))
        in_arr2 = np.zeros((1,))
        for i in range(vocab_size):
            in_arr1[0,] = valid_word_idx
            in_arr2[0,] = i
            out = validation_model.predict_on_batch([in_arr1, in_arr2])
            sim[i] = out
        return sim
sim_cb = SimilarityCallback()


In [None]:
"""
The training loop
The main training loop of the model is:

In this loop, we run through the total number of epochs.  First, we select a random index from our word_target, word_context and labels arrays and place the values in dummy numpy arrays.  Then we supply the input ([word_target, word_context]) and outputs (labels) to the primary model and run a train_on_batch() operation.  This returns the current loss evaluation, loss, of the model and prints it. Every 10,000 iterations we also run functions in the SimilarityCallback.

Here are some of the word similarity outputs for the validation example word “eight” as we progress through the training iterations:

Iterations = 0:

Nearest to eight: much, holocaust, representations, density, fire, senators, dirty, fc

Iterations = 50,000:

Nearest to eight: six, finest, championships, mathematical, floor, pg, smoke, recurring

Iterations = 200,000:

Nearest to eight: six, five, two, one, nine, seven, three, four

As can be observed, at the start of the training, all sorts of random words are associated with “six”.  However, as the training iterations increase, slowly other word numbers are associated with “six” until finally all of the closest 8 words are number words.

There you have it – in this Word2Vec Keras tutorial, I’ve shown you how the Word2Vec methodology works with negative sampling, and how to implement it in Keras using its functional API.  In the next tutorial, I will show you how to reload trained embedding weights into both Keras and TensorFlow. You can also checkout how embedding layers work in LSTM networks in this tutorial.
"""

arr_1 = np.zeros((1,))
arr_2 = np.zeros((1,))
arr_3 = np.zeros((1,))
for cnt in range(epochs):
    idx = np.random.randint(0, len(labels)-1)
    arr_1[0,] = word_target[idx]
    arr_2[0,] = word_context[idx]
    arr_3[0,] = labels[idx]
    loss = model.train_on_batch([arr_1, arr_2], arr_3)
    if i % 100 == 0:
        print("Iteration {}, loss={}".format(cnt, loss))
    if cnt % 10000 == 0:
        sim_cb.run_sim()