### Skip-Gram and CBOW  
Word2Vec comes in two variants Skip-Gram and CBOW (Continuous Bag-Of-Words). Algorithmically, these models are similar.  
CBOW predicts the target words using its neighborhood(context) whereas Skip-Gram does the inverse, which is to predict context words from the target words. For example, given the sentence the quick brown fox jumped over the lazy dog. Defining the context words as the word to the left and right of the target word, CBOW will be trained on the dataset:  
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox)...  
where CBOW tries to predict the target word quick from the context words in brackets [the, brown], and predict brown from [quick, fox] and so on. However, with Skip-Gram, the dataset becomes  
(quick, the), (quick, brown), (brown, quick), (brown, fox), ...  
where Skip-Gram predicts the context word the, brown with the target word quick. Statistically, CBOW smoothes over a lot of the distributional information (by treating an entire context as one example). For the most part, this turns out to be a useful thing for smaller datasets.  
On the other hand, Skip-Gram treats each context-target pair as a new observation and is shown to be able to capture the semantics better when we have a large dataset.

### Preparing training data  
To generate batches for training, several functions defined below are used.  
First, we read the data into the memory and build the vocabulary using a number of most commonly seen words.  
Meanwhile, we build keep two dictionaries, a dictionary that translates words to indices and another which does the reverse.  
Then, for every word in the text selected as the center word, pair them with one of the context words.  
Finally, a python generator which generates a batch of pairs of center-target pairs.

In [15]:
"""The content of process_data.py"""

from collections import Counter
import random
import os
import sys
sys.path.append('..')
import zipfile

import numpy as np
from six.moves import urllib
import tensorflow as tf


# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/'
EXPECTED_BYTES = 31344016
DATA_FOLDER = ''
FILE_NAME = 'text8.zip'


def make_dir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass

def download(file_name, expected_bytes):
    """ Download the dataset text8 if it's not already downloaded """
    file_path = DATA_FOLDER + file_name
    if os.path.exists(file_path):
        print("Dataset ready")
        return file_path
    file_name, _ = urllib.request.urlretrieve(DOWNLOAD_URL + file_name, file_path)
    file_stat = os.stat(file_path)
    if file_stat.st_size == expected_bytes:
        print('Successfully downloaded the file', file_name)
    else:
        raise Exception(
              'File ' + file_name +
              ' might be corrupted. You should try downloading it with a browser.')
    return file_path    
    
    
def read_data(file_path): #string(words)
    """ Read data into a list of tokens"""
    with zipfile.ZipFile(file_path) as f:
        words = tf.compat.as_str(f.read(f.namelist()[0])).split()
        # tf.compat.as_str() converts the input into the string
    return words

def build_vocab(words, vocab_size): #two dictionaries(dictionary, index_dictionary)
    """ Build vocabulary of VOCAB_SIZE most frequent words """
    dictionary = dict()
    count = [('UNK', -1)]
    count.extend(Counter(words).most_common(vocab_size - 1))
    index = 0
    make_dir('processed')
    with open('processed/vocab_1000.tsv', "w") as f:
        for word, _ in count:
            dictionary[word] = index
            if index < 1000:
                f.write(word + "\n")
            index += 1
    index_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, index_dictionary

def convert_words_to_index(words, dictionary):
    """ Replace each word in the dataset with its index in the dictionary """
    return [dictionary[word] if word in dictionary else 0 for word in words]

def process_data(vocab_size): #string_to_number(index_words)
    """ Read data, build vocabulary and dictionary"""
    file_path = download(FILE_NAME, EXPECTED_BYTES)
    words = read_data(file_path)
    dictionary, index_dictionary = build_vocab(words, vocab_size)
    index_words = convert_words_to_index(words, dictionary)
    del words # to save memory
    return index_words, dictionary, index_dictionary

def generate_sample(index_words, context_window_size):
    """ Form training pairs according to the skip-gram model. """
    for index, center in enumerate(index_words):
        context = random.randint(1, context_window_size)#1~5
        # get a random target before the center word
        for target in index_words[max(0, index - context): index]:
            yield center, target
        # get a random target after the center wrod
        for target in index_words[index + 1: index + context + 1]:
            yield center, target

def get_batch(iterator, batch_size):
    """ Group a numerical stream into batches and yield them as Numpy arrays. """
    while True:
        center_batch = np.zeros(batch_size, dtype=np.int32)
        target_batch = np.zeros([batch_size, 1])
        for index in range(batch_size):
            center_batch[index], target_batch[index] = next(iterator)
        yield center_batch, target_batch

def get_batch_gen(index_words, context_window_size, batch_size):
    """ Return a python generator that generates batches"""
    single_gen = generate_sample(index_words, context_window_size)
    batch_gen = get_batch(single_gen, batch_size)
    return batch_gen


In [27]:
vocab_size = 10000
window_sz = 5
batch_sz = 64

index_words, dictionary, index_dictionary = process_data(vocab_size)
batch_gen = get_batch_gen(index_words, window_sz, batch_sz)
X, y = next(batch_gen) #X=center, y=neighbor(number of neighbor is random(1~5) )

print(X.shape)
print(y.shape)

Dataset ready
(64,)
(64, 1)


In [32]:
for i in range(50): # print out the first 10 words in the text
    print(index_dictionary[index_words[i]], end=' ')

anarchism originated as a term of abuse first used against early working class UNK including the UNK of the english revolution and the UNK UNK of the french revolution whilst the term is still used in a UNK way to describe any act that used violent means to destroy the 

In [31]:
for i in range(len(X)): # print out the pairs
    data = index_dictionary[X[i]]
    label = index_dictionary[y[i,0]]
    print('(', data, label,')')

( anarchism originated )
( anarchism as )
( anarchism a )
( anarchism term )
( anarchism of )
( originated anarchism )
( originated as )
( as anarchism )
( as originated )
( as a )
( as term )
( a originated )
( a as )
( a term )
( a of )
( term anarchism )
( term originated )
( term as )
( term a )
( term of )
( term abuse )
( term first )
( term used )
( term against )
( of term )
( of abuse )
( abuse as )
( abuse a )
( abuse term )
( abuse of )
( abuse first )
( abuse used )
( abuse against )
( abuse early )
( first a )
( first term )
( first of )
( first abuse )
( first used )
( first against )
( first early )
( first working )
( used term )
( used of )
( used abuse )
( used first )
( used against )
( used early )
( used working )
( used class )
( against term )
( against of )
( against abuse )
( against first )
( against used )
( against early )
( against working )
( against class )
( against UNK )
( against including )
( early abuse )
( early first )
( early used )
( early agains

### Using the Dataset API

In [33]:
BATCH_SIZE = 128
dataset = tf.data.Dataset.from_tensor_slices((X, y))
dataset = dataset.repeat()          # Repeat the input indefinitely.
dataset = dataset.batch(BATCH_SIZE) # stack BATCH_SIZE elements into one
iterator = dataset.make_one_shot_iterator() # iterator
next_batch = iterator.get_next()    # an operation that gives the next batch

In [34]:
with tf.Session() as sess:
    data, label = sess.run(next_batch)
    print(data.shape)
    print(label.shape)

(128,)
(128, 1)
